The dataset can be found in a figshare repository (Weng et al.22, and are licensed under CC BY.
The database, which includes three main components (occurrence records, DNA barcode data and functional traits), is organized into three distinct files formatted as ‘. Xlsx’. Instances of missing data within these files have been systematically designated as NA.
The “Occurrence_data.xlsx” file consists two sheets: one entitled ‘species distribution records’ and the other ‘species list’. Each record within the ‘species distribution records’ sheet contains detailed information, including the taxonomic category of the species, viz., family, genus, and species, as well as specific specimen details including latitude and longitude, date of collection, habitat type, depth, source of data, and country/region. Taxonomic category columns clarify the classification hierarchy, incorporating scientific names along with the author’s surname and the year of naming. Data sourced from public databases are marked with the respective database name, such as GBIF or OBIS, whereas literature-derived entries include the title of the publication. The depth column indicates the vertical water depth (in meters) where the species was found, and the habitat column characterizes the environments from which the specimens were collected. The ‘species list’ sheet provides taxonomic classifications for each species, including family, genus, and species.
The dataset comprises approximately 39,310 records of polychaete annelid worms, representing 2,831 species in 696 genera and 75 families, covering the period from 1776 to 2024. Notably, an 13% of these entries are derived from scientific literature, and it is important to highlight that this portion of the data is exclusive and not incorporated within existing databases. The majority of species records are marine, with a small number found in terrestrial or freshwater environments (Fig. 3). The period from 1991 to 2010 experienced the highest number of sampling events, totaling 12,089, which notably surpassed the 6,258 events documented in the decades from 1971 to 1990. Australia was identified as the country with the most sampling events, contributing 61.4% to the total, followed by Indonesia, China, India, and the Philippines. Australia was identified as the country with the most sampling events, contributing 61.4% to the total, followed by Indonesia, China, India, and the Philippines. Most sampling activities had occurred within the 0–100 meter. Furthermore, five families exhibiting the greatest species diversity include Syllidae, with 329 species; Nereididae, with 220 species; Terebellidae, with 215 species; Spionidae, with 174 species; and Polynoidae, with 142 species, as presented in Fig. 4.
The dataset titled “Functional_traits_data.xlsx” consists of a matrix comprising 2,831 species and 13 trait variables. A total of 11,953 valid trait recordings were collected, with temperature tolerance, salinity tolerance, depth zonation, and branching structure/branchiae being the four traits most frequently noted. Conversely, the traits with the fewest number of recordings were population spawning frequency, epistasis, and longevity (Fig. 5).
The dataset entitled “DNA_barcode_data.xlsx” consists of five separate sheets.
The first sheet, identified as “COI”, includes data relevant to the COI gene sequence, detailing information such as class, family, genus, and species, as well as the gene name (abbreviated as COI), gene length, GenBank ID, BOLD ID, and the nucleotide sequence.
The second sheet, labeled “16S”, contains information related to the 16S gene sequence, while the third sheet, named “18S”, provides analogous data for the 18S gene sequences. The columns in this sheet are consistent with those found in the COI sheet, thereby ensuring uniformity across the dataset. The fourth sheet, titled “mtDNA”, consists of mitochondrial genome data, featuring columns such as class, family, genus, species name, length, molecule type, GenBank ID, and sequence. Finally, the fifth sheet summarizes the gene collections affiliated with the species, containing four columns: COI, 16S, 18S, and mtDNA, where the values in the cells denote the number of sequences corresponding to each gene.
In the present study, we catalogued a total of 3,973 COI sequences, which accounts for 20.10% of the total species. Furthermore, we recorded 1,574 sequences for the 16S gene, corresponding to 17.20% of the species diversity. Moreover, we recorded 1,505 18S sequences, accounting for 20.28% of the overall species. In total, we also catalogued 154 mitochondrial genome sequences, of which 55 were generated in the present study. These sequences encompass 33 families, with Nereididae and Spionidae emerging as the most abundant (Fig. 6).
Statistics of DNA Barcode Data. (A) COI Gene: Depicts the proportion of species and the cumulative sequence count that incorporate the Cytochrome Oxidase I (COI) gene within the dataset. (B) 16S rRNA Gene: Displays the proportion of species and the total sequence number that harbor the 16S ribosomal RNA (16S rRNA) gene in the dataset. (C) 18S rRNA Gene: Illustrates the proportion of species and the aggregated sequence count that encompass the 18S ribosomal RNA (18S rRNA) gene in the dataset. (D) Mitochondrial Genome Distribution: Provides a statistical overview of the families represented, based on the presence of mitochondrial genomes in the dataset.
In the pie chart, the grey sections are species with sequences and the white sections are species without sequences.