Curated global occurrence dataset of the insect order Zoraptera


To incorporate all species currently classified in this insect order into the dataset, the most recent comprehensive catalog of Zoraptera was utilized17. Subsequently described taxa were then added (e.g.3,4,5,6,18,19,20), while taxa that were not zorapterans were removed21. We followed the currently used and widely accepted higher classification with two families (each with two subfamilies) and ten genera. This classification was recently proposed by Kočárek et al.2 and Kočárek & Kočárková22, and it is based on the results of analyses of molecular phylogeny in combination with morphological characters. The current version of the dataset includes all recent members of the order Zoraptera described before October 1, 2024.

Data sources

The geographical position of each species was obtained from published sources, as well as from material deposited in museum collections and other material collected and/or identified by the authors of this contribution. Additionally, data from iNaturalist23 and GBIF24 were included. Initially, data from all original descriptions of new species was incorporated, and subsequently, all Zoraptera distributional records found in the remaining literature were added. A comprehensive search strategy was employed, encompassing all references cited in the original descriptions as well as in the catalog by Hubbard17, and subsequently, all the references cited in those works were searched. This process was repeated until no new references or occurrence records were identified. To ensure the comprehensiveness of the results, systematic searches were conducted on Google Scholar, Google, and Web of Knowledge using the keyword “Zoraptera” and all supraspecific taxon names historically used in this order (Zorotypidae, Spiralizoridae, Zorotypinae, Spermozorinae, Latinozorinae, Spiralizorinae, Aspiralizoros, Brazilozoros, Centrozoros, Cordezoros, Floridazoros, Latinozoros, Meridozoros, Scapulizoros, Spermozoros, Spiralizoros, Zorotypus, and Usazoros). Therefore, we reviewed not only the taxonomic and faunistic literature, but also studies focusing on the biology, morphology, and phylogeny of Zoraptera (many of which included useful distributional data of the material examined), as well as various general books and other documents. Most publications were in English; rare cases of studies written in other languages (German, Latin, Portuguese, Spanish, Chinese) were analyzed in consultation with colleagues and translated using online translation websites (DeepL or Google Translate). The references included in the final dataset were either those providing original data or, in cases where multiple references reported the same distribution information, only those with the first record or with the most complete information.

All records from iNaturalist have been revised and identified to the lowest possible reliably determinable taxonomic rank directly on the iNaturalist website by Petr Kočárek, and have only imported records with the appropriate license (CC-BY, CC-BY-NC, CC0). GBIF was then queried, excluding iNaturalist data, resulting in a GBIF dataset24 that was further manually revised (see Technical Validation section below for common errors in Zoraptera identification). Specifically, we excluded sequence records imported from genetic databases (Barcode of Life Data System—BOLD, and European Nucleotide Archive—ENA) in all cases when there was no specified voucher specimen deposited in the publicly available collection, fossil records from amber, or records with any geographic information. Finally, we matched the corresponding records by adding a GBIF id to existing records in our dataset and then added the remaining records with revised taxon identification.

Digitizing locations

Information about the geographic location of the records in the available sources was stored in different ways and of varying quality. In all cases, we attempted to derive the most likely accurate coordinates and then determine the degree of positional uncertainty. If the record already had the coordinates, the coordinates were converted to decimal degree format, and the uncertainty was set based on the coordinate precision according to Wieczorek25. In cases where coordinates were missing or not uninterpretable and only a record description was available, we obtained the coordinates by digitizing the areas based on all available information, including maps in publications. To find locations by name or address, we used the Nominatim geocoding tool, which searches for features in OpenStreetMap (OSM) data26. The digitizing process was conducted in QGIS 3.3827. We used Nominatim in QGIS using the plugin OSM place search 1.4.528, which enabled us to import OSM geometry features with attributes directly to QGIS for further processing. We followed the Georeferencing Quick Reference Guide29, and each record was treated separately and carefully based on the context of the environment; i.e., we excluded from the digitized area areas that we felt would be more appropriate for description in the collector’s situation, such as a major city or other notable geographic feature. We used OSM geometry features in various ways based on the amount of available information and the context surrounding the target area, e.g., if there was little information, such as the name of the state or city or only part of it, we used the exact OSM feature. In other cases, the OSM feature was edited or only used as a reference point, e.g., if the location was described as ‘near’ or using distance and direction from the given location. The identification numbers (ids) of the original OSM features were stored in the dataset and could be retrospectively examined and compared. If the record site could not be localized with Nominatim, we used other sources such as Google Maps or various sites and publications reached with Google search and digitized manually or used corresponding OSM features. If altitude was considered, we used OpenTopoMap30 to derive the area corresponding to the altitude. If the description referred to a line or point feature (e.g., a road, river, hill, or other feature), we converted these features to polygons with at least a 100-meter buffer. In general, if the polygon extended beyond the coastline (e.g., Java), coastal areas were cropped to coastlines from the OSM data (OSM tag ‘natural = coastlines’). Finally, all polygons were simplified using the QGIS native Simplify tool with the Visvalingam algorithm and a threshold tolerance set of 100. This resulted in the removal of redundant polygon vertices, so the resulting dataset saved data storage space, with negligible loss of information. The resulting polygons were stored in a GeoPackage file that was published as part of the dataset repository. In addition, the polygons were written into the dataset itself as well-known text (WKT). This allows users to check the individual areas from which the coordinates were created, and to make further edits and updates.

We assigned a country name and code to every record based on ISO 3166-1 alpha-2. To obtain coordinates and positional uncertainty from polygon geometries, we computed enclosing circles and their centroids. If such a centroid was outside the original polygon geometry, we calculated the nearest point that intersecting the polygon. We then calculated the geodetic distance from that point to the most distant point on the polygon (i.e., the radius). These points represent the coordinates of the record, and the distances represent the positional uncertainties. These values were calculated in R 4.3.331 with the packages sf32 and lwgeom33.

Dataset updates

Records in the dataset can be updated by directly editing the ‘zoraptera_occs.csv’ or semiautomatically from iNaturalist and GBIF. The iNaturalist update workflow starts by revising the identification directly in iNaturalist, then we use the rinat R package34 to check new records or identification updates verified by specific users (for the initial version of dataset only Petr Kočárek); compliant data are then automatically appended to the ‘zoraptera_occs.csv’ dataset, and the date of an update is recorded in the log file. The GBIF update workflow starts by downloading the current Zoraptera data from GBIF using the rgbif R package35. On each update, the current GBIF dataset is compared with the last downloaded and revised GBIF dataset based on the gbifID of the records. The date and the dataset doi are stored in a log file to repeat this process. All new GBIF records are temporarily stored and manually revised and implemented in ‘zoraptera_occs.csv’. Polygon geometry can be manually added or edited within the GeoPackage, and the coordinates with positional uncertainties can be automatically recalculated and updated in ‘zoraptera_occs.csv’.



Source link

More From Forest Beat

Multi-modal Language models in bioacoustics with zero-shot transfer: a case study

Hautier, Y. et al. Anthropogenic environmental changes affect ecosystem stability via biodiversity. Science 348 (6232), 336–340 (2015).Article  ADS  ...
Biodiversity
10
minutes

Ecological novelty is the new norm on our planet

Kerr, M. R. et al. Nat. Ecol. Evol. https://doi.org/10.1038/s41559-025-02662-2 (2025).Article  ...
Biodiversity
0
minutes

Genetic survey of crucian carp Carassius carassius populations in Hungary for...

Dudgeon, D. et al. Freshwater biodiversity: Importance, threats, status and conservation challenges. Biol. Rev. Camb. Philos. Soc. 81, 163–182. https://doi.org/10.1017/S1464793105006950 (2006).Article  ...
Biodiversity
11
minutes

Reconciling empathy with the utilitarian approach to biodiversity conservation

Convincing policymakers of the importance and urgency of protecting nature is a common challenge faced by ecologists and conservation scientists. The different priorities...
Biodiversity
1
minute
spot_imgspot_img