Introduction
By differentiating species based on standardized DNA regions (
Hebert et al. 2003;
Hebert et al. 2003), DNA barcoding makes it possible to map species distributions at scale. It not only enables estimates of species diversity, but also aids the recognition of unknown species and tracking biodiversity changes over time (
Deiner et al. 2017). DNA barcoding also makes it possible to monitor diversity in lineages whose taxonomy is weakly developed, providing the spatially explicit and taxonomically comprehensive data needed to evaluate changes in species distributions in response to environmental change (
Archambault et al. 2010;
Hochkirch et al. 2021;
Borgelt et al. 2022). Building on the capabilities provided by the Barcode of Life Data System (BOLD) (
Ratnasingham and Hebert 2007,
2013), Canada has been the global leader in this approach, significantly advancing the understanding and conservation of its biodiversity. However, gaps in the sequence database hinder scalable, high-resolution biosurveillance (
Leray and Knowlton 2016;
Sinniger et al. 2016;
Weigand et al. 2019).
DNA barcode records include a voucher specimen, its taxonomic assignment, details on its point and time of collection, and a barcode sequence (
Ratnasingham and Hebert 2013). Once a species enters BOLD, members of the same taxon can be rapidly identified from their DNA barcode whether derived from a single specimen, from bulk samples (DNA metabarcoding), or from environmental samples (eDNA) such as water, air, soil, or sediment (
Ruppert et al. 2019). However, when analysis of metabarcoding or eDNA samples reveals sequences unrepresented in the reference library, the inferred taxonomic composition is incomplete or inaccurate (
Beng and Corlett 2020). While taxonomy-free approaches for assessing biodiversity from molecular analysis escape this problem (
Apothéloz-Perret-Gentil et al. 2017;
Callahan et al. 2017;
Mächler et al. 2021), they fail to connect sequences with the biological attributes of the species that served as their source. As a consequence, species identifications remain essential for understanding ecology, functional roles, population genetics, and tracking invasive or endangered species.
A key step in directing future work in library construction involves summarizing progress in securing DNA barcode coverage from a biogeographic perspective. Ecological land classifications delineate geographic regions (e.g., provinces, ecoregions, and ecozones) based on geological, climatic, and biotic differences (
Omernik 1995). As such, these units provide a framework for understanding spatial complexity, conducting biological risk assessments, and directing conservation action (
Omernik 1995;
Smith et al. 2020;
Spalding et al. 2007).
In this study, we evaluate DNA barcode coverage for each of Canada’s 28 ecozones to identify variation in coverage among them and to highlight taxonomic groups lacking data. Summarizing the number of barcode records by ecozone allowed a comparison of terrestrial and aquatic habitats. Additionally, we investigated barcode coverage for Canadian marine invertebrates across taxonomic groups to highlight gaps and aid prioritization of future efforts.
Materials and methods
To assess DNA barcode coverage for the Canadian fauna, we downloaded all public Canadian records from BOLD (
Ratnasingham and Hebert 2007,
2013; accessed 6 June 2022). Each record with GPS coordinates was assigned to one of Canada’s 28 ecozones (15 terrestrial, 12 marine, and 1 freshwater) (
Fig. 1). Canada’s National Ecological Framework (
Ecological Stratification Working Group 1995;
Marshall et al. 1996;
Hirvonen 2001) was used to define ecozone boundaries (
Fig. 1). Shapefiles were downloaded from the National Ecological Framework website (
Government of Canada 2017; accessed 8 October 2022) for terrestrial ecozones, and from the Canadian Council on Ecological Areas website (
CCEA 2015; accessed 8 October 2022) for aquatic ecozones. Both sets of ecozones have 1 m resolution. We compared the total number of records, records per ecozone, and records per 100 km
2 for terrestrial and aquatic ecozones. In addition, we plotted the accumulation of unique BINs and species as a function of the number of barcode records for terrestrial and aquatic organisms. To do this, we randomized the list of records and sampled without replacement, counting the number of unique BINs or species accumulated until the records were exhausted. This process was repeated 100 times, and values at each sample size were averaged to produce a smooth line.
For marine animals, we also summarized barcode progress by taxonomic group. We downloaded publicly available BOLD records (accessed 20 August 2022) whose identification matched marine species or genera in the Canadian Register of Marine Species (CaRMS) (
CaRMS 2022, accessed 8 October 2022). Records using synonymized or otherwise invalid names were relabelled with the accepted name. BOLD assigns a Barcode Index Number (BIN; a species proxy) to each sequence record that meets specified quality requirements (>500 bp in length, <1% ambiguous bases, no frameshifts, contaminants, or chimeras) and possesses complete metadata (sample ID, field or museum voucher ID, phylum, country, and institution storing the specimen) (
Ratnasingham and Hebert 2013). For each animal phylum, we summarized the number of accepted Canadian species, the number of accepted species with at least one record on BOLD, the number of accepted species with at least one BIN, the total number of records, and the total number of unique BINs. We calculated coverage as the percentage of described species with at least one BIN. We also summarized coverage and average BINs per species by family (see Supp. Data Table S1).
Results
In total, 2.3 million Canadian animal specimens have been barcoded, providing coverage for 85 706 BINs and 37 327 named species. Most of these records (94.7%) include GPS coordinates for the collection location, enabling their assignment to an ecozone. Of these, 98.5% of records were arthropods. The total area of Canada’s terrestrial ecozones is about 50% greater than its aquatic ecozones, but the difference in barcode records was far greater as 95.6% (2213 305) derived from land, while just 4.4% (100 894 records) were from water (
Tables 1 and
2). Coverage among ecozones showed >1000-fold variation (
Fig. 2). Terrestrial ecozones showed a 100-fold range from 6023 records (Taiga Shield;
Fig. 2A, #15;
Table 1) to 667 342 (Mixedwood Plains;
Fig. 2A, #7;
Table 1). Marine ecozones showed nearly a 1000-fold range from 29 (Arctic Archipelago;
Fig. 2B, #22) to 22 062 (Gulf of Saint Lawrence;
Fig. 2B, #27). The number of records on an areal basis averaged 22.4 records per 100 km
2 for land (
Table 1) versus 1.7 records per 100 km
2 for water (
Table 2), but records were not evenly distributed within ecozones (
Fig. 3). Specimens collected within terrestrial ecozones included representatives for 78 139 BINs and 23 775 named species, while the specimens collected within aquatic ecozones included 12 535 BINs and 6 117 species (
Fig. 4). The ratio of BINs to species was 3.29 for terrestrial records and 2.04 for aquatic records (
Fig. 4).
Despite the low sampling effort geographically, 60.5% of the marine animal species in CaRMS (
n = 6373) were represented in BOLD by at least one record (
Table 3). Most of these taxa possessed a high-quality sequence as 57.8% of the species on the CaRMS list were represented by at least one BIN (
Fig. 5;
Table 3) with an average of 39.3 records (±115 SD) per species, but 9.7% had just a single record. On average, 45.5% of known species for the 23 phyla in CaRMS possessed a BIN on BOLD. One phylum (Nematomorpha) lacked coverage, while all four species of Phorinida and 88.1% of the 1834 species of Chordata had coverage (
Table 3). Among the 151 601 records for Canadian marine taxa, 74.7% derived from two phyla (
Fig. 5;
Table 3): Chordata (75 881 records) and Arthropoda (37 304 records). Coverage for the major marine groups was 54.4% for Annelida (55.3% for Polychaeta); 51.7% for Arthropoda (51.1% for Crustacea, 90.9% for Balanomorpha, 81.4% for Decapoda, and 54.3% for Peracarida); 88.1%% for Chordata (94.6% for Tetrapoda, 92.0% for Chondrichthyes, 38.7% for Tunicata, and 89.0% for Actinopterygii); 39.2% for Echinodermata (33.4% for Asteroidea, 100% for Crinoidea, 47.2% for Holothuroidea, and 76.0% for Ophiuroidea); 45.5% for Mollusca (45.9% for Bivalvia, 43.3% for Gastropoda, and 72.7% for Polyplacophora) (
Tables 3 and
4). Family-level coverage averaged 56.3% (±42.3%) across phyla, and the highest ratio of BINs to species was 39 for one species in the family Hippothoidae (Bryozoa) (Supp. Data Table S1).
Discussion and conclusions
Considerable progress has been made in assembling a DNA barcode reference library for the Canadian fauna, but 95.6% of the 2.4 million records derive from terrestrial settings versus 4.4% for aquatic environments. On an areal basis, the number of records differs 13-fold between these two environments with 22.4/100 km2 versus 1.7/100 km2. Viewed from a taxonomic basis, BOLD has barcodes for 57.8% of described Canadian marine species, but certain phyla were underrepresented or absent. We highlight several important gaps that warrant targeted efforts going forward.
Our geographic analysis revealed major gaps in coverage for marine ecozones and northern regions. Marine records were concentrated along coastlines and inshore areas close to population centers, reflecting the fact that sampling remote marine settings requires substantial resources (
Archambault et al. 2010). For this reason, northerly ecozones have received little attention as evidenced by two Arctic marine ecozones (Arctic Basin and Acric Archipelago) with fewer than 100 records (74 and 29, respectively). Significant gaps in barcode coverage were also evident in northern terrestrial ecozones. More than 0.5 million records derive from the Mixedwood Plains, Canada’s most southerly ecozone, while five of six arctic and subarctic ecozones had fewer than 30 000 records each. A direct comparison of sampling effort among biomes and ecozones is difficult because the distribution of organisms is not uniform. Since the number of records from remote regions may be doubly impacted by low biodiversity and by the concentration of sampling in more easily accessible areas, environments such as the open ocean and north should be prioritized for sampling. Addressing gaps in northern areas experiencing rapid warming is crucial for understanding the impacts of climate change (
WWF 2022;
IPCC 2023).
Sampling continued to capture new BINs as more specimens were analyzed in both terrestrial and aquatic ecozones. Despite the higher number of BINs relative to species, the terrestrial curves were closer to an asymptote than the aquatic curves, reflecting the extensive sampling effort on land. Aquatic records had twice as many BINs as species, a value higher than the 1.5x BINs per species found in a similar analysis in Europe (
Leite et al. 2020). This difference may be driven by a few groups with many BINs per species, such as the bryozoan family Hippothoideae. The summary of barcode coverage by family (Supp. Data Table S1) provides a roadmap for prioritizing future efforts where the concentration of genetic diversity is likely the greatest.
Our summary of coverage by ecozone did not recover the same data as our summary by taxonomy. In fact, our geographic summary recovered 6117 aquatic species while our taxonomic summary recovered records for just 3858 species. Given this discordance, our sampling may be even lower for marine ecozones than suggested by our geographic analysis. The additional taxa attributed to aquatic ecozones in the geographic analysis likely included some terrestrial specimens collected along shorelines. As well, our aquatic ecozone analysis included the freshwater Great Lakes ecozone while the taxonomic analysis was restricted to marine taxa. Finally, some species within the taxonomic dataset with ranges extending either southwards or into Alaska may have been collected in the USA. While specimens collected elsewhere remain important as reference sequences, they may differ genetically from their Canadian conspecifics. Future work could assess whether any species have only been collected outside of Canada so sampling efforts can prioritize these groups.
The barcode coverage for described Canadian marine species from our taxonomic analysis is encouraging.
Weigand et al. (2019) reported barcodes for just 22% of European marine species, while
Leite et al. (2020) found 37% coverage within Atlantic Iberia. Coverage within major marine phyla was also lower in Europe than Canada (
Weigand et al. 2019;
Leite et al. 2020). However, among phyla with 50 or more described species, Nematoda and Platyhelminthes were severely under-represented in Canada. In addition, the checklist of marine animals for Canada is suspiciously low: it includes just 6373 species compared to Europe’s 16 962 species (
Weigand et al. 2019) for a much smaller area. Certainly, taxonomic studies on the European marine fauna have been underway for far longer than similar work in Canada, and the number of taxonomists is much higher (
Archambault et al. 2010). Accordingly, the apparently high coverage for the Canadian marine fauna may be an artefact of an incomplete species checklist rather than more comprehensive DNA barcode coverage.
Additional limitations of current barcode coverage are the small number of public records for many species and discordances in taxonomic assignments. While our analysis did not include private records, coverage would be substantially improved by their inclusion (
Weigand et al. 2019). Understanding species distributions, abundances, and trends over time requires multiple records per species (
Hochkirch et al. 2021;
Cowie et al. 2022), but many species on BOLD are currently represented by a single record (
Weigand et al. 2019). For organisms with a broad geographic range, 14–25 records may be needed to evaluate the extent of regional variation (
van Proosdij et al. 2016). Acquiring multiple barcodes per species is especially important for groups with high intraspecific variation or multiple BINs per species, where multiple sequences are needed to understand barcode variation (
Meyer and Paulay 2005;
Leite et al. 2020). A review of major marine invertebrate groups on BOLD found that 24% of species include misidentifications, ambiguities, or discordances, often reflecting taxonomic challenges (
Radulovici et al. 2021). Addressing these limitations is crucial to enhance the effectiveness of DNA barcode reference libraries (
Weigand et al. 2019;
Fontes et al. 2021).
As arthropods are, by far, the most diverse animal phylum (
Larsen et al. 2017), most Canadian barcoding efforts have focused on them (
Pentinsaari et al. 2020;
Young et al. 2021;
Lowe et al. 2022). The resulting records address a critical need for biodiversity monitoring and conservation planning for species weakly represented in frameworks such as the IUCN Red List (
IUCN 2023) which is focused on mammals and birds (
Hochkirch et al. 2021). Geographic coordinates enhance the value of barcode data for tracking species distributions, range size, regional differences, invasive species, and susceptibility to anthropogenic pressures (
Pimm et al. 2014;
Seebens et al. 2021). Combining large-scale biodiversity databases like BOLD with spatial data on land and sea will provide a powerful tool for advancing progress towards biodiversity protection targets (
Pimm et al. 2014).
In conclusion, this study has documented significant progress in building a DNA barcode reference library for the Canadian fauna, but it has also revealed geographic and taxonomic gaps. Marine environments, particularly northern and open ocean regions, require far more attention. Our results provide a roadmap for DNA barcoding efforts to address data gaps and to improve the quality of reference libraries. Potential paths for expanding taxonomic and regional coverage include dedicated initiatives that prioritize indicator species and target phylogenetic gaps (
Weigand et al. 2019) and leveraging regular offshore surveys conducted by federal agencies such as Fisheries and Oceans Canada. This strategic approach will advance the scalability and cost-effectiveness of the molecular biosurveillance programs crucial for global biodiversity assessment and conservation (
Compson et al. 2020;
Grant et al. 2021;
Ray et al. 2021).