Difference between revisions of "Taxonomic datasets"

From TETTRIs
Jump to: navigation, search
m
Line 5: Line 5:
 
''Downloads: ''All data can be downloaded, in the original format or in various [https://www.checklistbank.org/about/formats formats]. Download requires a login with a free GBIF user account. Parts of checklists can be downloaded by means of selecting a root taxon, e.g. a genus within the checklist, or a specific taxonomic rank.
 
''Downloads: ''All data can be downloaded, in the original format or in various [https://www.checklistbank.org/about/formats formats]. Download requires a login with a free GBIF user account. Parts of checklists can be downloaded by means of selecting a root taxon, e.g. a genus within the checklist, or a specific taxonomic rank.
 
===[https://www.gbif.org Global Biodiversity Information Facility (GBIF)]===
 
===[https://www.gbif.org Global Biodiversity Information Facility (GBIF)]===
GBIF is storing uploaded taxonomic datasets, assigns a DOI and makes them available for download. As of August 2025, nearly 61,000 datasets are listed. Metadata descriptions are usually comprehensive. The datasets have been used to assemble the GBIF Backbone Taxonomy (see under datasets below) as a [https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c "single, synthetic management classification with the goal of covering all names GBIF is dealing with"]. This dataset will be replaced by the Catalogue of Life eXtended Release (COL XR, also see below) in the future [pers. comm. *** aug 2025].   
+
GBIF is storing uploaded taxonomic datasets, assigns a DOI and makes them available for download. As of August 2025, nearly 61,000 datasets are listed. Metadata descriptions are usually comprehensive. The datasets have been used to assemble the GBIF Backbone Taxonomy (see under datasets below) as a [https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c "single, synthetic management classification with the goal of covering all names GBIF is dealing with"]. This dataset will be replaced by the Catalogue of Life eXtended Release (COL XR, also see below) in the future [pers. comm. *** aug 2025].<br>  
 
''Downloads: ''All datasets can be downloaded as DwCA files.
 
''Downloads: ''All datasets can be downloaded as DwCA files.
  

Revision as of 11:58, 22 August 2025

Repositories

Listed are sites that host uploaded or imported datasets from various sources. All listed here also provide some kind of API and name matching services for the included datasets. Only ChecklistBank and GBIF provide the possibility to also download the datasets.

ChecklistBank (CLB)

CLB was developed by the Catalogue of Life (COL) and the Global Biodiversity Information Facility (GBIF). It is a repository, holding a huge number of individual datasets, ranging from large checklists (like COL or World Flora Online editions) to data extracted from individual publications (taxonomic treatments). The latter, provided by PLAZI, form the bulk of the submissions (as of August 2025, more than 58,000 datasets of a total of about 61,100). Functioning of the portal is documented in a tutorial for users and the code is managed on Github.
Downloads: All data can be downloaded, in the original format or in various formats. Download requires a login with a free GBIF user account. Parts of checklists can be downloaded by means of selecting a root taxon, e.g. a genus within the checklist, or a specific taxonomic rank.

Global Biodiversity Information Facility (GBIF)

GBIF is storing uploaded taxonomic datasets, assigns a DOI and makes them available for download. As of August 2025, nearly 61,000 datasets are listed. Metadata descriptions are usually comprehensive. The datasets have been used to assemble the GBIF Backbone Taxonomy (see under datasets below) as a "single, synthetic management classification with the goal of covering all names GBIF is dealing with". This dataset will be replaced by the Catalogue of Life eXtended Release (COL XR, also see below) in the future [pers. comm. *** aug 2025].
Downloads: All datasets can be downloaded as DwCA files.

Global names architecture (GNA)

The Global Names Architecture (GNA) is a system of web-services which helps people to register, find, index, check and organize biological scientific names and interconnect on-line information about species. For its name matching tool (Global Names Verifier) it draws on a number of imported taxonomic datasets which are regularly updated, where possible [checked 22 aug 2025]. Name matching can be restricted to one or more of these. However, these datasets cannot be downloaded directly from the GNA website.

Botanical information and ecology network (BIEN)

BIEN is a network based at the National Center for Ecological Analysis and Synthesis (NCEAS). BIEN aims at integrating global botanical data (see https://bien.nceas.ucsb.edu/bien/about/. For its name matching service (Taxonomic Name Resolution Service TNRS) it hosts botanical datasets that can be selected for the matching process.

Taxonomic datasets

These are structured lists of scientific names that follow a single classification system. They typically present a hierarchical, tree-like taxonomy in which each taxon represents a node. Each scientific name is either assigned as the accepted name of a taxon or treated as a synonym (except in cases where the name exists but cannot currently be resolved).

Catalogue of Life (COL)

Dataset description: https://www.catalogueoflife.org/about/catalogueoflife
Scope: All organisms / global
Downloads: The entire CoL in its latest version can be downloaded from https://www.catalogueoflife.org/data/download in ColDP Archive. Darwin Core Archive, ACEF Archive, or TextTree format. CoL-ChecklistBank also offers partial downloads (in various formats, with DOI), this requires a free GBIF user account.

Encyclopedia of Life (EOL)

Dataset description: https://eol.org/docs/eol-dynamic-hierarchy
Scope: All organisms / global
Downloads: "EOL Dynamic Hierarchy Version 2.2", format: tsv. EOL Dynamic Hierarchy Active Version, format tsv (".tab"). See also https://eol.org/docs/what-is-eol/data-services

Fungal Names

Dataset description: https://nmdc.cn/fungalnames/toabout
Scope: Fungi, global
Downloads: see https://nmdc.cn/fungalnames/towebservice

Integrated Taxonomic Information System (ITIS)

Dataset description: https://www.itis.gov/about_itis.html
Scope: All organisms / US and global
Downloads: Up to 32,727 records of a specific taxonomic group can be downloaded in Taxonomic Workbench format or as DwC-A - see https://www.itis.gov/access.html.

Mycobank

Dataset description: Crous & al. 2004
Scope: Fungi, global
Downloads: An Excel version of the list of taxa present in MycoBank (export date: 13th of January 2025) can be downloaded from https://www.mycobank.org/Images/MBList.zip.

NCBI Taxonomy

Dataset description: Schoch 2011 / 2021
Scope: All organisms / global
Downloads: The full taxonomy database along with files associating nucleotide and protein sequence records with their taxonomy IDs. https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

(The) Paleobiology Database

Dataset description: Uhlen & al. 2023
Scope: All organisms (fossils only) / global
Downloads: Records can be downloaded in several formats using the Download Generator

World Checklist of Vascular Plants (WCVP)

Dataset descriptionGovaerts & al. 2021
Scope: Vascular plants
Downloads: According to Rafael Govaerts (pers. comm. July 17, 2024), the primary place for the WCVP dataset is on PoWo, under WCVP "DATA". Both, text (csv) and DarwinCore download are available, with citation and other metadata provided in a readme.txt and the eml.xml file, respectively. The WCVP data can also be checked and downloaded via Checklist Bank in its latest version (the metadata there are less precise).

World Flora Online (WFO) Plant List

Dataset description: https://wfoplantlist.org/background
Scope: Plants (Vascular plants and "bryophytes")
Downloads: Available for the published versions (currently 6-monthly updates, published via ZENODO, with DOI). The current version can be found under https://zenodo.org/records/8079052. If there is a later version available, this will be indicated at the top of the page.
The following formats are available (example for the December 2022 version):

  • wfo_plantlist_2022-12.zip The Catalogue of Life Data Package of the WFO Plant List. This is the most expressive standards based form of the list.
  • plant_list_2022-12.json.zip JSON formatted version of the WFO Plant List. This has been designed for direct import into a schemaless instance of a SOLR index and is used to drive the WFO Plant List API (https://list.worldfloraonline.org) which in turn drives the WFO Plant List in the portal. This is recommended if you want a local, read only version of the list rather than use the API.
  • plant_list_2022-12.sql.gz This is the complete production database (minus logging data and API keys) as a MySQL backup file. It can be restored directly to a MySQL 5.7 or later instance if you require the list in SQL format.
  • ipni_to_wfo.csv.gz A file mapping all the IPNI IDs we track to their associated WFO IDs.
  • families_dwc.tar.gz Individual Darwin Core Archive files for each of 718 recognized families. If you want a single family in DwC but can't load the whole list download and expand this file. Family and genus files are also available for download through the portal.
  • DwC_backbone_R.zip A single Darwin Core Archive file containing non deprecated names and taxa for use in the existing R package.
  • _uber.zip A single Darwin Core Archive file containing all names and taxa even those that are deprecated along with some extra columns

Weekly updated DwC-Archive files for all families and for the _uber.zip are available at https://list.worldfloraonline.org/rhakhis/api/downloads/dwc/

Nomenclators

A special class of datasets that systematically catalog scientific names along with their authorship, publication date and references, nomenclatural status, and (sometimes) type information. These focus on nomenclatural accuracy such as on correct spelling and nomenclatural validity rather than providing taxonomic opinion or classification.