Taxonomic datasets

From TETTRIs
Revision as of 15:45, 21 August 2025 by WalterBerendsohn (talk | contribs) (Fungal Names)
Jump to: navigation, search

Repositories

Repositories are sites that host uploaded or imported datasets from various sources. All listed here also provide some kind of API and name matching services.

ChecklistBank (CLB)

CLB was developed by the Catalogue of Life (COL) and the Global Biodiversity Information Facility (GBIF). It is a repository, holding a huge number of individual datasets, ranging from large checklists (like COL or World Flora Online editions) to data extracted from individual publications (taxonomic treatments). The latter, provided by PLAZI, form the bulk of the submissions (as of February 2025, nearly 55,000 datasets of a total of about 57,500). Functioning of the portal is documented in a tutorial for users and the code is managed on Github.
Downloads: All data can be downloaded, in the original format or in various formats. Download requires a login with a free GBIF user account. Parts of checklists can be downloaded by means of selecting a root taxon, e.g. a genus within the checklist, or a specific taxonomic rank.

Taxonomic datasets

These are structured lists of scientific names that follow a single classification system. They typically present a hierarchical, tree-like taxonomy in which each taxon represents a node. Each scientific name is either assigned as the accepted name of a taxon or treated as a synonym (except in cases where the name exists but cannot currently be resolved).

Catalogue of Life (COL)

Dataset description: https://www.catalogueoflife.org/about/catalogueoflife
Scope: All organisms / global
Downloads: The entire CoL in its latest version can be downloaded from https://www.catalogueoflife.org/data/download in ColDP Archive. Darwin Core Archive, ACEF Archive, or TextTree format. CoL-ChecklistBank also offers partial downloads (in various formats, with DOI), this requires a free GBIF user account.

Encyclopedia of Life (EOL)

Dataset description: https://eol.org/docs/eol-dynamic-hierarchy
Scope: All organisms / global
Downloads: "EOL Dynamic Hierarchy Version 2.2", format: tsv. EOL Dynamic Hierarchy Active Version, format tsv (".tab"). See also https://eol.org/docs/what-is-eol/data-services

Fungal Names

Dataset description: https://nmdc.cn/fungalnames/toabout
Downloads: see https://nmdc.cn/fungalnames/towebservice

Integrated Taxonomic Information System - ITIS

Downloads: Up to 32,727 records of a specific taxonomic group can be downloaded in Taxonomic Workbench format or as DwC-A - see https://www.itis.gov/access.html.

Mycobank

https://www.mycobank.org
Downloads: An Excel version of the list of taxa present in MycoBank (export date: 13th of January 2025) can be downloaded from https://www.mycobank.org/Images/MBList.zip.

NCBI Taxonomy

Downloads: The full taxonomy database along with files associating nucleotide and protein sequence records with their taxonomy IDs. https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

(The) Paleobiology Database

Downloads: Records can be downloaded in several formats using the Download Generator

World Checklist of Vascular Plants (WCVP)

Downloads: According to Rafael Govaerts (pers. comm. July 17, 2024), the primary place for the WCVP dataset is on PoWo, under WCVP "DATA". Both, text (csv) and DarwinCore download are available, with citation and other metadata provided in a readme.txt and the eml.xml file, respectively. The WCVP data can also be checked and downloaded via Checklist Bank in its latest version (the metadata there are less precise).

World Flora Online (WFO) Plant List Downloads

Downloads: Available for the published versions (currently 6-monthly updates, published via ZENODO, with DOI). The current version can be found under https://zenodo.org/records/8079052. If there is a later version available, this will be indicated at the top of the page.
The following formats are available (example for the December 2022 version):

  • wfo_plantlist_2022-12.zip The Catalogue of Life Data Package of the WFO Plant List. This is the most expressive standards based form of the list.
  • plant_list_2022-12.json.zip JSON formatted version of the WFO Plant List. This has been designed for direct import into a schemaless instance of a SOLR index and is used to drive the WFO Plant List API (https://list.worldfloraonline.org) which in turn drives the WFO Plant List in the portal. This is recommended if you want a local, read only version of the list rather than use the API.
  • plant_list_2022-12.sql.gz This is the complete production database (minus logging data and API keys) as a MySQL backup file. It can be restored directly to a MySQL 5.7 or later instance if you require the list in SQL format.
  • ipni_to_wfo.csv.gz A file mapping all the IPNI IDs we track to their associated WFO IDs.
  • families_dwc.tar.gz Individual Darwin Core Archive files for each of 718 recognized families. If you want a single family in DwC but can't load the whole list download and expand this file. Family and genus files are also available for download through the portal.
  • DwC_backbone_R.zip A single Darwin Core Archive file containing non deprecated names and taxa for use in the existing R package.
  • _uber.zip A single Darwin Core Archive file containing all names and taxa even those that are deprecated along with some extra columns

Weekly updated DwC-Archive files for all families and for the _uber.zip are available at https://list.worldfloraonline.org/rhakhis/api/downloads/dwc/

Nomenclators

A special class of datasets that systematically catalog scientific names along with their authorship, publication date and references, nomenclatural status, and (sometimes) type information. These focus on nomenclatural accuracy such as on correct spelling and nomenclatural validity rather than providing taxonomic opinion or classification.