Wish list for name matching services

From TETTRIs
Revision as of 20:19, 10 December 2024 by WalterBerendsohn (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Terminology

  • candidates = returned partial matches, as opposed to exact matches
  • canonical name = name string without authors/year, but with rank indicator (where required)
  • asynchronous output = (here) output (as a downloadable file) is produced after internal processing, which may take time. User is notified when ready.

General

Input interface

  • Allow input of a pasted column of names
  • Allow upload of a table with names (dialogue: name bearing column[s])
  • Allow upload of any other column
  • Accept standard name tables (DwC:"Taxon", ColDP:"Name")
  • Accept standard data packages (DwC-A, ColDP)
  • Allow wildcards
  • Do not limit the input (in asynchronuous mode)
  • Allow input of name components in separate fields (at least in uploads):
    • For ICNAFP: Monomial / genus component; infrageneric rank; infrageneric epithet; species epithet; infraspecific rank; infraspecific epithet; basionym author (team); combination author (team); year of publication.
    • For ICZN: Genus name, subgenus name, specific name, subspecific name, original combination author (team), year of publication

Initial input processing

  • Filter out terms for uncertainty (e.g. c., cf., aff., sp. prox., (?) etc.) (optional)
  • Filter out terms for species groups/aggregates/complexes (agg., species group, species complex) (optional)
  • Recognise the different structures of zoological (ICZN) and "botanical" (ICNAFP) names (see "Structure of names" below)

Content (aggregator)

  • Include name records with unique IDs for all names or name-like designations that have been in use in a taxononomic context (depending on the aggregator's editorial policy, some may be marked as deprecated):
    • Orthographical variants and original spellings of the canonical name
    • Erroneous name author designations (referring to a wrong place of publication, e.g. later citations ascribed to the citing author; not to variations in author abbreviations, inclusion of ex-authors)
    • For ICNAFP: Isonyms
    • Later homonyms
    • For ICNAFP: Names with and without hybrid symbol (nomenclaturally the same)

Matching

  • If the name is entered as a string, parse the name and match the name components independently
  • For ICNAFP: Tolerate different abbreviations for infrageneric and infraspecific rank designations (e.g. "subspecies", "subsp.", "ssp.")
  • For ICNAFP: Exactly match the rank of the name, if unambiguous in the input – i.e. do not return a subspecies for a variety, a genus name for a family, etc. (default)
  • For ICNAFP: Tolerate superfluous spaces in standard author abbreviations
  • For ICZN: Exactly match the rank of the name, if unambiguous in the input – even more important here, because redundant names are allowed (Bufo bufo bufo, Meles meles)
  • Avoid returning names that are completely improbable as candidates
    • Weigh probabilities hierarchically; e.g., in a species name, a full or near full match on a genus name is more important than that of the species epithet (epithets may be used in many genera).
    • Make phonetic matching optional
  • Allow parameters for matching (see below)

Interactive mode

  • Allow for interactive selection of the matching record

Output

  • Return matched records and candidates in a single table (at least in asynchronous mode)
  • Return all input columns with matching results (at least in asynchronous mode)
  • Return a stable resolvable aggregator ID for matched names
  • List candidates sorted by probability (see above)
  • Provide output in standard formats (DwC, ColDP)

Parameters for exact matches

  • optional: include deprecated records in name matching (explicit)
  • optional (ICNAFP): accept IPNI, Tropicos and full spaced author abbreviations as exact matches (default)
  • optional (ICNAFP): ignore ex authors (the author or team preceding the ex) (explicit)
  • optional (ICNAFP): ignore hybrid symbol (or “x” space or space “x” space) in name (explicit)
  • optional (ICZN): ignore indicators for hybrids (kl. for klepton, sk. for subklepton) (default)
  • optional (ICNAFP): ignore authors in autonyms (explicit)
  • optional: (ICNAFP): ignore endings in name elements (monomials, species epithet/name, infraspecific epithet/name) (explicit)
  • optional: (ICZN): ignore the classic genitive mistake (-ii instead of -i, -iae instead of -ae) [it's such a common mistake. In zoology, we would only have -ii/-iae if someone's name finishes with -i. In botany, the use of the double ii is more common. There was a time when the -ii was also the norm in zoology (ex. Theraphosa blondi [correct spelling] was described as blondii, dedicated to Leblond)] (E. Saliba, pers. comm.)
  • optional: treat canonical matches as exact if only one match is found (i.e. ignore authors/year) (explicit)

Parameters for candidate matches

  • allow different ranks (explicit)
  • activate phonetic matching (explicit)
  • increase tolerance (accept wider range of near matches)
  • ...

Machine Interface

  • OpenRefine Reconciliation Interface
  • ...

Structure of names

A botanical name strictly consists of a maximum of three canonical name elements plus a rank indicator, where appropriate. Author(s) given in parenthesis indicate the basionym authors, preceeding the combination author(s)
In zoology, there is only one infraspecific rank, the subspecies. No need to specify the rank. That one is not open to interpretation, the Code is quite clear about it. There is officially also only one infrageneric rank, the subgenus, but there seem to be other opinions. A discussion of the ICZN occurred in 2018 and if we follow Erna Aescht’s position in Opinion 2420 “the taxonomic terms ‘section’, ‘division’ and ‘superspecies’ are given in quotation marks [in the Code], because they are not nomenclatural terms” and the subgenus is the only possible rank in the genus series, other than the genus rank. Subgenera are usually indicated in between parentheses when used in a full species name (Art. 6.1.). The ICodeZN recommends that No genus-group name other than a valid subgeneric name should be interpolated between a generic name and a specific name, even in square brackets or parentheses. An author who desires to refer to a former generic combination should do so in some explicit form such as "Branchiostoma lanceolatum [formerly in Amphioxus]".The ICZN permits entering the combination author, but that is usually not done, where a new combination was made, parentheses are put around the original author name and date.
[Elie Saliba, pers. comm. 27 may 2024]

Literature and Links

  • Conti, M., Nimis, P.L., Martellos, S. 2021: Match Algorithms for Scientific Names in FlorItaly, the Portal to the Flora of Italy. Plants 2021, 10, 974. https://doi.org/10.3390/plants10050974