Difference between revisions of "Wish list for name matching services"

From TETTRIs
Jump to: navigation, search
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Terminology==  
+
=Terminology=  
*candidates are returned partial matches, as opposed to exact matches
+
*candidates = returned partial matches, as opposed to exact matches
*canonical name
+
*canonical name = name string without authors/year, but with rank indicator (where required)
*asynchronous output
+
*asynchronous output = (here) output (as a downloadable file) is produced after internal processing, which may take time. User is notified when ready.
  
 
=General=
 
=General=
Line 9: Line 9:
 
*Allow upload of a table with names (dialogue: name bearing column[s])
 
*Allow upload of a table with names (dialogue: name bearing column[s])
 
*Allow upload of any other column
 
*Allow upload of any other column
 +
*Accept standard name tables (DwC:"Taxon", ColDP:"Name")
 +
*Accept standard data packages (DwC-A, ColDP)
 
*Allow wildcards
 
*Allow wildcards
 +
*Do not limit the input (in asynchronuous mode)
 +
*Filter out terms for uncertainty (e.g. c., cf., aff., sp. prox., (?) etc.)
 +
*Filter out terms for species groups/aggregates/complexes (agg., species group, species complex)
 +
*Filter out
 +
*Recognise the different structures of zoological (ICZN) and "botanical" (ICNAFP) names  (see "Structure of names" below)
 
*Allow input of name components in separate fields (at least in uploads):  
 
*Allow input of name components in separate fields (at least in uploads):  
 
**For ICNAFP: Monomial / genus component; infrageneric rank; infrageneric epithet; species epithet; infraspecific rank; infraspecific epithet; basionym author (team); combination author (team); year of publication.  
 
**For ICNAFP: Monomial / genus component; infrageneric rank; infrageneric epithet; species epithet; infraspecific rank; infraspecific epithet; basionym author (team); combination author (team); year of publication.  
**For ICZN: work in progress.
+
**For ICZN: Genus name, subgenus name, specific name, subspecific name, original combination author (team), year of publication
 +
 
 
==Interactive mode==
 
==Interactive mode==
 +
*Allow for interactive selection of the matching record
 +
 
==Matching==
 
==Matching==
 
*If the name is entered as a string, parse the name and match the name components independently
 
*If the name is entered as a string, parse the name and match the name components independently
*Avoid returning absurd candidates (names that are completely improbable)
+
*For ICNAFP: Tolerate different abbreviations for infrageneric and infraspecific rank designations (e.g. "subspecies", "subsp.", "ssp.")
*Exactly match the rank of the name, if unambiguous in the input – i.e. do not return a subspecies for a variety, a genus name for a family, etc.
+
*For ICNAFP: Exactly match the rank of the name, if unambiguous in the input – i.e. do not return a subspecies for a variety, a genus name for a family, etc. (default)
 +
*For ICZN: Exactly match the rank of the name, if unambiguous in the input – even more important here, because redundant names are allowed (Bufo bufo bufo, Meles meles)
 +
*Avoid returning names that are completely improbable as candidates
 +
**Weigh probabilities hierarchically; e.g., in a species name, a full or near full match on a genus name is more important than that of the species epithet (epithets may be used in many genera).
 +
**Make phonetic matching optional
 
*Allow parameters for matching (see below)
 
*Allow parameters for matching (see below)
 +
 
==Output==
 
==Output==
*In asynchronous mode, return matched records and candidates in a single table
+
*Return matched records and candidates in a single table (at least in asynchronous mode)
*In asynchronous mode, return all input columns with matching results
+
*Return all input columns with matching results (at least in asynchronous mode)
*weigh probabilities hierarchically:
+
*Return a stable resolvable aggregator ID for matched names
**e.g., in a species name, a full or near full match on a genus name is more important than that of the species epithet (epithets may be used in many genera).
+
*List candidates sorted by probability (see above)
=Parameters for exact matches=
+
*Provide output in standard formats (DwC, ColDP)
*optional (ICNAFP): accept IPNI, Tropicos and full spaced author abbreviations as exact matches
+
 
*optional (ICNAFP): ignore ex authors (the author or team preceeding the ex)
+
==Parameters for exact matches==
*optional (ICNAFP): ignore hybrid symbol (or “x” space/space “x” space) in name
+
*optional (ICNAFP): accept IPNI, Tropicos and full spaced author abbreviations as exact matches (default)
*optional (ICNAFP): ignore authors in autonyms
+
*optional (ICNAFP): ignore ex authors (the author or team preceeding the ex)  (explicit)
*optional: ignore endings in epithets (ICNAFP) / (species/subspecies name (ICZN)  
+
*optional (ICNAFP): ignore hybrid symbol (or “x” space/space “x” space) in name (explicit)
=Parameters for candidate matches=
+
*optional (ICZN): ignore indicators for hybrids (kl. for klepton, sk. for subklepton) (default)
work in progress
+
*optional (ICNAFP): ignore authors in autonyms (explicit)
 +
*optional: (ICNAFP): ignore endings in name elements (monomials, species epithet/name, infraspecific epithet/name) (explicit)
 +
*optional: (ICZN): ignore the classic genitive mistake (-ii instead of -i, -iae instead of -ae) [it's such a common mistake. In zoology, we would only have -ii/-iae if someone's name finishes with -i. In botany, the use of the double ii is more common. There was a time when the -ii was also the norm in zoology (ex. Theraphosa blondi [correct spelling] was described as blondii, dedicated to Leblond)] (E. Saliba, pers. comm.)
 +
*optional: treat canonical matches as exact if only one match is found (i.e. ignore authors/year) (explicit)
 +
 
 +
==Parameters for candidate matches==
 +
*allow different ranks (explicit)
 +
*activate phonetic matching (explicit)
 +
*increase tolerance (accept wider range of near matches)
 +
*...
 +
 
 +
==Machine Interface==
 +
*OpenRefine Reconciliation Interface
 +
*...
 +
 
 +
==Structure of names==
 +
A botanical name strictly consists of a maximum of three canonical name elements plus a rank indicator, where appropriate. Author(s) given in parenthesis indicate the basionym authors, preceeding the combination author(s)<br />
 +
In zoology, there is only one infraspecific rank, the subspecies. No need to specify the rank. That one is not open to interpretation, the Code is quite clear about it. There is officially also only one infrageneric rank, the subgenus, but there seem to be other opinions. A discussion of the ICZN occurred in 2018 and if we follow Erna Aescht’s position in Opinion 2420  “the taxonomic terms ‘section’, ‘division’ and ‘superspecies’ are given in quotation marks [in the Code], because they are not nomenclatural terms” and the subgenus is the only possible rank in the genus series, other than the genus rank. Subgenera are usually indicated in between parentheses when used in a full species name (Art. 6.1.). The ICodeZN recommends that ''No genus-group name other than a valid subgeneric name should be interpolated between a generic name and a specific name, even in square brackets or parentheses. An author who desires to refer to a former generic combination should do so in some explicit form such as "Branchiostoma lanceolatum [formerly in Amphioxus]".''The ICZN permits entering the combination author, but that is usually not done, where a new combination was made, parentheses are put around the original author name and date.<br />
 +
[Elie Saliba, pers. comm. 27 may 2024]<br />
 +
 
 +
=Literature and Links=
 +
*Conti, M., Nimis, P.L., Martellos, S. 2021: Match Algorithms for Scientific Names in FlorItaly, the Portal to the Flora of Italy. Plants 2021, 10, 974. [https://doi.org/10.3390/plants10050974 https://doi.org/10.3390/plants10050974]

Latest revision as of 09:33, 29 May 2024

Terminology

  • candidates = returned partial matches, as opposed to exact matches
  • canonical name = name string without authors/year, but with rank indicator (where required)
  • asynchronous output = (here) output (as a downloadable file) is produced after internal processing, which may take time. User is notified when ready.

General

Input

  • Allow input of a pasted column of names
  • Allow upload of a table with names (dialogue: name bearing column[s])
  • Allow upload of any other column
  • Accept standard name tables (DwC:"Taxon", ColDP:"Name")
  • Accept standard data packages (DwC-A, ColDP)
  • Allow wildcards
  • Do not limit the input (in asynchronuous mode)
  • Filter out terms for uncertainty (e.g. c., cf., aff., sp. prox., (?) etc.)
  • Filter out terms for species groups/aggregates/complexes (agg., species group, species complex)
  • Filter out
  • Recognise the different structures of zoological (ICZN) and "botanical" (ICNAFP) names (see "Structure of names" below)
  • Allow input of name components in separate fields (at least in uploads):
    • For ICNAFP: Monomial / genus component; infrageneric rank; infrageneric epithet; species epithet; infraspecific rank; infraspecific epithet; basionym author (team); combination author (team); year of publication.
    • For ICZN: Genus name, subgenus name, specific name, subspecific name, original combination author (team), year of publication

Interactive mode

  • Allow for interactive selection of the matching record

Matching

  • If the name is entered as a string, parse the name and match the name components independently
  • For ICNAFP: Tolerate different abbreviations for infrageneric and infraspecific rank designations (e.g. "subspecies", "subsp.", "ssp.")
  • For ICNAFP: Exactly match the rank of the name, if unambiguous in the input – i.e. do not return a subspecies for a variety, a genus name for a family, etc. (default)
  • For ICZN: Exactly match the rank of the name, if unambiguous in the input – even more important here, because redundant names are allowed (Bufo bufo bufo, Meles meles)
  • Avoid returning names that are completely improbable as candidates
    • Weigh probabilities hierarchically; e.g., in a species name, a full or near full match on a genus name is more important than that of the species epithet (epithets may be used in many genera).
    • Make phonetic matching optional
  • Allow parameters for matching (see below)

Output

  • Return matched records and candidates in a single table (at least in asynchronous mode)
  • Return all input columns with matching results (at least in asynchronous mode)
  • Return a stable resolvable aggregator ID for matched names
  • List candidates sorted by probability (see above)
  • Provide output in standard formats (DwC, ColDP)

Parameters for exact matches

  • optional (ICNAFP): accept IPNI, Tropicos and full spaced author abbreviations as exact matches (default)
  • optional (ICNAFP): ignore ex authors (the author or team preceeding the ex) (explicit)
  • optional (ICNAFP): ignore hybrid symbol (or “x” space/space “x” space) in name (explicit)
  • optional (ICZN): ignore indicators for hybrids (kl. for klepton, sk. for subklepton) (default)
  • optional (ICNAFP): ignore authors in autonyms (explicit)
  • optional: (ICNAFP): ignore endings in name elements (monomials, species epithet/name, infraspecific epithet/name) (explicit)
  • optional: (ICZN): ignore the classic genitive mistake (-ii instead of -i, -iae instead of -ae) [it's such a common mistake. In zoology, we would only have -ii/-iae if someone's name finishes with -i. In botany, the use of the double ii is more common. There was a time when the -ii was also the norm in zoology (ex. Theraphosa blondi [correct spelling] was described as blondii, dedicated to Leblond)] (E. Saliba, pers. comm.)
  • optional: treat canonical matches as exact if only one match is found (i.e. ignore authors/year) (explicit)

Parameters for candidate matches

  • allow different ranks (explicit)
  • activate phonetic matching (explicit)
  • increase tolerance (accept wider range of near matches)
  • ...

Machine Interface

  • OpenRefine Reconciliation Interface
  • ...

Structure of names

A botanical name strictly consists of a maximum of three canonical name elements plus a rank indicator, where appropriate. Author(s) given in parenthesis indicate the basionym authors, preceeding the combination author(s)
In zoology, there is only one infraspecific rank, the subspecies. No need to specify the rank. That one is not open to interpretation, the Code is quite clear about it. There is officially also only one infrageneric rank, the subgenus, but there seem to be other opinions. A discussion of the ICZN occurred in 2018 and if we follow Erna Aescht’s position in Opinion 2420 “the taxonomic terms ‘section’, ‘division’ and ‘superspecies’ are given in quotation marks [in the Code], because they are not nomenclatural terms” and the subgenus is the only possible rank in the genus series, other than the genus rank. Subgenera are usually indicated in between parentheses when used in a full species name (Art. 6.1.). The ICodeZN recommends that No genus-group name other than a valid subgeneric name should be interpolated between a generic name and a specific name, even in square brackets or parentheses. An author who desires to refer to a former generic combination should do so in some explicit form such as "Branchiostoma lanceolatum [formerly in Amphioxus]".The ICZN permits entering the combination author, but that is usually not done, where a new combination was made, parentheses are put around the original author name and date.
[Elie Saliba, pers. comm. 27 may 2024]

Literature and Links

  • Conti, M., Nimis, P.L., Martellos, S. 2021: Match Algorithms for Scientific Names in FlorItaly, the Portal to the Flora of Italy. Plants 2021, 10, 974. https://doi.org/10.3390/plants10050974