Difference between revisions of "Wish list for name matching services"

From TETTRIs
Jump to: navigation, search
m
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Terminology==  
+
=Terminology=  
*candidates are returned partial matches, as opposed to exact matches
+
*candidates = returned partial matches, as opposed to exact matches
*canonical name
+
*canonical name = name string without authors/year, but with rank indicator (where required)
*asynchronous output
+
*asynchronous output = (here) output (as a downloadable file) is produced after internal processing, which may take time. User is notified when ready.
  
 
=General=
 
=General=
Line 10: Line 10:
 
*Allow upload of any other column
 
*Allow upload of any other column
 
*Allow wildcards
 
*Allow wildcards
 +
*Do not limit the input (in asynchronuous mode)
 
*Allow input of name components in separate fields (at least in uploads):  
 
*Allow input of name components in separate fields (at least in uploads):  
 
**For ICNAFP: Monomial / genus component; infrageneric rank; infrageneric epithet; species epithet; infraspecific rank; infraspecific epithet; basionym author (team); combination author (team); year of publication.  
 
**For ICNAFP: Monomial / genus component; infrageneric rank; infrageneric epithet; species epithet; infraspecific rank; infraspecific epithet; basionym author (team); combination author (team); year of publication.  
 
**For ICZN: work in progress.
 
**For ICZN: work in progress.
 +
 
==Interactive mode==
 
==Interactive mode==
 +
*Allow for interactive selection of the matching record
 
==Matching==
 
==Matching==
 
*If the name is entered as a string, parse the name and match the name components independently
 
*If the name is entered as a string, parse the name and match the name components independently
*Avoid returning absurd candidates (names that are completely improbable)
+
*Tolerate different abbreviations for infrageneric and infraspecific rank designations (e.g. "subspecies", "subsp.", "ssp.")
 
*Exactly match the rank of the name, if unambiguous in the input – i.e. do not return a subspecies for a variety, a genus name for a family, etc.
 
*Exactly match the rank of the name, if unambiguous in the input – i.e. do not return a subspecies for a variety, a genus name for a family, etc.
 +
*Avoid returning names that are completely improbable as candidates
 +
**Weigh probabilities hierarchically; e.g., in a species name, a full or near full match on a genus name is more important than that of the species epithet (epithets may be used in many genera).
 +
**Make phonetic matching optional
 
*Allow parameters for matching (see below)
 
*Allow parameters for matching (see below)
 +
 
==Output==
 
==Output==
*In asynchronous mode, return matched records and candidates in a single table
+
*Return matched records and candidates in a single table (at least in asynchronous mode)
*In asynchronous mode, return all input columns with matching results
+
*Return all input columns with matching results (at least in asynchronous mode)
*weigh probabilities hierarchically:
+
*List candidates sorted by probability (see above)
**e.g., in a species name, a full or near full match on a genus name is more important than that of the species epithet (epithets may be used in many genera).
+
*Provide output in standard formats (DwC, ColDP)
=Parameters for exact matches=
+
 
 +
==Parameters for exact matches==
 
*optional (ICNAFP): accept IPNI, Tropicos and full spaced author abbreviations as exact matches
 
*optional (ICNAFP): accept IPNI, Tropicos and full spaced author abbreviations as exact matches
 
*optional (ICNAFP): ignore ex authors (the author or team preceeding the ex)
 
*optional (ICNAFP): ignore ex authors (the author or team preceeding the ex)
 
*optional (ICNAFP): ignore hybrid symbol (or “x” space/space “x” space) in name
 
*optional (ICNAFP): ignore hybrid symbol (or “x” space/space “x” space) in name
 
*optional (ICNAFP): ignore authors in autonyms
 
*optional (ICNAFP): ignore authors in autonyms
*optional: ignore endings in epithets (ICNAFP) / (species/subspecies name (ICZN)  
+
*optional: ignore endings in name elements (monomials, species epithet/name, infraspecific epithet/name)
=Parameters for candidate matches=
+
*optional: treat canonical matches as exact if only one match is found (i.e. ignore authors/year)  
work in progress
+
 
 +
==Parameters for candidate matches==
 +
*allow different ranks
 +
*activate phonetic matching
 +
*increase tolerance (accept wider range of near matches)
 +
*...
 +
 
 +
==Machine Interface==
 +
*OpenRefine Reconciliation Interface
 +
*...
 +
 
 +
=Literature and Links=
 +
*Conti, M., Nimis, P.L., Martellos, S. 2021: Match Algorithms for Scientific Names in FlorItaly, the Portal to the Flora of Italy. Plants 2021, 10, 974. [https://doi.org/10.3390/plants10050974 https://doi.org/10.3390/plants10050974]

Revision as of 10:06, 23 May 2024

Terminology

  • candidates = returned partial matches, as opposed to exact matches
  • canonical name = name string without authors/year, but with rank indicator (where required)
  • asynchronous output = (here) output (as a downloadable file) is produced after internal processing, which may take time. User is notified when ready.

General

Input

  • Allow input of a pasted column of names
  • Allow upload of a table with names (dialogue: name bearing column[s])
  • Allow upload of any other column
  • Allow wildcards
  • Do not limit the input (in asynchronuous mode)
  • Allow input of name components in separate fields (at least in uploads):
    • For ICNAFP: Monomial / genus component; infrageneric rank; infrageneric epithet; species epithet; infraspecific rank; infraspecific epithet; basionym author (team); combination author (team); year of publication.
    • For ICZN: work in progress.

Interactive mode

  • Allow for interactive selection of the matching record

Matching

  • If the name is entered as a string, parse the name and match the name components independently
  • Tolerate different abbreviations for infrageneric and infraspecific rank designations (e.g. "subspecies", "subsp.", "ssp.")
  • Exactly match the rank of the name, if unambiguous in the input – i.e. do not return a subspecies for a variety, a genus name for a family, etc.
  • Avoid returning names that are completely improbable as candidates
    • Weigh probabilities hierarchically; e.g., in a species name, a full or near full match on a genus name is more important than that of the species epithet (epithets may be used in many genera).
    • Make phonetic matching optional
  • Allow parameters for matching (see below)

Output

  • Return matched records and candidates in a single table (at least in asynchronous mode)
  • Return all input columns with matching results (at least in asynchronous mode)
  • List candidates sorted by probability (see above)
  • Provide output in standard formats (DwC, ColDP)

Parameters for exact matches

  • optional (ICNAFP): accept IPNI, Tropicos and full spaced author abbreviations as exact matches
  • optional (ICNAFP): ignore ex authors (the author or team preceeding the ex)
  • optional (ICNAFP): ignore hybrid symbol (or “x” space/space “x” space) in name
  • optional (ICNAFP): ignore authors in autonyms
  • optional: ignore endings in name elements (monomials, species epithet/name, infraspecific epithet/name)
  • optional: treat canonical matches as exact if only one match is found (i.e. ignore authors/year)

Parameters for candidate matches

  • allow different ranks
  • activate phonetic matching
  • increase tolerance (accept wider range of near matches)
  • ...

Machine Interface

  • OpenRefine Reconciliation Interface
  • ...

Literature and Links

  • Conti, M., Nimis, P.L., Martellos, S. 2021: Match Algorithms for Scientific Names in FlorItaly, the Portal to the Flora of Italy. Plants 2021, 10, 974. https://doi.org/10.3390/plants10050974