Difference between revisions of "What is name matching?"
(→Closed World vs Open World Assumptions) |
(→Describing Matching Algorithm) |
||
(36 intermediate revisions by the same user not shown) | |||
Line 47: | Line 47: | ||
Other changes to come with each code update. Rules are retrospective and all versions of the spelling of the name will continue to be in circulation from historical data even if there is a prescribed version that should be used from now on. | Other changes to come with each code update. Rules are retrospective and all versions of the spelling of the name will continue to be in circulation from historical data even if there is a prescribed version that should be used from now on. | ||
+ | |||
+ | FIXME | ||
=== Errors === | === Errors === | ||
Only at the end do we get to stochastic or unforced errors. These may occur when a name is mistyped by a human or through OCR failure. | Only at the end do we get to stochastic or unforced errors. These may occur when a name is mistyped by a human or through OCR failure. | ||
+ | |||
+ | FIXME | ||
=== Zombies === | === Zombies === | ||
Line 57: | Line 61: | ||
The only way to deal with zombie names is to have a central register that includes them and flags them as not being resolvable. This is the approach World Flora Online has taken with the notion of deprecated names. Because the names are tracked and have IDs similar to other names they can be resurrected in the unlikely event they are discovered to have been correctly published. | The only way to deal with zombie names is to have a central register that includes them and flags them as not being resolvable. This is the approach World Flora Online has taken with the notion of deprecated names. Because the names are tracked and have IDs similar to other names they can be resurrected in the unlikely event they are discovered to have been correctly published. | ||
− | == | + | == Matching vs Searching == |
+ | |||
+ | The processes of '''matching''' and '''searching''' are often confused. It is worth defining what is meant by "name matching" here. | ||
+ | |||
+ | In general parlance matching means one person or thing resembles or corresponds to another. In computing an exact match is usually implied, for example, all the lines in a file that contain a certain string match the query. Both senses imply a comparison. | ||
+ | |||
+ | In taxonomic name matching one of the two things compared is a string of characters representing the name of an organism and the other is a list of known names of organisms. The intent is to go from a string of characters on the page to a protolog and the type specimen that anchors the name into a classification. To achieve this the matching process must only return a single result. If it returns multiple results it is searching the list for potential matches but not actually matching the name string. | ||
+ | |||
+ | |||
+ | {| class="wikitable" | ||
+ | |+ Comparison of Matching and Searching | ||
+ | |- | ||
+ | ! Matching !! Searching !! Explanation | ||
+ | |||
+ | |- | ||
+ | |||
+ | | Normative || Informative || A normative response is prescriptive. It gives an answer on how to comply with a standard. An informative response provides useful or interesting information to help to understand what the name means. | ||
+ | |||
+ | |- | ||
+ | |||
+ | | Definitive || Indefinite || A matching process will return '''the''' value from a controlled vocabulary that has been specified ''a priori''. Searching returns '''a''' result (or results) and the domain can be vaguely defined. | ||
+ | |||
+ | |- | ||
+ | |||
+ | | One or zero returned || zero to many returned || A match is only expect to return a single result (usually an ID) or to fail. A search may result in many candidates answers that are of interest to a user. | ||
+ | |||
+ | |- | ||
+ | |||
+ | | Correct or Incorrect || Relevant or not || If a matching process succeeds the result is either correct or incorrect. Searching produces results with a degree of relevance (depending on the algorithm used). | ||
+ | |||
+ | |- | ||
+ | |||
+ | | Targeted at machines || Targeted at humans || The result of a matching process is typically applicable to machine interpretation, e.g. linking resources. Search results are usually consumed by humans. | ||
+ | |||
+ | |} | ||
+ | |||
+ | FIXME: Define input. | ||
− | Consider two lists of | + | === Examples of Matching === |
+ | |||
+ | Looking up a telephone number for a name in an old fashioned telephone book is an example of matching. The domain is the city the book covers. The answer will be the number prescribed by the telephone company for that name. It will be found or not found. It could be the wrong number but it couldn't be only partially correct. The number is for consumption by a machine, the telephone, not a human. | ||
+ | |||
+ | Finding the DOI for a publication based on a written citation in APA (American Psychological Association) format. The publication will only have a single DOI which is authoritative and controlled by the DOI Foundation. The DOI is used to find the official metadata and full text of the publication and of no interest to a human in itself. | ||
+ | |||
+ | === Examples of Searching === | ||
+ | |||
+ | A Google search of the internet. Many results are typically returned from the provided search terms ranked by relevance. | ||
+ | |||
+ | Booking a flight route from New York to Paris via an flight aggregator. | ||
+ | |||
+ | Using BLAST to find regions of similarity between nucleotide or protein sequence and those in a database. | ||
+ | |||
+ | === Areas of Confusion === | ||
+ | |||
+ | It is possible to look up a DOI for a publication by searching for the authors and title on Google. It is very likely that the DOI of the paper will be in the first few results returned. This feels like matching but requires a human to pick the most appropriate result. Google is helping a human do the matching. | ||
+ | |||
+ | Most matching algorithms will fall back to searching if an unambiguous match is not found. When presented with a list of names a algorithm might automatically match the majority of them, perhaps using a degree of "fuzzy matching" as Taxamatch does, but if confidence falls below a predefined threshold a list is presented for a human to pick a prefered result if possible. At this point the matching algorithm has done a search for candidates and the human is actually doing the matching, not the machine. | ||
+ | |||
+ | == Necessity for a reference list == | ||
+ | |||
+ | Matching is asserting identity not equality. If two name strings "match" then they are considered to reference the same name, with a protologue and type specimen, rather than to be merely variations on the same string (equality). When any form of fuzzy or approximate matching is used then identity between strings can not be assumed to be transitive. Just because A and B match (reference the same name) and B and C match it does not mean that C == A will necessarily match. | ||
+ | |||
+ | == Requirement for human interaction == | ||
+ | |||
+ | Consider two lists of names. If we match list X against list Y there are six possible outcomes: | ||
# A match IS NOT found for name X<sub>n</sub> in Y | # A match IS NOT found for name X<sub>n</sub> in Y | ||
− | ## X<sub>n</sub> is | + | ## X<sub>n</sub> is missing from list Y |
## X<sub>n</sub> is in list Y but the matching algorithm has failed to resolve the ambiguities | ## X<sub>n</sub> is in list Y but the matching algorithm has failed to resolve the ambiguities | ||
## X<sub>n</sub> is a malformed name and wouldn't match any list | ## X<sub>n</sub> is a malformed name and wouldn't match any list | ||
# A match IS found for name X<sub>n</sub> in Y | # A match IS found for name X<sub>n</sub> in Y | ||
− | ## | + | ## the matching name in Y is the correct one |
− | ## | + | ## the matching name in Y is NOT the correct name |
− | ### X<sub>n</sub> IS in Y but the matching algorithm has erroneously returned the wrong | + | ### X<sub>n</sub> IS in Y but the matching algorithm has erroneously returned the wrong name. |
− | ### X<sub>n</sub> IS NOT Y but the matching algorithm has erroneously returned a match anyway. | + | ### X<sub>n</sub> IS NOT in Y but the matching algorithm has erroneously returned a match anyway. |
− | == | + | There are two factors that can affect the proportion of different outcomes. |
+ | |||
+ | === Comprehensiveness and Authoritativeness of list Y === | ||
+ | |||
+ | If list Y is comprehensive then we can eliminate 2.2.2 and 1.1 because the list contains all known names. In practice we can have very well curated name lists but, in the absence of a mandatory registry of names, we can only minimise these factors. For botany (excluding fungi) a comprehensive, curated list would contain in excess of two million names and a list for zoology would be much larger. | ||
+ | |||
+ | Because no list is entirely comprehensive no list can be totally authoritative. There is often the chance that a name string that doesn't match may be a new name and should be added to the list. | ||
+ | |||
+ | For ''ad hoc'' comparison of smaller lists any unmatched name must be considered a potential new name until it can be fully researched. | ||
+ | |||
+ | The more comprehensive list Y is the more effective the algorithm can be. For example, detecting malformed names is easier if the system contains all known genera names and specificific epithets. | ||
+ | |||
+ | === Effectiveness of the algorithm === | ||
FIXME | FIXME | ||
− | + | == Kinds of algorithm == | |
− | + | Four main approaches but a combination can be adopted. | |
− | == Matching | + | # semantic - uses subject domain knowledge |
+ | # syntactic - uses generic lexical analysis | ||
+ | # Machine Learning | ||
+ | ## Trained models | ||
+ | ## Large language models / generative AI | ||
+ | |||
+ | Trained models need training sets and are functionally very similar to lexical analysis. | ||
+ | |||
+ | If we allow a LLM to take the context of a matching into account, which seems like the only reason we would want to use a LLM, we must accept that a different context would result in a different matching result. This may not be desirable if our goal is reproducibility of results in scientific analyses. | ||
+ | |||
+ | Bottom line: We can't expect a matching algorithm to do something a human couldn't do more slowly - or could we? Discuss! | ||
+ | |||
+ | == Describing Matching Algorithm == | ||
+ | |||
+ | * Input | ||
+ | ** Missing ranks - how handled | ||
+ | ** Missing authors string - how handled | ||
+ | ** Missing year - how handled. | ||
+ | * Result | ||
+ | * Scope | ||
+ | * Failure behaviour | ||
+ | * Parameters | ||
+ | ** Affecting result | ||
+ | ** Affecting failure behaviour |
Latest revision as of 12:47, 31 October 2024
The process of combining biodiversity data from multiple sources currently starts with matching of the Latin name strings for the organisms used in each dataset.
Studies often contain names that can not be unambiguously matched or miss out some names entirely.
When combining datasets, between 10% and 20% of names will fail to match perfectly and may need some human interaction or accepted error.
With datasets of many thousands of species this soon becomes a major hurdle that has to be crossed every time datasets are used in analyses
and is exasperated when more than two datasets are used.
It is better if study data can be matched once, at source, then linked on unambiguous name IDs rather than by matching potentially ambiguous name strings.
What is discussed here is name matching not matching to a specific taxon which is a subsequent process. Firstly we identify the name that string of characters applies to then we check which taxon that name applies to within a specific classification. See Potential caveats for discussion.
Contents
How Latin names are ambiguous
It is worthwhile to summarise some of the mechanisms whereby scientific names can be ambiguous either because different strings of characters refer to the same published name or the same string of characters refers to a different published name.
Homonyms
Homonyms are names that are spelt the same but refer to different things. Under the codes of nomenclature one of the names will always have presidence over another but because there have not been universal name registries it has not been possible to prevent creation of duplicate names. In the strict sense homonym means the full name, including the author(s) names, are identical. Homonym is often used in a looser sense of just applying to the words that make up the name, excluding the author string. This is because author strings are often not standardised or omitted entirely. Homonyms may occur within or between codes, that is the same name string may be used for two plants or for a plant and an animal. Furthermore there are two types of homonyms:
Isonyms
Isonyms occur when a name is based on the same type specimen but published in multiple places. The majority of isonyms are created by the author publishing the name again (perhaps in a paper and in a flora, fauna or catalogue) and so have the same author(s). There is no scope for taxonomic confusion in botany and the only scope for nomenclatural confusion caused by isonyms is citing the wrong reference as a place of original publication. In zoology the name string may have different dates thus causing matching failures even though the intent of the author(s) was to name the same taxon.
True Homonyms
True homonyms are names based on different type specimens and, usually, published by different authors. If they are published by different authors (homonym in the loose sense) and the author(s) names are included in the full name then they should not be ambiguous during matching however author(s) names may be omitted, causing false matches, or use nonstandard forms, causing false mismatches. Because the useage of species epithets in different genera (new combination in botany) are not required in zoology the potential for ambiguity is higher.
Author(s) String variation
Most ambiguity caused by identically spelt scientific names can be resolved if the author(s) of the name are included in the full name. Unfortunately this is fraught with difficulties in real data.
- The author(s) are frequently omitted. If material is created for a general audience then inclusion of author(s) names can be considered confusing especially if the scientific (Latin) form of the name is being used in addition to a well known vernacular name. When data is being shared within a specialist scientific community who only work on a few species then the author(s) are omitted because there is no chance of confusion in that particular research context. Omission has also been influenced by legacy systems having restrictions on the length of data fields and restricted character encoding.
- The zoological code of nomenclature does not consider the author(s) to be part of the name and inclusion is only customary although usually advisable. It is recommended on first use in a publication. Zoologists do not include the names of authors of new combinations (species placed in different genera). Botanists are more consistent in use of authors but do not include the year of publication of a name which is customary in zoology.
- Standard author abbreviations are not mandated in either botany or zoology although in botany there is a more establish convention to use the author abbreviations as maintained by IPNI and also community curated in Wikidata property P428. Publication editors will sometimes mandate changes to, or further abbreviation of, author strings, for example the use of et al when there are more than two authors and the addition of spaces after periods.
- Use of ex is inconsistently applied and error prone. The nomenclatural codes allows the author(s) who validly publish a name to acknowledge previous author(s) who published the name incorrectly by including the original authors' names in the citation. When they do this the two sets of author(s) names are separated by 'ex'. In botany the original authors come before the ex but in zoology the two sets of authors are presented the other way around. Citing the original authors on subsequent use of the name is considered optional in both codes. Often there is confusion as to which set of authors to leave out resulting in multiple combinations of author strings for a single name being in circulation, some legal and some illegal. Some of those versions will, by chance, match unrelated names.
- Encoding issues. Prior to widespread adoption of UTF-8. Author names may include accented characters but taxon names should only include common characters from the original ASSCII code page.
In summary: Even within the scope of vascular plants covered by IPNI, where it should be possible to follow a standard, there are always errors in trying to match full name strings that include the author(s).
Orthographical variants
Permitted spelling variations and corrections.
Correctable spellings - list from botany Gender changes - check what zoology do Mandated changes - derivatives of the word caffra, an apartheid-era racial slur used against Black people in southern Africa, to derivatives of “afr,” signaling the species' African origins.
Other changes to come with each code update. Rules are retrospective and all versions of the spelling of the name will continue to be in circulation from historical data even if there is a prescribed version that should be used from now on.
FIXME
Errors
Only at the end do we get to stochastic or unforced errors. These may occur when a name is mistyped by a human or through OCR failure.
FIXME
Zombies
Zombie names are name strings that may have occurred in the literature or a database (via a bad OCR or typo) just once or may have been legitimately published for a long discredited classification. They have subsequently been propagated from one dataset to the next without ever being deleted. They soaking up time and resources because each time large datasets are combined the process has to resolve zombie names to an original place of publication starts again. Zombie names are particularly problematic in the age of big data. If we delete them they will keep coming back again from different data sources and each time they are rediscovered they will use up more resources.
The only way to deal with zombie names is to have a central register that includes them and flags them as not being resolvable. This is the approach World Flora Online has taken with the notion of deprecated names. Because the names are tracked and have IDs similar to other names they can be resurrected in the unlikely event they are discovered to have been correctly published.
Matching vs Searching
The processes of matching and searching are often confused. It is worth defining what is meant by "name matching" here.
In general parlance matching means one person or thing resembles or corresponds to another. In computing an exact match is usually implied, for example, all the lines in a file that contain a certain string match the query. Both senses imply a comparison.
In taxonomic name matching one of the two things compared is a string of characters representing the name of an organism and the other is a list of known names of organisms. The intent is to go from a string of characters on the page to a protolog and the type specimen that anchors the name into a classification. To achieve this the matching process must only return a single result. If it returns multiple results it is searching the list for potential matches but not actually matching the name string.
Matching | Searching | Explanation |
---|---|---|
Normative | Informative | A normative response is prescriptive. It gives an answer on how to comply with a standard. An informative response provides useful or interesting information to help to understand what the name means. |
Definitive | Indefinite | A matching process will return the value from a controlled vocabulary that has been specified a priori. Searching returns a result (or results) and the domain can be vaguely defined. |
One or zero returned | zero to many returned | A match is only expect to return a single result (usually an ID) or to fail. A search may result in many candidates answers that are of interest to a user. |
Correct or Incorrect | Relevant or not | If a matching process succeeds the result is either correct or incorrect. Searching produces results with a degree of relevance (depending on the algorithm used). |
Targeted at machines | Targeted at humans | The result of a matching process is typically applicable to machine interpretation, e.g. linking resources. Search results are usually consumed by humans. |
FIXME: Define input.
Examples of Matching
Looking up a telephone number for a name in an old fashioned telephone book is an example of matching. The domain is the city the book covers. The answer will be the number prescribed by the telephone company for that name. It will be found or not found. It could be the wrong number but it couldn't be only partially correct. The number is for consumption by a machine, the telephone, not a human.
Finding the DOI for a publication based on a written citation in APA (American Psychological Association) format. The publication will only have a single DOI which is authoritative and controlled by the DOI Foundation. The DOI is used to find the official metadata and full text of the publication and of no interest to a human in itself.
Examples of Searching
A Google search of the internet. Many results are typically returned from the provided search terms ranked by relevance.
Booking a flight route from New York to Paris via an flight aggregator.
Using BLAST to find regions of similarity between nucleotide or protein sequence and those in a database.
Areas of Confusion
It is possible to look up a DOI for a publication by searching for the authors and title on Google. It is very likely that the DOI of the paper will be in the first few results returned. This feels like matching but requires a human to pick the most appropriate result. Google is helping a human do the matching.
Most matching algorithms will fall back to searching if an unambiguous match is not found. When presented with a list of names a algorithm might automatically match the majority of them, perhaps using a degree of "fuzzy matching" as Taxamatch does, but if confidence falls below a predefined threshold a list is presented for a human to pick a prefered result if possible. At this point the matching algorithm has done a search for candidates and the human is actually doing the matching, not the machine.
Necessity for a reference list
Matching is asserting identity not equality. If two name strings "match" then they are considered to reference the same name, with a protologue and type specimen, rather than to be merely variations on the same string (equality). When any form of fuzzy or approximate matching is used then identity between strings can not be assumed to be transitive. Just because A and B match (reference the same name) and B and C match it does not mean that C == A will necessarily match.
Requirement for human interaction
Consider two lists of names. If we match list X against list Y there are six possible outcomes:
- A match IS NOT found for name Xn in Y
- Xn is missing from list Y
- Xn is in list Y but the matching algorithm has failed to resolve the ambiguities
- Xn is a malformed name and wouldn't match any list
- A match IS found for name Xn in Y
- the matching name in Y is the correct one
- the matching name in Y is NOT the correct name
- Xn IS in Y but the matching algorithm has erroneously returned the wrong name.
- Xn IS NOT in Y but the matching algorithm has erroneously returned a match anyway.
There are two factors that can affect the proportion of different outcomes.
Comprehensiveness and Authoritativeness of list Y
If list Y is comprehensive then we can eliminate 2.2.2 and 1.1 because the list contains all known names. In practice we can have very well curated name lists but, in the absence of a mandatory registry of names, we can only minimise these factors. For botany (excluding fungi) a comprehensive, curated list would contain in excess of two million names and a list for zoology would be much larger.
Because no list is entirely comprehensive no list can be totally authoritative. There is often the chance that a name string that doesn't match may be a new name and should be added to the list.
For ad hoc comparison of smaller lists any unmatched name must be considered a potential new name until it can be fully researched.
The more comprehensive list Y is the more effective the algorithm can be. For example, detecting malformed names is easier if the system contains all known genera names and specificific epithets.
Effectiveness of the algorithm
FIXME
Kinds of algorithm
Four main approaches but a combination can be adopted.
- semantic - uses subject domain knowledge
- syntactic - uses generic lexical analysis
- Machine Learning
- Trained models
- Large language models / generative AI
Trained models need training sets and are functionally very similar to lexical analysis.
If we allow a LLM to take the context of a matching into account, which seems like the only reason we would want to use a LLM, we must accept that a different context would result in a different matching result. This may not be desirable if our goal is reproducibility of results in scientific analyses.
Bottom line: We can't expect a matching algorithm to do something a human couldn't do more slowly - or could we? Discuss!
Describing Matching Algorithm
- Input
- Missing ranks - how handled
- Missing authors string - how handled
- Missing year - how handled.
- Result
- Scope
- Failure behaviour
- Parameters
- Affecting result
- Affecting failure behaviour