Difference between revisions of "Usecase Specify database at the herbarium Madrid"

From TETTRIs
Jump to: navigation, search
Line 3: Line 3:
 
The principal use case is data cleaning of the name data in the herbarium database and to try out the concept-linking mechanisms using the TETTRIs workflow.
 
The principal use case is data cleaning of the name data in the herbarium database and to try out the concept-linking mechanisms using the TETTRIs workflow.
  
Silvia Lusa (SL), the database administrator at the Real Jardín Botánico Madrid provided the central Taxon table from SPECIFY.<br/>
+
Silvia Lusa (SL), the database administrator at the Real Jardín Botánico Madrid provided the central Taxon table from SPECIFY.
Walter Berendsohn (TETTRIs) made the following observations from looking at it in Microsoft Access: <br/>
+
 
138,320 names, with their name relations (accepted name, higher taxon). Only 356 of these are not accepted. 25 are duplicates [only 9 exact, the rest differ in the author string). At first view, taxa above species-level do not carry authors (77 genera do). Rarely, in species and ranks below authors are also missing (mostly for autonyms). Complete names: 127.802. <br/>
+
 
The content of the field ''TaxonID'' is unique. The ''GUID'' field is not used here.<br/>
+
Walter Berendsohn (TETTRIs) made the following observations from looking at it in Microsoft Access:  
''Name'' contains the last epithet (or monomial) used for recursive name concatenation. <br/>
+
*138,320 names, with their name relations (accepted name, higher taxon). Only 356 of these are not accepted. 25 are duplicates [only 9 exact, the rest differ in the author string). At first view, taxa above species-level do not carry authors (77 genera do). Rarely, in species and ranks below authors are also missing (mostly for autonyms). Complete names: 127.802. <br/>
''Cultivar'' is empty (unclear, if used in other SPECIFY instances)<br/>
+
*The content of the field ''TaxonID'' is unique. The ''GUID'' field is not used here.<br/>
''Title'' contains the verbatim rank (with first capital letter except for forma). <br/>
+
*''Name'' contains the last epithet (or monomial) used for recursive name concatenation. <br/>
''RankID'' contains a string with a number, which represents the rank sequence. This is probably originally a number, it can be transformed into a number without errors. <br/>
+
*''Cultivar'' is empty (unclear, if used in other SPECIFY instances)<br/>
''FullName'' contains the canonical name of the species or infraspecific name, preceeded by the family (and -subfamily in Leguminosae) and colon (e.g. Umbelliferae: Laserpitium latifolium subsp. nevadense). The canonical name can be derived using a substring of the FullName starting at string position “:” + 2 [[VBA Code]] <br/>
+
*''Title'' contains the verbatim rank (with first capital letter except for forma). <br/>
 +
*''RankID'' contains a string with a number, which represents the rank sequence. This is probably originally a number, it can be transformed into a number without errors. <br/>
 +
*''FullName'' contains the canonical name of the species or infraspecific name, preceeded by the family (and -subfamily in Leguminosae) and colon (e.g. Umbelliferae: Laserpitium latifolium subsp. nevadense). The canonical name can be derived using a substring of the FullName starting at string position “:” + 2 [[VBA Code]] <br/>
 
There are no separate fields for the other name elements (which can be found by means of the recursive structure). So there is no easy way to identify autonyms without resolving the recursion. However, finding autonyms can be achieved by counting the occurrence of the string contained in Name in the canonicalName[3].<br/>
 
There are no separate fields for the other name elements (which can be found by means of the recursive structure). So there is no easy way to identify autonyms without resolving the recursion. However, finding autonyms can be achieved by counting the occurrence of the string contained in Name in the canonicalName[3].<br/>
''Author'' contains the abbreviated author string, without spaces after dots according to TDWG/IPNI standard, with a space before the last name (abbreviation) according to Tropicos or with spaces everywhere. The standard authors can be constructed using the “. “ replacement routine (re-introducing spaces before ampersand, “ex” and “in”[4]. <br/>
+
*''Author'' contains the abbreviated author string, without spaces after dots according to TDWG/IPNI standard, with a space before the last name (abbreviation) according to Tropicos or with spaces everywhere. The standard authors can be constructed using the “. “ replacement routine (re-introducing spaces before ampersand, “ex” and “in”[4]. <br/>
There are a few “in”-authors; the nomenclatural rank is stated in about 96 places (filter for “*nom. “), as well as some other nomenclaturarl notes (such as “no type indicated”; in 179 cases, an indication of a misapplication is given (“sensu”). Given the low incidence of these cases, they should be ignored or corrected manually. <br/>
+
*There are a few “in”-authors; the nomenclatural rank is stated in about 96 places (filter for “*nom. “), as well as some other nomenclaturarl notes (such as “no type indicated”; in 179 cases, an indication of a misapplication is given (“sensu”). Given the low incidence of these cases, they should be ignored or corrected manually. <br/>
The field FullNameWithStandardAuthors needed for the name matching process can then be generated by using the new CanonicalName and, except for autonyms, concatenate with space and StandardAuthors.
+
*The field FullNameWithStandardAuthors needed for the name matching process can then be generated by using the new CanonicalName and, except for autonyms, concatenate with space and StandardAuthors.
  
  

Revision as of 21:43, 9 April 2024

The herbarium at the Real Jardín Botánico Madrid (MA) is one of the major European herbaria, with 1.1 m specimens, very important historical collections, and the principal collection used for Flora Iberica. The collection is partly digitised, the data are held in the widely used SPECIFY collection management system, which makes this important beyond the specific case. Further European SPECIFY users according to https://www.specifysoftware.org/members/ include RGB Edinburgh, Natural History Museum Basel, Gothenburg Natural History Museum, Institut de Recherche pour le Développement (IRD France), University of Zurich and (Israel) the Hebrew University of Jerusalem. In Spain, the national node for DiSSCo at the CSIC organises workshops and support the migration of collection data to SPECIFY.

The principal use case is data cleaning of the name data in the herbarium database and to try out the concept-linking mechanisms using the TETTRIs workflow.

Silvia Lusa (SL), the database administrator at the Real Jardín Botánico Madrid provided the central Taxon table from SPECIFY.


Walter Berendsohn (TETTRIs) made the following observations from looking at it in Microsoft Access:

  • 138,320 names, with their name relations (accepted name, higher taxon). Only 356 of these are not accepted. 25 are duplicates [only 9 exact, the rest differ in the author string). At first view, taxa above species-level do not carry authors (77 genera do). Rarely, in species and ranks below authors are also missing (mostly for autonyms). Complete names: 127.802.
  • The content of the field TaxonID is unique. The GUID field is not used here.
  • Name contains the last epithet (or monomial) used for recursive name concatenation.
  • Cultivar is empty (unclear, if used in other SPECIFY instances)
  • Title contains the verbatim rank (with first capital letter except for forma).
  • RankID contains a string with a number, which represents the rank sequence. This is probably originally a number, it can be transformed into a number without errors.
  • FullName contains the canonical name of the species or infraspecific name, preceeded by the family (and -subfamily in Leguminosae) and colon (e.g. Umbelliferae: Laserpitium latifolium subsp. nevadense). The canonical name can be derived using a substring of the FullName starting at string position “:” + 2 VBA Code

There are no separate fields for the other name elements (which can be found by means of the recursive structure). So there is no easy way to identify autonyms without resolving the recursion. However, finding autonyms can be achieved by counting the occurrence of the string contained in Name in the canonicalName[3].

  • Author contains the abbreviated author string, without spaces after dots according to TDWG/IPNI standard, with a space before the last name (abbreviation) according to Tropicos or with spaces everywhere. The standard authors can be constructed using the “. “ replacement routine (re-introducing spaces before ampersand, “ex” and “in”[4].
  • There are a few “in”-authors; the nomenclatural rank is stated in about 96 places (filter for “*nom. “), as well as some other nomenclaturarl notes (such as “no type indicated”; in 179 cases, an indication of a misapplication is given (“sensu”). Given the low incidence of these cases, they should be ignored or corrected manually.
  • The field FullNameWithStandardAuthors needed for the name matching process can then be generated by using the new CanonicalName and, except for autonyms, concatenate with space and StandardAuthors.


The example shows that some initial data transformations and data cleaning measures are necessary even if the data are coming from established systems. So for each of these systems a specific workflow should be generated, if possible in cooperation with the system developers.


The name matching of the with WFO using OpenRefine resulted in 68,658 exact matches (and WFO-ID assignment). This represents about 54 % of the complete names (names with authors).
IPNI matching is with 76,211 matches sligthly higher, but uses a less strict matching algorithm.

SL looked at the results and suggested the following workflow: