Use case Specify database at the herbarium Madrid

From TETTRIs
Revision as of 00:08, 10 April 2024 by WalterBerendsohn (talk | contribs) (Created page with "The herbarium at the Real Jardín Botánico Madrid (MA) is one of the major European herbaria, with 1.1 m specimens, very important historical collections, and the principal c...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The herbarium at the Real Jardín Botánico Madrid (MA) is one of the major European herbaria, with 1.1 m specimens, very important historical collections, and the principal collection used for Flora Iberica. The collection is partly digitised, the data are held in the widely used SPECIFY collection management system, which makes this important beyond the specific case. Further European SPECIFY users according to https://www.specifysoftware.org/members/ include RGB Edinburgh, Natural History Museum Basel, Gothenburg Natural History Museum, Institut de Recherche pour le Développement (IRD France), University of Zurich and (Israel) the Hebrew University of Jerusalem. In Spain, the national node for DiSSCo at the CSIC organises workshops and support the migration of collection data to SPECIFY.

The principal use case is data cleaning of the name data in the herbarium database and to try out the concept-linking mechanisms using the TETTRIs workflow. An additional use case at a later stage would be to provide herbarium curators with synonyms to aid in the search of specimens for a specific loan request to their collection, where specimens may be stored under different names.

Silvia Lusa (SL), the database administrator at the Real Jardín Botánico Madrid provided the central Taxon table from SPECIFY.


Walter Berendsohn (TETTRIs) made the following observations from looking at it in Microsoft Access:

  • 138,320 names, with their name relations (accepted name, higher taxon). Only 356 of these are not accepted. 25 are duplicates [only 9 exact, the rest differ in the author string). At first view, taxa above species-level do not carry authors (77 genera do). Rarely, in species and ranks below authors are also missing (mostly for autonyms). Complete names: 127.802.
  • The content of the field TaxonID is unique.
  • The GUID field is not used here.
  • Name contains the last epithet (or monomial) used for recursive name concatenation.
  • Cultivar is empty (unclear, if used in other SPECIFY instances)
  • Title contains the verbatim rank (with first capital letter except for forma).
  • RankID contains a string with a number, which represents the rank sequence. This is probably originally a number, it can be transformed into a number without errors.
  • FullName contains the canonical name of the species or infraspecific name, preceeded by the family (and -subfamily in Leguminosae) and colon (e.g. Umbelliferae: Laserpitium latifolium subsp. nevadense). The canonical name can be derived using a substring of the FullName starting at string position “:” + 2. See VBA Code.

There are no separate fields for the other name elements (which can be found by means of the recursive structure). So there is no easy way to identify autonyms without resolving the recursion. However, finding autonyms can be achieved by counting the occurrence of the string contained in Name in the canonicalName. See VBA Code.

  • Author contains the abbreviated author string, without spaces after dots according to TDWG/IPNI standard, with a space before the last name (abbreviation) according to Tropicos or with spaces everywhere. The standard authors can be constructed using the “. “ replacement routine (re-introducing spaces before ampersand, “ex” and “in”. See VBA Code.
  • There are a few “in”-authors; the nomenclatural rank is stated in about 96 places (filter for “*nom. “), as well as some other nomenclaturarl notes (such as “no type indicated”; in 179 cases, an indication of a misapplication is given (“sensu”). Given the low incidence of these cases, they should be ignored or corrected manually.
  • The field FullNameWithStandardAuthors needed for the name matching process can then be generated by using the new CanonicalName and, except for autonyms, concatenate with space and StandardAuthors.

The example shows that some initial data transformations and data cleaning measures are necessary even if the data are coming from established systems. So for each of these systems a specific workflow should be generated, if possible in cooperation with the system developers.

The name matching of the with WFO using OpenRefine resulted in 68,658 exact matches (and WFO-ID assignment). This represents about 54 % of the complete names (names with authors).
IPNI matching is with 76,211 matches slightly higher, but uses a less strict matching algorithm.


SL looked at the results and suggested the following workflow:

1. Pipeline to revise names

  • Create a defined export from SPECIFY, generating a json containing the used parameters. This can be used by other SPECIFY users with no or minimal adaptations.
  • For that dataset, create further OpenRefine routines replacing the Visual Basic data preparation routines used, in order to have a seamless OpenRefine process including data preparation and reconciliation. TO DO
  • Use WFO reconciliation to resolve all names that did not match exactly to a single matching WFO-name, adding the corresponding WFO-ID and marking the record as not exactly matched
  • Export and pass the non-exact matches to a taxonomic data curator at the herbarium, controlling and - if necessary - correcting the matches
  • Introduce the data in SPECIFY: either unite the two resulting datasets and replace the taxon table in SPECIFY; or: replace the respective records with the corrected records (BatchEdit is a data modification tool in SPECIFY that may be up to the task); or: change the underlying database (MariaDB) directly using an SQL-script.

2.Pipeline to introduce synonyms

  • Based on the unified taxon table, retrieve synonyms from WFO. SPECIFY can hold these, but the user interface only allows to introduce them on by one. So an SQL script will be used to do this as a batch process.

3. Pipeline for periodical revisions of the state of names in the database

  • Using the wfo-ID, its possible to periodically check the state of the names in the WFO database.