BiNHum Harvesting and Indexing Toolkit

Pre-requisites

Basics

Tested with Java 8, on Linux and Windows.

Can be deployed with Tomcat (tested with Apache Tomcat/6.0.39 JRE 1.8.0_05-b13) or Jetty (tested with Jetty 8.1.14), or can be run inside Eclipse (tested with Eclipse Luna and Kepler. If the libraries get deleted by eclipse, run mvn eclipse:eclipse to get automatically all libraries defined in the pom-xml file ; project encoding must be UTF-8 (also defined in the pom.xml)).

MySQL Database v5.5

Web-services

Local gisgraphy installation (coordinates check) or maybe online version, see http://www.gisgraphy.com/free-access.htm

CoordinatesKDTree werbservice (based on https://github.com/AReallyGoodName/OfflineReverseGeocode )

Configuration

Edit the application.properties file

Quality tests

* Quality tests: activated / deactivated. The tab won't be displayed it turned OFF

qualityOnOff=OFF / ON

* Where the CoordinatesKDTree Web service is running

coordinatesWebservice=http://localhost:8080/CoordinatesWS

A local folder for temporary file manipulation

Windows paths: please use the format C:\\Users\\username\\folder\\

temporaryFolder=/home/user/test

Set the installation type (production true/false)

production=true

Database properties

* DB credentials	dataSource.username=USER dataSource.password=PASS
* DB name	dataSource.name=DATABASE_NAME
* MySQL-Server IP + DB name	dataSource.url=jdbc:mysql://DATABASE-IP:3306/DATABASE_NAME?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8&MVCC=true

Application properties

* Application server IP (ie. Localhost)

dataSource.servername=APPLICATIONSERVER_IP

* default URL with port (ie. http://localhost:8040/)

The port must match your Tomcat/Jetty configuration !

baseUrl=http://APPLICATIONSERVER:PORT/

* directory (with all permissions for Tomcat/Jetty) where to store the harvested files

Windows paths: please use the format C:\\Users\\username\\folder\\

harvest.directory=/home/USER/PROJECT/

Classes default output folder: binhum/WebContent/WEB-INF/classes

Edit applicationContext-security.xml

Edit the admin password, md5 encoded. The default password corresponds to « banana! »

Before starting the app

Run the database creation script (schemaOnly.sql).

Starting B-HIT

Based on your configuration, and the name chosen (eclipse config / war file name), open your favorite web-browser to http://localhost:8040/Bibhum/datasource/list.html .

Overview

Main menu:

Datasources : main entry point to add datasources/datasets, launch metadata operations, inventory+harvesting+processing data
Associated datasources: entry point for associated data (relationships), to harvest and process associated data only
Extra units: entry point for single units retrieval, based on a list of unit IDs
Jobs: overview of waiting and running jobs
Console: overview of log events
Report: generation of reports (statistics)
Datasource management: to hide or delete datasets
Data quality : to launch quality tests, exports test results and display results
Data viewer: to display data stored in the database, either the raw data or the improved data from the quality tests

Add a datasource

Click on the button « add bioDatasource ». A form is displayed with mandatory fields to populate.

Enter a name for the datasource, the provider name abbreviated (ie. BGBM), the provider fullname (ie. Botanischer Garten und Botanisches Museum Berlin-Dahlem), the accesspoint (ie. http://ww3.bgbm.org/biocase/pywrapper.cgi?dsa=BoBo), the provider website (ie. http://www.bgbm.org), the provider postal address and country, and the type of datasource (ie. Biocase/digir/Tapir/darwincore archive). Then save. The new datasource is now listed on the http://localhost:8040/Bibhum/datasource/list.html page.

If you already added some providers, you can reuse them by selecting an existing provider from the dropdown menu on the right. It will populate automatically the form fields.

To see the available operations for this datasource (see Figure 4), click on the « available methods » checkbox on the left or on the expand/collapse icon.

Check accesspoints

On the datasource page, you can use the function “Check accesspoints validity” to check if the accesspoints are valid and accessible.

DEBUGGING HINT: The most common reason that a metadata update fails, is that the accesspoint is wrong. Classic Biocase URL looks like http://host/biocase/pywrapper.cgi?dsa=datasourceName

Retrieve datasets

In order to trigger the « Metadata update » operation, check the box and click on « Schedule » at the bottom of the page.

[[Image:]]A new Job is created and can be followed under the « Job » tab (/Bibhum/job/list.html) (see Figure 5).

[[Image:]]Refresh the page to update the job status (see Figure 6 ).

Going back to the main page, a new line appeared in the list of datasource for the discovered dataset (see Figure 7), with its corresponding available methods. These methods depend on the type of provider. ABCD-Archive will enable the methods Download ABCD-Archive/Process harvested, DwC-A will enable Download /Process harvested records

[[Image:]]

The target columns contains the theoretical number of units, according to the providers metadata.

Perform inventory

Only available for classic BIOCASe (and DiGIR) (not needed for ABCD-Archives or DwC A).

Select the first operation and click on « schedule ». The queries are stored on the disc, in compressed files. The location path is defined in the configuration file, and the directory for each datasource can be found in the MySQL table bio_datasource, column « basedirectory ».

Harvest data

Data will be harvested if and only if a list of units has been retrieved with the Inventory or can be done directly if the datasource provides an archive (ABCD or DwC). The outputs from this operation are * for classic datasources:

One or more search requests (with enumerated extensions corresponding to the order in which they were dispatched, i.e. search_request.000)
One or more search responses (with enumerated extensions corresponding to the order in which they were dispatched, i.e. search_response.000). Often, there will only be a single response per request, but sometimes there can be multiple responses for a single request!)

The default range is 200 units per request.* For archive datasources (ABCD-Archives, DwC-Archive)

1. One search response, the downloaded archive unzipped in a folder named “archive”.

Process harvested records

In this operation, an Operator (BioDatasource) will collect all the search responses, parse them, and write the parsed values to the database. In case of re-harvesting a dataset, the system will check if the data changed since the last harvesting. It will calculate the new checksum of each downloaded file, and will modify data in the DB if and only if the checksum changed. Each checksum is stored in the DB table « sha1responses ». (See the BHit_documentation for more infos).

Tables starting with “raw” will be first filled (i.e. rawoccurrence, rawcoordinates).

Sibling units

Sibling units are units associated within a dataset (i.e split Herbarium sheets).

Associated datasets

Units can have associations to other specimen or observation data, within the same dataset or with an external dataset. These associations are stored in the association table (link between 2 units).

If it is required to add an external or new dataset, the datasource checkbox will be slightly different (see Figure 8).

[[Image:]]

vs.

[[Image:]]

Figure 8 Display of dataset with or without associated datasources

Extra units

A selection of records can be retrieved for existing datasources. It will load a text file containing the record identifiers (unitID) and only get those records. Format: 1 UnitID per line.

Logs

Logs can be viewed in the Console tab. The Tomcat logs (ie. catalina.out can be helpful for debugging.

Datasource management

Datasources (harvester and metadata factories) can be deleted from the database with a simple button click. The selected source will be totally removed from the dabatase with all belonging records! Be careful!

Quality

Quality tests can be triggered from the Quality tab if the option was activated (see configuration “qualityOnOff”).

Country names

The first test consists in translating the country name in English. The different accepted input values are extracted from the multiple language files and are completed based on the most common errors and typos met (i.e. “Iatlien” instead of “Italien”, missing empty spaces etc.). States and regions are also commonly mapped as countries by the providers: their affiliation has also been added to the known inputs

If the original country does no longer exists or if it borders do not fit any actual country, the corrected value will be set as an “Unknown or unspecified continent” (continent being replaced by Europe, Eurasia, Asia, North and Central America, South America, Oceania or Antarctica). “Unknown or unspecified country” will be inserted in the database for the empty values or for the characterisable values.

The second test consists in trying to extract the country based on the locality and the gathering areas.

ISO-codes

The third test consists in standardising the ISO code in its 2-letter standard. The original value can be correct or non-available, can be replaced (ISO no more existing, i.e. SU) or corrected (3-letter code to 2-letter code).

The forth test consists in comparing the validated ISO and the validated country name. During this comparison, if no coordinates are available, a missing value will be inferred from the available country or from the available ISO (i.e. ISO-code=ZZ (unknown) and country name=Germany, coordinates empty-> ISO-code inferred as “DE”). Inferring a value will lead to a warning information in the database.

Coordinates

The fifth test will check the coordinates validity – the text values are parsed into decimal values, their ranges are checked (i.e. is the latitude between -90 and +90, resp. -180 and + 180 for the longitude).

The sixth test will compare the cleaned country data and the coordinates, using different methods and services. If the system previously detected an inconsistency between the ISO-code and the country name, it will try to improve the data based on the coordinates, and correct either the ISO-code or the country name.

If the original values of the coordinates do not fit neither the country field nor the ISO-code field, the quality process will check the opposites values of the coordinates (+latitude +longitude; +latitude -longitude; -latitude +longitude; -latitude -longitude), and try to add a leading number (many latitude were truncated and missed the first digit). It will also proceed to a permutation of the latitude and longitude values.

Tests on coordinates reuse existing code from Gisgraphy, KDTree and Geonames.

Date

Eventually, the gathering dates are checked and converted into the YYY-MM-DD format, and the gathering year is extracted.

Scientific names

During the quality task, the scientific names are parsed using the GBIF-Name Parser. Sadly, this tool cannot handle all names, logically the names containing errors, but also even some names which are correct according to taxonomic nomenclatures (emendavit, nominate…). About 50 regular expressions has been defined in order to parse the scientific names the GBIF-name-parser could not handle (Unicode characters, ex., forma, emend., names not conform to the different taxonomic nomenclatures, most common typos).

See the BHit_documentation for more infos regarding the quality tests.

Quality reports

Every single quality test generates logs, which are saved in the database in specific tables. From the Quality tab, it is also possible to extract these logs from the database and export them into text files with tab separated values. It will create one file per dataset and per test, if and only if the test modified the original content. In worst case scenario, it will generate n x m files (n=number of datasources, m= number of tests run). The system will only extract problematic rows, i.e. when the test failed or when it generated a warning. This log files contain the name of the test executed, the original values, the edited values, and the list of units concerned by the changes. These files can be sent to the data provider, who will then decide if he wants to modify his data.

Reports

The lists of missing units (based on the datasources inventory files) can be generated from the report tab.

Data viewer

The cleaned data can be displayed in this tab. If the quality tests did not run, it will remain empty!

Installation

Contents

Pre-requisites

Basics

Web-services

Configuration

Edit the application.properties file

Edit applicationContext-security.xml

Before starting the app

Starting B-HIT

Overview

Add a datasource

Check accesspoints

Retrieve datasets

Perform inventory

Harvest data

Process harvested records

Sibling units

Associated datasets

Extra units

Logs

Datasource management

Quality

Country names

ISO-codes

Coordinates

Date

Scientific names

Quality reports

Reports

Data viewer

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools