From Berlin Harvesting and Indexing Toolkit
Jump to: navigation, search

With the rapidly growing number of data publishers, the process of harvesting and indexing information to offer advanced search and discovery becomes a critical bottleneck in globally distributed primary biodiversity data infrastructures. The Global Biodiversity Information Facility (GBIF) implemented a Harvesting and Indexing Toolkit (HIT), which largely automates data harvesting activities for hundreds of collection and observational data providers. The team of the Botanic Garden and Botanical Museum Berlin-Dahlem has extended this well-established system with a range of additional functions, including improved processing of multiple taxon identifications, the ability to represent associations between specimen and observation units, new data quality control and new reporting capabilities. The open source software B-HIT can be freely installed and used for setting up thematic networks serving the demands of particular user groups.

It makes it possible to harvest biodiversity data and index them in a MySQL database by using established pipeline like BioCASe and IPT. After harvesting B-HIT perfoms several data cleaning steps. Original provider data are stored in parallel to cleaned data.

B-HIT is currently used by BiNHum, OpenUp!, World Flora Online and GGBN. We recommend to set up a SOLR instance in addition to the MySQL database to speed up queries in a data portal. Please check out the Wiki of the GGBN Portal Software for further details about the SOLR instance and an open source portal software optimized for usage with B-HIT.

Supported schemata and protocols

B-HIT support all established collection data exchange standards. In addition extensions developed within special interest networks (GGBN, EFG (Geosciences)) are also supported.

ABCD: 2.06, 2.1, EFG, GGBN, GGBN Enviro, ABCD - Archive

DwC: DwC 1.0, 1.4, 1.4-Geospatial, 1.4-Curatorial, MaNIS 1.0, MaNIS 1.21, DwC Archive, DwC GGBN