Difference between revisions of "Open Refine Data Cleaning"

From BGBM Collection Workflows
Jump to: navigation, search
Line 22: Line 22:
 
Fig.1: selection of worksheets and header row.
 
Fig.1: selection of worksheets and header row.
  
== E-Mesh Extension ==
+
== E-Mesh Extension for OpenRefine: User Guide ==
  
=== Overview ===
+
The E-Mesh extension helps you clean, validate, and enrich botanical specimen data in OpenRefine. It supports two data standards: CDF (Collection Data Format) and JACQ for herbarium specimens.
  
The e-mesh extension is a OpenRefine plugin for managing and validating botanical specimen data. It provides specialized tools for data quality checking, field normalization, and enrichment through integration with the JACQ herbarium management system. The extension supports two primary data standards: CDF (Collection Data Form) and JACQ import files. Further data standards can be added as required.
+
=== What You Can Do ===
  
This section provides a high-level overview of the extension's architecture, components, and workflows.
+
==== Validate Your Data ====
  
=== System Architecture ===
+
Turn on real-time validation to see which cells contain errors:
  
The e-mesh extension operates as a three-tier system within OpenRefine, consisting of client-side JavaScript UI enhancements, server-side Java data processing, and configuration-driven validation rules.
+
# Go to View menu → Show / Hide data validity by CDF standard or JACQ standard
 +
# Valid cells appear green, invalid cells appear red with error tooltips
 +
# Click View → Collapse all-valid columns to hide columns with no errors
  
=== Core Data Standards ===
+
==== Transform Geographic Coordinates ====
  
The extension enforces data quality through two constraint schemas that define required fields, data types, controlled vocabularies, and other validation rules.
+
Convert between coordinate formats using column menus:
  
==== CDF Standard Fields (Subset) ====
+
# DMS to Decimal Degrees: Convert coordinates like "40°26'46"N" to 40.446111
 +
# Decimal to DMS: Convert decimal coordinates back to degrees/minutes/seconds format
 +
# Split DMS: Break DMS coordinates into separate degree, minute, second, and direction columns
  
The CDF standard defines specimen documentation requirements:
+
==== Enrich Data from JACQ Database ====
  
{| class="wikitable"
+
Look up taxonomic and specimen information from the JACQ herbarium management system:
! Example Categories
 
! Example Fields
 
! Example Validation
 
|-
 
|-
 
| Collection Data
 
| Collectors, Collection Date from, Collection Date to
 
| Required, ISO-8601 dates
 
|-
 
| Geographic
 
| Latitude, Longitude, Geocode Method
 
| Required, decimal degrees, range validation
 
|-
 
| Taxonomic
 
| Family, Genus, Specific Epithet
 
| Required fields
 
|-
 
|}
 
  
'''Sources'''
+
# Search by taxonID: Get scientific names, UUIDs, and taxonomic details
 +
# Search by UUID: Retrieve taxon information from universal identifiers
 +
# Search by scientific name: Find matching taxa with wildcard search
 +
# Search by specimenID: Get specimen details and stable identifiers
 +
# Autocomplete scientific names: Get suggestions for partial scientific names
  
<code>src/main/resources/rules/constraints-CDF.json</code> (lines 1–575)
+
==== Verify Data Against Local Database ====
  
==== JACQ Standard Fields (Subset) ====
+
Check your data against reference databases:
  
The JACQ standard focuses on herbarium specimen documentation to be imported into the JACQ herbarium management system:
+
# CollectorID for Collector: Verify collector names and get IDs
 +
# Is Voucher ID in JACQ?: Check if specimen vouchers exist
 +
# SeriesID for Series: Look up series identifiers
 +
# AuthorID for Author: Verify author names
 +
# Is specimen_ID checked and accessible?: Validate specimen accessibility in JACQ
  
{| class="wikitable"
+
==== Normalize Column Headers ====
! Example Categories
 
! Example Fields
 
! Example Validation
 
|-
 
| Specimen Identity
 
| HerbNummer, collectionID, CollectionNumber
 
| Required
 
|-
 
| Collection
 
| Sammler, Datum, Datum2
 
| Required collector and date
 
|-
 
| DMS Coordinates
 
| coord_NS, lat_degree, lat_minute, lat_second
 
| Dependent required, 0–90° latitude
 
|}
 
  
'''Sources'''
+
Map your column names to standard names:
  
<code>src/main/resources/rules/constraints-JACQ.json</code> (lines 1–137)
+
# Go to Edit headers menu
 +
# Choose Normalize headers to CDF standard or JACQ standard
 +
# The dialog can auto-suggest mappings based on similarity
 +
# Drag and drop to adjust mappings manually
  
=== Major Subsystems ===
+
==== Export to Standard Formats ====
  
==== Column Menu Extensions ====
+
Export your cleaned data in standardized formats:
  
Provides custom operations accessible from column headers, organized into four functional groups:
+
# Click Export → CDF... or JACQ...
 +
# Map your columns to required standard fields
 +
# Add fixed values for missing required fields
 +
# The extension validates that all required fields are mapped before export
  
[[File:e-mesh-column-menu-extension-groups.png|500px]]
+
==== Transform Collection Data ====
  
'''Sources'''
+
Convert between collection name formats:
  
<code>module/scripts/project/data-table-column-header/column-header-extend-menu.js</code> (lines 1–1532)
+
# To coll_short: Convert to abbreviated collection codes
 
+
# To collection: Convert to full collection names
'''Truncate Utility Operation'''
+
# To collectionID: Convert to collection identifiers
 
 
A single utility operation "To length of..." provides an interactive dialog for truncating string values to a specified length.
 
 
 
This operation demonstrates integration with OpenRefine's <code>ExpressionPreviewDialog</code> Widget, allowing users to:
 
 
 
* Enter a custom truncation length (default: 5)
 
* Preview the transformation results
 
* Specify error handling behavior
 
* Name the output column
 
 
 
'''Sources'''
 
 
 
<code>module/scripts/project/data-table-column-header/column-header-extend-menu.js</code> (lines 184-253)
 
 
 
==== Schema Dialogs ====
 
 
 
Two specialized dialogs enable mapping project columns to standardized schemas:
 
 
 
* <code>RenameDialog</code>: Normalizes project column headers to CDF or JACQ standards
 
* <code>ExportDialog</code>: Maps columns and exports data in standardized formats with validation
 
 
 
Both dialogs implement a four-stage auto-mapping algorithm:
 
 
 
# Exact case-insensitive match
 
# Check if column name matches any alias in constraints[field].aliases
 
# Predefined mapping lookup from <code>mappings.json</code>
 
# Cosine similarity matching (threshold: 0.75)
 
 
 
'''Sources'''
 
 
 
<code>module/scripts/project/data-table-header-all/rename-dialog.js</code> (lines 1–404)
 
 
 
<code>module/scripts/project/project-controls-export/exporter-dialog.js</code> (lines 1–539)
 
 
 
==== Validation System ====
 
 
 
Real-time client-side validation using a constraint-driven rule engine:
 
 
 
[[File:E-mesh-validation-system.png|900px]]
 
 
 
'''Validation Rule Classes'''
 
 
 
All validation rules implement a common interface:
 
 
 
<pre>
 
class ExampleRule {
 
  validate(value) { /* Returns boolean */ }
 
  getMessage(columnName) { /* Returns error message string */ }
 
  getSuggestion(columnName) { /* Returns suggestion string */ }
 
  getTransformationFunction() { /* Returns function or empty string */ }
 
}
 
</pre>
 
 
 
'''Rules'''
 
 
 
The validator.js file defines 17 validation rule classes:
 
 
 
# RequiredRule (lines 39-55): Validates that a field has a non-null, non-undefined, non-empty value.
 
# DependentRequiredRule (lines 58-121): Validates that a field is required when any of its dependent columns have values (OR logic).
 
# StringRule (lines 123-142): Validates that a value is of type string.
 
# NumberRule (lines 144-187): Validates that a value is a valid number by parsing with parseFloat() and ensuring the entire string is numeric.
 
# BooleanRule (lines 189-208): Validates that a value is of type boolean.
 
# DateRule (lines 210-229): Validates that a value can be parsed as a valid date using Date.parse().
 
# MySqlIntRule (lines 231-253): Validates that a value matches the MySQL integer format (optional minus sign followed by digits only).
 
# IntMaximumRule (lines 255-282): Validates that an integer value is less than or equal to a specified maximum.
 
# IntMinimumRule (lines 284-311): Validates that an integer value is greater than or equal to a specified minimum.
 
# MySqlFloatRule (lines 313-333): Validates that a value can be parsed as a MySQL floating-point number.
 
# DateTimeRule (lines 335-355): Validates that a value matches the MySQL datetime format YYYY-MM-DD HH:MM:SS.
 
# TimeRule (lines 357-379): Validates that a value matches the MySQL time format HH:MM:SS.
 
# ISO8601CalendarDateRule (lines 381-414): Validates that a value is a valid ISO 8601 calendar date in format YYYY-MM-DD with actual date validation.
 
# LatitudeDecimalDegreeRule (lines 416-445): Validates that a value is a valid latitude in decimal degrees (between -90 and 90).
 
# LongitudeDecimalDegreeRule (lines 447-476): Validates that a value is a valid longitude in decimal degrees (between -180 and 180).
 
# EnumRule (lines 478-504): Validates that a value matches one of the allowed values in a controlled vocabulary (case-insensitive).
 
# LengthRule (lines 506-535): Validates that a string value has exactly the expected character length and provides a transformation function to truncate to that length.
 
# TextContainsRule (lines 537-565): Validates that a value contains at least one of the required substrings (case-insensitive).
 
 
 
These rules are instantiated and applied by the Validator class's <code>validateCell()</code> method (lines 580-801), which checks constraints in a specific order: required rules first, then dependent required rules, then type-specific rules, and finally value constraint rules.
 
 
 
'''Visual Feedback''':
 
 
 
* Green: Valid cells
 
* Red: Invalid cells (with tooltip showing violation)
 
* Gray: No rule applies
 
* Fix buttons: Quick-action remediation for common issues
 
 
 
'''Sources'''
 
 
 
<code>module/scripts/project/data-table-header-all/validation-cell-renderer-jacq.js</code> (lines 1–224)
 
 
 
<code>module/scripts/project/data-table-header-all/validation-cell-renderer-cdf.js</code> (lines 1–224)
 
 
 
<code>module/scripts/validator.js</code> (lines 1–840)
 
 
 
==== Database Search Operations ====
 
 
 
The "Search databases" submenu provides five operations that query the JACQ  database and to validate or enrich data with identifiers.
 
 
 
'''Sources'''
 
 
 
<code>module/scripts/project/data-table-column-header/column-header-extend-menu.js</code> (lines 258-383)
 
 
 
<code>src/main/java/org/openrefine/extensions/emesh/db/DBConnection.java</code> (lines 258-383)
 
 
 
==== JACQ API Integration ====
 
 
 
The <code>JacqClient</code> class provides HTTP communication with the JACQ  API.
 
 
 
'''Key Features:'''
 
 
 
* Caching: Caffeine cache with 10,000-entry maximum and 5-minute expiration
 
* Rate Limiting: Bursty rate limiter allowing 5 requests/second
 
* Concurrency Control: Bulkhead pattern limiting 6 concurrent requests
 
* Retry Logic: Exponential backoff (150ms to 2s) with jitter, honors Retry-After headers
 
* Error Handling: Custom TransientHttpException for retriable failures
 
 
 
'''Sources'''
 
 
 
<code>src/main/java/org/openrefine/extensions/emesh/jacq/JacqClient.java</code> (lines 1-355)
 
 
 
=== Primary Data Workflows ===
 
 
 
The extension supports three primary workflows that users execute to clean, enrich, and export specimen data:
 
 
 
==== Data Enrichment Workflow ====
 
 
 
[[File:E-mesh-data-enrichment-workflow.png|900px]]
 
 
 
Users select enrichment operations from column menus. GREL expressions invoke Java functions that call <code>JacqClient</code>, which applies resilience policies before querying the JACQ API. Results populate new columns.
 
 
 
'''Sources'''
 
 
 
<code>module/scripts/project/data-table-column-header/column-header-extend-menu.js</code> (lines 389-495)
 
 
 
<code>src/main/java/org/openrefine/extensions/emesh/jacq/JacqClient.java</code> (lines 108-126)
 
 
 
==== Data Validation Workflow ====
 
 
 
[[File:E-mesh-data-validation-workflow.png|900px]]
 
 
 
Users toggle validation renderers from the View menu. Renderers fetch constraint definitions, then use the <code>Validator</code> class to check each cell against applicable rules. Visual feedback is rendered directly in the data table.
 
 
 
'''Sources'''
 
 
 
<code>module/scripts/project/data-table-header-all/header-all-extend-menu.js</code> (lines 92-157)
 
 
 
==== Schema Export Workflow ====
 
 
 
[[File:E-mesh-schema-export-workflow.png|700px]]
 
 
 
Users open the export dialog, which auto-maps project columns to schema fields. After manual adjustments, the <code>SchemaExporter</code> validates required fields, applies mappings and fixed values, and generates a standards-compliant output file.
 
 
 
'''Sources'''
 
 
 
<code>module/scripts/project/project-controls-export/exporter-dialog.js</code> (lines 1-539)
 
 
 
<code>src/main/java/org/openrefine/extensions/emesh/exporters/SchemaExporter.java</code> (lines 1-221)
 
  
 +
=== How to Access Features ===
  
 +
All features are available through:
  
 +
* Column dropdown menus (click the triangle next to any column name)
 +
* All → View menu (for validation toggles)
 +
* Edit headers menu (for column normalization)
 +
* Export menu (for standardized exports)
  
 
== Ressources ==
 
== Ressources ==
 
The OpenRefine website (http://openrefine.org/) provides a very nice compilation of tutorial videos, which will help you to understand the basic concepts of data exploration, cleaning, and transformation. There is also a written manual in the form of a wiki-site with common use cases and links to external tutorials in different languages (https://github.com/OpenRefine/OpenRefine/wiki).
 
The OpenRefine website (http://openrefine.org/) provides a very nice compilation of tutorial videos, which will help you to understand the basic concepts of data exploration, cleaning, and transformation. There is also a written manual in the form of a wiki-site with common use cases and links to external tutorials in different languages (https://github.com/OpenRefine/OpenRefine/wiki).

Revision as of 14:13, 21 November 2025

1 Introduction

Data cleaning is an important part of the scientific data management and curation process and involves activities such as identification and correction of spelling errors (for example on scientific names), mapping data to agreed controlled vocabularies (e.g. ISO country names), setting correct references to external files (e.g. digital images), as well as re-arranging data so that they comply with standards required for specific data processing steps (e.g. imports into the JACQ collection management system at the BGBM).

The BGBM scientific data workflows are based on the assumption that scientists are cleaning their data themselves to meet the BGBM data formats and standards. To support this important step of the scientific process, we provide an open installation of OpenRefine (formerly known as Google Refine) on a BGBM server at http://services.bgbm.org/ws/OpenRefine/.

!! Please note that the BGBM OpenRefine server is only intended for the use as a tool for working on your data. It does not offer functions for storing data permanently and safely. We will however not delete any data without notice !!

Like Microsoft Excel, OpenRefine operates primarily on tabular data with rows representing a measurement, occurrence record, or specimen for example and columns representing specific attributes (e.g. collectors, locality, longitude, latitude). Unlike Microsoft Excel, OpenRefine focusses specifically on functions and operations which are particularly useful for cleaning and arranging scientific data. This includes, for example, the automated detection and merging of (potential) duplicate data items such as collector names, which have been captured using slightly different spellings but are referring to the same persons or person teams. The BGBM BDI team is constantly adding functions to the OpenRefine server installation, which are particularly useful in the context of botanic collection data processing. Please contact us for further information (email).

2 (Very) first steps

OpenRefine is extremely powerful and we will definitely not be able to compete with dozens of already existing excellent manuals and tutorial videos. Please refer to the Resources section for further information. Here, we would like to guide you through the very first steps with OpenRefine using the example of a BGBM Collection Data Form:

  1. Open your web browser and navigate to the BGBM OpenRefine installation at http://services.bgbm.org/ws/OpenRefine/.
  2. Press the “Choose File/Browse”-Button and select a CDF-File of your choice.
  3. Press “Next”.
  4. You will now be greeted with a first preview of your data. Since your CDF-File consisted of several (Excel) worksheets, OpenRefine needs to know, which of them need to be considered for importing data. In this case, de-select all worksheets except the collection dataform worksheet (see fig. 1). OpenRefine also needs to know, which row contains the relevant header of your worksheet. For the CDF you will have to “ignore first 3 line(s)” and “parse next 1 line(s)”.
  5. Finally, press the “create project” button and you are ready to go.


OpRef Screenshot.jpg
Fig.1: selection of worksheets and header row.

3 E-Mesh Extension for OpenRefine: User Guide

The E-Mesh extension helps you clean, validate, and enrich botanical specimen data in OpenRefine. It supports two data standards: CDF (Collection Data Format) and JACQ for herbarium specimens.

3.1 What You Can Do

3.1.1 Validate Your Data

Turn on real-time validation to see which cells contain errors:

  1. Go to View menu → Show / Hide data validity by CDF standard or JACQ standard
  2. Valid cells appear green, invalid cells appear red with error tooltips
  3. Click View → Collapse all-valid columns to hide columns with no errors

3.1.2 Transform Geographic Coordinates

Convert between coordinate formats using column menus:

  1. DMS to Decimal Degrees: Convert coordinates like "40°26'46"N" to 40.446111
  2. Decimal to DMS: Convert decimal coordinates back to degrees/minutes/seconds format
  3. Split DMS: Break DMS coordinates into separate degree, minute, second, and direction columns

3.1.3 Enrich Data from JACQ Database

Look up taxonomic and specimen information from the JACQ herbarium management system:

  1. Search by taxonID: Get scientific names, UUIDs, and taxonomic details
  2. Search by UUID: Retrieve taxon information from universal identifiers
  3. Search by scientific name: Find matching taxa with wildcard search
  4. Search by specimenID: Get specimen details and stable identifiers
  5. Autocomplete scientific names: Get suggestions for partial scientific names

3.1.4 Verify Data Against Local Database

Check your data against reference databases:

  1. CollectorID for Collector: Verify collector names and get IDs
  2. Is Voucher ID in JACQ?: Check if specimen vouchers exist
  3. SeriesID for Series: Look up series identifiers
  4. AuthorID for Author: Verify author names
  5. Is specimen_ID checked and accessible?: Validate specimen accessibility in JACQ

3.1.5 Normalize Column Headers

Map your column names to standard names:

  1. Go to Edit headers menu
  2. Choose Normalize headers to CDF standard or JACQ standard
  3. The dialog can auto-suggest mappings based on similarity
  4. Drag and drop to adjust mappings manually

3.1.6 Export to Standard Formats

Export your cleaned data in standardized formats:

  1. Click Export → CDF... or JACQ...
  2. Map your columns to required standard fields
  3. Add fixed values for missing required fields
  4. The extension validates that all required fields are mapped before export

3.1.7 Transform Collection Data

Convert between collection name formats:

  1. To coll_short: Convert to abbreviated collection codes
  2. To collection: Convert to full collection names
  3. To collectionID: Convert to collection identifiers

3.2 How to Access Features

All features are available through:

  • Column dropdown menus (click the triangle next to any column name)
  • All → View menu (for validation toggles)
  • Edit headers menu (for column normalization)
  • Export menu (for standardized exports)

4 Ressources

The OpenRefine website (http://openrefine.org/) provides a very nice compilation of tutorial videos, which will help you to understand the basic concepts of data exploration, cleaning, and transformation. There is also a written manual in the form of a wiki-site with common use cases and links to external tutorials in different languages (https://github.com/OpenRefine/OpenRefine/wiki).