Open Refine Data Cleaning
Contents
1 Introduction
Data cleaning is an important part of the scientific data management and curation process and involves activities such as identification and correction of spelling errors (for example on scientific names), mapping data to agreed controlled vocabularies (e.g. ISO country names), setting correct references to external files (e.g. digital images), as well as re-arranging data so that they comply with standards required for specific data processing steps (e.g. imports into the JACQ collection management system at the BGBM).
The BGBM scientific data workflows are based on the assumption that scientists are cleaning their data themselves to meet the BGBM data formats and standards. To support this important step of the scientific process, we provide an open installation of OpenRefine (formerly known as Google Refine) on a BGBM server at http://services.bgbm.org/ws/OpenRefine/.
!! Please note that the BGBM OpenRefine server is only intended for the use as a tool for working on your data. It does not offer functions for storing data permanently and safely. We will however not delete any data without notice !!
Like Microsoft Excel, OpenRefine operates primarily on tabular data with rows representing a measurement, occurrence record, or specimen for example and columns representing specific attributes (e.g. collectors, locality, longitude, latitude). Unlike Microsoft Excel, OpenRefine focusses specifically on functions and operations which are particularly useful for cleaning and arranging scientific data. This includes, for example, the automated detection and merging of (potential) duplicate data items such as collector names, which have been captured using slightly different spellings but are referring to the same persons or person teams. The BGBM BDI team is constantly adding functions to the OpenRefine server installation, which are particularly useful in the context of botanic collection data processing. Please contact us for further information (email).
2 (Very) first steps
OpenRefine is extremely powerful and we will definitely not be able to compete with dozens of already existing excellent manuals and tutorial videos. Please refer to the Resources section for further information. Here, we would like to guide you through the very first steps with OpenRefine using the example of a BGBM Collection Data Form:
- Open your web browser and navigate to the BGBM OpenRefine installation at http://services.bgbm.org/ws/OpenRefine/.
- Press the “Choose File/Browse”-Button and select a CDF-File of your choice.
- Press “Next”.
- You will now be greeted with a first preview of your data. Since your CDF-File consisted of several (Excel) worksheets, OpenRefine needs to know, which of them need to be considered for importing data. In this case, de-select all worksheets except the collection dataform worksheet (see fig. 1). OpenRefine also needs to know, which row contains the relevant header of your worksheet. For the CDF you will have to “ignore first 3 line(s)” and “parse next 1 line(s)”.
- Finally, press the “create project” button and you are ready to go.
Fig.1: selection of worksheets and header row.
3 E-Mesh Extension for OpenRefine: User Guide
The E-Mesh extension helps you clean, validate, and enrich botanical specimen data in OpenRefine. It supports two data standards: CDF (Collection Data Format) and JACQ for herbarium specimens.
3.1 What You Can Do
3.1.1 Validate Your Data
Turn on real-time validation to see which cells contain errors:
- Go to View menu → Show / Hide data validity by CDF standard or JACQ standard
- Valid cells appear green, invalid cells appear red with error tooltips
- Click View → Collapse all-valid columns to hide columns with no errors
3.1.2 Transform Geographic Coordinates
Convert between coordinate formats using column menus:
- DMS to Decimal Degrees: Convert coordinates like "40°26'46"N" to 40.446111
- Decimal to DMS: Convert decimal coordinates back to degrees/minutes/seconds format
- Split DMS: Break DMS coordinates into separate degree, minute, second, and direction columns
3.1.3 Enrich Data from JACQ Database
Look up taxonomic and specimen information from the JACQ herbarium management system:
- Search by taxonID: Get scientific names, UUIDs, and taxonomic details
- Search by UUID: Retrieve taxon information from universal identifiers
- Search by scientific name: Find matching taxa with wildcard search
- Search by specimenID: Get specimen details and stable identifiers
- Autocomplete scientific names: Get suggestions for partial scientific names
3.1.4 Verify Data Against Local Database
Check your data against reference databases:
- CollectorID for Collector: Verify collector names and get IDs
- Is Voucher ID in JACQ?: Check if specimen vouchers exist
- SeriesID for Series: Look up series identifiers
- AuthorID for Author: Verify author names
- Is specimen_ID checked and accessible?: Validate specimen accessibility in JACQ
3.1.5 Reconcile Taxonomic Names
Match scientific names in your data against the JACQ database using OpenRefine's reconciliation interface:
- Click the column dropdown menu → Reconcile → Start reconciling...
- Choose JACQ as the service
- Select "Taxon" as the reconciliation type
- Review suggested matches with confidence scores
- Accept matches individually or in bulk
- Preview taxon details in the right-side panel before accepting
3.1.6 Normalize Column Headers
Map your column names to standard names:
- Go to Edit headers menu
- Choose Normalize headers to CDF standard or JACQ standard
- The dialog can auto-suggest mappings based on similarity
- Drag and drop to adjust mappings manually
3.1.7 Export to Standard Formats
Export your cleaned data in standardized formats:
- Click Export → CDF... or JACQ...
- Map your columns to required standard fields
- Add fixed values for missing required fields
- The extension validates that all required fields are mapped before export
3.1.8 Transform Collection Data
Convert between collection name formats:
- To coll_short: Convert to abbreviated collection codes
- To collection: Convert to full collection names
- To collectionID: Convert to collection identifiers
3.2 How to Access Features
All features are available through:
- Column dropdown menus (click the triangle next to any column name)
- All → View menu (for validation toggles)
- Edit headers menu (for column normalization)
- Export menu (for standardized exports)
4 Ressources
The OpenRefine website (http://openrefine.org/) provides a very nice compilation of tutorial videos, which will help you to understand the basic concepts of data exploration, cleaning, and transformation. There is also a written manual in the form of a wiki-site with common use cases and links to external tutorials in different languages (https://github.com/OpenRefine/OpenRefine/wiki).