Difference between revisions of "Supporting data preparation software"

From reBiND Documentation
Jump to: navigation, search
(Software products to support data preparation)
m (11 revision)
 
(One intermediate revision by one other user not shown)
Line 3: Line 3:
 
In addition to the core reBiND software [Installation|described in the installation guide] several other software tools were created to support data preparation, by data cleaning, data substitutions and other modifications. These are outlined below.
 
In addition to the core reBiND software [Installation|described in the installation guide] several other software tools were created to support data preparation, by data cleaning, data substitutions and other modifications. These are outlined below.
  
== Data Splitter ==
+
===Data Splitter===
  
 
A Java-based program was written to facilitate preparation of data where a single field of data in a text file needs to be split into several fields. This ‘Data Splitter’ program requires the user to specify a regular expression where data in a single field should be split.  
 
A Java-based program was written to facilitate preparation of data where a single field of data in a text file needs to be split into several fields. This ‘Data Splitter’ program requires the user to specify a regular expression where data in a single field should be split.  
Line 13: Line 13:
 
For example the user enters a regular expression in the Regex box which atomises the data by splitting it at the specified character, in this case it was a comma separated value (csv) file. In the example file the full locality information occurred in one field but we required it to be atomised. After clicking ‘Split Data’ the original data shown in column 1 is split into several numbered columns (1-7) to the right of this data.  
 
For example the user enters a regular expression in the Regex box which atomises the data by splitting it at the specified character, in this case it was a comma separated value (csv) file. In the example file the full locality information occurred in one field but we required it to be atomised. After clicking ‘Split Data’ the original data shown in column 1 is split into several numbered columns (1-7) to the right of this data.  
  
After splitting if any of the atmoised data needs to be re-combined, highlighting cells (for example cells 5,6,7) and pressing Ctrl + J can be used to re-join any columns.
+
After splitting if any of the atomised data needs to be re-combined, highlighting cells (for example cells 5,6,7) and pressing Ctrl + J can be used to re-join any columns.
  
  
== Character Encoding Correcter ==
+
===Character Encoding Correcter===
== Stand-alone Correction Manager ==
+
===Stand-alone Correction Manager===
  
The correction manager has been described in detail, in the context of the reBiND data portal. It can also be used in stand-alone mode. My writing an additional Java main method, to specifiy the import and export files and the correction donfiguration file all of the correction modules can be used. This enables the program to be used directly from the command-line or via an entirely different user interface, thus it could be incorporated into other projects which required automated correction of XML files.
+
The correction manager has been described in detail, in the context of the reBiND data portal. It could also be used in stand-alone mode. To enable this an additional Java main method, to specify the import and export files and the correction configuration file could be written. This enables the correction modules to be used directly from the command-line or via an entirely different user interface, thus it could be incorporated into other projects which required automated correction of XML files.

Latest revision as of 17:15, 10 November 2014

Software products to support data preparation

In addition to the core reBiND software [Installation|described in the installation guide] several other software tools were created to support data preparation, by data cleaning, data substitutions and other modifications. These are outlined below.

Data Splitter

A Java-based program was written to facilitate preparation of data where a single field of data in a text file needs to be split into several fields. This ‘Data Splitter’ program requires the user to specify a regular expression where data in a single field should be split.

The user interface is shown in the screenshot below.

Data splitter.PNG

For example the user enters a regular expression in the Regex box which atomises the data by splitting it at the specified character, in this case it was a comma separated value (csv) file. In the example file the full locality information occurred in one field but we required it to be atomised. After clicking ‘Split Data’ the original data shown in column 1 is split into several numbered columns (1-7) to the right of this data.

After splitting if any of the atomised data needs to be re-combined, highlighting cells (for example cells 5,6,7) and pressing Ctrl + J can be used to re-join any columns.


Character Encoding Correcter

Stand-alone Correction Manager

The correction manager has been described in detail, in the context of the reBiND data portal. It could also be used in stand-alone mode. To enable this an additional Java main method, to specify the import and export files and the correction configuration file could be written. This enables the correction modules to be used directly from the command-line or via an entirely different user interface, thus it could be incorporated into other projects which required automated correction of XML files.