Difference between revisions of "Export DataPerfect"

From reBiND Documentation
Jump to: navigation, search
(additional information)
 
(5 intermediate revisions by 2 users not shown)
Line 4: Line 4:
  
 
=== Install and run DOSBox ===
 
=== Install and run DOSBox ===
: To run DataPerfect a DOS emulator is needed. The free emulator DOSBox works quite well. Go to http://www.dosbox.com/download.php?main=1 or http://sourceforge.net/projects/dosbox/, download the latest release and install it. For this articles DOSBox 0.74 will be used and installed under Windows 7.  
+
: To run DataPerfect a DOS emulator is needed. The free emulator DOSBox works quite well. Go to http://www.dosbox.com/download.php?main=1 or http://sourceforge.net/projects/dosbox/, download the latest release and install it. For this article DOSBox 0.74 will be used and installed under Windows 7.  
  
 
===  Download DataPerfect ===
 
===  Download DataPerfect ===
Line 50: Line 50:
 
* '''Down''' on a field which has references to entries in another table, that referenced entry is displayed
 
* '''Down''' on a field which has references to entries in another table, that referenced entry is displayed
 
* '''Up''' The list of entries in this table where the content of the current field is displayed. Navigate though this list using the '''Up''' and '''Down''' keys. When typing characters this list the focus will jump to the entry whose unique key column is like the typed characters. For example in the table displayed in the image below, typing a number will focus on the entry with that id. The corresponding entry is automatically displayed. To edit this entry, press '''Enter'''.  
 
* '''Up''' The list of entries in this table where the content of the current field is displayed. Navigate though this list using the '''Up''' and '''Down''' keys. When typing characters this list the focus will jump to the entry whose unique key column is like the typed characters. For example in the table displayed in the image below, typing a number will focus on the entry with that id. The corresponding entry is automatically displayed. To edit this entry, press '''Enter'''.  
* '''F7''' Goes to the next upper level in the hierarchy. For example if the list of entries is opened, it will jump back to the  
+
* '''F7''' Goes to the next upper level in the hierarchy. For example if the list of entries is opened, it will jump back to the next upper level in the hierarchy.
 
[[File:DPE10_DIAS_Table1_browse.png|frame|none|Browsing through the first table of DIAS]]
 
[[File:DPE10_DIAS_Table1_browse.png|frame|none|Browsing through the first table of DIAS]]
  
 
=== Select Report ===
 
=== Select Report ===
 +
From the list of available reports select the entry at the top: '''Build-In Short Reports'''.
 
[[File:DPE11_DIAS_Table1_Report_List.png|frame|none|The list of available reports.]]
 
[[File:DPE11_DIAS_Table1_Report_List.png|frame|none|The list of available reports.]]
  
Line 72: Line 73:
  
 
=== Converting special characters ===
 
=== Converting special characters ===
 +
The best way to handle special characters is to know the character encoding by the original file system. If it is a DOS based system, the command '''CHCP''' will display what Character Code Page used. With this knowledge the file can be easily converted. In DOSBox the Command is '''KEYB''' however, it only helps if all of the special characters are displayed correctly within the DataPerfect. In the case of the Rohwer data set, the KEYB command showed Codepage 437, however Codepage 850 was actually used, must of the special characters are however identical between the two sets.
 +
 +
Once the character encoding is known the exported file converted by a program that is able to read that encoding. Under Windows Notepad++ does a good job. After opening the file (the special characters are probably distorted), one must select the original character encoding as the encoding of the file and then convert the file to UTF-8.
 +
 +
Google Refine is another software that is able to read the CP 850 encoding and allows for further processing of the files.
 +
 +
Additional Infomation about Codepages can be found on the Wikipedia Article ''[http://en.wikipedia.org/wiki/Code_page Code page]''. Also the article ''[http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]'' by Joel Spolsky is very helpful.
 +
 +
In retrospective we now know that the Rohwer data set is encoded using the Codepage 850 (also called OEM 850), however some single characters are in the wrong encoding. This however is the result of the process where we parsed the file, got a list of all the different characters used and looked them up and converted them by hand. Though this process is a lot more work, but it works even in the case of mixed character sets. It is documented in detail below.
 +
 +
 +
==== converting characters individually ====
 +
The export out of DataPerfect has problems with special characters which are not converted correctly in a proper character set. Though all of the occurrences of such special characters can be replaced automatically, each of the character to be replaced has to be defined manually one. The files as they are generated by DataPerfect use the ANSI Character Encoding. This needs to be converted to UTF8. Under Windows the free program Notepad++ is very suited for this. When selecting the menu item '''Encoding''' (''Kodierung'' in the image, because it is the German version of Notepad++), the entry ''ANSI'' should be marked as the current encoding. Now click on '''Convert to UTF-8''' and save the file.
 +
[[File:DPE16 Notepad CharSet.png|frame|none|Converting the Character Set in Notepad++]]
 +
The next step is to run the file through a program that converts the characters. The small Java program CharReplacer was written just to do that. There are two different files to that program: CharReplacer.class (the acutual program) and CharReplacer.settings (the settings file, which specifies, what characters to replace. The first line in ''CharReplacer.settings'' are all the characters which will not be replaced. All the following lines have the number value of the character which will be replaced, followed by a tab and the character it will be replaced with.
 +
 +
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789",.-; ()?\':/=&[]!`<>#+%
 +
8222 ä
 +
8221 ö
 +
129 ü
 +
225 ß
 +
381 Ä
 +
8482 Ö
 +
353 Ü
 +
Excerpt from the ''CharReplacer.settings'' file.
 +
 +
To run the CharReplacer program, the Java Runtime Envirement (JRE) must be installed on a computer. It must be run via the command line interface (also known as console).
 +
 +
Under Windows the Command Prompt (cmd.exe) can be used, but it needs to be adjusted to show the UTF-8 characters.
 +
 +
''(The following steps displayed in italics only need to be done if the Windows Command Prompt is used.)''
 +
* Copy the files CharReplacer.class and CharReplacer.settings in the folder in which the exported files are (in the example from above it is C:\DOS\DP26Y\DIAS\)
 +
* Start the console.
 +
* Change to the directory in which the files are.
 +
* ''Set the Font for the Command Prompt to '''Lucida Console'''''
 +
* ''Switch the Character Set for the console by typing '''chcp 65001'''''
 +
* run the program by typing '''java -Dfile.encoding=UTF-8 CharReplacer''' followed by the name or names of the files you want to run the program on, e.g. ''java -Dfile.encoding=UTF-8 CharReplacer OUTPUT1.EXP'' and hit Enter.
 +
 +
{|
 +
|[[File:DPE17 Console change Font.png|frame|none|Changing the Font in the Windows Command Prompt (German version)]]
 +
|[[File:DPE18 Console start CharReplacer.png|frame|none|Running the CharReplacer]]
 +
|}
 +
 +
The output of the program will be a file with the name of the input file, but with an additional '''.csv''' extension at the end. So '''OUTPUT1.EXP''' will become '''OUTPUT1.EXP.csv'''.
 +
 +
If the program find characters which are not in the list of allowed characters (the first line of the file) and for which no replacement rule exists, it will not replace this character, but it will show a message, informing the user that an unknown character was found, where it was found (file, line and column), what it looks like in UTF-8 and what its code is. If this occurs it is necessary to review the file at this position and create a new rule on how to handle that character. If the program prints out an unknown character, open the file and look at the given position. If the character is correct and represents precisely what was meant in the original dataset, then add this character at the end of the first line in the settings file. If it is a wrong character however, try to figure out what character it is meant to be. This could be derived from the context in which this unknown character appears. In the example below, the unknown character is displayed as '''†''' and appears in the work '''"R†dhusplassen"'''. People how are familiar with Norwegian might recognize the word as '''"Rådhusplassen"''' (the Norwegian word for "City Hall Place" or "Town Hall Place"). In other cased it could be not so clear. So it becomes necessary to look up the entry in the original DataPerfect table and see, what character was originally entered. In the third image, it can be seen that in this case the unknown character is indeed a '''å'''. So a new line can be added at the end of the settings file with the following content:
 +
8224 å
 +
The 8224 is the code for the character, as shown in the message of the CharReplacer program in the first image.
 +
 +
To edit the ''CharReplacer.settings'' file, use a regular text editor (like Notepad++ under Windows).
 +
 +
{|
 +
|[[File:DPE19 Console run CharReplacer.png‎|frame|none|some unknown characters were found]]
 +
|[[File:DPE20_Notepad_Special_Characters.png‎|frame|none|Viewing the unconverted character in Notepad++]]
 +
|[[File:DPE21 DataPerfect Special Characters.png|frame|none|Viewing the unconverted character in the original data entry in DataPerfect.]]
 +
|}
  
 
=== Documenting Foreign Key Relations ===
 
=== Documenting Foreign Key Relations ===
 +
When the data is exported out of DataPerfect the associations between the tables are lost. So it is important to take a look at the exported data and see the relation between the tables. What columns are unique keys for their table and what columns are foreign key links to other tables. Looking at the tables in DataPerfect is helpful in this context, as the UI of DataPerfect sometimes shows columns which are not part of the export of that table, so they must be loaded from a different table. Sometimes when selecting a field in DataPerfect it also opens the window to another table, making it obvious that this column is a foreign key connection.
  
 
=== Importing into a modern relational database ===
 
=== Importing into a modern relational database ===
 +
Once the data is exported, converted into UTF-8 and the foreign key relations have been analyzed, it is possible to import the data in a modern relational data base, so it can be accessed by the BioCASe Provider Software. To do this, one must first create a database for it and the respective tables. The columns of the tables must already be prepared. It is important that the maximal length allowed for the columns is not smaller then the longest entry in this columns. Also the data formats must be correct. Though it is possible to store any kind of data in a text field, it will be more useful if the data is actually stored using the format it actually is.
  
 
=== Next Steps ===
 
=== Next Steps ===
 +
Depending on how the tables are structured, it might become necessary to do a controlled denormalization so that the entries can be easier mapped to ABCD. This is described on the [http://wiki.bgbm.org/bps/index.php/Preparation Preparation page] of the [http://wiki.bgbm.org/bps BioCASe Provider Software Wiki].
 +
 +
The other following steps are also described at the BioCASe Provider Software Wiki, like the mapping of the ABCD concepts.
  
 
=== Alternative Ways of exporting data from a DataPerfect database ===
 
=== Alternative Ways of exporting data from a DataPerfect database ===
 +
Another way of exporting DataPerfect files is the [http://dans-dp-lib.sourceforge.net/ DANS DataPerfect Library]. There is a reference implementation of a [http://dans-dp-lib.sourceforge.net/Dp2MySqlExport.java DP2MySQLConverter]. After testing it on small sample databases, it worked fine, but it ran intro problems exporting the DIAS database, causing huge and corrupt output files. For other DataPerfect files it could still be useful, especially since it already handles the special characters correctly.

Latest revision as of 01:45, 19 November 2014

This article describes how to export data from a DataPerfect database, based on the Rohwer data set.

Data Perfect is a DOS based database system. It's latest release is 2.6Y from June 2008. It can be downloaded via http://dataperfect.nl/ .

Install and run DOSBox

To run DataPerfect a DOS emulator is needed. The free emulator DOSBox works quite well. Go to http://www.dosbox.com/download.php?main=1 or http://sourceforge.net/projects/dosbox/, download the latest release and install it. For this article DOSBox 0.74 will be used and installed under Windows 7.

Download DataPerfect

Go to http://dataperfect.nl/ and download the DataPerfect. Unzip the files into a specific folder. The folder used for this article is C:/DOS/DP26Y.

Copy your data files in the DataPerfect Folder

You can either copy them directly in the folder or create a subfolder for your data files and copy them in there. The data files for this article are located in the directory DIAS/ within the DataPerfect folder.

Start DOSBox

When starting DOSBox a second console window is opened. When running several instances of DOSBox these additional windows can bloat the window bar quite a bit. To avoid this just start the version DOXBox-0.74/Extras/DOSBox 0.74 (noconsole) from the Program Menu.
File:DPE01 DOSBox.png
The DOSBOX Start Window

Mount the DataPerfect Folder

After starting DOSBox the directory of the DataPerfect files needs to be mounted as a virtual drive. Type mount c C:\DOS\DP27Y to mount the folder and C: to change to the drive.
File:DPE02 DOSBox mount.png
Mounting the DataPerfect Folder

Start DataPerfect

Start DataPerfect by typing DP26YU and pressing Enter.
File:DPE03 DataPerfect start.png
Switch to C: and start DataPerfect
File:DPE04 DataPerfect welcome.png
The Welcome Screen of DataPerfect

Change to the folder of your data files

If the data files are not in the directory of DataPerfect you need to change the directory. Press 2 to change the directory and type in the path to the directory of your data.
File:DPE05 DataPerfect open.png
The initial screen of DataPerfect
File:DPE06 DataPerfect change directory.png
Change to the data directory

Open the data files

If you are in the correct directory you see the DataPerfect datasets within the folder. Use the cursor keys to select the correct data set and press enter.
File:DPE07 DataPerfect select DIAS.png
Selecting the DIAS data set.

Select the table to export

You can now see the tables within this database project. Select the table you want to export by using the up or down keys and pressing enter.

File:DPE08 DIAS Overview.png
The list of available tables.
File:DPE09 DIAS Table1.png
The first entry in the first table of DIAS

Navigate the Table

This these steps are not necessary for the export, but will be documented here anyway. Here are important keys for navigating thought a table:

  • Tab the next field is highlighted
  • Down on a field which has references to entries in another table, that referenced entry is displayed
  • Up The list of entries in this table where the content of the current field is displayed. Navigate though this list using the Up and Down keys. When typing characters this list the focus will jump to the entry whose unique key column is like the typed characters. For example in the table displayed in the image below, typing a number will focus on the entry with that id. The corresponding entry is automatically displayed. To edit this entry, press Enter.
  • F7 Goes to the next upper level in the hierarchy. For example if the list of entries is opened, it will jump back to the next upper level in the hierarchy.
File:DPE10 DIAS Table1 browse.png
Browsing through the first table of DIAS

Select Report

From the list of available reports select the entry at the top: Build-In Short Reports.

File:DPE11 DIAS Table1 Report List.png
The list of available reports.

Set Export Properties

Set the export settings as shown in the first image below. Press the number of the property you want to change. For example, to change the file name of the output file, press 2 and the file options will appear, as can be seen in the second image. Press 1 to create a new file and enter the name of the file (image 3). The filename of the export file must not be longer than 8 characters without file extension and the file extension itself (without the dot) must not be longer than 3 characters.

File:DPE12 DIAS Table1 Export 1.png
Export Settings Overview
File:DPE13 DIAS Table1 Export 2.png
Changing the File Options

Run Export

To start the export press Shift + F7 again. The screen now shows a counter of how many elements from the table have already been exported. DOSBox starts with a limited CPU speed for the programs running in it. The speed can be increased by pressing Ctrl + F12. Increasing the speed to much, however creates a big overhead which will also result in a slower export. For some reason when increasing the speed the CPU load will jump from around 5% of one core to 100% within just 2-3 increase steps. So the CPU load should be watched when increasing the speed of DOSBox. On a DualCore 3 GHz Processor a speed of 50000-55000 cycles (can be seen in the title bar of the DOSBox Window) appears to be a good export speed without overhead.

File:DPE15 DIAS Export.png
The export is running.

After the export is done, you will see the list of available reports again. To step to the next higher level of the hierarchy press F7. Repeat this until you see the list of tables within this database project again. Now repeat the export process for all the other tables you want to export.

Converting special characters

The best way to handle special characters is to know the character encoding by the original file system. If it is a DOS based system, the command CHCP will display what Character Code Page used. With this knowledge the file can be easily converted. In DOSBox the Command is KEYB however, it only helps if all of the special characters are displayed correctly within the DataPerfect. In the case of the Rohwer data set, the KEYB command showed Codepage 437, however Codepage 850 was actually used, must of the special characters are however identical between the two sets.

Once the character encoding is known the exported file converted by a program that is able to read that encoding. Under Windows Notepad++ does a good job. After opening the file (the special characters are probably distorted), one must select the original character encoding as the encoding of the file and then convert the file to UTF-8.

Google Refine is another software that is able to read the CP 850 encoding and allows for further processing of the files.

Additional Infomation about Codepages can be found on the Wikipedia Article Code page. Also the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky is very helpful.

In retrospective we now know that the Rohwer data set is encoded using the Codepage 850 (also called OEM 850), however some single characters are in the wrong encoding. This however is the result of the process where we parsed the file, got a list of all the different characters used and looked them up and converted them by hand. Though this process is a lot more work, but it works even in the case of mixed character sets. It is documented in detail below.


converting characters individually

The export out of DataPerfect has problems with special characters which are not converted correctly in a proper character set. Though all of the occurrences of such special characters can be replaced automatically, each of the character to be replaced has to be defined manually one. The files as they are generated by DataPerfect use the ANSI Character Encoding. This needs to be converted to UTF8. Under Windows the free program Notepad++ is very suited for this. When selecting the menu item Encoding (Kodierung in the image, because it is the German version of Notepad++), the entry ANSI should be marked as the current encoding. Now click on Convert to UTF-8 and save the file.

File:DPE16 Notepad CharSet.png
Converting the Character Set in Notepad++

The next step is to run the file through a program that converts the characters. The small Java program CharReplacer was written just to do that. There are two different files to that program: CharReplacer.class (the acutual program) and CharReplacer.settings (the settings file, which specifies, what characters to replace. The first line in CharReplacer.settings are all the characters which will not be replaced. All the following lines have the number value of the character which will be replaced, followed by a tab and the character it will be replaced with.

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789",.-; ()?\':/=&[]!`<>#+%
8222	ä
8221	ö
129	ü
225	ß
381	Ä
8482	Ö
353	Ü

Excerpt from the CharReplacer.settings file.

To run the CharReplacer program, the Java Runtime Envirement (JRE) must be installed on a computer. It must be run via the command line interface (also known as console).

Under Windows the Command Prompt (cmd.exe) can be used, but it needs to be adjusted to show the UTF-8 characters.

(The following steps displayed in italics only need to be done if the Windows Command Prompt is used.)

  • Copy the files CharReplacer.class and CharReplacer.settings in the folder in which the exported files are (in the example from above it is C:\DOS\DP26Y\DIAS\)
  • Start the console.
  • Change to the directory in which the files are.
  • Set the Font for the Command Prompt to Lucida Console
  • Switch the Character Set for the console by typing chcp 65001
  • run the program by typing java -Dfile.encoding=UTF-8 CharReplacer followed by the name or names of the files you want to run the program on, e.g. java -Dfile.encoding=UTF-8 CharReplacer OUTPUT1.EXP and hit Enter.
File:DPE17 Console change Font.png
Changing the Font in the Windows Command Prompt (German version)

The output of the program will be a file with the name of the input file, but with an additional .csv extension at the end. So OUTPUT1.EXP will become OUTPUT1.EXP.csv.

If the program find characters which are not in the list of allowed characters (the first line of the file) and for which no replacement rule exists, it will not replace this character, but it will show a message, informing the user that an unknown character was found, where it was found (file, line and column), what it looks like in UTF-8 and what its code is. If this occurs it is necessary to review the file at this position and create a new rule on how to handle that character. If the program prints out an unknown character, open the file and look at the given position. If the character is correct and represents precisely what was meant in the original dataset, then add this character at the end of the first line in the settings file. If it is a wrong character however, try to figure out what character it is meant to be. This could be derived from the context in which this unknown character appears. In the example below, the unknown character is displayed as and appears in the work "R†dhusplassen". People how are familiar with Norwegian might recognize the word as "Rådhusplassen" (the Norwegian word for "City Hall Place" or "Town Hall Place"). In other cased it could be not so clear. So it becomes necessary to look up the entry in the original DataPerfect table and see, what character was originally entered. In the third image, it can be seen that in this case the unknown character is indeed a å. So a new line can be added at the end of the settings file with the following content:

8224	å

The 8224 is the code for the character, as shown in the message of the CharReplacer program in the first image.

To edit the CharReplacer.settings file, use a regular text editor (like Notepad++ under Windows).

File:DPE19 Console run CharReplacer.png
some unknown characters were found
File:DPE20 Notepad Special Characters.png
Viewing the unconverted character in Notepad++
File:DPE21 DataPerfect Special Characters.png
Viewing the unconverted character in the original data entry in DataPerfect.

Documenting Foreign Key Relations

When the data is exported out of DataPerfect the associations between the tables are lost. So it is important to take a look at the exported data and see the relation between the tables. What columns are unique keys for their table and what columns are foreign key links to other tables. Looking at the tables in DataPerfect is helpful in this context, as the UI of DataPerfect sometimes shows columns which are not part of the export of that table, so they must be loaded from a different table. Sometimes when selecting a field in DataPerfect it also opens the window to another table, making it obvious that this column is a foreign key connection.

Importing into a modern relational database

Once the data is exported, converted into UTF-8 and the foreign key relations have been analyzed, it is possible to import the data in a modern relational data base, so it can be accessed by the BioCASe Provider Software. To do this, one must first create a database for it and the respective tables. The columns of the tables must already be prepared. It is important that the maximal length allowed for the columns is not smaller then the longest entry in this columns. Also the data formats must be correct. Though it is possible to store any kind of data in a text field, it will be more useful if the data is actually stored using the format it actually is.

Next Steps

Depending on how the tables are structured, it might become necessary to do a controlled denormalization so that the entries can be easier mapped to ABCD. This is described on the Preparation page of the BioCASe Provider Software Wiki.

The other following steps are also described at the BioCASe Provider Software Wiki, like the mapping of the ABCD concepts.

Alternative Ways of exporting data from a DataPerfect database

Another way of exporting DataPerfect files is the DANS DataPerfect Library. There is a reference implementation of a DP2MySQLConverter. After testing it on small sample databases, it worked fine, but it ran intro problems exporting the DIAS database, causing huge and corrupt output files. For other DataPerfect files it could still be useful, especially since it already handles the special characters correctly.