This article describes how to export data from a DataPerfect database, based on the Rohwer data set.
Data Perfect is a DOS based database system. It's latest release is 2.6Y from June 2008. It can be downloaded via http://dataperfect.nl/ .
- 1 Install and run DOSBox
- 2 Download DataPerfect
- 3 Copy your data files in the DataPerfect Folder
- 4 Start DOSBox
- 5 Mount the DataPerfect Folder
- 6 Start DataPerfect
- 7 Change to the folder of your data files
- 8 Open the data files
- 9 Select the table to export
- 10 Navigate the Table
- 11 Select Report
- 12 Set Export Properties
- 13 Run Export
- 14 Converting special characters
- 15 Documenting Foreign Key Relations
- 16 Importing into a modern relational database
- 17 Next Steps
- 18 Alternative Ways of exporting data from a DataPerfect database
Install and run DOSBox
- To run DataPerfect a DOS emulator is needed. The free emulator DOSBox works quite well. Go to http://www.dosbox.com/download.php?main=1 or http://sourceforge.net/projects/dosbox/, download the latest release and install it. For this article DOSBox 0.74 will be used and installed under Windows 7.
- Go to http://dataperfect.nl/ and download the DataPerfect. Unzip the files into a specific folder. The folder used for this article is C:/DOS/DP26Y.
Copy your data files in the DataPerfect Folder
- You can either copy them directly in the folder or create a subfolder for your data files and copy them in there. The data files for this article are located in the directory DIAS/ within the DataPerfect folder.
- When starting DOSBox a second console window is opened. When running several instances of DOSBox these additional windows can bloat the window bar quite a bit. To avoid this just start the version DOXBox-0.74/Extras/DOSBox 0.74 (noconsole) from the Program Menu.
Mount the DataPerfect Folder
- After starting DOSBox the directory of the DataPerfect files needs to be mounted as a virtual drive. Type mount c C:\DOS\DP27Y to mount the folder and C: to change to the drive.
- Start DataPerfect by typing DP26YU and pressing Enter.
Change to the folder of your data files
- If the data files are not in the directory of DataPerfect you need to change the directory. Press 2 to change the directory and type in the path to the directory of your data.
Open the data files
- If you are in the correct directory you see the DataPerfect datasets within the folder. Use the cursor keys to select the correct data set and press enter.
Select the table to export
You can now see the tables within this database project. Select the table you want to export by using the up or down keys and pressing enter.
This these steps are not necessary for the export, but will be documented here anyway. Here are important keys for navigating thought a table:
- Tab the next field is highlighted
- Down on a field which has references to entries in another table, that referenced entry is displayed
- Up The list of entries in this table where the content of the current field is displayed. Navigate though this list using the Up and Down keys. When typing characters this list the focus will jump to the entry whose unique key column is like the typed characters. For example in the table displayed in the image below, typing a number will focus on the entry with that id. The corresponding entry is automatically displayed. To edit this entry, press Enter.
- F7 Goes to the next upper level in the hierarchy. For example if the list of entries is opened, it will jump back to the next upper level in the hierarchy.
From the list of available reports select the entry at the top: Build-In Short Reports.
Set Export Properties
Set the export settings as shown in the first image below. Press the number of the property you want to change. For example, to change the file name of the output file, press 2 and the file options will appear, as can be seen in the second image. Press 1 to create a new file and enter the name of the file (image 3). The filename of the export file must not be longer than 8 characters without file extension and the file extension itself (without the dot) must not be longer than 3 characters.
To start the export press Shift + F7 again. The screen now shows a counter of how many elements from the table have already been exported. DOSBox starts with a limited CPU speed for the programs running in it. The speed can be increased by pressing Ctrl + F12. Increasing the speed to much, however creates a big overhead which will also result in a slower export. For some reason when increasing the speed the CPU load will jump from around 5% of one core to 100% within just 2-3 increase steps. So the CPU load should be watched when increasing the speed of DOSBox. On a DualCore 3 GHz Processor a speed of 50000-55000 cycles (can be seen in the title bar of the DOSBox Window) appears to be a good export speed without overhead.
After the export is done, you will see the list of available reports again. To step to the next higher level of the hierarchy press F7. Repeat this until you see the list of tables within this database project again. Now repeat the export process for all the other tables you want to export.
Converting special characters
The best way to handle special characters is to know the character encoding by the original file system. If it is a DOS based system, the command CHCP will display what Character Code Page used. With this knowledge the file can be easily converted. In DOSBox the Command is KEYB however, it only helps if all of the special characters are displayed correctly within the DataPerfect. In the case of the Rohwer data set, the KEYB command showed Codepage 437, however Codepage 850 was actually used, must of the special characters are however identical between the two sets.
Once the character encoding is known the exported file converted by a program that is able to read that encoding. Under Windows Notepad++ does a good job. After opening the file (the special characters are probably distorted), one must select the original character encoding as the encoding of the file and then convert the file to UTF-8.
Google Refine is another software that is able to read the CP 850 encoding and allows for further processing of the files.
Additional Infomation about Codepages can be found on the Wikipedia Article Code page. Also the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky is very helpful.
In retrospective we now know that the Rohwer data set is encoded using the Codepage 850 (also called OEM 850), however some single characters are in the wrong encoding. This however is the result of the process where we parsed the file, got a list of all the different characters used and looked them up and converted them by hand. Though this process is a lot more work, but it works even in the case of mixed character sets. It is documented in detail below.
converting characters individually
The export out of DataPerfect has problems with special characters which are not converted correctly in a proper character set. Though all of the occurrences of such special characters can be replaced automatically, each of the character to be replaced has to be defined manually one. The files as they are generated by DataPerfect use the ANSI Character Encoding. This needs to be converted to UTF8. Under Windows the free program Notepad++ is very suited for this. When selecting the menu item Encoding (Kodierung in the image, because it is the German version of Notepad++), the entry ANSI should be marked as the current encoding. Now click on Convert to UTF-8 and save the file.
The next step is to run the file through a program that converts the characters. The small Java program CharReplacer was written just to do that. There are two different files to that program: CharReplacer.class (the acutual program) and CharReplacer.settings (the settings file, which specifies, what characters to replace. The first line in CharReplacer.settings are all the characters which will not be replaced. All the following lines have the number value of the character which will be replaced, followed by a tab and the character it will be replaced with.
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789",.-; ()?\':/=&!`<>#+% 8222 ä 8221 ö 129 ü 225 ß 381 Ä 8482 Ö 353 Ü
Excerpt from the CharReplacer.settings file.
To run the CharReplacer program, the Java Runtime Envirement (JRE) must be installed on a computer. It must be run via the command line interface (also known as console).
Under Windows the Command Prompt (cmd.exe) can be used, but it needs to be adjusted to show the UTF-8 characters.
(The following steps displayed in italics only need to be done if the Windows Command Prompt is used.)
- Copy the files CharReplacer.class and CharReplacer.settings in the folder in which the exported files are (in the example from above it is C:\DOS\DP26Y\DIAS\)
- Start the console.
- Change to the directory in which the files are.
- Set the Font for the Command Prompt to Lucida Console
- Switch the Character Set for the console by typing chcp 65001
- run the program by typing java -Dfile.encoding=UTF-8 CharReplacer followed by the name or names of the files you want to run the program on, e.g. java -Dfile.encoding=UTF-8 CharReplacer OUTPUT1.EXP and hit Enter.
The output of the program will be a file with the name of the input file, but with an additional .csv extension at the end. So OUTPUT1.EXP will become OUTPUT1.EXP.csv.
If the program find characters which are not in the list of allowed characters (the first line of the file) and for which no replacement rule exists, it will not replace this character, but it will show a message, informing the user that an unknown character was found, where it was found (file, line and column), what it looks like in UTF-8 and what its code is. If this occurs it is necessary to review the file at this position and create a new rule on how to handle that character. If the program prints out an unknown character, open the file and look at the given position. If the character is correct and represents precisely what was meant in the original dataset, then add this character at the end of the first line in the settings file. If it is a wrong character however, try to figure out what character it is meant to be. This could be derived from the context in which this unknown character appears. In the example below, the unknown character is displayed as † and appears in the work "R†dhusplassen". People how are familiar with Norwegian might recognize the word as "Rådhusplassen" (the Norwegian word for "City Hall Place" or "Town Hall Place"). In other cased it could be not so clear. So it becomes necessary to look up the entry in the original DataPerfect table and see, what character was originally entered. In the third image, it can be seen that in this case the unknown character is indeed a å. So a new line can be added at the end of the settings file with the following content:
The 8224 is the code for the character, as shown in the message of the CharReplacer program in the first image.
To edit the CharReplacer.settings file, use a regular text editor (like Notepad++ under Windows).
Documenting Foreign Key Relations
When the data is exported out of DataPerfect the associations between the tables are lost. So it is important to take a look at the exported data and see the relation between the tables. What columns are unique keys for their table and what columns are foreign key links to other tables. Looking at the tables in DataPerfect is helpful in this context, as the UI of DataPerfect sometimes shows columns which are not part of the export of that table, so they must be loaded from a different table. Sometimes when selecting a field in DataPerfect it also opens the window to another table, making it obvious that this column is a foreign key connection.
Importing into a modern relational database
Once the data is exported, converted into UTF-8 and the foreign key relations have been analyzed, it is possible to import the data in a modern relational data base, so it can be accessed by the BioCASe Provider Software. To do this, one must first create a database for it and the respective tables. The columns of the tables must already be prepared. It is important that the maximal length allowed for the columns is not smaller then the longest entry in this columns. Also the data formats must be correct. Though it is possible to store any kind of data in a text field, it will be more useful if the data is actually stored using the format it actually is.
Depending on how the tables are structured, it might become necessary to do a controlled denormalization so that the entries can be easier mapped to ABCD. This is described on the Preparation page of the BioCASe Provider Software Wiki.
The other following steps are also described at the BioCASe Provider Software Wiki, like the mapping of the ABCD concepts.
Alternative Ways of exporting data from a DataPerfect database
Another way of exporting DataPerfect files is the DANS DataPerfect Library. There is a reference implementation of a DP2MySQLConverter. After testing it on small sample databases, it worked fine, but it ran intro problems exporting the DIAS database, causing huge and corrupt output files. For other DataPerfect files it could still be useful, especially since it already handles the special characters correctly.