Archiving

From BioCASe Provider Software
Revision as of 16:02, 15 November 2011 by JoergHoletschek (talk | contribs) (XML Archives)
Jump to: navigation, search

With the archiving feature, you can create archive files that store all data published by your BioCASe web service. That might be handy if your database stores a huge number of records (several hundred thousands or even millions) and the traditional harvesting approach of paging through the records published by the web service takes unacceptably long. Once stored in an archive, this file can be downloaded from your web server or sent to the harvester via email and ingested much faster.

There are two types of archives: XML Archives simply store the records as compressed XML documents, each holding a customizable number of records. If your web service publishes 500,000 records by using the ABCD data schema, for example, an ABCD Archive created for that service could hold 500 ABCD documents, each storing 1,000 records. For each schema supported by a BioCASe web service, a separate XML Archive can be created.

DarwinCore Archives (DwCAs), in contrast, consist of one or several text files storing the core records and related information, zipped up together with a descriptor file and a metadata document. Similar to a BioCASe web service, DwCAs can be used to publish data to the Global Biodiversity Information Facility (GBIF). A detailed specification can be found on the TDWG site.

The BioCASe Provider Software can create XML archives for each schema supported by a web service. Archives using the ABCD 2.06 schema (dubbed as “ABCD dumps”) can be converted into one or several DarwinCore Archives in a subsequent step. This is due to the fact that one BioCASe web service can publish several datasets, and a DarwinCore Archive can only store a single dataset. Unsurprisingly, DwCAs make use of the DarwinCore data standard, which is flat and less complex than ABCD. Consequently, a DarwinCore Archive will store less information than the ABCD dump it was created from.

XML Archives

Once you’ve finished mapping a schema (see ABCD2Mapping to learn how to map ABCD, for example) and tested the resulting web service, you can create an XML Archive for that web service. To go to the archiving page, simply use the Archive link on the overview page of the datasource setup. If you haven’t mapped a schema, you won’t be able to create any archives.

ArchArchivingPage.png

In the box “Existing Archives” you will see a list of already existing XML archives. If you go to the archiving page for the first time, it will be empty. Once you’ve created archives, they will show up here and can be downloaded and removed by using the links next to the entries.

Create a new Archive allows you to create a new archive for each schema mapped. Simply select the schema and press Create Archive (leave the defaults unless you have a good reason to change them). During the archiving process, the Log will display messages indicating the progress:

ArchProcess.png

You can always cancel the archiving process by pressing Cancel next to the respective entry listed under in progress. However, it will take some time for the cancelling to show effect, namely until the current paging step has been completed. Depending upon the paging size set and the speed of your server, this can be up to one minute. Always wait until the process has stopped before starting a new dump!

As the archiving progresses, the first lines of the log will drop out of view. Pressing Show full log will open the full log in a separate browser tab. You can navigate away from the archiving page during the process (or simply close it) and return later. Once finished, the dump will show up in the Finished list, if it’s still being processed, under In progress. Pressing the Info link next to an entry will display the log again. Download and Remove will do exactly what the links suggest.

Even though you can create several archives in parallel, it is not a good idea. The archiving process will put heavy load on your servers – both the database server for data retrieval and the BioCASe server for creating the XML documents. So it is advisable to create just one archive at a time in order to allow both servers to respond to other requests.

There are four parameters for customizing the archiving process:

Destination schema: Data schema that will be used for storing the published records in the archive.

Paging size: Number of records stored in each document. The default 1,000 should be suitable for most cases. Important note: Be aware that a BioCASe web service can restrict the number of records per request! If that limit is below the paging size set for archiving, the paging size will be adjusted to the limit of the web service.

Max number of consecutive errors: The default for this is “1”. That means that if an error occurs during data retrieval, or XML/archive creation, the archiving process will be aborted. If you set this to a value larger than 1, the process will be cancelled after this number of errors. This allows you to complete the archiving in spite of errors, view the log file afterwards and have a look at the problematic records. Once you’ve corrected all errors breaking the dump, you should set the threshold to 1 and redo the archiving to create an archive storing all records.

Archive file name: Name of the archive file to be created. The default will be constructed by concatenating web service name and data schema.

DarwinCore Archives