Zusammenfassung / Summary
Nowadays, for software-driven processes of data acquisition in the biodiversity sector, various architectural solutions exist which should be taken into account for the definition of the StanDAP-Herb architecture. Most of the systems in operation are proprietary solutions and don’t offer facilities for easy integration, interoperability and distribution of functional blocks. On the opposite, StanDAP-Herb intends to enforce standardization by means of a consequent usage of standardized interfaces and established IT-Standards. The definition of a standard process aims to facilitate integration of systems operated by external institutions. In many cases, these institutions have done major investments for their systems in use. Hence, StanDAP-Herb does not demand to replace existing solutions. The discussions between the project partners revealed that StanDAP-Herb should be based on an architecture that is capable to define service interfaces at certain entry points along the process chain, such that specialized external services – e.g ., text recognition services or services for image processing – can be integrated into a workflow. The workflow itself should be most flexible and entail easy adaptation to changing environments. Moreover, a sustainable standard architecture as foreseen for StanDAP-Herb should not be sensitive to upcoming technological innovations, but be stable and long living. This requires the definition of architectural aspects on different levels: a technology independent level and a technology dependent level. The technology independent part of the architecture definition ensures long viability, while the technology dependent level is adaptable to upcoming new requirements. Moreover, the architecture should not be restricted to serve data acquisition only for dedicated collections, e.g. herbarium specimen, but rather be a blueprint for other types of object collections. For these reasons, it is recommended to base the StanDAP-Herb architecture on the principles exhibited by the Reference Model for Open Distributed Processing (ODP), which supports distribution, interworking, platform and technology independence and portability. Assisted by the ODP framework and projects that formally applied this approach (e.g. ORCHESTRA [Uslaender2007], LifeWatch [Ernst2009], SERVUS[Uslaender2013]), StanDAP-Herb is intended to provide a Service Oriented Architecture (SOA) on a generic, platform and technology independent level. The focus hereby should be on the workflow architecture: highly automated services provided by any institution should be able to easy integrate into a deployment of a StanDAP-Herb workflow chain.
- (Uslaender2007) Reference model for the ORCHESTRA architecture (RM-OA) V2, Usländer, T. (ed.), Open Geospatial Consortium Inc., 2007
- (Ernst2009) Ernst, V. H., Poigné, A., Giddy, J., Hardisty, A., Voss, A. and Voss, H., Towards a Reference Model for the LifeWatch ICT Infrastructure, GI Jahrestagung, 2009, pp. 654-668
- (Uslaender2013) Usländer, T., Batz, T. and van der Schaaf, H., SERVUS--Collaborative Tool Support for Agile Requirements Analysis, International Symposium on Environmental Software Systems (ISESS), 2013
Several workflow management systems exist that target the scientific community, most notably Taverna [Oinn2002] and Kepler [Altintas2004]. Both systems are open source and have a history of several years of development. These systems are oriented at data-driven computations and provide some specialized primitives for data flow control and data transformations to facilitate data exchange between computation tasks [Yildiz2009] [Lin2014]. While these specialized features may be very convenient for data-driven applications, they are not seen as advantageous for fulfilling the StanDAP-Herb requirements. Additionally, there is no standard notation supported across multiple systems. Another interesting project is Argo [rak:2012], described as “a workbench for analyzing (primarily annotating) textual data”. Argo relies on the UIMA standard [Ferrucci2004] to support interoperability between processing components. Users who develop UIMA-based components are able to deposit them on the system, and it is also possible to develop Argo clients that interact with the system through web services. Argo main focus is curation of biomedical literature and is currently in beta phase. Argo is suited for creating pipelines of text analysis tasks, but it lacks primitives to express the complex workflows required to cover the whole digitalization process. Since the aforementioned systems do not provide a compelling solution with clear advantages to support the digitalization process, a decision has been made to use the more generic BPMN notation (WhiteBPMN), recognized as an ISO standard [ISO/IEC 19510:2013], to design the StanDAP-Herb workflow for processing herbarium specimens. A number of BPMN engines, including open source projects like Activiti [activiti2015], provide the run-time support to execute and manage workflow instances.
- (Altintas2004) Altintas, I., Jaeger, E., Lin, K., Ludaescher, B. and Memon, A., A Web Service Composition and Deployment Framework for Scientific Workflows, 2013 IEEE 20th International Conference on Web Services, IEEE Computer Society, 2004, Vol. 0, pp. 814
- (Ferrucci2004) Ferrucci, D. and Lally, A., UIMA: an architectural approach to unstructured information processing in the corporate research environment, Natural Language Engineering, Cambridge Univ Press, 2004, Vol. 10(3-4), pp. 327-348
- (Lin2014) Lin, Y., Mougenot, I. and Libourel, T., Method and components for creating scientific workflow, Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on 2014, pp. 147-153
- (Oinn2002) Oinn, T., Greenwood, M., Addis, M. J., Alpdemir, M. N., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D. and others, Taverna: lessons in creating a workflow environment for the life sciences, Journal of Concurrency and Computation: Practice and experience, John Wiley & Sons Ltd, 2002
- (rak:2012) Rak, R., Rowley, A., Black, W. and Ananiadou, S., Argo: an integrative, interactive, text mining-based workbench supporting curation, Database: The Journal of Biological Databases and Curation, 2012, Vol. 2012
- (Yildiz2009) Yildiz, U., Guabtni, A. and Ngu, A., Business versus Scientific Workflows: A Comparative Study, 2009 World Conference on Services – I, 2009, pp. 340-343
- (WhiteBPMN) White, S. A., Introduction to BPMN, http://www.omg.org/bpmn/Documents/Introduction_to_BPMN.pdf
- (activiti2015) Activiti BPM Platform, http://activiti.org/
Text analysis frameworks
The production of structured information, more easily searchable and exploitable, out of non-structured text sources is a common problem in many knowledge areas. Frameworks, such as GATE, UIMA and OpenNLP, provide means for facilitating development and integration of tools for processing non-structured text. OpenNLP is an initiative to bring together development efforts in the field of natural language processing, it provides a large number of algorithms, but integration mechanisms and a global architecture are lacking [Essalihe2006]. Both GATE and UIMA provide an extensive set of tools to implement language processing solutions, and are also capable to deal with images an audio. While GATE may benefit from a larger number of available analysis components, UIMA is an OASIS standard [Lally2009] and provides better support for parallelized and distributed processing [Bank2012]. The Argo workflow management system is based on UIMA [Rak2014].
- (Essalihe2006) salihe null, M. E. and Bond, S., Étude des frameworks UIMA, GATE et OpenNLP, Computer Research Institute of Montreal, 2006
- (Bank2012) Bank, M. and Schierle, M., A Survey of Text Mining Architectures and the UIMA Standard, LREC 2012, pp. 3479-3486
- (Lally2009) Lally, A. and Nyberg, K. V. E., Unstructured Information Management Architecture (UIMA) Version 1.0, 2009
- (Rak2014) Rak, R., Batista-Navarro, R. T., Carter, J., Rowley, A. and Ananiadou, S., Processing biological literature with customizable Web services supporting interoperable formats, Database, 2014, Vol. 2014
Text analysis services and algorithms
Thessen et al. provide an excellent review of applications of natural language processing in biodiversity science in [Thessen2012]. The review of tools for entity name recognition and morphological character extraction algorithms is of special interest. Some of these tools are already available as web services. The Taxonomic Name Resolution Service (TNRS) offered by iPlant [Goff2011] makes use of Taxamatch algorithms. TNRS corrects spelling errors and alternative spellings to a standard list of names and converts out-of-date names to the currently accepted name. The Global Names Recognition and Discovery (GNRD) service provided by the Global Names Architecture (GNA) [Patterson2010] relies on NetiNeti and TaxonFinder to find scientific names in text. The GBIF Name Parser service [GBIFNameParser2015] makes use of regular expressions to extract the different parts that form the scientific name. Services like Unlock Text [Barker2012] and Yahoo’s PlaceSpotter [PlaceSpotter2015] offer named entity recognition of locations in unstructured text. The Unlock Text API provides access to both the LTG Parser from the University of Edinburgh [Grover2010], which makes use of a dual-license model, and the open source CLAVIN geoparser [Clavin2015]. Yahoo’s PlaceSpotter is a paid service. Other services, like OpenCalais [OpenCalais2015] and AlchemyAPI [AlchemyAPI2015], extend geoparsing services with recognition of additional entity types, like persons and organizations. Both are offered as paid subscription-based services, but can be used free of charge up to a certain number of transactions per day. The free package in OpenCalais allows 5000 submissions per day. AlchemyAPI allows 1000 free transactions per day. These services are not specifically tailored to a particular domain. There are numerous libraries and algorithm implementations that can be integrated in UIMA and GATE for implementing custom text processing pipelines. However, these are generic building blocks that need to be adapted with specific gazetteers and may involve machine learning techniques that require training. Additionally, the characteristics of the herbarium specimens make the application of such processing tools specially challenging. The labels present in the specimens may be old and handwritten, making it difficult to produce good quality OCRed text. Besides, they may contain text in multiple languages, old orthography, acronyms, abbreviations and archaic terms. Since the labels offer limited space for documenting the specimen, some grammatical rules may have not been respected for the sake of brevity. Search, evaluation and integration of available text analysis algorithms and services is an on-going process that will continue during the implementation phase.
- (Thessen2012) Thessen, A. E., Cui, H. and Mozzherin, D., Applications of natural language processing in biodiversity science, Advances in bioinformatics, Hindawi Publishing Corporation, 2012, Vol. 2012
- (Goff2011) Goff, S. A., Vaughn, M., McKay, S., Lyons, E., Stapleton, A. E., Gessler, D., Matasci, N., Wang, L., Hanlon, M., Lenards, A. and others, The iPlant collaborative: cyberinfrastructure for plant biology, Frontiers in plant science, Frontiers Media SA, 2011, Vol. 2
- (Patterson2010) Patterson, D. J., Cooper, J., Kirk, P. M., Pyle, R. and Remsen, D. P., Names are key to the big new biology, Trends in ecology & evolution, Elsevier, 2010, Vol. 25(12), pp. 686-691
- (GBIFNameParser2015) GBIF Name Parser, http://tools.gbif.org/nameparser/
- (Barker2012) Barker, E., Byrne, K., Isaksen, L., Kansa, E. and Rabinowitz, N., The Geographic Annotation Platform--a Framework for Unlocking the Places in Free-text Corpora, NeDiMAH workshop at Digital Humanities 2012 Conference (DH2012), 2012
- (PlaceSpotter2015) BOSS PlaceSpotter API, https://developer.yahoo.com/boss/placespotter/
- (Grover2010) Grover, C., Tobin, R., Byrne, K., Woollard, M., Reid, J., Dunn, S. and Ball, J., Use of the Edinburgh geoparser for georeferencing digitized historical collections, Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, The Royal Society, 2010, Vol. 368(1925), pp. 3875-3889
- (Clavin2015) CLAVIN (Cartographic Location And Vicinity INdexer), https://clavin.bericotechnologies.com/
- (AlchemyAPI2015) AlchemyAPI, http://www.alchemyapi.com/
- (OpenCalais2015) Open Calais, http://new.opencalais.com/
The StanDAP-Herb system relies on a number of third-party services in order to provide some of its capabilities. This strategy aligns with the growing trend in the biodiversity community to facilitate access to data and processing algorithms through web services. Acknowledging this trend, the BiodiversityCatalogue [BioCat2015] provides a centralized registry of curated biodiversity web services to facilitate their discovery and reuse. While this is certainly a step in the right direction, it is often the case that such services are not mature enough or not properly documented or maintained. Software components built during the project will also rely on web services to enable their interconnection. SOAP and REST are the most popular approaches to implement machine-to-machine communication on the web. Despite SOAP defining a protocol and REST being an architectural style, they are often directly compared. SOAP based APIs lead to tightly coupled designs similar to remote procedure calls. In REST, systems are more loosely coupled and its operation is similar to navigation using web links [ZurMuehlen2005]. While many early APIs were written using SOAP, today REST has become the dominant force [Mason2011]. Technologies, such as message-oriented middleware (MOM), provide alternative ways to build distributed applications [Curry2004]. With a MOM, the system may benefit from the usage of publish/subscribe communication patterns, and queue mechanisms to deliver and process messages. In this case, an extra component, the message broker, is required.
- (BioCat2015) BiodiversityCatalogue, https://www.biodiversitycatalogue.org/
- (Curry2004) Curry, E., Message-oriented middleware, Middleware for communications, John Wiley & Sons, 2004, pp. 1-28
- (Mason2011) Mason, R., How REST replaced SOAP on the Web: What it means to you, http://www.infoq.com/articles/rest-soap/, 2011
- (ZurMuehlen2005) Zur Muehlen, M., Nickerson, J. V. and Swenson, K. D., Developing web services choreography standards—the case of REST vs. SOAP, Decision Support Systems, Elsevier, 2005, Vol. 40(1), pp. 9-29
Data quality tools
In [Chapman2005] Chapman, provides an extensive list of recommendations and best practices to deal with error prevention and data cleaning. Tools for supporting these best practices are also listed. However, in most cases such tools are highly interactive, or work on a limited set of data sources, for instance, collections provided through a certain portal or network. Thus making their integration into third party solutions difficult. A basic quality check may consist in ensuring that all important fields have been assigned a value. Additionally, a number of checks can be carried out depending on the field type. For instance, date fields should be stored using an ISO-compliant format, geographical coordinates should be stored in degrees, hold values within fixed limits, and be consistent with accompanying country codes. Scientific names, location names and collector names can be checked against available lists and gazetteers. The OpenUp! Data Quality Toolkit performs some of these checks on a given set of ABCD records, usually coming from an access point of BioCASE providers (Biocase2015). By decoupling the Data Quality Toolkit user interface layer from the underlying data quality services, the services themselves can be used in other contexts [Berendsohn2012]. The speciesLink portal also features a data cleaning module [SpeciesLink2015]. In this case, quality checks are applicable to the collections listed in the portal. The portal also provides tools that target the spatial data component of biodiversity records. It can give information, such as country, state and municipality, of a specified point, help in detecting outliers, and determine if the given coordinates fall on land or on water. Timpute and OpenRefine are open source tools that can support data cleanup. Timpute tries to predict a value in a database cell, given the values in the other database cells [VandenBosch2009]. OpenRefine [OpenRefine2015] is a tool that allows exploration, clean up and transformation of tabular data.
- (Chapman2005) Chapman, A. D., Principles and methods of data cleaning, GBIF, 2005
- (Biocase2015) Biological Collection Access Services (BioCASE), http://www.biocase.org/
- (Berendsohn2012) Berendsohn, W. G. and Güntsch, A., OpenUp! Creating a cross-domain pipeline for natural history data, ZooKeys, Pensoft Publishers, 2012(209), pp. 47
- (SpeciesLink2015) speciesLink data & tools, http://splink.cria.org.br/tools?criaLANG=en
- (VandenBosch2009) Van den Bosch, A., Van Erp, M. and Sporleder, C., Making a clean sweep of cultural heritage, IEEE Intelligent Systems, 2009, Vol. 24(2), pp. 54-63
- (OpenRefine2015) OpenRefine, http://openrefine.org/
While OCR tools can greatly facilitate the digitalization process, it is often the case that the recognition process is not able to generate correct results for some parts of the text, especially when handwritten text is involved. A number of tools exist that facilitate the transcription of text with input from users. In general, these tools are oriented towards the transcription of text from books and articles, rather than to the short texts to be found in specimen labels. Notes from Nature [NotesFromNature2015], however, provides an online tool to transcribe museum records from biodiversity collections. This is a citizen science project, where anyone can collaborate in the transcription effort. The Biodiversity Heritage Library [BHL2015] also enables volunteers to contribute to their digitalization efforts. In this case, collaborators can choose to contribute through a traditional interface, with book sheet images and input text boxes [DigiVol2015], or to provide their input while playing one of the two available purposeful games [BHLGaming2015]. The European project TranScriptorium [Transcriptorium2015], specializes in handwritten text recognition, and provides a web-based interface that enables the correction of the extracted text. Transcribo [Transcribo2015], developed by the University of Trier, is a desktop tool aimed at facilitating the transcription of texts. It is also possible to add metadata to the text in parallel to the transcription. Another desktop tool, PerfectDoc [Yacoub2005], is a java based application from HP that also offers transcription functionality. An interesting aspect of PerfectDoc is that, after the OCR process, it highlights the words that are suspected to be wrong an offers the possibility to correct them directly on the text or using a table with the words listed. The list of tools referenced in the previous paragraphs provides a wide range of approaches to supporting transcription efforts. The evaluation of transcription tools and integration possibilities is an on-going work.
- (NotesFromNature2015) Notes from Nature, http://www.notesfromnature.org/
- (BHL2015) Biodiversity Heritage Library, http://www.biodiversitylibrary.org/
- (DigiVol2015) DigiVol, http://volunteer.ala.org.au/
- (BHLGaming2015) BHL Purposeful Gaming, https://biodivlib.wikispaces.com/Purposeful+Gaming
- (Transcriptorium2015) tranScriptorium, http://transcriptorium.eu/
- (Transcribo2015) Transcribo, http://transcribo.org/
- (Yacoub2005) Yacoub, S., Saxena, V. and Sami, S. N., Perfectdoc: A ground truthing environment for complex documents, Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on 2005, pp. 452-456
The results yielded by the processing of herbarium specimens need to be encoded in a specific format before they can be stored for publication and exploitation. The Biodiversity Information Standards association (TDWG), also known as the Taxonomic Databases Working Group, supports development of standards for the exchange of biological/biodiversity data [TDWG2015]. The most widely deployed formats for biodiversity occurrence data are Darwin Core and ABCD. The Darwin Core includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. Such terms can be used to encode information about taxa in several formats: using the Resource Description Framework (RDF), as text with accompanying metadata files, and as XML documents. The Access to Biological Collection Data (ABCD) defines an XML schema with a reconciled set of element names to specify biological collection units. The ABCD standard covers nearly 1200 concepts organized hierarchically. It is not expected (or even possible) for any collection to use more than a fraction of the elements defined in the standard. Mappings exist for matching Darwin Core terms to ABCD.
- (TDWG2015) Biodiversity Information Standards TDWG, http://www.tdwg.org/