Entwicklungen Text Mining, themenspezifisches Wissen, Systemarchitekturen - IOSB
Im Folgenden werden Entwicklungen (Arbeiten und Tools) vorgestellt, die für die Bereiche Text Mining, themenspezifisches Wissen und Systemarchitekturen im Hinblick auf die Erschließung inhaltlicher Metainformation aus digitalisierten Bildern von Herbarbelegen relevant sind.
- 1 Text Mining
- 2 Themenspezifisches Wissen
- 3 Systemarchitekturen
Unsupervised Post-Correction of OCR Errors
Beschreibung: The thesis proposes an unsupervised, fully-automatic approach for correcting OCR errors, resulting from a combination of several methods for retrieving the best correction proposal for a misspelled word: general spelling correction (Anagram Hash), a new OCR adapted method based on the shape of characters (OCR-Key) and context information (bigrams).
Utilizing Big Data in Identification and Correction of OCR Errors
Beschreibung: Google Search is used access the big data resources available to identify possible candidates for correction. The proper candidate is automatically picked using a combination of the Longest Common Subsequences (LCS) and Bayesian estimates.
Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary?
Beschreibung: Most tokens that occur in a text of a given area can be found in web pages with a direct relationship to the given thematic field. Relevant web pages can be retrieved using simple queries to search engines. Analyzing the vocabulary of such pages, thematic dictionaries can be built automatically that improve the coverage of standard dictioniaries in a significant way, yield estimates for the occurrence frequencies of words in the given area that are more reliable than frequencies derived from general purpose corpora and thus help to improve the correction adequacy of systems for lexical postcorrection.
GBIF introductory videos
Beschreibung: A series of short introductory training videos published by GBIF (gbif.org) about biodiversity data quality and data fitness for use.
List of data quality related tools
Beschreibung: This list is a complete list of biodiversity informatics tools related to data quality. It is the base for the community review of that subsection of the GBIF catalogue of tools.
NBN Record Cleaner
Beschreibung: NBN Record Cleaner is a new, free software tool to help people improve the quality of their wildlife records and databases.
Whether you are an individual recorder or work in an organisation such as Local Record Centre or a Recording Scheme, the NBN Record Cleaner is designed to help you spot common problems in your data. The goal is to aid the process of data cleaning and ensure the quality of any datasets you pass on to others.
It is designed to access biological records stored in a wide variety of formats such as text files (CSV, tab delimited, etc), Excel spreadsheets and databases - including those in biological recording packages such as Recorder and MapMate. It also allows you to check that your dataset is in the NBN Exchange Format prior to submission to the NBN Gateway.
OpenUp! Data Quality Toolkit
Beschreibung: The Data Quality Toolkit is an open web-based application for BioCASE providers participating in the OpenUp! Project who wish to perform data quality checks on their data. Based on a user-defined filter, the system retrieves records from a given BioCASE provider software installation and applies a set of selected data quality rules.
Beschreibung: Timpute is a perl package based on the TiMBL software that self-corrects the contents of each cell in a database based on the rest of the database. Timpute is essentially a wrapper that processes the database and passes it piece by piece to TiMBL, whose output is parsed into a csv file again.
Paper: Van den Bosch, Antal, Marieke Van Erp, and Caroline Sporleder. "Making a clean sweep of cultural heritage." IEEE Intelligent Systems 24.2 (2009): 54-63.
Beschreibung: Knowledge about a domain can be used to identify inconsistencies in data from a particular domain. This is the main assumption behind the ontology-driven data cleaning method implemented in Validato.
Paper: van Erp, Marieke, et al. "Natural Selection: Finding Specimens in a Natural History Collection." CHANGING DIVERSITY IN CHANGING ENVIRONMENT (2011): 375.
Beschreibung: OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; extending it with web services; and linking it to databases like Freebase. See also https://wiki.biovel.eu/display/doc/Tutorial+for+Data+Refinement+Workflow+v16
Darwin Core Archive (DwC-A) validation
Beschreibung: The dwca-validator library shall: (1) Take a DarwinCore Archive file (or URL pointing to an archive) as input. (2) Run some validation routines. (3) Return results as structured content (object). Each record evaluator concrete classes shall implement the RecordEvaluator interface in order to become a possible member of a validation chain. They will run at the record level with a post iteration phase at the end of the archive (to allow validation with full archive scope e.g. uniqueness).
Kurator: A Kepler Package for Data Curation Workflows
Beschreibung: This paper presents Kurator, a software package for automating data curation pipelines in the Kepler scientific workflow system. Several curation tools and services are integrated into this package as actors to enable construction of workflows to perform and document various data curation tasks. The integration of Google cloud services (e.g., Google spreadsheets), allows workflow steps to invoke human experts outside the workflow in a manner that greatly simplifies the complex data handling in distributed, multi-user curation workflows.
See also presentation "Workflow Support for Continuous Data Quality Control in a FilteredPush Network" from Ludaescher in TDWG 2014.
Taxonomic Name Resolution Service
Beschreibung: The Taxonomic Name Resolution Service (TNRS) is a tool for the computer-assisted standardization of plant scientific names. The TNRS corrects spelling errors and alternative spellings to a standard list of names, and converts out of date names (synonyms) to the current accepted name. The TNRS can process many names at once, saving hours of tedious and error-prone manual name correction. For names that cannot be resolved automatically, the TNRS present a list of possibilities and provides tools for researching and selecting the preferred name.
The Plant List
Beschreibung: The Plant List is a working list of all known plant species. It aims to be comprehensive for species of Vascular plant (flowering plants, conifers, ferns and their allies) and of Bryophytes (mosses and liverworts).
Beschreibung: The quantity and heterogeneity of data in the biodiversity sciences have given rise to many distributed resources. Typically, researchers wish to combine these resources into multi-step computational tasks for a range of analytical purposes. Workflows, made of modularised units that can be repeated, shared, reused and repurposed, offer a practical solution for this task.
Semantische Auszeichnung / Annotation gemäß Standards
Beschreibung: CharaParser is a software application for semantic annotation of morphological descriptions. CharaParser annotates semistructured morphological descriptions in a manner that all stated morphological characters of an organ are marked up in Extensible Markup Language format.
Paper: Cui, Hong. "CharaParser for fine‐grained semantic annotation of organism morphological descriptions." Journal of the American Society for Information Science and Technology 63.4 (2012): 738-754.
brat rapid annotation tool
Beschreibung: brat is a web-based tool for text annotation; that is, for adding notes to existing text documents. brat is designed in particular for structured annotation, where the notes are not freeform text but have a fixed form that can be automatically processed and interpreted by a computer.
Beschreibung: The Terminizer detects ontological terms in pieces of text such as publications or experimental annotations. Trial page: http://wlnebc1-prod.nwl.ac.uk/index.html
Beschreibung: DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight recognizes that names of concepts or entities have been mentioned (e.g. "Michael Jordan"), and subsequently matches these names to unique identifiers (e.g. dbpedia:Michael_I._Jordan, the machine learning professor or dbpedia:Michael_Jordan the basketball player). It can also be used for building your solution for Named Entity Recognition, Keyphrase Extraction, Tagging, etc. amongst other information extraction tasks.
Beschreibung: MBT is a memory-based tagger-generator and tagger in one. The tagger-generator part can generate a sequence tagger on the basis of a training set of tagged sequences; the tagger part can tag new sequences. MBT can, for instance, be used to generate part-of-speech taggers or chunkers for natural language processing. It has also been used for named-entity recognition, information extraction in domain-specific texts, and disfluency chunking in transcribed speech.
Beschreibung: GoldenGATE is an editor which allows the creation of new XML content from plain text / html data. It is ideally suited to the special needs of marking up OCR output with XML. The idea is to support a user in creating XML markup as far as possible. This comprises automation support for manual editing of XML as well as fully automated creation of markup. Its functionality ranges from easy, intuitive tagging through markup conversion to dynamic binding of configurable plug-ins provided by third parties, e.g. NLP-plugins.
Informationsextraktion und Suche
General Architecture for Text Engineering (GATE)
Beschreibung: GATE can be used for all types of computational task involving human language. GATE excels at text analysis of all shapes and sizes. GATE has grown over the years to include a desktop client for developers, a workflow-based web application, a Java library, an architecture and a process. GATE provides an integrated development environment for language processing components bundled with a very widely used Information Extraction system (ANNIE) and a comprehensive set of other plugins. ANNIE can be used to create RDF or OWL metadata for unstructured content. The Java Annotation Patterns Engine (JAPE) of GATE provides ﬁnite state transduction over annotations based on regular expressions. GATE includes core components for diverse language processing tasks, e.g. parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others.
Beschreibung: This project serves to support the processing of organism names and represents a consolidation of effort by a number of partnering organisations and initiatives. The project includes links to web applications, source code, shared dictionaries and test files, and other useful resources for those in need of tools and services for finding, parsing, and processing taxonomic names. This project was initiated by the participants of the Nomina IV workshop (May 2009) supported by the Encyclopedia of Life and the Global Biodiversity Information Facility. This project serves as the basis for a Working Session on Name Processing at the TDWG 2009 conference.
Citation: Thessen, Anne E., Hong Cui, and Dmitry Mozzherin. "Applications of natural language processing in biodiversity science." Advances in bioinformatics 2012 (2012).
Beschreibung: This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development.
Citation: Wei, Qin, P. Bryan Heidorn, and Chris Freeland. "Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL)." (2010).
Beschreibung: Taxonomic Name Recognition is prerequisite for more advanced processing and mining of full-text taxonomic literatures. This paper investigates three issues of current TNR tools in detail: (1) The difficulties and methods used in TNRs. (2) The performance of Optical Character Recognition (OCR) and TNR tools by samples from Biodiversity Heritage Library (BHL). (3) The methods for potential improvement.
Citation: Bank, Mathias and Schierle, Martin. "A Survey of Text Mining Architectures and the UIMA Standard. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 23-25 2012" (2012).
Beschreibung: In recent years, different NLP frameworks have been proposed to provide an efficient, robust and convenient architecture for information processing tasks. This paper presents an overview over the most common approaches with their advantages and shortcomings, and will discuss them with respect to the first standardized architecture – the Unstructured Information Management Architecture (UIMA)..
CRF++: Yet Another CRF toolkit
Beschreibung: CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.
Global Names Recognition and Discovery
Beschreibung: Find scientific names on web pages, PDFs, Microsoft Office documents, images, or in freeform text. Uses TaxonFinder and NetiNeti name discovery algorithms. See http://globalnames.org/Available_code for more services.
Citation: Lendvai, Piroska, and Steve Hunt. "From Field Notes towards a Knowledge Base." LREC. 2008.
Beschreibung: We describe the process of converting plain text cultural heritage data to elements of a domain-specific knowledge base, using general machine learning techniques. First, digitised expedition field notes are segmented and labelled automatically. In order to obtain perfect records, we create an annotation tool that features selective sampling, allowing domain experts to validate automatically labelled text, which is then stored in a database. Next, the records are enriched with semi-automatically derived secondary metadata. Metadata enable fine-grained querying, the results of which are additionally visualised using maps and photos.
Citation: Sporleder, Caroline. "Natural language processing for cultural heritage domains." Language and Linguistics Compass 4.9 (2010): 750-768.
Beschreibung: Recently, more and more cultural heritage institutes have started to digitise their collections, for instance to make them accessible via web portals. However, while digitisation is a necessary first step towards improved information access, to fully unlock the knowledge contained in these collections, users have to be able to easily browse, search and query these collections. This requires cleaning, linking and enriching the data, a process that is often too time-consuming to be performed manually. Information technology can help with (partially) automating this task.
Beschreibung: Morphisto is a morphological analyzer and generator for German wordforms. The basis of Morphisto is the open-source SMOR morphology for the German language developed by the University of Stuttgart (GPL v2) for which a free lexicon is provided under the Creative Commons 3.0 BY-SA Non-Commercial license.
Beschreibung: a standalone command line application capable of identifying environment descriptive terms, such as "coral reef, cultivated land, glacier, pelagic, forest, lagoon", in text. The Environment Ontology (EnvO) a community resource offering a controlled, structured vocabulary for biomes, environmental features, and environmental materials, serves as the source of names and synonyms for such identification process. Given a folder with plain text files, ENVIRONMENTS based on its name and synonym dictionary reports the detected environment descriptive term, its start and end position in each document, and the corresponding Environment Ontology identifier.
siehe: "Projektrelevante domänenspezifische Infrastrukturen (1.2)"
Beschreibung: UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
Beschreibung: Build powerful concurrent & distributed applications more easily. Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.
Beschreibung: Scratchpads are an online virtual research environment for biodiversity, allowing anyone to share their data and create their own research networks. Sites are hosted at the Natural History Museum London, and offered freely to any scientist that completes an online registration form.
Paper: Smith, Vincent S., et al. "Scratchpads 2.0: a Virtual Research Environment supporting scholarly collaboration, communication and data publication in biodiversity science." ZooKeys 150 (2011): 53.
ComTax Project, A Community-driven Curation Process for Taxonomic Databases
Beschreibung: To motivate multiple research communities to engage in the identification of potentially new taxonomic names in the literature and linking these to known hierarchies, the project will build curation web services under the Scratchpad framework. These will make use of the extensive and flexible data models and potential wider biodiversity community network provided by Scratchpads to enhance the applicability and utilisation of the service.
Paper: Yang, Hui, et al. "Literature-driven Curation for Taxonomic Name Databases." (2013).
Beschreibung: LifeWatch is the European research infrastructure on biodiversity. It is building virtual, instead of physical, laboratories supplied by the most advanced facilities to capture, standardise, integrate, analyse and model biodiversity, and to consider scenarios of change. The LifeWatch ICT Infrastructure will be a distributed system of nodes that provide access to and processing of biodiversity data from a variety of sources through common open interfaces. The LifeWatch Reference Model is based on the ORCHESTRA Reference Model (RM OA).
Paper: Basset, A., Los, W. Biodiversity e-Science: LifeWatch, the European infrastructure on biodiversity and ecosystem research Plant Biosystems - An International Journal Dealing with all Aspects of Plant Biology, 2012, 146, 780-782
Paper: Ernst, Vera Hernández, et al. "Towards a Reference Model for the LifeWatch ICT Infrastructure." GI Jahrestagung. 2009.
Beschreibung: ORCHESTRA is a generic specification framework for geospatial service-oriented architectures and service networks that is endorsed as best practice by the Open Geospatial Consortium (OGC).
Paper: Lauzán, José Fernando Esteban, et al. "orchestra: an open service architecture for risk management." (2008).
SERVUS (Anforderungsanalyse für serviceorientierte Infomationssysteme)
Beschreibung: Diese Methode dient zur Analyse und Dokumentation von Anwendungsfällen (use cases) und damit verbundenen Anforderungen. Mittlerweile haben wir auch ein web-basiertes Software-Werkzeug entwickelt, welches die Erstellung und Kommentierung von Anwendungsfällen und Anforderungen unterstützt und vereinfacht.
Paper: Usländer, Thomas, and Batz, Thomas. "How to analyse user requirements for service-oriented environmental information systems." Environmental Software Systems. Frameworks of eEnvironment. Springer Berlin Heidelberg, 2011. 161-168.
Paper: Usländer, Thomas, Batz, Thomas, and Schaaf, Hylke van der. "SERVUS - collaborative tool support for agile requirements analysis." Presentation held at 10th IFIP WG 5.11 International Symposium on Environmental Software Systems, ISESS 2013, Workshop ENVIP, Neusiedl am See, Austria, October 9-11, 2013
Tools, algorithms and services developed by uBio
Beschreibung: findIT identifies scientific names in any text, the Author Abbreviation Resolver is a thesaurus for resolving abbreviations of author names in scientific nomenclature, ParseIT accepts a complex scientific name and breaks it into it's component parts ...
Citation: Bauer, A. "Assisted interpretation of infrastructure facilities from aerial imagery." SPIE Europe Security+ Defence. International Society for Optics and Photonics, 2009.
Beschreibung: While designed for a completely different domain, such kind of tool could serve as inspiration (or even be adapted) to develop a solution for supporting the steps where human interaction is necessary. The evaluation of a country’s critical infrastructure requires a detailed analysis of facilities such as airfields, harbors, communication lines and heavy industry. To improve the interpretation process, an interactive support system for the interpretation of infrastructure facilities from aerial imagery is developed. The aim is to facilitate the training phase for beginners, increase the flexibility in the assignment of interpreters and improve the overall quality of the interpretation.