TechnicalSpecification

From Annotationssystem für Biodiversitätsdaten
Jump to: navigation, search

Overview

Based on the requirements identified as part of the Functional Specification, the AnnoSys system will be implemented along the Technical Specification described in this document. Thereby, AnnoSys will be built up on following interacting system components:

  • Repository
  • Data Sharing
  • Message System
  • User Interface
  • Security
Overview of the technical specification draft

The repository component provides functionalities to persist any annotations and a copy of the relating XML record document as presented to the user at creation time. The data sharing component provides service interfaces to enable external clients or services to conduct searches within the repository or to retrieve any kind of data from it. The task of the message system component is to organise information flows such as informing annotators, curators or other users about current status changes in annotation workflows. A web based, but desktop oriented user interface shall enable annotators, curators and users to intuitively manage partly complex annotation workflow functionalities. Finally, the security component is responsible for secure authentication, authorisation and data privacy concerns within the system.

The following chapters will first enlighten the annotation context model. Based on that context model, the underlying data model and data exchange format will be introduced. Follow-up, a detailed description of the system components will be given.

Annotation context modelling

AnnoSys' annotation context modelling approach founds on the presumption that annotations will be related to any kind of data records being uniquely identifiable by Globally Unique Identifiers (GUIDs)[1][2]). Beyond that, it is presumed that these records can be made accessible as XML documents presented in a well defined standard format like ABCD<[3] or Darwin Core[4]. Several use cases like e.g. annotating geolocations require combining multiple record data elements with each's new value proposals and/or an annotator comments into a single annotation. That is, the annotation's context model has to integrate general meta information determining e.g. the annotated record with an aggregated list of annotated elements proposing new values and/or making natural language comments on a selected set of data record elements.

Therefore, in AnnoSys the annotation context will be defined by

  • GUID of a unit for which a relating data record exists (tripleID)
  • An XML representation of the data record
  • A selection of XML elements (XPath expression) within the XML data record representation affected by the annotation

The following sections will introduce the AnnoSys' interpretation and implementation of the above mentioned annotation context elements.

GUIDs in AnnoSys

With regard to biodiversity collections, the term GUID initially reflects data records describing physical unit objects in museum collections. As such, a GUID is usually represented as a tripleID constructed by the following elements:

  • institutionID
  • collectionID
  • unitID

As such, TDWG XML record document standards like ABCD[3] or Darwin Core[4] define elements reflecting that data triple.

In general, AnnoSys' basic annotation workflow starts by downloading and analysing an XML record document. One part of the analysing process the system conducts is to extract the tripleID from the document and compare it with the most recent document version available for that tripleID within the repository. If there are any differences detected, a new revision of the data record will be assumed and thus, a new version for the related XML record document has to be added to the repository by the system. The latter is required since any changes within the XML record document may invalidate any annotations created before. Thus, annotations in AnnoSys must be connected with a specific version of the annotated XML record document. Therefore, AnnoSys uses LSIDs[5][6] that extend the former introduced notion of tripleID by the desired version information.

The same information may be available in different data formats. AnnoSys can only correctly reproduce annotations, if the standard of the corresponding annotation's XML record document has also been recorded. Therefore, AnnoSys again has extended the LSID specification by adding the namespace URI of the annotated XML record document. Thus, currently a persistent identifier within AnnoSys is defined as follows:

While, the URN namespace "annosys" has been chosen to avoid different interpretations of the term guid, subsequent redefinitions or adoptions to other persistent identifier schemas may be implemented instead.

As all namespace specific strings forming the GUID URN may contain special characters, it is required to transform them according to RFC 2141[7].

XML Data Record representations

AnnoSys expects XML data record documents in one of the following formats:

  • ABCD 2.06
  • Darwin Core 1.2

Other formats or versions may be added as required.

Annotation context selection

An annotator may identify several items within a data record to be worth annotating. While AnnoSys is planned to work with XML record documents, a mechanism to identifiy the corresponding data elements within the relating XML document has to be implemented. Evaluation results turn out that the usage of XML diff representations or inserting annotated values and/or natural language comments as XML comments into XML record documents results in either cryptic result representations, low performance or conceptual difficulties. Therefore, AnnoSys introduces the idea of a mechanism realising a sort of "recorder" functionality to capture any XML data record elements addressed by annotators.

As XPath XPath[8] expressions are universally applicable within the XML domain, regardless of the XML data record document format, XPath can be used to define adequate expressions selecting (sets of) record data elements. These selectors may uniquely identify a single dedicated data element, but can also encompass sets of data elements like

  • "all elements named "xxx" located below the element "yyy""
  • "all elements named "xxx" with a value "zzz""

The range of use cases includes e.g. typo corrections in a single data element to mass annotations of errorneous geographic location references.

Note: XPath filter expressions may become highly complex. So please note that not any XPath (e.g. manually or automatically produced) filter expression can be adequately presented in a graphical user interface presenting an annotation's differences in relation to the original document.

Data Model

The data model specified within the next sections will mainly be built up on the results evaluated in chapter Annotation context modelling. As such, annotations will be described within the data model by two interrelated parts:

  • Annotation Metadata Specification
  • Annotated Elements Specification

While the metadata specification provides general meta information (e.g. determining the annotated record and document), it will be augmented by a list of elements recording the annotation's context selections. The following sections will provide more detailed information on both parts.

Annotation Data Model

Annotation Metadata Specification

Annotation metadata provide general information about

  • annotation creation
  • annotated record
  • annotating agent
  • optional annotation aspects

Except optional annotation aspects, any data described within the following sections is mandatory in order to instantiate a valid annotation.

Annotation creation

Any creation of an annotation entails the generation of the following data elements

  • Motivation
  • Datetime (of the creation)
  • GUID (of the annotation)

Usually, annotations will be proposed for well determined reasons. In the first instance, these reasons were thought to be included as "annotation type" outlining particular intentions of the given modification or comment. Inspired by the extensions[9] proposed by the TDWG Annotations Interest Group[10] the term motivation was introduced synonymously to the former notion of annotation type. The term shall express the reason for which an annotator has made the annotation. Currently, AnnoSys supports the following motivations (annotation types):

  • Determination
  • Locality of duplicates
  • Gathering
  • Nomenclatural type
  • Sequence
  • Record basis
  • Scientific name
  • Label text
  • Other

The creation's datetime describes the point in time the annotation was published by the system. This time is usually different to the time, the annotation was created within the system for user processing. The time's granularity is given in milliseconds.

The GUID of the annotation shall uniquely identify any annotation created. In accordance to the extended GUID interpretation in AnnoSys, the GUID's "institutionID" should match the organisation running the AnnoSys instance creating the annotation. The "collectionID" should match the name of the annotation repository (e.g. "AnnoSys"). While "unitID" expresses here the time in milliseconds since 01/01/1970 the annotation was first instantiated in the system for processing, revision is represented by the time in milliseconds since 01/01/1970 the annotation was published by the system. Finally, the format is determined by the namespace URI identifiying the standard format and version used to publish the annotation (e.g. Open Annotation: http://www.w3.org/ns/openannotation/core/).

Annotated record

The term annotated record circumscribes information regarding the underlying data record representation of the annotated unit:

  • record GUID
  • record document format
  • record document source
  • record document revision

The recordGUID identifies the collection unit, on which any kind of data records may rely. In general, that recordGUID is expected to be represented in tripleID format as described in section GUIDs in AnnoSys. Optionally, other formats may be supported.

The record document format describes the format of the annotated record document. It is expected to be represented by the namespace URI of the relating XML standard format including its version (e.g. ABCD: http://www.tdwg.org/schemas/abcd/2.06http://www.tdwg.org/schemas/abcd/2.06).

The document source states the source from which the annotated record document has been downloaded. Usually, this will be represented in URL form. Optionally, any URI denoting any standard transport or retrieval protocol like FTP, or DOI's may be supported.

The revision information reflects any kind of version information regarding the annotated record document. If no revision information can be extracted from the record document based on the given document format, the time in milliseconds since 01/01/1970 when downloading the document first into the system will be set as revision information.

Annotating agent

An agent can be either a person or a machine conducting the annotation. Conforming to the common practice in botanical museums, only a minimum set of data has to be captured:

  • name
  • email
  • institution

The name field should include first name(s) and last name(s) of the annotating person, or the name and version of the software or service used to create the annotation.

The email address should hold a valid email address of the annotating person, or the email address of the annotating machine's administrator.

The institution denotes the organisation the annotating agent belongs to. If the institution field is empty, the authorisation component will consider the agent as member of the "private taxonomist" group.

Optional annotation aspects

The provision of the following aspects is not mandatory for annotations. Mainly, they provide additional information regarding the annotation management processed by the collection manager or the access control component of the issuing annotation system.

  • evidence
  • expectation
  • constraints

The "evidence" aspect provides a means to justify an annotation by substantial proof arguments. For instance, the evidence may be provided by a list of references, links to multimedia content or other instruments of evidence. It is a point for discussion whether evidence should be provided entirely or to each annotated element separately.

Using the "expectation" aspect, an annotator may express the outcome of the annotation especially with regard to the holder(s) of the annotated record unit(s). Mainly, the expectation aspect is part of this data model for conformance reasons to the proposal of the TDWG Annotations Interest Group[10] Anyway, in terms of AnnoSys the general expectation might be assumed to express the annotators' willing that the curator enters the annotated data into his collection database.

The "constraints" aspect may be used by annotators to define constraints (e.g. endangered species, planned scientific publication), to prevent e.g. unwanted publication, unintentional usage, time, copyright or other restrictions assigned to that specific annotation. This aspect mainly applies to the system creating annotations on behalf of the annotator. That is, authorisation restrictions must apply before the system is showing annotation data to users or exchanging annotation data with other services.

Annotated Elements Specification

According to the definition described in chapter Annotation context modelling, any entry on the annotated element list mentioned earlier combines the following information to propose a new value and/or comment on a selected set of data record elements to the data collection manager:

  • Element selector (XPath expression)
  • Value proposal
  • Text comment

While both, "value proposal" and "text comment" are optional, one of both must be given in order to provide valuable information. That is, annotation data that does neither include a value proposal nor a free text comment has to be considered as invalid.

The "element selector" will be realised in form of an XPath expression according to section Annotation context selection. Therewith, the following special use cases will be interpreted by AnnoSys' selector implementation as follows :

  1. If the XPath expression selects an XML element for which no value is assigned in the relating XML Schema, only a text comment may be given in order to make an annotation in natural language.
  2. Selecting the root element ("/") will be recognised as to address the entire data record. In particular, a general comment about the referenced document will be stated like this.

"Value proposal" enables annotators to propose a new value to any XML element selected by the element selector. While the intended usage is to propose the replacement of the selected XML element values by the given value, curently the value proposal is expected to be represented as text string. Note: An empty value proposal will be interpreted as the annotator's propose to remove all selected elements from the record.

"Text comment" permits annotators to annotate or to make comments on the selected set of record elements in natural language. Text comment is also considered to be represented as text string.

Exchange Format

In comparison to AnnoSys' evaluation phase, the activities of the W3C Open Annotation Community Group[11] alleviate the decision towards a suitable storage and exchange format for annotations significantly. W3C Open Annotation Community Group pools forces of two former initiatives:

The consolidation of both initiatives under the umbrella of W3C prospects for a long term sustainability of the upcoming standard. As such, the specifications W3C Open Annotation Core Data Model[14] and W3C Open Annotation Extension Specification[15] provide powerful means to express any of AnnoSys' basic requirements defined earlier. Additionally, they provide for a bunch of further concepts for potential system enhancements.

Even though AnnoSys' primary focus is on XML record documents, W3C Open Annotation is a semantic approach defining an ontology for annotation use cases in general. Therefore, annotation data has to be stored using RDF[16] terminology. Due to the early stage of W3C Open Annotation Community Group, the further directions of standard specification can actively be influenced by providing use cases to the W3C Open Annotation Community Group. As such, some extension proposals discussed within the TDWG Annotations Interest Group[10] and AnnoSys' implementation concerns have already raised group discussions and influenced decisions toward the next standard release scheduled for the end of 2012. Thus, W3C Open Annotation is the standard AnnoSys will support and contribute to.

As mentioned above, the upcoming version of the W3C Open Annotation will yield some substantial changes. In particular, these changes will provide solutions to problems not adequately solveable with the current Community Draft 1. The following diagrams will show the current interpretation of Open Annotation Community Draft 1 as implemented in the first AnnoSys protoype. Due to the pending changes, it will not be discussed in detail. Nevertheless, some comments regarding necessary extensions and the mentioned implementation issues will be given.

W3C Open Annotation representation of Annotation Metadata

While the application of W3C Open Annotation regarding the annotation meta data defined in the AnnoSys data model could have been implemented in a quite straight forward manner, the realisation of the list of annotated elements caused some problems.

First, AnnoSys introduced the extension XPathSelector simply representing XPath expression as a new selector type for specific target resources.

Second, W3C Open Annotation supports addressing multiple (specific) targets. Unfortunately, it lacks to offer an opportunity to relate different bodies to individual targets by limiting the total number of bodies in an annotation to one. Thus, AnnoSys had to create multiple OA annotations in order to represent the data model's list of annotated elements. The relationship between the annotated elements and the original annotation holding the metadata information has been realised by "misusing" the oa:hasSemanticTag property "linking" annotated elements to meta annotation. As a consequence, a number of unrequired annotation instances would have to be created within the RDF model. The upcoming standard release will permit multiple bodies and provide means to related bodies to specific targets.

W3C Open Annotation representation of Annotated Elements

Repository

The former chapter defines W3C Open Annotation Community Group[11] as exchange and storage format for annotations in AnnoSys. While W3C Open Annotation produces metadata in RDF format, the record documents are predetermined to be delivered in standardised XML formats (like ABCD or Darwin Core). Both kinds of data (record documents and annotations) are closely related to each other. Thus, both have to be archived within the AnnoSys repository in order to be precisely reproducable at some point in time later. Besides XML and RDF data, the AnnoSys repository also includes user profile information including login credentials and authorisation details.

Record documents

With regard to the experience gained from the current implementation of the BioCASE annotation system, the file system based, hierarchical storage architecture used here proofed to be an acceptable solution for efficiently storing XML record documents. Therefore, the AnnoSys repository stores XML record documents in a configurable location on the server's file system as well. Using ABCD terms for the unit's tripleID the following file path creation scheme is used.

  • $RECORD_BASE_DIR/SourceInstitutionID/SourceID/format/UnitID.revision.xml

Thereby, the tripleID information (SourceInstitutionID, SourceID and UnitID) will be extracted from the XML record document in due consideration of the relating standard format. While the format will be represented by the XML namespace prefix of the record document's XML namespace URI, a predefined mapping scheme between namespace URI and namespace prefix will be used by all AnnoSys system components. Finally, the revision information will describe the datetime when the XML record document was first downloaded into the repository. According to respective data elements defined within the [#Data Model|Data Model], the revision's value will be given by time in milliseconds since 01/01/1970.

Optionally, if required for some substantial reasons, record documents may also be stored in an dedicated XML database in later revisions.

Annotations

As AnnoSys uses W3C Open Annotation Community Group[11], which is an RDF based standard format to manage annotations, annotations will be stored in so called RDF stores. RDF stores (aka. triplestores) can be viewed as specific databases for storage and retrieval of RDF data (triples). Comparable to relational database systems, information stored in a triplestore can be retrieved using the SPARQL[17] query language.

Some recent evaluation reports and RDF store benchmarking studies [18][19][20] provide a serious overview of suitable RDF store implementations. In particular, these reports evaluate that native stores are significantly outperforming RDBMS backed RDF stores. Even though, these benchmarks show that with regard to native RDF stores individual strengths and weaknesses have to be considered. In AnnoSys, the focus can be seriously narrowed down to the most relevant and performing open source solutions supported by the Apache Jena[21] framework.

The benchmarks show that Apache Jena's TDB[22] model is an adequately performing solution for development and prototype implementations. As the number of triples to be stored approaches 50 millions, the most performant triple store turned out to be Virtuoso Universal Server[23]. The latter will be the case if the number of annotations approximates about 500.000 annotations. Though, initially AnnoSys will start implementing using the Apache TDB triple store, and reserves switching to Virtuoso if the system's performance suggests to do so.

The following sections will provide a brief introduction to both triple store implementations.

TDB

TDB[22] is the native triplestore implementation of the Apache Jena framework. Thus, it is supported by the full range of Jena APIs and is advertised by Jena as "high performance RDF store on a single machine". Nevertheless, benchmarks are showing that TDB has quite a good performance regarding the import of triples, but query and update performance is just adequate.

Due to the file based approach of TDB, replication, backup or transactions are not directly provided by TDB. On the other hand, SPARQL and full-text queries as well as inference engines are well supported through the Jena API.

Virtuoso Universal Server

Virtuoso Universal Server[23] represents a hybrid storage solution, which can be used for RDF as well as for XML, relational data and free text documents. Through its unified storage, it can serve as an integration point for heterogeneous data sources. While it is offered as an open source product, there are several license models for commercial usage as well. Virtuoso is known for hosting the large datasets of e.g. DBpedia[24].

Benchmarks show that Virtuoso outperforms in most query and update disciplines. Apparently, the advantage has grown in the last few years. Regarding loading times, Virtuoso shows linear scaling, but is to some degree slower than other competitors.

Further on, Virtuoso supports transactions, replication, backup, inference engines as well as SPARQL and full-text queries. Moreover, Virtuoso offers back-end providers for being seamlessly integrated with Apache Jena.

User Profiles

The term "user profiles" aggregates the following information concerning user agents:

  • metadata of annotating agents
  • authentication credentials
  • private annotation store

Metadata of annotating agents

"Metadata of annotating agents" addresses the storing of information derscribed within the Data Model. In consideration of the best practices recommendation provided in the W3C Open Annotation Extension Specification[15], the FOAF vocabulary will be used to describe agents as well as institutions. The metadata will be stored along with the annotation data with AnnoSys' triple store. Also, any of these resources will be referenced within AnnoSys according to the extended GUID interpretation in AnnoSys. Thereby, the URN of an agent resource in AnnoSys might be similar to the following expression:

Please note that if user profile data changes, a new resource with an updated version datetime will be generated in order to retain annotator information as they were available at annotation creation time.

Furthermore, an agent resource may be represented by the following properties:

property definition
rdf:type agent type (foaf:Person, foaf:Organization or dcTypes:Software)
foaf:name agent's name
foaf:mbox agent's mailbox
foaf:member optional list of members (for agent type foaf:Organization only)

Authentication credentials

Authentication credentials contain the login information for a given agent, as well as a copy of the current agents' metadata information. These information will be stored in a SQLite database where the users table defines the following columns:

column description
uid agent's login name
credential agent's password
resourceid agent's resource URI in AnnoSys triple store
name agent's name
mailbox agent's mailbox

Private annotation store

AnnoSys maintains a private annotation store for any registered agent within the system. Initially, these stores hold any annotation data created by the user. Agents must explicitly publish annotations in order to move these annotations from the agent's private store to the AnnoSys triple store and make them generally accessible. After publication, the annotation will be removed from the agent's private triple store.

The private triple stores will be created in a separate directory named according to the agents resource id. Thereby, any colons (":") will be replaced by underscores ("_") in the directory name. The private triple stores will be created using Apache Jena's TDB implementation.

Data Sharing

In order to permit external services accessing annotations stored in the AnnoSys, there must be interfaces to access RDF based annotation information as well as XML record documents. Internally, annotations refer to XML record documents through the GUID defined for the XML record document instance as described in the repository specification. The next sections will describe how access is provided to both kinds of information.

RDF

Access to the AnnoSys triple store will be given by an open SPARQL[17]. endpoint. A SPARQL endpoint enables clients to conduct any SPARQL query on the AnnoSys triple store over common web protocols like HTTP. Therefore, Apache Jena (Fuseki) as well as Virtuoso provide those SPARQL endpoints.

XML

Along with the relating annotations, the AnnoSys repository for XML record documents must be opened for queries executed by external services as well. This is a necessary feature, because the representation of annotated elements depends on the XML record representation available at the time of annotation creation. There, the following options have been considered:

  • WebDAV/FTP connection to repository files
  • BioCASE Provider Software

Providing WebDAV[25] or FTP[26] services to enable direct read access to the file based hierachy of the record repository is an easy and lightweight solution. The mapping scheme from a record document's GUID to the relating file path in the record repository is described in section Record documents. Thus, external services simply have to implement that simple mapping scheme in order to access annotation related record documents.

Alternatively, the BioCASE Provider Software implementation has to be adopted in order to enable the delivery of XML record documents from the record repository's file hierachie.

Message System

The message system provides mechanisms to inform system actors like collection managers(curators), annotators or other users about the current status of various annotation workflow processings. The message system supports notifications for the following annotation workflow activities:

  • inform collection managers whenever annotations referencing their collection(s) has been issued
  • inform annotators regarding the acceptance or rejection of issued annotations on behalf of the collection manager
  • notify users about ongoing workflow activities

The following sections provide an overview of the technical components used to implement the AnnoSys message system and describe the integration of the AnnoSys message system concept into the technical environment.

Technical components

Technically, the message system builds on the Apache ActiveMQ[27] which is a provider implementation of the Java Message Service (JMS)[28]. Thereby, the implementation of system external message transport protocols like email will be realised through Apache Camel[29]. The following sections will briefly introduce the basic concepts of Java Message Service(JMS), Apache ActiveMQ and Apache Camel.

Java Message Service (JMS)

The Java Message Service (JMS)[28] standard was developed by the Java Community Process and defined within the specification JSR 914[30] . JMS is an application programming interface (API) for message communication between one or more software components or client applications.

The term messaging describes a loosely coupled form of reliable, asynchronous, distributed message exchange between software components. Herein, senders don't need to have precise knowledge of their receivers and vice versa. Nevertheless, the message format must be arranged by communication partners. Further on, communication partners are not required to be online at the same time. The reliability term determines that a message is delivered once and only once, even in case of system failures.

JMS provides two communication models:

  • point-to-point
  • publish/subscribe

Using point-to-point communication, the producer sends a message to a queue which is connected to a dedicated receiver. If the receiver is currently not available, the message will be stored by the message service and the receiver can fetch it when he reconnects.

The publish/subscribe model requires the publisher to create a message topic, where an arbitrary number of clients can subscribe to. Messages sent to that topic must then actively be consumed through subscription or will be lost. Optionally, subscribers may decide for durable-subscription. Thereby, messages will be persistently stored by the message service and redistributed to subscribers on every reconnect.

In addition, JMS provides several types of messages to be sent to queues or topics like text, byte streams or (mapped)Java objects.

In order to use JMS, a JMS provider implementation like Apache ActiveMQ is required providing functionalities to manage topics, queues and sessions. JMS roviders may operate in stand-alone or embedded mode. While in stand-alone mode, the JMS provider runs in its own process, it will run in the same process (Java Virtual Machine(JVM)) as client applications like e.g. AnnoSys.

Apache ActiveMQ and Camel

Apache ActiveMQ is one of the most popular open source JMS provider implementation and fully compliant to the JMS API specification. It also includes an administration tool permitting to maintain queues, topics and messages externally. Moreover, it supports different transport mechanisms like e.g. TCP(OpenWire), SSL or VM(in-process connections for embedded broker) and integrates different platforms and programming languages like Java, C/C++ or .Net.

In combination with Apache Camelref name="APACHE_CAMEL">Claus Ibsen and Jonathan Anstey (2010). "Camel in Action", Manning.
</ref> a bunch of more than 70 network transport protocols (e.g. email, news-feeds, FTP, HTTP) can be configured to transport messages by using Apache Camel Components. Apache Camel routes messages from or to destinations (e.g. queues, topics) to communication endpoints defined by URIs.

Thereby, Apache Camel[29] provides a Java API completely supporting any of the 65 Enterprise Integration Patterns(EIPs)[31] to configure routing and mediation rules. EIPs represent a catalogue of accepted solutions to recurring problems in enterprise applications. In particular, messaging, routing and transformation solutions are addressed like e.g. content based message routing.

Integrating annotation workflows and Java Message Service (JMS)

Basically, the AnnoSys message system requires the creation of message queue instances for

  • agents
  • records
  • topic subscriptions

As naming of JMS queues and topics is based on free text any naming scheme could be introduced to map specimen records or annotations related messages to individual destinations. It is common practice using dot based separators to indicate hierarchical levels in naming context (e.g. Java packages, domain names), adopting a similar schema for annotation queues might also be fine. Thus, AnnoSys names agent and record message queues according to their GUID representations.

Agent message queues serve as point-to-point communication with users and will be backed by email endpoints through the relating camel component. That way, agents, curators and topic subscribers will be notified by email of events they have been registered or subscribed to. Optionally, any other message transport protocol provided by Apache Camel (e.g. news-feeds) may be integrated.

Messages considering annotated records or annotations will be directed to topics named analogously to their corresponding GUIDs.

Due to this GUID based schema, event driven topic subscriptions related to collections, records or annotations can be managed through wild-card selectors supported by ActiveMQ. That way, subscriptions like "keep me informed about events regarding unitID x in any institution and collection" (e.g. "*.*.unitID").

Through further extensions of that schema, e.g. by annotation types, users might subscribe for issued annotations of a specific type (e.g. in general "*.*.*.*.*.annotationType" or record related "instId.collId.1234.*.*.annotationType"). Alternatively, that feature could also be implemented through content based routing rules in Apache Camel. However, for more specific and record data oriented topics like genus, geographic locations etc., a suitable approach will have to be developed and is subject of further research.

Message types

AnnoSys includes the following messages types:

message type definition
ANNOTATION_ISSUED annotation issued by an annotator
ANNOTATION_ACCEPTED annotation accepted by collection manager
ANNOTATION_PARTLY_ACCEPTED annotation partly accepted by collection manager
ANNOTATION_REJECTED annotation rejected by collection manager

ANNOTATION_ISSUED type messages will automatically be generated and sent to the related collection manager as well as to the record's topic. All other messages types have to be explicitly submitted by collection managers when they have updated (or not) the collection database record related to a given annotation. Therewith, they may either reject the annotation, or accept it by completely or partially updating the corresponding data record.

While annotators can be automatically addressed through their user profile data, collection managers will have to be recognised based on their role approved by the AnnoSys authentication system. Thus, whenever a new annotation is issued, the message system requests a current list of users who are registered as curators for the given collection from the authentication system. A similar request to the authentication system ensures that only users registered as curators for a given collection may issue acceptance or rejection messages towards annotators.

For any message type, notifications will not only be sent to the curator(s) or annotator(s) respectively, but also to the message queue of the annotated records. That way, those status messages are automatically propagated to related topic subscribers.

User Interface

Providing and managing annotations may become a complex job either to annotators or data curators. For instance, annotators have to gather specimen data, analyse a potentially large number of data objects and annotate or comment specific data elements accurately. Again, data curators have to analyse these annotations, accept or reject them possibly along with individual comments and have to reintegrate them back into their data collections. Further on, any of these activities should be documented and communicated back to annotators and other subscribing members of the community.

As the main focus of both annotators and curators should be on preparing accurate information, a suitable user interface should give advice and support users through all these workflows. Furthermore, it should allow a clear presentation of information, adoption to individual user preferences and provide convenient guidance and support through the most common workflow activities. While common web interfaces are doing their job quite well, they are feeling mostly a little bit static and inflexible in comparison to modern user interfaces of desktop applications.

Fortunately, due to recent web toolkit developments based on the Asynchronous JavaScript and XML (Ajax)[32] technology, desktop alike applications can be implemented for the web now as well. Therefore, AnnoSys decided to use Eclipse Rich Ajax Platform (RAP) for developing the desktop oriented user interface.

The aim of the Eclipse Rich Ajax Platform (RAP)[33] is to make the Rich Client Platform (RCP)[34] available for the web. That is, desktop applications originally developed with Eclipse RCP can be executed within a client's web browser. Currently, some parts like individual manipulation of drawing context and working with styled text components are not achieved. However, any other kinds of widget required to realise comfortable user interfaces are available.

Eclipse RAP applications are completely running in an Java EE server container. Thereby, AnnoSys uses the Jetty 8 platform as recommended by the Eclipse RAP developers.

Security

The security component tasks comprise the following functionalities:

  • verified user registration including user profile information
  • secure user login
  • role based authorisation

As the validity of each user's email address is a prerequisite for the communication workflow within the system, the verification of the claimed address is the basic functionality of the user registration procedure. Moreover, the registration procedure is accompagnied by appropriate measures to reduce spam bot registrations. With respect to data privacy protection, the repository stores user profile data in a location inaccessible from system external services.

A password based, secure user login will be provided in order to seriously ascertain a user's identity. Optionally, logins initiated by federated identity management systems such as SAML or OpenID may be integrated if seriously requested.

AnnoSys implements functional access restriction to certain functions based on a user's role assignment. In particular, AnnoSys defines the following roles:

role definition
annotator default role enabling annotating workflow activities
administrator role for system administrators
manager_collectionX role for curators of a given collection X

The technical demand is an easy integratibility with other AnnoSys components like repository, message system and user interface. Since any of these components are realised based on the JavaEE[35] platform, in particular the solution must be interoperable with Java Web Application Containers like Jetty. Based on evaluation results, AnnoSys prefers Apache Shiro[36] towards the common solution Java Authentication and Authorisation Service (JAAS)[37].

The main advantage of Apache Shiro is its easy to use, understand and configure production quality security framework supporting authentication, authorisation, session management and cryptography. In particular, Apache Shiro facilitates the definition of complex authorisation rules and session management issues.

References

  1. TDWG, Welcome to the Globally Unique Identifiers (GUID) Wiki, http://wiki.tdwg.org/GUID (2009), accessed 22 Oct 2012
  2. Richards, Kevin; TDWG GUID Applicability Statement, 09 September 2009, http://www.tdwg.org/standards/150/download/, accessed 19 December 2011
  3. 3.0 3.1 Holetschek, Jörg, ABCD - Access to Biological Collection Data, http://wiki.tdwg.org/twiki/bin/view/ABCD/ (02 March 2010), accessed 05 Dec 2011
  4. 4.0 4.1 Wieczorek, John, Döring, Markus, De Giovanni, Renato, Robertson, Tim, Vieglais, Dave, Darwin Core, http://rs.tdwg.org/dwc/index.htm (2009), accessed 05 Dec 2011
  5. TDWG; LSID Authority Identifications, www.omg.org/cgi-bin/doc?dtc/04-05-01, ???, not accessible 19 December 2011
  6. Pereira, Ricardo; Richards, Kevin; Hobern, Donald, Hyam, Roger; Belbin, Lee; Blum,Stan;TDWG Life Sciences Identifiers (LSID) Applicability Statement, 03 September 2009, http://www.tdwg.org/standards/150/download/, accessed 19 December 2011
  7. Moats, R.;URN Syntax, May 1997, http://tools.ietf.org/html/rfc2141, accessed 24 October 2012
  8. Berglund, Anders, Boag, Scott, et al., XML Path Language (XPath) 2.0 (Second Edition) - W3C Recommendation 14 December 2010 (Link errors corrected 3 January 2011), http://www.w3.org/TR/xpath20/ (14 December 2010), accessed 05 Dec 2011
  9. Morris, Paul J., Proposed AOD Extensions to AO, http://wiki.tdwg.org/twiki/bin/view/AnnotationsIG/DataExtensionsToAO (06 July 2011), accessed 05 Dec 2011
  10. 10.0 10.1 10.2 Morris, Paul J., TDWG Wiki AnnotationsIG, http://wiki.tdwg.org/AnnotationsIG (2011), accessed 05 Dec 2011
  11. 11.0 11.1 11.2 W3C, W3C Community and Business Groups - Open Annotation Community Group, http://www.w3.org/community/openannotation/ (2012), accessed 25 Oct 2012
  12. Google, annotation-ontology, http://code.google.com/p/annotation-ontology/wiki/Homepage (2011), accessed 05 Dec 2011
  13. Sanderson, Robert; Van de Sompel, Herbert, Open Annotation: Beta Data Model Guide, http://www.openannotation.org/spec/beta/ (10 August 2011), accessed 25 Oct 2011
  14. Sanderson, Robert; Ciccarese, Paolo; Van de Sompel, Herbert, W3C Open Annotation Core Data Model - Community Draft (09 May 2012), http://www.openannotation.org/spec/core/, accessed 25 Oct 2012
  15. 15.0 15.1 Sanderson, Robert; Ciccarese, Paolo; Van de Sompel, Herbert, Open Annotation Extension Specification - Community Draft (09 May 2012), http://www.openannotation.org/spec/extension/, accessed 25 Oct 2012
  16. W3C RDF Working Group, Resource Description Framework (RDF), http://www.w3.org/RDF/, 10 February 2004, accessed 13 Dec 2011
  17. 17.0 17.1 Eric Prud'hommeaux and Andy Seaborn (2008). SPARQL Query Language for RDF.
  18. Bernhard Haslhofer, Elaheh Momeni, et al. (2011). Europeana RDF store report.
  19. Chris Bizer and Andreas Schultz (2011). "BSBM V3 Results (February 2011) ". Retrieved 15 February 2012, 2012, from http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html.
  20. Chris Bizer and Andreas Schultz (2009). "Berlin SPARQL Benchmark Results." Retrieved 15 February 2012, 2012, from http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html.
  21. The Apache Software Foundation (2012). "Apache Jena - Welcome to Jena." Retrieved 15 February, 2012, from http://incubator.apache.org/jena/.
  22. 22.0 22.1 The Apache Software Foundation (2012). "Apache Jena - TDB." Retrieved 15 February, 2012, from http://incubator.apache.org/jena/documentation/tdb/.
  23. 23.0 23.1 OpenLink Software (2011). "Virtuoso Open-Source Edition." Retrieved 15 February 2012, 2012, from http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/.
  24. Pablo Mendes (2011). "The DBpedia Knowledge Base." Retrieved 15 February 2012, 2012, from http://dbpedia.org.
  25. L. Dusseault, Ed., HTTP Extensions for Web Distributed Authoring and Versioning (WebDAV) (June 2007), http://tools.ietf.org/html/rfc4918, accessed 07 Nov 2012
  26. J. Postel, Ed.; J. Reynolds, FILE TRANSFER PROTOCOL (FTP) (October 1985), http://tools.ietf.org/html/rfc959, accessed 07 Nov 2012
  27. Apache Software Foundation (2011). "ActiveMQ." Retrieved 30 January, 2012, from http://activemq.apache.org/.
  28. 28.0 28.1 Oracle. "Java Message Service (JMS)." Retrieved 30 January, 2012, from http://www.oracle.com/technetwork/java/jms-136181.html.
  29. 29.0 29.1 Claus Ibsen and Jonathan Anstey (2010). "Camel in Action", Manning.
  30. Rich Burridge Mark Hapner, Rahul Sharma, Joseph Fialli, Kate Stout (2002) Java Message Service.
  31. Gregor Hohpe and Bobby Woolf (2003). "Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions", Addison-Wesley. ISBN 0-321-20068-3.
  32. Garrett, J. J. (2005, 18 February 2005). "Ajax: A New Approach to Web Applications." from http://www.adaptivepath.com/ideas/ajax-new-approach-web-applications.
  33. The Eclipse Foundation (2012). "Enabling modular business apps for desktop, browser and mobile." Retrieved 08 February, 2012, from http://www.eclipse.org/rap/.
  34. The Eclipse Foundation (2012). "Rich Client Platform." Retrieved 08 February, 2012, from http://www.eclipse.org/rcp/.
  35. Oracle; Java EE at a Glance, http://www.oracle.com/technetwork/java/javaee/index.html, 2012, accessed 04 January 2012.
  36. The Apache Software Foundation (2012). "Welcome to Apache Shiro." Retrieved 20 February, 2012, from http://shiro.apache.org/.
  37. Oracle (2011). "JavaTM Authentication and Authorization Service (JAAS) Reference Guide." Retrieved 20 February, 2012, from http://docs.oracle.com/javase/6/docs/technotes/guides/security/jaas/JAASRefGuide.html.