Tree Annotation

Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".

Quick links

Overview: Enrico's final presentation
- Also see example annotation graph for the Peters et al. tree and alignment.
AnnotatedPhylotasticSourceTrees - report on the set of source trees
TreestoreMetadataQueryDemonstration - report on the workflow and querying
AdvancingMIAPA - report on MIAPA-relevant findings and developments
checklist Reconciled_draft_checklist checklist from TDWG 2011

Synopsis

We (1) identified a set of 10 large trees useful as phylotastic source trees (2) created free-text annotations (metadata) for citations, sources and methods, (3) encoded the data and metadata as RDF using CDAO and a new ontology, (4) loaded the encoded information into a triplestore, and (5) demonstrated logical querying based on data and metadata attributes. During the hackathon, group members spent their time developing and revising a strategy, interpreting source materials, developing language support, encoding annotations, working out technical bugs in the workflow, and addressing emerging challenges. The tangible outcomes of this exercise include

a set of 10 source trees (720 to 250,000 species) with low-res metadata (see AnnotatedPhylotasticSourceTrees)
a demonstration of semantic annotation, data-basing, and querying (see TreestoreMetadataQueryDemonstration)
- a workflow plan for encoding tree data and metadata and loading them into a treestore
- a treestore instance populated with some of these data and metadata
Advances in Minimum Information About a Phylogenetic Analysis (see AdvancingMIAPA)
- a new "MIAPA" ontology that leverages several existing ontologies
- recommendations on the draft checklist, and input form
optionally
- a screencast
- a draft GSOC proposal

Background, Motivation, and Aims

Metadata annotations represent an essential part of the design of phylotastic systems, for two reasons. First, while we do not have a robust and detailed understand of how users will make use of phylotastic systems, we assume that they will wish to identify trees based on sources and methods. For instance, a user may restrict a phylotastic query so as to include only trees inferred by Maximum Likelihood, or to exclude grafted trees, or to implicate only the tree associated with the publication by Bininda-Emonds, 2007.

Second, one of the design criteria of phylotastic systems is to provide credible results, which in the scientific world means providing a description of provenance suitable for a scientific publication. To be credible, a tree generated by a phylotastic system must include a description of how it was derived, which includes information on source trees as well as a description of any subsequent manipulations. Yet, metadata play little or no role in current phylotastic component implementations.

Some guidance may be obtained from prior art relating to databases and to metadata. Two databases for trees exist already, TreeBASE and Dryad Dryad does not provide any explicit support for tree-specific annotations. The TreeBASE input interface allows citation data, creates links to species ident, and links a matrix to a tree with a "analysis" link that may implicate a particular software program. Though useful, the TreeBASE model falls far short of the recommendations for a "minimum information" standard for phylogeny metadata known as MIAPA, or "Minimum Information About a Phylogenetic Analysis". For instance, the draft MIAPA checklist from TDWG2011 calls for an explicit indication of whether a tree is a gene or species tree, whether it is rooted, what software (and version number) was used to derive it, and so on.

The TreeAnnotation team of hackathon 2 (Enrico, Hilmar, Joachim, Arlin, Ramona and 0.5 of Andrea) decided to conduct an annotation exercise that would cover the flow of information from initial annotation of trees, to querying of treestores (not including the annotation of subsequent phylotastic manipulations such as pruning or scaling).

The motivation for this exercise relates partly to the aspirational nature of MIAPA, which was proposed many years ago but has never evolved into a clear standard supported with convenient technology. Those of us who have been involved in MIAPA-related efforts sensed a need for practical experiences in real-world uses of annotations. While the scope of Phylotastic is narrower than that of MIAPA, in the sense of covering only species trees, and mainly large ones, the challenge of supporting useful metadata queries in a treestore represents a critical test of the relevance of the MIAPA checklist and the technology for encoding and managing semantic annotations.

We also hoped to enrich current phylotastic implementations by providing metadata for a specific set of useful trees. Hackathon participants have been using a handful of trees (APGIII, Bininda-Emonds, etc) without any metadata on citations or methods.

Thus, our approach has 3 inter-connected aims:

to create a set of 10 usefully annotated source trees
to demonstrate the feasibility of metadata-based querying in a treestore
to leverage a practical annotation exercise to advance the MIAPA project

Approach

Our approach consisted of the following steps

identify 10 useful source trees with available publications
generate free-text annotations
encode citations and annotations in computable form
load the citation, annotations, and trees into a treestore
demonstrate querying based on metadata

In particular, we chose to gather metadata corresponding to the MIAPA draft checklist, to enode it as RDF using a new ontology that imports several other ontologies, and to load the results into Ben Morris's Virtuoso-based treestore implementation. On Day 3, we decided to begin by focusing on citations, which are not in the MIAPA checklist, with the plan to carry citation data through steps 3 to 5.

workflow, in more detail

Identify 10 trees for use as phylotastic source trees. Most of the trees were identified pre-hackathon by Arlin. One of the trees was replaced on day 2 due to lack of metadata (the unpublished fish tree of Westneat & Lundberg).
Annotate them in free-text form. This was done on day #2 as a team effort by Ramona, Enrico, Arlin and Andrea.
- create web form in Google docs for input of annotations, based on MIAPA draft checklist from TDWG 2011 workshop
- Spread sheet has pull down menus, plus options for free text entries under "other"
Transform annotations into a formal language statements in RDF. This was done on Days 3 and 4 by Ramona, Arlin, Enrico and Hilmar.
- Literature Citations
  - after some discussion, we decided to use BIBO (not dc or prism alone)
  - we spent 6 to 8 person hours trying to do this interactively in Protege before finding an automated pathway of discovery and conversion via PubMed--> EndNote --(bibtex export)--> Zotero --> bibo export (bibliontology RDF).
  - here is the File:10trees bibliontology.rdf
- Hilmar developed an annotation ontology that incorporates CDAO, OBI, PROV and other ontologies
Load trees into TreeStore. On days 2 to 3, Joachim worked on the technology for getting our encodings into a triplestore. Part of the challenge was deciding on an URL scheme.
Execute queries to demonstrate success. On Day 3 we had success in querying for citation metadata. On Day 5 ? ? ?

Model for semantic encoding

more annotations

miapa ontology

topology
- gene tree vs species tree: Network:Tree:'Gene tree' or SpeciesTree
- rooted: Network:Tree:RootedTree or UnrootedTree
- 'Consensus tree'
otus
- toTaxon, object property, points to taxon concept, can be URI from NCBI or other authority
- derived_from specimen
- location imported from geo
branch properties
- branch lengths:
  - data property edge length
  - object property has_Annotation edge_length
- branch support: data property has support value either bootstrap or posterior prob
character matrix
alignment method
- name of software, version
- parameters
- manual correction
tree inference method
- name of software, version: tree wasGeneratedBy (activity=) software procedure; software procedure wasAssociatedWith instance of software agent named "RaXML"
- parameters: (activity) used instance of a parameter specification (which is a kind of plan)
- character weights

Tree Annotation

Contents

Quick links

Synopsis

Background, Motivation, and Aims

Approach

Model for semantic encoding

more annotations

Navigation menu

Tree Annotation

Quick links

Synopsis

Background, Motivation, and Aims

Approach

Model for semantic encoding

more annotations

Navigation menu

Search