Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".
- Overview: Enrico's final presentation
- AnnotatedPhylotasticSourceTrees - report on the set of source trees
- TreestoreMetadataQueryDemonstration - report on the workflow and querying
- AdvancingMIAPA - report on MIAPA-relevant findings and developments
- checklist Reconciled_draft_checklist checklist from TDWG 2011
We (1) identified a set of 10 large trees useful as phylotastic source trees (2) created free-text annotations (metadata) for citations, sources and methods, (3) encoded the data and metadata as RDF using CDAO and a new ontology, (4) loaded the encoded information into a triplestore, and (5) demonstrated logical querying based on data and metadata attributes. During the hackathon, group members spent their time developing and revising a strategy, interpreting source materials, developing language support, encoding annotations, working out technical bugs in the workflow, and addressing emerging challenges. The tangible outcomes of this exercise include
- a set of 10 source trees (720 to 250,000 species) with low-res metadata (see AnnotatedPhylotasticSourceTrees)
- a demonstration of semantic annotation, data-basing, and querying (see TreestoreMetadataQueryDemonstration)
- a workflow plan for encoding tree data and metadata and loading them into a treestore
- a treestore instance populated with some of these data and metadata
- Advances in Minimum Information About a Phylogenetic Analysis (see AdvancingMIAPA)
- a new "MIAPA" ontology that leverages several existing ontologies
- recommendations on the draft checklist, and input form
- a screencast
- a draft GSOC proposal
Background, Motivation, and Aims
Metadata annotations represent an essential part of the design of phylotastic systems, for two reasons. First, while we do not have a robust and detailed understand of how users will make use of phylotastic systems, we assume that they will wish to identify trees based on sources and methods. For instance, a user may restrict a phylotastic query so as to include only trees inferred by Maximum Likelihood, or to exclude grafted trees, or to implicate only the tree associated with the publication by Bininda-Emonds, 2007.
Second, one of the design criteria of phylotastic systems is to provide credible results, which in the scientific world means providing a description of provenance suitable for a scientific publication. To be credible, a tree generated by a phylotastic system must include a description of how it was derived, which includes information on source trees as well as a description of any subsequent manipulations. Yet, metadata play little or no role in current phylotastic component implementations.
Some guidance may be obtained from prior art relating to databases and to metadata. Two databases for trees exist already, TreeBASE and Dryad Dryad does not provide any explicit support for tree-specific annotations. The TreeBASE input interface allows citation data, creates links to species ident, and links a matrix to a tree with a "analysis" link that may implicate a particular software program. Though useful, the TreeBASE model falls far short of the recommendations for a "minimum information" standard for phylogeny metadata known as MIAPA, or "Minimum Information About a Phylogenetic Analysis". For instance, the draft MIAPA checklist from TDWG2011 calls for an explicit indication of whether a tree is a gene or species tree, whether it is rooted, what software (and version number) was used to derive it, and so on.
The TreeAnnotation team of hackathon 2 (Enrico, Hilmar, Joachim, Arlin, Ramona and 0.5 of Andrea) decided to conduct an annotation exercise that would cover the flow of information from initial annotation of trees, to querying of treestores (not including the annotation of subsequent phylotastic manipulations such as pruning or scaling).
The motivation for this exercise relates partly to the aspirational nature of MIAPA, which was proposed many years ago but has never evolved into a clear standard supported with convenient technology. Those of us who have been involved in MIAPA-related efforts sensed a need for practical experiences in real-world uses of annotations. While the scope of Phylotastic is narrower than that of MIAPA, in the sense of covering only species trees, and mainly large ones, the challenge of supporting useful metadata queries in a treestore represents a critical test of the relevance of the MIAPA checklist and the technology for encoding and managing semantic annotations.
We also hoped to enrich current phylotastic implementations by providing metadata for a specific set of useful trees. Hackathon participants have been using a handful of trees (APGIII, Bininda-Emonds, etc) without any metadata on citations or methods.
Thus, our approach has 3 inter-connected aims:
- to create a set of 10 usefully annotated source trees
- to demonstrate the feasibility of metadata-based querying in a treestore
- to leverage a practical annotation exercise to advance the MIAPA project
Our approach consisted of the following steps
- identify 10 useful source trees with available publications
- generate free-text annotations
- encode citations and annotations in computable form
- load the citation, annotations, and trees into a treestore
- demonstrate querying based on metadata
In particular, we chose to gather metadata corresponding to the MIAPA draft checklist, to enode it as RDF using a new ontology that imports several other ontologies, and to load the results into Ben Morris's Virtuoso-based treestore implementation. On Day 3, we decided to begin by focusing on citations, which are not in the MIAPA checklist, with the plan to carry citation data through steps 3 to 5.
workflow, in more detail
- Identify 10 trees for use as phylotastic source trees. Most of the trees were identified pre-hackathon by Arlin. One of the trees was replaced on day 2 due to lack of metadata (the unpublished fish tree of Westneat & Lundberg).
- Annotate them in free-text form. This was done on day #2 as a team effort by Ramona, Enrico, Arlin and Andrea.
- Transform annotations into a formal language statements in RDF. This was done on Days 3 and 4 by Ramona, Arlin, Enrico and Hilmar.
- Literature Citations
- after some discussion, we decided to use BIBO (not dc or prism alone)
- we spent 6 to 8 person hours trying to do this interactively in Protege before finding an automated pathway of discovery and conversion via PubMed--> EndNote --(bibtex export)--> Zotero --> bibo export (bibliontology RDF).
- here is the File:10trees bibliontology.rdf
- Hilmar developed an annotation ontology that incorporates CDAO, OBI, PROV and other ontologies
- Literature Citations
- Load trees into TreeStore. On days 2 to 3, Joachim worked on the technology for getting our encodings into a triplestore. Part of the challenge was deciding on an URL scheme.
- Execute queries to demonstrate success. On Day 3 we had success in querying for citation metadata. On Day 5 ? ? ?
Model for semantic encoding
- gene tree vs species tree: Network:Tree:'Gene tree' or SpeciesTree
- rooted: Network:Tree:RootedTree or UnrootedTree
- 'Consensus tree'
- toTaxon, object property, points to taxon concept, can be URI from NCBI or other authority
- derived_from specimen
- location imported from geo
- branch properties
- branch lengths:
- data property edge length
- object property has_Annotation edge_length
- branch support: data property has support value either bootstrap or posterior prob
- branch lengths:
- character matrix
- alignment method
- name of software, version
- manual correction
- tree inference method
- name of software, version: tree wasGeneratedBy (activity=) software procedure; software procedure wasAssociatedWith instance of software agent named "RaXML"
- parameters: (activity) used instance of a parameter specification (which is a kind of plan)
- character weights