Tree Annotation

From Evolutionary Interoperability and Outreach
Jump to navigation Jump to search

Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".

Overview

The current phylotastic system (which is at a very early stage of development) fails to deliver (1) metadata for the original source trees and (2) any description of further manipulations. Ultimately, Phylotastic won't be useful for research without this kind of information. Documenting sources and methods is not the only reason to have this annotation-- the (hypothetical) design of phylotastic calls for ways to identify source trees based on their metadata (e.g., user might want to select a particular source tree, either directly or via satisfaction of search criteria).

An initial attempt at prioritizing

  1. support the most common user criteria for searching a treestore to get the right tree (whatever they are)
    • query on OTU identifiers for tips (most common), internal identifiers (e.g., taxonomy markups), sources or types of data, method?
  2. support adequate annotation of the provenance of the resulting phylotastic tree tree
    • imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it
  3. support to the proper assignment of credit (blame) for tree-producers and phylotastic service-providers
  4. support licensing that protects creators, resource-providers and end-users


Questions & possible approaches

What are sensible rules for treating annotations of trees subject to manipulation? For instance, a bootstrap value is typically a split on an unrooted tree, but people think of them as being associated with a node on a rooted tree. If we have pruned several groups on one side of the split, the split doesn't have the same meaning anymore. It seems to me that, under some conditions, it becomes an underestimate of the true support value.

Probably we want to develop some concrete test-cases. These could be real trees, stripped-down versions of real trees, or imaginary (but realistic) trees, but the important thing is that we have multiple cases of trees for which there are concrete instances of metadata. Some of these will be tree-level annotations (apply to whole tree) and some will be OTU- or node- or branch-associated annotations (in principle).

We could create free-text versions of annotations, based on the most important criteria from the MIAPA checklist. We could carry out a thought-experiment of asking what we need to represent in order to process queries and create annotations for modified trees.

The next step would be to try to encode some of this stuff more formally. For instance, we could use NeXML files.

If we have NeXML plus ontology plus translation to CDAO RDF (all stuff that has been used at previous hackathons), then we can feed the test files into a triple store and try to execute some queries using SPARQL.

Resources

MIAPA

Annotation

Possible deliverables

  • a set of >10 trees with a succinct version of minimal information
    • a free-text version
    • an encoded version (NeXML, NEXUS, PhyloXML)
  • sample queries based on metadata
    • a set of queries that a TreeStore should be able to process (cf Nakleh, et al., 2003)
    • a set of tests based on input trees (e.g., find_molecular_trees( TestTreeSet ) ==> return the correct list of trees annotated as being based on molecular data)
  • formal language support for this annotation
    • a list of terms with free-text definitions
    • a reference list of relevant ontologies, e.g., OBI, CDAO
    • an ontology or extension to existing ontologies
  • a token TreeStore implementation that satisfies tests


Getting started

List of trees with description and links to sources & methods

Hackathon workflow

  • Create web form in Google docs for input of annotations, based on MIAPA draft checklist from TDWG 2011 workshop
    • Spread sheet has pull down menus, plus options for free text entries under "other"
  • Hackers read papers or other documentation to fill in spread sheet
  • Load trees into TreeStore
    • Will need to have trees in the correct format
  • Encode spread sheet/free text entries, using Protege
    • encoding process is iterative with ontology editing
    • Don't need to load whole tree into Protege
      • Get URI for tree from TreeStore, add annotations to that URI in Protege
  • Output triple store from Protege to import into Tree Store


A note from Brian O: btw, I made an R package (phyloorchard) to hold large trees; it has a few in there now, and I'll add the ones above. People should feel free to request to be added to that project if you want to do more with it. --BrianOMeara 15:56, 25 April 2012 (EDT))