Tree Annotation: Difference between revisions

From Evolutionary Interoperability and Outreach
Jump to navigation Jump to search
Line 30: Line 30:
* [[MIAPA]], primarily:   
* [[MIAPA]], primarily:   
** [http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/MIAPADraft#Reconciled_draft_checklist MIAPA draft checklist] from TDWG 2011 workshop
** [http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/MIAPADraft#Reconciled_draft_checklist MIAPA draft checklist] from TDWG 2011 workshop
** some [http://www.slideshare.net/ElliottHauser/phylogenetics-data-provenance-survey-results slides] from an OToL project showing which types of metadata consumers want (and producers are willing to provide)
** [http://www.slideshare.net/ElliottHauser/phylogenetics-data-provenance-survey-results Slideshow] with some results from the [http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/MIAPADraft#Community_survey MIAPA community survey] orchestrated by the Open Tree of Life project in fall 2012.
* ontologies  
* ontologies  
** [http://bioportal.bioontology.org/ontologies/49276?p=terms phylont] from [http://arnetminer.org/publication/phylont-a-domain-specific-ontology-for-phylogeny-analysis-3601839.html;jsessionid=E93A88FA38E826E1113A957BBA5BB352.tt Maryam Panahiazar, et al]
** [http://bioportal.bioontology.org/ontologies/49276?p=terms phylont] from [http://arnetminer.org/publication/phylont-a-domain-specific-ontology-for-phylogeny-analysis-3601839.html;jsessionid=E93A88FA38E826E1113A957BBA5BB352.tt Maryam Panahiazar, et al]

Revision as of 23:00, 28 January 2013

Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".

Overview

The current phylotastic system (which is at a very early stage of development) fails to deliver (1) metadata for the original source trees and (2) any description of further manipulations. Ultimately, Phylotastic won't be useful for research without this kind of information. Documenting sources and methods is not the only reason to have this annotation-- the (hypothetical) design of phylotastic calls for ways to identify source trees based on their metadata (e.g., user might want to select a particular source tree, either directly or via satisfaction of search criteria).

An initial attempt at prioritizing

  1. support the most common user criteria for searching a treestore to get the right tree (whatever they are)
    • query on OTU identifiers for tips (most common), internal identifiers (e.g., taxonomy markups), sources or types of data, method?
  2. support adequate annotation of the provenance of the resulting phylotastic tree tree
    • imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it
  3. support to the proper assignment of credit (blame) for tree-producers and phylotastic service-providers
  4. support licensing that protects creators, resource-providers and end-users


Questions & possible approaches

What are sensible rules for treating annotations of trees subject to manipulation? For instance, a bootstrap value is typically a split on an unrooted tree, but people think of them as being associated with a node on a rooted tree. If we have pruned several groups on one side of the split, the split doesn't have the same meaning anymore. It seems to me that, under some conditions, it becomes an underestimate of the true support value.

Probably we want to develop some concrete test-cases. These could be real trees, stripped-down versions of real trees, or imaginary (but realistic) trees, but the important thing is that we have multiple cases of trees for which there are concrete instances of metadata. Some of these will be tree-level annotations (apply to whole tree) and some will be OTU- or node- or branch-associated annotations (in principle).

We could create free-text versions of annotations, based on the most important criteria from the MIAPA checklist. We could carry out a thought-experiment of asking what we need to represent in order to process queries and create annotations for modified trees.

The next step would be to try to encode some of this stuff more formally. For instance, we could use NeXML files.

If we have NeXML plus ontology plus translation to CDAO RDF (all stuff that has been used at previous hackathons), then we can feed the test files into a triple store and try to execute some queries using SPARQL.

Resources

Possible deliverables

  • a set of >10 trees with a succinct version of minimal information
    • a free-text version
    • an encoded version (NeXML, NEXUS, PhyloXML)
  • sample queries based on metadata
    • a set of queries that a TreeStore should be able to process (cf Nakleh, et al., 2003)
    • a set of tests based on input trees (e.g., find_molecular_trees( TestTreeSet ) ==> return the correct list of trees annotated as being based on molecular data)
  • formal language support for this annotation
    • a list of terms with free-text definitions
    • a reference list of relevant ontologies, e.g., OBI, CDAO
    • an ontology or extension to existing ontologies
  • a token TreeStore implementation that satisfies tests


Getting started

List of trees with description and links to sources & methods

A note from Brian O: btw, I made an R package (phyloorchard) to hold large trees; it has a few in there now, and I'll add the ones above. People should feel free to request to be added to that project if you want to do more with it. --BrianOMeara 15:56, 25 April 2012 (EDT))