Tree Annotation

Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".

Overview

The current phylotastic system (which is at a very early stage of development) fails to deliver (1) metadata for the original source trees and (2) any description of further manipulations. Ultimately, Phylotastic won't be useful for research without this kind of information. Documenting sources and methods is not the only reason to have this annotation-- the (hypothetical) design of phylotastic calls for ways to identify source trees based on their metadata (e.g., user might want to select a particular source tree, either directly or via satisfaction of search criteria).

An initial attempt at prioritizing

support the most common user criteria for searching a treestore to get the right tree (whatever they are)
- return tree with maximal coverage of list ::= { species }
  - specify namespace of <list>
- limits on source trees
  - restrict by publication status (published or not)
  - restrict by publication year (e.g., no trees older than 5 years)
  - restrict by author (e.g., author = bininda-emonds)
  - restrict by other citation information (e.g., jrnl = index fungorum)
  - restrict by method
    - use (exclude) trees made with method class = { parsimony, likelihood, Bayesian, distance, supertree, supermatrix, hand-crafted. . . }
    - use (exclude) trees made with evolutionary model = { }
    - use (exclude) trees made with software = { RaXML, BEAST, PAUP*, . . . }
    - other restrictions (e.g., supertree, . . . )
  - restrict by availability of source data such as character matrix
    - restrict by minimum number of characters in matrix
  - restrict by type of source data = { molecular, morphological, mixed }
  - restrict by annotation level = { platinum, gold, silver, bronze, polystyrene }
  - require feature
    - require support values
    - require rooting
    - require branch lengths
    - require other features
- limits on phylotastic manipulations
  - disallow species substitution
  - disallow grafting of source trees
  - scaling: provide median age estimates
  - scaling: provide only lower age estimates
  - TNRS: prohibit fuzzy matching
  - TNRS: use matches from source = { NCBI, ITIS, et }
support adequate annotation of the provenance of the resulting phylotastic tree
- imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it? "The phylogeny of 40 mammal species was obtained via phylotastic (phylotastic.org
- citation information
support to the proper assignment of credit (blame) for tree-producers and phylotastic service-providers
support licensing that protects creators, resource-providers and end-users

Resources

MIAPA

The main MIAPA page. Primarily:
- MIAPA draft checklist from TDWG 2011 workshop.
- Emily McTavish's Tree Annotation Vocabulary as resulting from the Phylotastic I hackathon. Includes a mapping to MIAPA draft checklist attributes.
- Slideshow with some results from the MIAPA community survey orchestrated by the Open Tree of Life project in fall 2012.
Ontologies
- phylont from Maryam Panahiazar, et al
- CDAO
Development:
- MIAPA repo on Github
- Template development after the TNRS ontology developed at (and after) Phylotastic I.

Annotation

representation of metadata: NeXML
some stuff that Rutger did mapping the metadata from ToLWeb XML format onto semantic annotations in NeXML.
- ToLWeb XML described here: http://tolweb.org/tree/home.pages/downloadtree.html
- a simple script that does the conversion: https://github.com/ncbnaturalis/bio-phylo/blob/master/experimental/tolconvert.pl
- Here's an example input file: https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol.xml
- Here's the resulting output file (indented): https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol-nexml-pp.xml

Getting started

Source trees targeted for annotation

Annotated

Supertree of mammals from Bininda-Emonds, et al 2007
- 4510 species
- file NEXUS format, species-level, includes branch lengths (File:Bininda-emonds 2007 mammals.nex)
- we are using the "mammalST_bestDates" tree out of 3 in the NEXUS file
- link to supplementary data including description of phylogeny methods: http://www.nature.com/nature/journal/v446/n7135/suppinfo/nature05634.html
- URI: http://phylotastic.org/data/Bininda-emonds_2007_mammals.nex

Angiosperm phylogeny group (APG) tree of APGIII
- free full text version
- file File:Phylomatictree.nex
- Nodes with IDs: 1,827
- max ID Length: 34 (harrimanelloideae_to_vaccinioideae)
- Manually curated, right?
- URI: http://phylotastic.org/data/APGIII_Phylomatic_tree.nex

Peters, et al hymenoptera tree (1100 species)
- File:Tree 2 Peters et al.tre has properly formated bootstraps like "(<node contents>):<length>[<support>]"
- File:Tree 1 Peters et al.tre has improperly formatted bootstraps like "(<node contents>)<support>:<length>"
- URI: http://phylotastic.org/data/Peters_etal_Hymenoptera.nwk

Tree of Life Web Project Structure, zipped version of proprietary XML format, spans all of life, family level and above (File:TOL.xml.zip)
- reference is 2007 paper by Maddison, Schulz and Maddison
- 16K tips
- needs conversion to Newick or NEXUS format
- (note: this tree structure in TOL.xml.zip is *old*, it is from October 22, 2006)
- URI: http://tolweb.org/tree/

angiosperm phylogeny from Smith et al. 2011, [1]
- data file at Dryad
- file Newick format, species-level (File:Smith 2011 angiosperms.txt)
- Nodes with IDs: 55,473
- max ID Length: 63 (Aesculus_glabra_var__arguta_x_Aesculus_sylvatica_var__pubescens)
- URI: http://dx.doi.org/10.5061/dryad.8790/1

Tree of 720 taxa from The Genomic Encyclopedia of Bacteria and Archaea (GEBA)
- file: Nexus, includes branch lengths (File:GEBAtree.nex)
  - NEXML and Phyml trees available from TreeBASE
- link to website; includes link to publication
- link to TreeBASE page
- URI: http://purl.org/phylo/treebase/phylows/tree/TB2:Tr25470

Avian phylogeny of Jetz, et al. 2012
- 9,993 birds
- http://www.nature.com/nature/journal/v491/n7424/full/nature11631.html#/supplementary-information
  - methods are described in the supplementary info PDF
  - the trees are provided (NEXUS format) in the supplementary data package "MCC_trees.zip" in the supplementary data files above
  - we got an arbitrary tree File:One arbitrarily chosen jetz tree.tre from the birdtree.org
    - this is the first tree from the file EricsonStage2_9001_10000.zip
- URI: http://phylotastic.org/data/Jetz_etal_2012_one_birdtree.nwk

All-species living tree of life based on SSU rRNA
- publication: http://eigr.grupoei.com/i/i8031/publicaciones/80-LIVING_TREE.pdf
- web site with tree: http://www.arb-silva.de/projects/living-tree/
- you can get the alignment and the Newick tree from this site
  - newick file has metadata section
- most recent tree: File:LTPs108 SSU tree.txt This is a newick tree.
- URI: http://www.arb-silva.de/fileadmin/silva_databases/living_tree/LTP_release_108/LTPs108_SSU_tree.newick

Tree of all Eukaryotes in Genbank from Goloboff et al. 2009
- free full text with methods
- 73,060 terminal taxa analyzed with parsimony in TNT
- file: zipped version of TNT-formatted treefiles and diagrams (File:Goloboff Trees.zip)
- here is the attempt to convert this to Newick (needs testing): File:Goloboff molecules only shortest.nwk.txt
- the github repository has scripts used to convert from TNT
- URI: http://phylotastic.org/data/Goloboff_etal_2009_molecules_only_shortest.nwk

NCBI taxonomy tree (http://www.ncbi.nlm.nih.gov/guide/taxonomy/)
- 250000 species
- available as an SQL dump from ftp://ftp.ncbi.nih.gov/pub/taxonomy/ (see the README file)
- the converted tree in Newick format is available compressed from http://itol.embl.de/other_trees.shtml, as follows
  - complete tree, using scientific names, internal nodes with only one child are removed.
  - also File:Ncbi complete collapsed with names.tre uploaded here as the uncompressed newick tree
  - Note: This tree has every taxon in NCBI, including ones like "Insertion_sequence_IS2" and "Plasmid_pHV2" (though in this version not those that have only a single child). For Phylotastic, we would probably want to generate a tree that is some subset of this.
- manually curated
- FYI, NCBI provides an interactive way to get a tree phylotastically (http://www.ncbi.nlm.nih.gov/guide/howto/gen-com-tree/)
- URI: http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/normalized

Not annotated yet

Megaphylogeny of 800+ living and fossil families of fishes, Westneat and Lundberg unpublished
- file: NEXUS format including Mesquite extensions, family level with other higher-taxa labelled (File:Westneat Lundberg BigFishTree.nex)
- not yet annotated, because it is hard to get the metadata from just the nexus file.
- URI: http://phylotastic.org/data/Westneat_Lundberg_BigFishTree.nex

Hackathon plan

develop plan (day 1)
- revise as needed
- some work is done in parallel

main workflow

identify 10 trees for use as phylotastic source trees
annotate them in free-text form
- create web form in Google docs for input of annotations, based on MIAPA draft checklist from TDWG 2011 workshop
- Spread sheet has pull down menus, plus options for free text entries under "other"
transform annotations into a formal language statements in RDF
- encoding process is iterative with ontology editing
- Hilmar is working on language support
- Joachim is working on the technology for getting this into a triplestore
- Get URI for tree from TreeStore, add annotations to that URI in Protege
Load trees into TreeStore
- Will need to have trees in the correct format
execute queries to demonstrate success

Log and accomplishments

initial plan (day 1)
initial MIAPA checklist-based input form (day 1)
revised input form
plan for (temporarily) storing trees and matrices (data) separate from metadata

Lessons learned from tree-finding and annotation

data frequently is not readily accessible online
- e.g. Jetz one must go to birdtree.org
- e.g., we obtained hymenoptera tree by pers. communication from Peters
- only one of the studies has trees in TreeBASE (GEBA)
- another study has a tree in Dryad (Smith)
authors often don't provide minimal information explicitly
- whether or not a tree is rooted is usually not explicit
  - e.g., taxonomic hierarchies (APG, ToLWeb, NCBI) imply rootedness
  - e.g., Peters, et al 2011 never invokes the term "root" but we infer rooting from the description of outgroups not present in the final tree
- the methodology is largely unexplained for curated or hand-crafted taxonomic frameworks like APGIII or NCBI
a study frequently has several trees by slightly different methods, but the checklist seems to imply one method
- e.g., Goloboff molecules only vs. molecules with morphology
- e.g., Bininda-Emonds best dates vs min dates vs max dates
sometimes a study has many trees that all represent outputs of the same method
- e.g., Jetz provide a large sample from the posterior distribution
process of constructing tree does not follow sequences--> alignment--> tree
- e.g., supertree method in Bininda-Emonds
- e.g., hand-crafted APG, Smith, trees
process of constructing tree cannot be condensed easily
- e.g., Goloboff, iterative procedure with divide-and-conquer search to find parsimony tree ("tree fuse" & "sector" search repeated)
- GEBA tree, { missing explanation }
- partitioned alignment (e.g., Smith, et al), but miapa implies "model of evolution" as though there were only one, whereas in Peters, et al there are 2 models for 2 partitions
clustering to define orthologs not included in checklist, but seems important
- Smith phlawd, alignment: no pre-orthology.
mixed data is common
- Goloboff has morpho and molecular
- multiple studies have DNA (e.g., SSU rDNA) and protein sequences
concatenated alignments are common, e.g., multiple proteins
- this means accession:OTU mapping is not 1:1 but many to one
not encountered in our inputs but sometimes the OTU is <genus_sp> and the data are fused from multiple species (this is common in MorphoBank)
many important trees do not have branch lengths
- e.g., APGIII is a taxonomic framework
- e.g., some supertrees don't have branch lengths
do binomials count as meaningful external identifiers for OTUs?
- in some cases, the methods make clear that these come from a specific source
  - e.g., Goloboff names clearly come directly from NCBI via their bioinfo pipeline
  - e.g., Bininda-Emonds publication declares that naming authority is Wilson & Roeder (Mammal Species of the World)
  - e.g., NCBI taxonomy comes as a database dump wth taxids and synonyms, so it represents its own authority
- usually the naming authority is not clear
was any study straightforward?
OTUs checklist question may be redundant: why have external identifier and then ask for collections information?
OTU external refs in checklist: not applicable for supertree methods, consensus methods, hand-crafted

Tree Annotation

Contents

Overview

Resources

MIAPA

Annotation

Getting started

Source trees targeted for annotation

Annotated

Not annotated yet

Hackathon plan

Lessons learned from tree-finding and annotation

Navigation menu

Tree Annotation

Overview

Resources

MIAPA

Annotation

Getting started

Source trees targeted for annotation

Annotated

Not annotated yet

Hackathon plan

Lessons learned from tree-finding and annotation

Navigation menu

Search