Phylotastic/Use Cases

From Evolutionary Interoperability and Outreach
Revision as of 17:14, 10 June 2012 by Hilmar (talk | contribs) (moved PhylotasticUseCases to Phylotastic/Use Cases)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

about

this is the page for use-cases. the more clearly specified the use-case, the more it will be useful in testing. ideally, include all input data and expected outputs.

the three use-cases given below represent standard character analysis. how about some more diversity?

  • a metagenomics use-case
  • a mashup (tree + table) with species data accessible via web-services (see Rod Page's ispecies for example of a species mashup)
    • image (EoL, wikipedia, phylopic)
    • collection data, incl. geo location (GBIF)
    • nearest museum specimen (VertNet)
    • genome availability (NCBI)
    • number of sequences available (NCBI)
    • sequence for a specific marker gene or protein (rDNA, cytC, cytOx, etc) (NCBI)
    • average body size (EoL, FishNet)
    • review publications about (PubMed)

Big Trees

First, here are several large phylogenies that can be used to work out case studies.

  • 55,473 taxon angiosperm phylogeny from Smith et al. 2011, Newick format, species-level (File:Smith 2011 angiosperms.txt)
    • Nodes with IDs: 55,473
    • max ID Length: 63
    • max ID length text: Aesculus_glabra_var__arguta_x_Aesculus_sylvatica_var__pubescens
  • This is the angiosperm phylogeny group (APG) tree that underlies Phylomatic, NEXUS format, family level (File:Phylomatictree.nex)
    • Nodes with IDs: 1,827
    • max ID Length: 34
    • max ID length text: harrimanelloideae_to_vaccinioideae
  • The megatrees and component trees for angiosperms associated with phylomatic are now more accessible, via github (they were previously on a self-hosted SVN server). Includes tools for joining the component trees together.
  • Megaphylogeny of 800+ living and fossil families of fishes, Westneat and Lundberg unpublished, NEXUS format including Mesquite extensions, family level with other higher-taxa labelled (File:Westneat Lundberg BigFishTree.nex)
  • Tree of Life Web Project Structure, zipped version of proprietary XML format, spans all of life, family level and above (File:TOL.xml.zip)
    • needs conversion to Newick or NEXUS format
  • 800k node GreenGenes Tree from early 2011, newick format, includes branch lengths (File:Greengenes2011.txt)
    • Nodes with IDs 413,004
    • max ID Length: 135
    • max ID length text:c__Thermolithobacteria; o__Thermolithobacterales; f__Thermolithobacteraceae; g__Thermolithobacter; s__Thermolithobacter ferrireducens

(btw, I made an R package (phyloorchard) to hold large trees; it has a few in there now, and I'll add the ones above. People should feel free to request to be added to that project if you want to do more with it. --BrianOMeara 15:56, 25 April 2012 (EDT))

use cases

Phylogenetic diversity analysis (community ecology)

Description In "Phylogenetic habitat filtering influences forest nucleation in grasslands", Duarte (2011) uses Phylomatic to analyze the diversity of forest patches in a Brazilian grassland region. Each patch has a size that can be measured, and each patch has some set of tree species S. The purpose of the analysis is to understand the relationship between the phylogenetic diversity of a patch and its size. Calculating phylogenetic diversity for patches P1, P2, P3 etc, requires phylogenies for the corresponding sets of species S1, S2, S3 etc. (Actually, another way to satisfy this use-case is for the phylogeny service to provide a measure of phylogenetic diversity directly from S1, S2, etc).

Steps subsequent to field work and coding of data

  • get the family/genus/binomial string for each species
  • construct the list of species S_x for each patch P_x
  • use phylomatic to get the topology for S_x
  • adjust branch lengths using BLADJ
  • compute phylogenetic diversity using NRI (net relatedness index) in phylocom

Files that we have (we thank Leandro Duarte for supplying data files)

Note that the phylogenies above are for the complete set of 58 species observed among all patches. It is not the phylogeny of species for any individual patch. This corresponds to Fig. 1 from the paper, shown here:

Error creating thumbnail: Unable to save thumbnail to destination

Teaching morphological evolution with reference to the community-consensus megatree of fishes.

Description: Educators often use phylogenies to illustrate and discuss the evolution of character systems. However, textbooks often present outdated phylogenies, particularly when the newest edition of a book begins to age, and printed phylogenies aren’t very computable in any case. Wouldn’t it be great if any educator could provide a list of taxa of interest and receive back a current phylogeny linking them?

In this case study, we’ll prune the current community-consensus megatree of fish families (Westneat and Lundberg, unpublished) to a list of 25 taxa, and then reconstruct some simple morphological characteristics on the pruned phylogeny. The two examples shown here are drawn directly from Brian Sidlauskas’ Ichthyology class at Oregon State University. He assembled the pruned tree by manually by examining the megatree, but not every educator would be comfortable doing this.

  • Task 1: Prune the Westneat and Lundberg megatree using the list of higher taxa contained in column one of this tab-delimited text file. Note that the megatree uses primarily families at its tips, but that most of the terminal taxa desired in the output tree are orders or even higher-level taxa. All taxa specified in the text file appear in the megatree, but many occur as internal node labels, not terminal taxa. The output topology should match that in this NEXUS format treefile (which was assembled by hand) and will ideally preserve terminal taxon names and internal node labels.
  • Task 2: Perform and display a parsimony-based ancestral state reconstruction using the character data in columns 3 and 4 of the text file. The output should match that appearing in the two images. Bonus if other methods of reconstruction are available as options (e.g. stochastic character mapping or likelihood).
    • Character coding is interpreted as follows.
      • Tail Type: 0 = Protocercal; 1 = Heterocercal; 2 = Diphycercal; 3 = Homocercal; 4 = Isocercal
      • Gasbladder Type: 0 = Absent; 1 = Physostomous; 2 = Physoclistous
  • Task 3: Perform the same pruning operation as in task 1, but use the scientific names listed in column two of the text file as input and return a tree with the species-level names as the terminal taxa. Performing this task would require a taxonomic lookup (Fishbase or Encyclopedia of Life?) to locate the family-level terminal taxon in the fish megatree that includes that taxon.
  • Possible Further Extension: Can a taxonomic lookup be extended to common names? What if an educator wants a phylogeny for "shark, salmon, herring, seahorse and pufferfish"?

Error creating thumbnail: Unable to save thumbnail to destination

Error creating thumbnail: Unable to save thumbnail to destination

Allometry of milk intake at peak lactation

This use-case is from Riek A: Allometry of milk intake at peak lactation. Mammalian Biology Zeitschrift fur Saugetierkunde 2011, 76(1):3-11 (Figure by Brian Sidlauskas and Arlin Stoltzfus includes sub-phylogeny and correlation Figures from Riek, 2011).

Error creating thumbnail: Unable to save thumbnail to destination

We have the File:Riek species list.xlsx of Riek, 2011 as a spreadsheet. This is from text extracted from publisher's PDF and corrected manually (caveat emptor). An original version is on request from the author. The species (without infraclass and order information) are

Bettongia penicillata, Macropus eugenii, Pseudocheirus peregrinus, Phyllostomus hastatus, Suricata suricata, Mustela vison, Mephitis mephitis, Felis catus, Canis lupus, Ursus americanus, Ursus arctos, Cystophora cristata, Erignathus barbatus, Halichoerusgrypus, Phoca groenlandica, Callorhinus ursinus, Arctocephalus australis, Equus caballus, Lama glama, Camelus dromedarius, Sus scrofa, Ovis orientalis, Bos taurus, Capra hircus, Capra ibex, Oreamnus americanus, Ovibus moschatus, Cervus elaphus, Alces alces, Odocoileus hemionus, Rangifer tarandus, Cephalophus manticola, Gazella dorcas, Papio cynocephalus, Homo sapiens, Rattus norvegicus, Mus musculus, Cavia porcellus, Oryctolagus cuniculus, Lepus europaeus

Riek extracted a phylogeny from the the Bininda-Emonds mammal tree as described here:

The phylogeny for the species used in the present study was derived from a mammalian supertree, which includes 4510 species with branch lengths derived from dated estimates of divergence times (Bininda-Emonds et al. 2007). The supertree for mammals in Newick format was transformed to a distance matrix using the Analyses in Phylogenetics and Evolution package in R (Paradis et al., 2004) and pruned to include only the species of the present study. The resulting tree had no polytomies. The program PhyloWidget (Jordan and Piel 2008) was used to construct a printable phylogenetic tree from the phylogeny in Newick format (Fig. 1)

The resulting tree is shown below (a Newick file is on request from the author). Error creating thumbnail: Unable to save thumbnail to destination

TimeFree database

TimeTree.org has a lot of information about divergence times. It would be trivial to scrape. However, they explicitly forbid widespread reuse:

"Substantial duplication is not permitted. We encourage wide use of this resource, but until it is complete it should not be used to represent a synthesis for any taxonomic group. Currently large scale, automated, data-mining is not permitted"

despite it being six years old and NSF-funded. You're only allowed to look up pairs of species, use the poster, read the book, and use the iPhone app -- even the tree with dated nodes is unavailable for reuse in analyses. Given the number of recent large, dated trees, it would not be that difficult to make a database of chronogram patristic distances of species, ideally with confidence, and serve that in a way that actually encourages reuse.

a pseudocode example This is an example of using the current TimeTree (or work-alike) to calibrate a Phylotastic species tree

  • pre-requisites
    • TOPOLOGY = the input phylogenetic topology without branch lengths (e.g., from the APG tree)
    • TIMEPOINT = a service (TimeTree) that supplies divergence time estimates for species A and B (note: TimeTree's licensing will not allow this, I believe, though it is technically possible)
    • SCALE = a method for calibrating TREE based on some set of TIMEPOINTs (e.g. using BEAST)
    • UNCERTAINTY = an indicator of current uncertainty in branch lengths of TREE
    • SAMPLE = a method of sampling nodes from a tree (e.g., pick one at random, use all, etc.)
  • pseudo code
    1. read TOPOLOGY
    2. let TREE = TOPOLOGY
    3. while UNCERTAINTY exceeds user-supplied threshold
      1. get SAMPLE node from TREE
      2. add TIMEPOINT( SAMPLE ) to TIMEPOINT_LIST
      3. SCALE TREE based on TIMEPOINT_LIST
      4. get UNCERTAINTY
    4. output TREE

There are actually three possible starting points. One is a topology, where one could use the above procedure to fix node ages where there are calibrations and make up node ages elsewhere (Phylocom has tools for this, as does a script by Olaf Bininda-Emonds (used to make up brlen on the mammal supertree)). Another is a topology but also some molecular data. One could combine the topology, data, and calibrations in something like Beast. The third is to have a tree with branch lengths in units of amount of change, and then use calibrations and a program like r8s to turn it into a chronogram.

Calculating phylogenetic diversity and distinctiveness from GIS-linked specimen data

Rutger suggested on the mailing list that his colleagues at Naturalis would like to be able to prune a megatree based on GIS data to come up with indices of phylogenetic distinctiveness for areas (perhaps administrative areas or blocks in a grid) to inform conservation priorities. The original idea was based on plants in Malaysia, but the concept would be applicable to any taxon. Basic steps would likely be:

  • Assemble a database of taxon records (species IDs and lat/long coordinates).
  • Prune that set geographically
  • Prune the megatree to include only taxa occurring within the area of interest.
  • Calculate phylogenetic diversity on the pruned tree
  • Display results

This seems like a broadly appealing situation that would be of use to many researchers. Is anyone interested in fleshing it out as a use case? --Bsidlauskas 12:25, 2 May 2012 (EDT)

Brian O'Meara has a big dataset of georeferenced localities for a variety of organisms that he could provide in a CSV file upon request. The data are aggregated from GBIF, Lifemapper and other sources and form the basis of queries in his webapp Lampyr. The GBIF locations from Lampyr are in a tab-delimited file located here (warning: large file) --Bsidlauskas 18:34, 7 May 2012 (EDT)

Possible ways to partition a geographic dataset (and thus to generate the species list for input to Phylomatic) would be:

  • Everything within a particular radius of a given point
  • Everything within a rectangular area bounded by four specific geographic coordinates.
  • Everything within each cell of a grid laid over a region (this was the specification of the original use case from Naturalis)
  • All occurrences within an irregular polygon (say a watershed, or a political boundary).


a pseudocode example

  • prerequisites
    • OCCURRENCE_SERVICE = a resource that can be queried by species or location (or metadata) for occurrence records
    • REGIONS_LIST = a list of geographic regions, either as polygons or as IDs for administrative regions
    • COLOR_MAP( MAP, REGION, VALUE ) = paint region on map with color-coded value
  • pseudo code
    1. read REGIONS_LIST
    2. for each REGION in REGIONS_LIST
      1. // convert region to polygon if needed by OCCURRENCE_SERVICE
      2. get OCCURRENCES in REGION from OCCURRENCE SERVICE
      3. get SPECIES_LIST from OCCURRENCES
      4. get PHYLOTASTIC_TREE from SPECIES_LIST
      5. get PHYLOGENETIC_DIVERSITY( PHYLOTASTIC_TREE )
      6. add { REGION, PHYLOGENETIC_DIVERSITY } to RESULT_LIST
      7. COLOR_MAP( MAP, REGION, PHYLOGENETIC_DIVERSITY )
    3. output MAP and RESULT_LIST

Species interaction

An example I've used before is to add a phylogenetic component to a 1929 collection of plant and pollinator interactions. There are several datasets of this kind available through NCEAS's Interaction Web database, not only of plant/pollinators but also of different kind of interactions among species.

  • Identify datasets of interest
  • Retrive phylogenetic trees for both groups
  • Reconcile trees
  • Display

Nmatasci 19:29, 2 May 2012 (EDT)

a pseudocode example

  • prerequisites
    • a functional MAPPING (e.g., ecological mapping) between species A and B, e.g., A23 pollinates B47
    • COMPARE = a method, possibly graphical, of comparing Phylogeny_A and Phylogeny_B given MAPPING
  • pseudo code
    1. input MAPPING
    2. extract list of subjects (e.g., plants) from MAPPING as LIST_A
    3. get PHYLOGENY_A for LIST_A (i.e., phylotastic)
    4. extract list of objects (e.g., pollinators) from MAPPING as LIST_B
    5. get PHYLOGENY_B for LIST_B (i.e., phylotastic)
    6. COMPARE( PHYLOGENY_A, PHYLOGENY_B, MAPPING )

Metagenomics: Flagging inconsistency/discrepancies in NCBI taxonomy

Because of issues with licensing and reuse associated with other taxonomic nomenclatures (e.g. SILVA), we continue to rely on NCBI taxonomy for annotating environmental sequence data (e.g. large 454, Illumina datasets). However, NCBI taxonomy is extremely messy and not concordant with phylogenetic structure for many groups. For example, in nematodes the NCBI hierarchy is more consistent with historical morphological classifications than the most recent knowledge about evolutionary relationships inferred from molecular phylogenies. How do link taxonomic information from different sources (e.g. TreeBase, SILVA, greengenes, the OpenTree project) and flag inconsistencies in NCBI’s hierarchy? Because we are working with many uncultured microbial taxa in environmental datasets, portals like GBIF are not appropriate taxonomy resources because of their focus on botany and larger, well-studied animal species. NCBI remains the best option.

Metagenomic approaches have primarily relied on BLAST-assigned taxonomy, using sequence homology to determine the closest relative in databases. However, the field is now moving towards tree-based taxonomy assignments, e.g. via tools such as pplacer which places short environmental sequence reads onto guide tree topologies (e.g. representing full-length reference sequences). This is a more robust method to assign taxonomy, since BLAST-based approaches are inherently reliant on the quality/size of reference databases and the (often uninformative) user annotations for deposited sequences.

However, the move to Tree-based taxonomic assignments faces a number of challenges, with one of the main issues being the reconciliation of NCBI taxonomy with molecular phylogenetic structures of guide trees. Currently we are mapping our gene tree phylogeny onto the NCBI taxonomic hierarchy, as follows:

Error creating thumbnail: Unable to save thumbnail to destination (Diagram courtesy of Aaron Darling, UC Davis Genome Center)

We cannot yet map taxonomy onto gene trees, because accurately executing this converse mapping approach will require sophisticated (as yet undeveloped) mathematical algorithms. We have been putting in grant proposals to develop these in the long terms, but for the time being we must work to tweak our mapping to NCBI taxonomy to be as robust as possible. --Holly.bik@gmail.com 16:49, 3 May 2012 (EDT)

Auto-supply species trees to reconcile-tree software

background The "reconciliation" problem is to infer gene duplications and deletions by combining the information from a gene tree with the corresponding species tree. The resulting tree is called a "reconciled tree" (or a reconcile tree). This is a quantitatively important use-case in bioinformatics, primarily because reconciliation-based orthology assignment is a step in pipelines for functional annotation of genomes. There are perhaps a dozen different software programs that do this.

At least one data resource, Ensembl Compara, provides pre-computed reconcile trees for its gene families. Typically the user computes the gene tree from molecular sequences, but must get the species tree from somewhere else, e.g., the scientific literature. This is a potential bottleneck that could be addressed in two steps:

  • get the list of species from the list of gene identifiers (can be done currently via NCBI web services)
  • get species tree from the list of species (Phylotastic)

Because most gene data are in GenBank, the relevant species for nearly all gene trees are represented already in the NCBI taxonomy-- thus it can substitute for a species tree in reconciliation protocols. Apparently this is quite a commonly done: it is the basis for the Ensembl Compara pipeline; the SoftParsMap package has some kind of internal functionality to extract a species tree from a downloaded NCBI taxonomy; the manual for Notung (yet another reconciliation program) provides instructions for users on how to get a species tree interactively from the NCBI taxonomy browser.

approach Regardless of approach, this can be automated only to the extent that we can discover species names from the user's gene tree or sequence alignment. This is a likely scenario to the extent that sequence names often reflect NCBI identifiers (gis or accessions). Let's assume that case.

There are alternative approaches to implement auto-discovery of a species tree within a reconcile-tree pipeline:

  • starting with an open-source reconciliation tool, integrate phylotastic web services into the code
    • + demonstrates auto-discovery
    • - only useful to those who use this reconciliation tool
  • write a wrapper around a reconciliation tool
    • + demonstrates auto-discovery; easier to adapt to other tools
    • - depends on possibly unstable interface to tool
  • write a standalone tool to get a species tree from a gene-tree (or alignment)
    • e.g., use BioPerl or DendroPy to supply the functionality for reading in alignments or trees

Arlin 15:24, 7 May 2012 (EDT)


a pseudocode example for the case of a stand-alone tool that will discover a species phylogeny from the user's input data

  • prerequisites
    • user's input GENE_TREE or SEQUENCE_ALIGNMENT in a standard format
  • pseudo code
    1. if input is a tree
      1. read GENE_TREE
      2. extract NCBI identifiers from GENE_TREE as ID_LIST
    2. else
      1. read SEQUENCE_ALIGNMENT
      2. extract NCBI identifiers from SEQUENCE_ALIGNMENT as ID_LIST
    3. get SPECIES_LIST from ID_LIST // uses NCBI's existing web-services interface
    4. get PHYLOGENY for SPECIES_LIST (i.e., phylotastic)


Sequence string as input approach

As an alternative to taking a user gene tree as input, we may also consider input from a step back in the process, and think of going from a single input sequence to a fully reconciled gene tree. An excellent example of going from raw-sequence to a gene tree phylogeny is provided in the DNA Subway education platform. Currently the 'Prospect Genome' track on the DNA Subway halts at the generation of a gene tree. A phylotastic service could provide the species tree needed to take this one step further to a reconciled gene tree. The DNA Subway education platform could then go from a raw genome sequence to a set of reconciled gene trees for that genomic segment, and allow for educators to discuss the interpretation of orthologs and paralogs in genome evolution. --JamesEstill@gmail.com 12:17, 16 May 2012 (EDT)

phylogenetic analysis of leaf vein patterns

This is based on a real study by Romona Walls (now at NY Botanical Garden), but the result tree is imaginary (for the purposes of this figure) so don't look too closely (Figure by Brian Sidlauskas and Arlin Stoltzfus, using leaf patterns figure from Walls, 2011). (Walls RL: Angiosperm leaf vein patterns are linked to leaf functions in a global-scale data set. American journal of botany 2011, 98(2):244-253.

This case is incomplete. It would be good to get the species list and a character matrix. Maybe they are available in supplementary data? The phylogeny is from the APG tree (and other trees-- see Walls, 2011 for details).

Error creating thumbnail: Unable to save thumbnail to destination

(why not just use the existing treedata() function in Geiger in R to prune to common leaf set? Note that there's also an interface to iPlant's taxonomic name resolution service in R already so names can be converted (package rplant: we just got funding to have a postdoc spend a year extending this to all of iPlant's APIs (already can handle data upload/download and batching alignments) --BrianOMeara 16:10, 25 April 2012 (EDT))