MIAPA/PhyloWays

From Evolutionary Interoperability and Outreach
Jump to: navigation, search

PhyloWays: a list of interpreted phyloinformatics workflows

This is intended to house a reference set of pairs, where each pair consists of

  1. a publication reporting a phylogeny
  2. a more precise or formal description of the methods

This reference set is developed in the hope it will be useful for various projects:

  • developing the vocabulary support to annotate phylogenetics workflows
  • developing an annotation tool to create phylogenetic records that satisfy a MIAPA-like standard
  • testing natural-language-processing (NLP) tools to extract methods information from published papers
  • creating an archive where users share, like, comment, and link to workflow descriptions

overview

guidelines

cases

some candidate cases

below are listed some cases where the data should be readily available.

a simple protein phylogeny

"Evolutionary history of the non-specific lipid transfer proteins"

this is in TreeBASE. There is one input alignment (263 proteins, 118 aligned chars), and two trees, NJ and ML.

http://treebase.org/treebase-web/search/study/analyses.html?id=11155

a simple DNA phylogeny

"Plastid DNA Diversity Is Higher in the Island Endemic Guadalupe Cypress than in the Continental Tecate Cypress"

1589 sequence characters for 35 otus. this is an open access article with TreeBASE data:

http://www.treebase.org/treebase-web/search/study/analyses.html?id=11036

a simple DNA phylogeny

"Cercosporoid leaf pathogens from whorled milkweed and spineless safflower in California"

this is a species phylogeny available in TreeBASE:

http://www.treebase.org/treebase-web/search/study/analyses.html?id=11804

There is one input alignment of 1125 columns for 35 ITSs. There is one tree, which is a parsimony tree.

prokaryotic phylogeny (Wu, et al., 2011)

large phylogeny of prokaryotes, concatenated alignment, has data in treebase:

Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E, Ivanova NN, Kunin V, Goodwin L, Wu M, Tindall BJ et al: A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 2009, 462(7276):1056-1060.

Caniformia phylogeny (Yu, et al., 2011)

Recent paper, sequence alignment, has deposited information in treebase.

Yu L, Luan PT, Jin W, Ryder OA, Chemnick LG, Davis HA, Zhang YP: Phylogenetic utility of nuclear introns in interfamilial relationships of Caniformia (order Carnivora). Syst Biol 2011, 60(2):175-187.

phylo analysis of rodent mandible shape

recent paper, new morphological character matrix, has deposited information in Dryad

Alvarez et al. Ecological and phylogenetic influence on mandible shape variation of South American caviomorph rodents (Rodentia: Hystricomorpha). Biological Journal of the Linnean Society, 2011, 102, 828–837.

Angiosperm phylogeny (Soltis, et al, 2011)

Publication Soltis DE, Smith SA, Cellinese N, Wurdack KJ, Tank DC, Brockington SF, Refulio-Rodriguez NF, Walker JB, Moore MJ, Carlsward BS, et al. 2011. Angiosperm phylogeny: 17 genes, 640 taxa. Am J Bot 2011:ajb.1000404. - http://www.amjbot.org/cgi/reprint/ajb.1000404v1

Data: concatenated alignments for a superset of 14loci/17 genes (nucleotide sequences) sampled from 640 species. Genes included 18S rDNA (nuc), 26S rDNA (nuc), atpB (cp), atp1 (mito), matK (cp), matR (mito), nad5 (mito), ndhF (cp), psbBTNH (cp 4 gene region), rbcL (cp), rpoC2 (cp), rps16 (cp), rps3 (mito), and rps4 (cp).

Alignment method: MAFFT used to align each of 14 loci; "adjustments were made by eye when there were obvious alignment errors due to particularly divergent or “ gappy ” sequences"; Sites (columns) with > 50% missing data (including gaps due to indels) were removed using Phyutility (Smith and Dunn, 2008). All or subsets of gene alignments concatenated for phylogenetic analysis.

Tree estimation: Independent MP and ML analyses performed the following data matrices; nuclear rDNA genes; cp genes; mito genes; nuclear+cp genes; all 17 genes.

  1. Method (1) - ML; 10 independent runs for each data matrix.
    • Program - RAxML (vers. 7.1; Stamatakis, 2006 ).
    • Model of sequence evolution - GTRGAMMA with parameters estimated separately (unlinked) for each gene partition.
    • Method for evaluating support - 100-300 bootstrap replicates
  2. Method (2) - MP parsimony ratchet with 50 independent replicates, each run for 500 iterations each; MP tree estimates as majority rule of best trees from each replicate; tree only shown for 17 gene supermatrix.
    • Program - SeqBoot (Phylip; Felsenstein, 2005), PAUPRat ( Sikes and Lewis, 2001 ) and PAUP* 4.0b10 ( Swofford, 2002 ).
    • Method for evaluating support - bootstrap - 500 bootstrap datasets generated using SeqBoot. A PAUPRat-generated ratchet file generated for each pseudoreplicate and run for a single 500-iteration search.

Additional comments: Trees available in TreeBASE - http://www.treebase.org/treebase-web/search/study/analyses.html?id=11267 ; Polyosma mtDNA loci omitted from analysis as contaminant after assessing discordance with other loci; Cardiopteris atp1 suspected as a contaminant, but retained.

semi-formalized description Don't let this freak you out. The idea here is to see how detailed a description might have to be in order to be computable.

The main descriptive statement

publication Pub1 reports PhylogenyResult1.1 and PhylogenyResult2.1

About the publication

Pub1 has_authors "Soltis DE", "Smith SA", "Cellinese N" . . . 
Pub1 has_citation "Am J Bot 2011:ajb.1000404" . . .
Pub1 has_URL . . .

About phylogeny result 1.1, which is a consensus tree? the value is either concrete (a newick tree) or a pointer (to a treebase accession or a nexml object)

PhylogenyResult1.1 has_value . . .  < concrete or referenced_by pointer >
PhylogenyResult1.1 has_input PhylogenyResult1.0
PhylogenyResult1.1 has_method MajorityRuleConsensus # ?? not sure
PhylogenyResult1.1 has_method_details "100 to 300 bootstrap replicates"

PhylogenyResult1.0 has_value NA  # we are not showing all the bootstrap trees
PhylogenyResult1.0 has_input Alignment1.1
PhylogenyResult1.0 has_method Method1

About phylogeny result 2.1, which is a consensus tree

PhylogenyResult2.1 has_value . . .  < concrete or referenced_by pointer >
PhylogenyResult2.1 has_input PhylogenyResult2.0
PhylogenyResult2.1 has_method MajorityRuleConsensus 
PhylogenyResult2.1 has_method_details "not sure about this" 

PhylogenyResult2.0 has_value NA  # we are not showing all the bootstrap trees
PhylogenyResult2.0 has_input Alignment1.1
PhylogenyResult2.0 has_method Method2

ALIGNMENTS About Alignment1.1, which is an edit from Alignment 1.0, which is a concatenation

Alignment1.1 has_value . . . < concrete or referenced_by pointer >
Alignment1.1 has_input Alignment1.0
Alignment1.1 has_method Pruning
Alignment1.1 has_method_details "delete sites with >50% missing data" 

Alignment1.0 has_value . . . < concrete or referenced_by pointer >
Alignment1.0 has_input Alignment2.1, Alignment3.1 . . . Alignment15.1
Alignment1.0 has_method Concatenate

About Alignment2.1, a component alignment edited from a MAFFT alignment

Alignment2.1 has_value . . . < concrete or referenced_by pointer >
Alignment2.1 has_input Alignment2.0
Alignment2.1 has_method EditByHand 
Alignment2.1 has_method_details "remove divergent or gappy sequences" 

Alignment2.0 has_value . . . < concrete or referenced_by pointer >
Alignment2.0 has_input . . . < list of GenBank accessions, ideally > 
Alignment2.0 has_method MAFFT
Alignment0.0 has_method_details NA

Alignments 3.1 to 15.1 are similar-- each one is a possibly edited version of a MAFFT alignment for an individual set of sequences.

PHYLOGENY METHODS

Method1 has_attributes
* software RAxML
* software_version 7.1
* objective_function maximum_likelihood
* sitewise_model SiteWiseModel1
* among_site_model AmongSiteModel1

SiteWiseModel1 has_attributes
* GTR

AmongSiteModel has_attributes
* gamma
* partitions 

Method2 has_attributes
* software PAUPRat
* software PAUP
* software_version 4.0b10
* objective_function maximum_parsimony
* search_method parsimony_ratchet

another case