MIAPA/PhyloWays
From Evoio
Contents |
PhyloWays: a list of interpreted phyloinformatics workflows
This is intended to house a reference set of pairs, where each pair consists of
- a publication reporting a phylogeny
- a more precise or formal description of the methods
This reference set is developed in the hope it will be useful for various projects:
- developing the vocabulary support to annotate phylogenetics workflows
- developing an annotation tool to create phylogenetic records that satisfy a MIAPA-like standard
- testing natural-language-processing (NLP) tools to extract methods information from published papers
- creating an archive where users share, like, comment, and link to workflow descriptions
overview
guidelines
cases
some candidate cases
below are listed some cases where the data should be readily available.
a simple protein phylogeny
"Evolutionary history of the non-specific lipid transfer proteins"
this is in TreeBASE. There is one input alignment (263 proteins, 118 aligned chars), and two trees, NJ and ML.
http://treebase.org/treebase-web/search/study/analyses.html?id=11155
a simple DNA phylogeny
1589 sequence characters for 35 otus. this is an open access article with TreeBASE data:
http://www.treebase.org/treebase-web/search/study/analyses.html?id=11036
a simple DNA phylogeny
"Cercosporoid leaf pathogens from whorled milkweed and spineless safflower in California"
this is a species phylogeny available in TreeBASE:
http://www.treebase.org/treebase-web/search/study/analyses.html?id=11804
There is one input alignment of 1125 columns for 35 ITSs. There is one tree, which is a parsimony tree.
prokaryotic phylogeny (Wu, et al., 2011)
large phylogeny of prokaryotes, concatenated alignment, has data in treebase:
Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E, Ivanova NN, Kunin V, Goodwin L, Wu M, Tindall BJ et al: A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 2009, 462(7276):1056-1060.
Caniformia phylogeny (Yu, et al., 2011)
Recent paper, sequence alignment, has deposited information in treebase.
Yu L, Luan PT, Jin W, Ryder OA, Chemnick LG, Davis HA, Zhang YP: Phylogenetic utility of nuclear introns in interfamilial relationships of Caniformia (order Carnivora). Syst Biol 2011, 60(2):175-187.
phylo analysis of rodent mandible shape
recent paper, new morphological character matrix, has deposited information in Dryad
Alvarez et al. Ecological and phylogenetic influence on mandible shape variation of South American caviomorph rodents (Rodentia: Hystricomorpha). Biological Journal of the Linnean Society, 2011, 102, 828–837.
Angiosperm phylogeny (Soltis, et al, 2011)
Publication Soltis DE, Smith SA, Cellinese N, Wurdack KJ, Tank DC, Brockington SF, Refulio-Rodriguez NF, Walker JB, Moore MJ, Carlsward BS, et al. 2011. Angiosperm phylogeny: 17 genes, 640 taxa. Am J Bot 2011:ajb.1000404. - http://www.amjbot.org/cgi/reprint/ajb.1000404v1
Data: concatenated alignments for a superset of 14loci/17 genes (nucleotide sequences) sampled from 640 species. Genes included 18S rDNA (nuc), 26S rDNA (nuc), atpB (cp), atp1 (mito), matK (cp), matR (mito), nad5 (mito), ndhF (cp), psbBTNH (cp 4 gene region), rbcL (cp), rpoC2 (cp), rps16 (cp), rps3 (mito), and rps4 (cp).
Alignment method: MAFFT used to align each of 14 loci; "adjustments were made by eye when there were obvious alignment errors due to particularly divergent or “ gappy ” sequences"; Sites (columns) with > 50% missing data (including gaps due to indels) were removed using Phyutility (Smith and Dunn, 2008). All or subsets of gene alignments concatenated for phylogenetic analysis.
Tree estimation: Independent MP and ML analyses performed the following data matrices; nuclear rDNA genes; cp genes; mito genes; nuclear+cp genes; all 17 genes.
- Method (1) - ML; 10 independent runs for each data matrix.
- Program - RAxML (vers. 7.1; Stamatakis, 2006 ).
- Model of sequence evolution - GTRGAMMA with parameters estimated separately (unlinked) for each gene partition.
- Method for evaluating support - 100-300 bootstrap replicates
- Method (2) - MP parsimony ratchet with 50 independent replicates, each run for 500 iterations each; MP tree estimates as majority rule of best trees from each replicate; tree only shown for 17 gene supermatrix.
- Program - SeqBoot (Phylip; Felsenstein, 2005), PAUPRat ( Sikes and Lewis, 2001 ) and PAUP* 4.0b10 ( Swofford, 2002 ).
- Method for evaluating support - bootstrap - 500 bootstrap datasets generated using SeqBoot. A PAUPRat-generated ratchet file generated for each pseudoreplicate and run for a single 500-iteration search.
Additional comments: Trees available in TreeBASE - http://www.treebase.org/treebase-web/search/study/analyses.html?id=11267 ; Polyosma mtDNA loci omitted from analysis as contaminant after assessing discordance with other loci; Cardiopteris atp1 suspected as a contaminant, but retained.
semi-formalized description Don't let this freak you out. The idea here is to see how detailed a description might have to be in order to be computable.
The main descriptive statement
publication Pub1 reports PhylogenyResult1.1 and PhylogenyResult2.1
About the publication
Pub1 has_authors "Soltis DE", "Smith SA", "Cellinese N" . . . Pub1 has_citation "Am J Bot 2011:ajb.1000404" . . . Pub1 has_URL . . .
About phylogeny result 1.1, which is a consensus tree? the value is either concrete (a newick tree) or a pointer (to a treebase accession or a nexml object)
PhylogenyResult1.1 has_value . . . < concrete or referenced_by pointer > PhylogenyResult1.1 has_input PhylogenyResult1.0 PhylogenyResult1.1 has_method MajorityRuleConsensus # ?? not sure PhylogenyResult1.1 has_method_details "100 to 300 bootstrap replicates" PhylogenyResult1.0 has_value NA # we are not showing all the bootstrap trees PhylogenyResult1.0 has_input Alignment1.1 PhylogenyResult1.0 has_method Method1
About phylogeny result 2.1, which is a consensus tree
PhylogenyResult2.1 has_value . . . < concrete or referenced_by pointer > PhylogenyResult2.1 has_input PhylogenyResult2.0 PhylogenyResult2.1 has_method MajorityRuleConsensus PhylogenyResult2.1 has_method_details "not sure about this" PhylogenyResult2.0 has_value NA # we are not showing all the bootstrap trees PhylogenyResult2.0 has_input Alignment1.1 PhylogenyResult2.0 has_method Method2
ALIGNMENTS About Alignment1.1, which is an edit from Alignment 1.0, which is a concatenation
Alignment1.1 has_value . . . < concrete or referenced_by pointer > Alignment1.1 has_input Alignment1.0 Alignment1.1 has_method Pruning Alignment1.1 has_method_details "delete sites with >50% missing data" Alignment1.0 has_value . . . < concrete or referenced_by pointer > Alignment1.0 has_input Alignment2.1, Alignment3.1 . . . Alignment15.1 Alignment1.0 has_method Concatenate
About Alignment2.1, a component alignment edited from a MAFFT alignment
Alignment2.1 has_value . . . < concrete or referenced_by pointer > Alignment2.1 has_input Alignment2.0 Alignment2.1 has_method EditByHand Alignment2.1 has_method_details "remove divergent or gappy sequences" Alignment2.0 has_value . . . < concrete or referenced_by pointer > Alignment2.0 has_input . . . < list of GenBank accessions, ideally > Alignment2.0 has_method MAFFT Alignment0.0 has_method_details NA
Alignments 3.1 to 15.1 are similar-- each one is a possibly edited version of a MAFFT alignment for an individual set of sequences.
PHYLOGENY METHODS
Method1 has_attributes * software RAxML * software_version 7.1 * objective_function maximum_likelihood * sitewise_model SiteWiseModel1 * among_site_model AmongSiteModel1 SiteWiseModel1 has_attributes * GTR AmongSiteModel has_attributes * gamma * partitions Method2 has_attributes * software PAUPRat * software PAUP * software_version 4.0b10 * objective_function maximum_parsimony * search_method parsimony_ratchet