From Evolutionary Interoperability and Outreach
Revision as of 12:18, 13 April 2012 by Arlin (talk) (A provisional taxonomy of barriers to re-use)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page consists of user stories from ourselves and other scientists who were kind enough to share their comments, and for giving permission for this material to be used in Stoltzfus, et al., Sharing and ReUse of Phylogenetic Trees (and Associated Data) to Facilitate Synthesis.

A provisional taxonomy of barriers to re-use

This is an attempt to create a taxonomy, in hopes of clarifying ideas about barriers to re-use, to serve as a guide for more detailed attempts to characterize barriers.

Before acquiring potentially re-useable information...

  • How important are these barriers to identifying and locating relevant records?
    • resources exist but do people know of them (lack of marketing, advertising?)
    • lack of intelligent, comprehensive search interfaces
      • false positives, false negatives, search inadequacies
      • lack of indexing of much biological literature in PubMed, Scopus, WoS
    • lack of archiving of original data sets (i.e., archived electronic record does not exist)
  • How important are these barriers to accessing and downloading relevant records?
    • paywalls
    • ease of accessibility (e.g. TreeBASE 1 [bad] & TimeTree iPhone app [good] )

After acquiring re-useable information, but before re-use...

  • How important are these barriers to extracting re-useable information from accessed records?
    • extracting/obtaining the data (data is stored in inappropriate formats and/or not online)
    • interpreting the data presented (ambiguity and lack of adequate metadata)
  • How important are these other barriers to re-using information extracted from records
    • inadequate taxon coverage
      • can't find a specific OTU of interest (e.g., new gene seq)
      • poor coverage of a group of interest
      • hard to pick and choose a set of OTUs that covers the problem without becoming unwieldy
    • lack of annotation of methods associated with the data (needed to replicate, assess quality)
    • potential for intellectual property issues (re-use rights for data are often not explicit)

When re-use happens...

  • How common are these different types of re-use?
    1. re-using statistical inferences such as trees, clades, dates of divergence
    2. re-using character data matrices
      • re-using aligned or homologized characters
      • re-using unaligned data
    3. re-using workflows or methods (e.g., as described in a paper)
  • How common are these different types of tangible outcomes?
    • re-use of older information is central to a new study
    • re-use of older information provides a counterpoint to a new analysis
    • re-use of older information provides perspective but not publishable results
  • How common are these different types of attributions?
    • authorship (on a new publication) by author of source publication that reported the re-used data
    • citation (in a new publication) of a source publication that reported the re-used data
    • acknowledgement (in a new publication) of individual sharing the data
    • noting or otherwise "liking" the source publication (or author, or archival record) in a public venue (e.g., citeulike)
    • a private letter of thanks to the authors of the re-used data
    • no public acknowledgement of any kind

A taxonomy of issues from the data producer perspective

Barriers to reaching a positive decision to archive

  • I don't know about resources (TreeBASE, Dryad, Paleobase)
  • I don't know about relevant policies of journals and funding agencies
  • I do not want to upload my data to a public archive because
    • I could get scooped
    • I want to control what it's being used for (not something stupid)
    • I want to ensure that I get credit-- so, potential users must contact me directly
    • I'll deny it if you ask me, but I am afraid of people trying to replicate my analysis
  • I am not opposing to archiving, but the benefits do not exceed the costs that I perceive, including
    • time to prepare
    • time to submit

Barriers to submitting a record to a chosen archive

  • the submission process is poorly documented. i don't know what to do.
  • difficulty of gathering information needed for a complete and well annotated record
  • relevant information does not fit the data model assumed by the submission process
  • relevant information is not in the format required by the submission process
  • the submission process is buggy, it breaks for no apparent reason

Negative experiences after archiving

  • (do producers get credit?)

User stories

The user stories generally refer to specific studies, while the user comments below include generalizations and collected experiences.

Coevolution of mimicry in rift lake catfish

User: Jeremy Wright


See the cover article of Evolution for Feb, 2011 [1]

User steps before decision to re-use or not

  1. finding phylogenies
    • already had the sense that work of Day, et al. 2009[2] was the current definitive phylogenetic work
    • checked with google scholar, ISI WoS (e.g., "phylogeny synodontis" "phylogeny tanganyika")
    • would have known of other studies by word of mouth among scholars
  2. getting data
    • first author J. Day sent alignment file (cytb) and tree file (Bayesian tree and ultrametric tree)
    • not sure if these data were ever archived

Subsequent user steps

  1. validation: re-analyzed the tree using same phylo inference (as described by Day, et al., 2009[2])
    • methods description in Day, et al. was clear, listed out parameters for MrBayes
    • found topologies were the same, tiny diff in support values, assumes that tree is right
  2. tree from Day, et al., 2009[2] is used in Fig. 1, 6, 7
  3. Fig 7 age estimates (grey bars) were added graphically (Adobe illustrator) by tracing over an earlier figure, i.e., the node ages were never encoded mathematically but were transferred graphically from one figure to another
  4. character data (gross color patterns of fish)
    • mostly based on examining specimens in hand
    • but, in a few cases, based on examining image in published paper (original works were cited)
    • data matrix was not archived

Another story: work in preparation will re-use alignment from Day, et al by adding sequences (by hand, with possible re-alignment).

Molecular phylogenetics of the genus Neoconocephalus (katydids)

User: Katy Frederick


The study investigated in Molecular phylogenetics of the genus Neoconocephalus (orthoptera, tettigoniidae) and the evolution of temperate life histories.[3]

Reuse experience

  • Some questions and answers
    • did you access public data sets in conducting your phylogenetic studies? If so, which ones?
      • While working on this paper we did pull sequence from NCBI of a closely related species just to see where they would fall out in our tree
    • have you ever encountered difficulties in reusing data provided by other researchers in your studies? If so, could you please describe what difficulties you encountered?
      • Most of this work was with AFLP's, although AFLP's are reproducible they are not very comparable which makes a difficulty in sharing this type of data. Also since we work with non-model organisms that are also not important for agriculture there isn't really allot of data out there for us to glean. I have been working on pulling sequence to compare to my mitogenomes and it is difficult to sort through the taxonomy to find relevant sequence, there are many mislabeled and redundant sequences available for Orthoptera
    • have you ever reused phylogenies developed by other researchers as part of your investigations?
      • There are very few phylogenies available for our study system. I have compared our phylogenies to those built from morphological characters and behavioral traits, but that is pretty much it... Most of these phylogenies were published years ago, so I just reproduced the newick format for purposes of comparing topology.
    • have you ever reused or replicated experimental protocols (in phylogenetic analysis) described by other researchers (e.g., described in scientific publications)?
      • I do this all of the time! Because we are using AFLP's not all methods are available, also having a larger data set is a limitation as well. I find trying to replicate phylogenetic methods a constant struggle. The most major problem being, that not everyone who makes a phylogeny knows what that means or what the computer was doing at that time... I am not saying I know more, but in replicating some task, it is obvious that the program could not have accomplished what people say. So you have to take the good with the bad, I generally read what the paper did and what program is used and then try to use the program from the manual not from the methods of a single paper. But in the past I have had to contact the program developers to complete my analysis.
    • have you ever been asked by other researchers to share your raw data or your phylogenies with them?
      • Here at MU there are quite a few people working with phylogenetics and genomics, we share data, and trees, and methods frequently, generally I just share my nexus files. I have not sent anyone outside of MU my datasets as of right now.

Origin of new proteins in E. coli

User: Arlin Stoltzfus


User (Arlin Stoltzfus, for project of post-doc Guoqin Yu) is studying the origin of new proteins or "ORFans" in E. coli, using NCBI's clusters as a starting point. Some clusters are found only in E. coli strains, others are distributed more deeply. The design of the study is based on getting population statistics for ORFans, and comparing these to the same statistics for sets of genes that are distributed in broader clades, i.e., as determined phylogenetically. The control sets will be chosen based on reliable clades. For this, the user wants a species tree with wide coverage of proteobacterial genomes whose genes are in the clusters db and with reliability values on nodes.

User steps prior to decision to re-use or not

  1. user searches PubMed, Google for resources
  2. user disregards most of these resources, including
    • various web sites devoted to prokaryotic phylogeny that (apparently) do not provide a downloadable tree (e.g.,,
    • published species trees based on whole-genome distance methods that the user finds dubious (e.g., [4])
    • published species trees that are too old (limited in coverage) but use rigorous methods (e.g., the supertree method by [5])
  3. user identifies one tree that has the right coverage and quality [6]
    • this has a composite alignment (many genes) for 720 prokaryotes, and a phyml tree
    • but the tree does not have branch support values such as bootstraps
    • note that user has a personal connection to senior author of [6]
  4. user requests and receives alignment and phylogeny from original authors

End result of this is that the user decides to re-use the alignment, but compute a new tree for a subset of OTUs, in order to obtain bootstrap values.

Subsequent user steps

  1. user spends hours reconciling names (different naming scheme for OTUs in tree and alignment)
  2. user combines alignment and tree in one file using Mesquite
  3. user uses an interactive tool ( to prune the tree to the target group (proteobacteria)
  4. user runs a RAXML analysis on pruned alignment using the CIPRES server, including bootstrap replicates

Rodent systematics

User: Ryan W. Norris


User (Ryan W Norris) has a dual interest in rodent systematics, including classification of newly characterized species, and fossil calibration of molecular trees with respect to the KT explosion hypothesis. In most cases, he uses BEAST and PAUP* to compute trees from compound alignments, based on DNA sequences for mitochondrial and nuclear genes. [7]

User says "I use GenBank all the time", successfully, in spite of a few cases of things not in GenBank (obtained by querying author). User has not re-used trees. However, user has re-used data matrices and alignments.

User steps before decision to re-use or not

  1. searched for data from specific papers ("classic" papers well known in field)
  2. searched TB, trying to find which genes had big data sets for rodents
    • found hits to compound alignments
    • downloaded NEXUS
    • try to fill in gaps
      • tried to find additional genes in GenBank (i.e., updating data)
      • pruned out species with missing genes (MacClade)
      • added newly determined sequences to alignment (unpublished user data)
  3. also found morphological fossil data sets
    • identified and located from literature survey of field
    • data represented in published image of a NEXUS file (or, in some cases, PDF)
    • hand-entered data into electronic NEXUS file
    • wanted to coordinate with molecular tree, but it was too difficult
      • difficult to make constraints tree due to poor overlap in taxa between the molecular and morphological datasets
      • high proportion of missing data
      • unable to replicate results in original study (treatment of uncertain states possibly not as described in published paper)

Subsequent user steps

  1. after building up a data set starting from a re-used matrix
    • re-align
    • phylo analysis
    • results to go in publications in rodent systematics
  2. user re-uses his own old alignments frequently
    • user's 12S and 16S alignments (with structure notes) used as template for aligning new sequences

Molecular phylogeny of plant defense genes

User: Ian Major


Some references to published work (other phylogenies are in progress)

  • Functional analysis of the Kunitz trypsin inhibitor family in poplar [8]
  • Analysis of a wound-inducible acid phosphatase in poplar [9]

Reuse experience

  • Access to public data sets: I have never accessed any public phylogenetic data sets (I was involved in a project related to DNA barcoding, and although there were phylogenetic analyses associated with the study, the retrieved data was intended for DNA barcoding).
  • I have never asked for (nor been able to find) phylogenies developed by other researchers for my own use/reuse.
  • I have often tried to reuse or reference phylogenies reported by other researchers, and as suggested by the first question, the biggest hurdle is usually acquisition of the actual sequences used for those phylogenies. Most frustrating are phylogenies in which the sequences used are given generic names (which basically makes it impossible to replicate). Obviously, such cases make it impossible to reuse the phylogeny. In cases where I have succeeded in reusing a reported phylogeny, it has involved repeating the analysis from alignment through phylogenetic analysis.
  • I have tried to replicate phylogenetic protocols described in publications. Often this involves use of phylogeny software with which I am not familiar (and so in some cases require a learning curve to use that software). In general I find that the actual phylogenetic methods in publications are poorly described and so make it difficult to replicate (though this may not be specific to phylogenetic analyses, since methods in general are poorly described in scientific publications.
  • I have never been asked by other researchers to share my raw phylogenies, though there hasn't been much occasion for this.

Analysis of Leaf Vein Patterns

User: Ramona Walls


A successful story of re-use. User (Ramona Walls) carried out a highly integrative analysis [10] of leaf vein patterns that combined new measurements based on new images of cleared leaves from an existing collection, new characterizations of major vein patterns based on existing online image collections of leaves, an existing database of leaf characteristics (LES), and pruned versions of existing phylogenies (APG) accessed via phylomatic.

Primary & secondary vein type

  1. types were designated (as Fig 1a of [11]) by manual inspection of electronic images from online collections
    • the existence of the online collections from herbaria is common knowledge among botanists
    • however, user was not working from any comprehensive list of collections
    • some useful collections (e.g., Smithsonian Trop Res Inst collection) were discovered indirectly via Google image searches (search species name; find image; trace this back to "credible" source)
  2. sourced images were not saved
  3. vein types were stored in a spreadsheet archived as supplementary data to [12]

MVD (minor vein density) data for cleared leaves

  1. user made images of cleared leaf collection, in order to calculate MVD values
  2. existence of cleared leaf collection is not common knowledge ("my advisor told me about it")
  3. user's image collection was not archived, but has informational links to specimen vouchers
  4. MVD values were included in supplementary data to published paper, in response to a request from the reviewers

Leaf Economics Spectrum (LES)

  1. existence of LES data of Wright, et al. [13] is common knowledge among botanists; mentioned repeatedly at recent professional meetings
  2. see the note below about taxonomic identifiers

Phylogenies via phylomatic

  1. used online interface to phylomatic <> to get species trees
    • for designated sets of species (intersection of LES and leaf types, intersection of LES and MVD)
  2. user could not recall how she knew about phylomatic (perhaps existence of APG tree and Davies, et al tree [14] are common knowledge among botanists)
  3. interface requires family names in addition to genus and species
    • this required manual annotation of the list of names ("an exercise in plant taxonomy")
    • used IPNI and Tropicos, according to methods section of [15]
  4. species names must match those in phylomatic ("you have to go through and check your spellings")
    • names in LES do not always match those used for the same species by phylomatic
    • user was not aware of TNRS resource provided by iPlant (did not exist when study was performed?)

User comments

Sudhir Kumar

  1. Incomplete data submitted to GenBank: This is not too common but it happens more often that I like. Usually the paper shows a list of accession numbers but sometimes it does not include all the reported sequences. It happens to me twice that I asked for the data and I was given some sort of excuse.
  2. Authors may not provide the alignment and sometime those cannot be accurately reproduced. That affects our ability of relating the molecular and the metadata.
  3. Poor coding and lack of access: I have had this problem with clinical data. Clinical data, even anonymized one (without patients identifiers), is seldom provided beyond averages or summary tables so it is hard to associate a sequence with a case definition. I believe this is one of the most pressing matters for those interested in associating genotypes with phenotypes of health interest. There are relatively poor standards for reporting case definitions. I believe we could invest time discussing the issue of how to code case definitions in a way that can be anonymized and made available for association analyzes. You may not recall, but we tried to contact CDC and persuade them about the importance of relational databases linking clinical and molecular data.
  4. Poor specimen identification. Sometimes specimens are not properly identified. In the case of microbes is ok because the authors explicitly state that they could not get other form of identification.
  5. I had problem understanding the parameters used in some molecular clock studies during the survey we worked on to figure out how researchers use calibrations. Many studies (the ones I looked at were 9 studies published in 2009) did not report their full parameters, for example they said they used a lognormal prior distribution for the calibrations without giving mean and standard deviation of it (I can send you the references for the papers I looked at, if you need it
  6. Another problem I encountered is mistakes in the published files. For example, in one study, the whole input file used in BEAST for phylogeny and timing estimations in the supplementary material did not work. I tried to use this file directly but did not work and found out, after emailing the authors, that the file published had a mistake in it that would prevent it from working in the software. The problem was solved by contacting the authors and asking for the correct file.

Ross Mounce and Anne O'Connor

  1. User (Ross Mounce) doubts 'controversial' new phylogeny and wished to replicate the analysis (as a first step). User copies data matrix from the supplementary information and reformats it into a PAUP* readable file. User re-analyses data and finds a very different result (topologically). Despite some settings being specified in the supplementary materials (which the user follows), user still can't replicate the published phylogeny (not even close). User contacts first author of original paper and is told they are busy. User informs editor and writes letter to the journal. See Liu et al 2011[16] and Mounce & Wills 2011 [17]
  2. User (Ross Mounce), wanting to re-use data for re-purposing spots a few mistakes in a recently published data matrix - the matrix is missing a few states for a few taxa, rendering it unusable, and the phylogenetic analysis unrepeatable (unless one contacts the authors to ascertain what the missing states are). User informs the authors and an erratum is published in the very next issue <>.
  3. User (Anne O'Connor, University of Bath): uses whatever kind of tree she can find; they are usually molecular. If user has a choice of trees in a paper, she usually goes with the molecular one. User is not sure that she has ever used TreeBASE. User had a quick look in it, but didn't do any 'real' searches: "There are more than enough bird trees out there in papers just doing a general search, but the part about having to generate the Newick format [by hand or using TreeSnatcher] is true though (and a pain in the arse!). My main problem is finding the fossil first and last dates for taxa. I don't really have a problem with the trees and I don't use the data matrices."
  4. User (Ross Mounce). Accessing data for repurposing: Sauropterygia. User wants to re-analyze Wu EA 2011 <>, reads paper. Only character codings for one taxon given in that paper, referred to another paper (Holmes EA 2008) for the rest. User gets Holmes EA 2008 [1] and discovers it only has part of the matrix too, referred to earlier paper for rest of matrix. Earlier paper (Rieppel EA 2002 <;2>) STILL does not contain a full matrix, paper refers user to Rieppel, 1999 <> which mercifully does contain a full matrix. However, the user notes that the pdf for this is an image, so the user cannot just copy-out the matrix and would have to manually type it out line-by-line. User notes that because of subtle changes between versions of the matrix, one can only recompose the Wu et al 2011 matrix if one has ALL the matrices in between.

Emily Gillespie

User 1 from Emily: Researcher re-uses various data forms (morphological matrices, sequence data, trees) frequently in teaching, and uses morphological matrices and sequence data infrequently in research.

  • wild ginger
  • ITS sequences, published early 1990's, author retired, could not locate data (alignment not published); ended up determining sequences again

User 1's concerns are attribution issues and integrity of the original data, free/open access, 'community standards' in reporting raw data (e.g. Genbank...where standards are quite flexible and mostly not mandatory. This user feels that scientists must adopt 'best practices' first, then convince journals to require these best practices.

  • reasons users don't want to share data are unclear-- don't want data involved in project of high quality?

User 2 from Emily: Researcher 2 has re-used anatomical data sets, sequence data, and morphological data sets, but has not re-analyzed anatomical or morphological data (instead, has re-evaluted in light of new phylogenetic hypotheses).

Researcher 2's concerns are concerns about data quality (e.g. from Genbank, concerned about sources) mostly, but also best practices in analyses (which is why Researcher 2 does not re-use trees from Treebase or other databases. (hadn't heard of TreeBASE)


  1. <pubmed>20964683</pubmed>
  2. 2.0 2.1 2.2 <pubmed>19226415</pubmed>
  3. <pubmed>19779617</pubmed>
  4. <pubmed>15166018</pubmed>
  5. <pubmed>12097345</pubmed>
  6. 6.0 6.1 <pubmed>20033048</pubmed>
  7. <pubmed>15120394</pubmed>
  8. <pubmed>18024557</pubmed>
  9. <pubmed>20129630</pubmed>
  10. <pubmed>21613113</pubmed>
  11. <pubmed>21613113</pubmed>
  12. <pubmed>21613113</pubmed>
  13. <pubmed>15103368</pubmed>
  14. <pubmed>14766971</pubmed>
  15. <pubmed>21613113</pubmed>
  16. <pubmed>21350485</pubmed>
  17. <pubmed>21833044</pubmed>