From Evolutionary Interoperability and Outreach
Revision as of 14:23, 25 March 2011 by Arlin (talk) (Created page with "This page consists of content generated by the MIAPA survey team. == A provisional taxonomy of barriers to re-use == This is our space to clarify our ideas about barriers to r...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page consists of content generated by the MIAPA survey team.

A provisional taxonomy of barriers to re-use

This is our space to clarify our ideas about barriers to re-use, so that we can ask the right questions on the survey. If we want to use the survey to assess the importance of various potential barriers to re-use, we need to develop specific ideas about what are those barriers.

Before acquiring potentially re-useable information...

  • How important are these barriers to identifying and locating relevant records?
    • resources exist but do people know of them (lack of marketing, advertising?)
    • lack of intelligent, comprehensive search interfaces
      • false positives, false negatives, search inadequacies
      • lack of indexing of much biological literature in PubMed, Scopus, WoS
    • lack of archiving of original data sets (i.e., archived electronic record does not exist)
  • How important are these barriers to accessing and downloading relevant records?
    • paywalls
    • ease of accessibility (e.g. TreeBASE 1 [bad] & TimeTree iPhone app [good] )

After acquiring re-useable information, but before re-use...

  • How important are these barriers to extracting re-useable information from accessed records?
    • extracting/obtaining the data (data is stored in inappropriate formats and/or not online)
    • interpreting the data presented (ambiguity and lack of adequate metadata)
  • How important are these other barriers to re-using information extracted from records
    • inadequate taxon coverage
      • can't find a specific OTU of interest (e.g., new gene seq)
      • poor coverage of a group of interest
      • hard to pick and choose a set of OTUs that covers the problem without becoming unwieldy
    • lack of annotation of methods associated with the data (needed to replicate, assess quality)
    • potential for intellectual property issues (re-use rights for data are often not explicit)

When re-use happens...

  • How common are these different types of re-use?
    1. re-using statistical inferences such as trees, clades, dates of divergence
    2. re-using character data matrices
      • re-using aligned or homologized characters
      • re-using unaligned data
    3. re-using workflows or methods (e.g., as described in a paper)
  • How common are these different types of tangible outcomes?
    • re-use of older information is central to a new study
    • re-use of older information provides a counterpoint to a new analysis
    • re-use of older information provides perspective but not publishable results
  • How common are these different types of attributions?
    • authorship (on a new publication) by author of source publication that reported the re-used data
    • citation (in a new publication) of a source publication that reported the re-used data
    • acknowledgement (in a new publication) of individual sharing the data
    • noting or otherwise "liking" the source publication (or author, or archival record) in a public venue (e.g., citeulike)
    • a private letter of thanks to the authors of the re-used data
    • no public acknowledgement of any kind

A taxonomy of issues from the data producer perspective

Barriers to reaching a positive decision to archive

  • I don't know about resources (TreeBASE, Dryad, Paleobase)
  • I don't know about relevant policies of journals and funding agencies
  • I do not want to upload my data to a public archive because
    • I could get scooped
    • I want to control what it's being used for (not something stupid)
    • I want to ensure that I get credit-- so, potential users must contact me directly
    • I'll deny it if you ask me, but I am afraid of people trying to replicate my analysis
  • I am not opposing to archiving, but the benefits do not exceed the costs that I perceive, including
    • time to prepare
    • time to submit

Barriers to submitting a record to a chosen archive

  • the submission process is poorly documented. i don't know what to do.
  • difficulty of gathering information needed for a complete and well annotated record
  • relevant information does not fit the data model assumed by the submission process
  • relevant information is not in the format required by the submission process
  • the submission process is buggy, it breaks for no apparent reason

Negative experiences after archiving

  • (do producers get credit?)

User stories

This is a place for us to record stories from talking directly to users who have re-used data, or have tried to re-use data.

Genome Wide Analysis of Eukaryote Thaumatin-like Proteins

Background see the article Genome-wide analysis of eukaryote thaumatin-like proteins (TLPs) with an emphasis on poplar" by Benjamin Petre, Ian Major, Nicolas Rouhier and Sébastien Duplessis, BMC Plant Biology 2011, 11:33.

Reuse experience

  • Access to public data sets: I have never accessed any public phylogenetic data sets (I was involved in a project related to DNA barcoding, and although there were phylogenetic analyses associated with the study, the retrieved data was intended for DNA barcoding).
  • I have never asked for (nor been able to find) phylogenies developed by other researchers for my own use/reuse.
  • I have often tried to reuse or reference phylogenies reported by other researchers, and as suggested by the first question, the biggest hurdle is usually acquisition of the actual sequences used for those phylogenies. Most frustrating are phylogenies in which the sequences used are given generic names (which basically makes it impossible to replicate). Obviously, such cases make it impossible to reuse the phylogeny. In cases where I have succeeded in reusing a reported phylogeny, it has involved repeating the analysis from alignment through phylogenetic analysis.
  • I have tried to replicate phylogenetic protocols described in publications. Often this involves use of phylogeny software with which I am not familiar (and so in some cases require a learning curve to use that software). In general I find that the actual phylogenetic methods in publications are poorly described and so make it difficult to replicate (though this may not be specific to phylogenetic analyses, since methods in general are poorly described in scientific publications.
  • I have never been asked by other researchers to share my raw phylogenies, though I admit this recent poplar TLP paper is my first publication that is primarily a phylogenetic analysis (I have a couple others in the pipeline).

Coevolution of mimicry in rift lake catfish

Background see the cover article of Evolution for Feb, 2011 by Jeremy Wright

User steps before decision to re-use or not

  • finding phylogenies
    • already had the sense that work of Day, et al. 2009 was the current definitive phylogenetic work
    • checked with google scholar, ISI WoS (e.g., "phylogeny synodontis" "phylogeny tanganyika")
    • would have known of other studies by word of mouth among scholars
  • getting data
    • first author J. Day sent alignment file (cytb) and tree file (Bayesian tree and ultrametric tree)
    • not sure if these data were ever archived

Subsequent user steps

  • validation: re-analyzed the tree using same phylo inference (as described by Day, et al., 2009)
    • methods description in Day, et al. was clear, listed out parameters for MrBayes
    • found topologies were the same, tiny diff in support values, assumes that tree is right
  • tree from Day, et al., 2009 is used in Fig. 1, 6, 7
  • Fig 7 age estimates (grey bars) were added graphically (Adobe illustrator) by tracing over an earlier figure, i.e., the node ages were never encoded mathematically but were transferred graphically from one figure to another
  • character data (gross color patterns of fish)
    • mostly based on examining specimens in hand
    • but, in a few cases, based on examining image in published paper (original works were cited)
    • data matrix was not archived

Another story: work in preparation will re-use alignment from Day, et al by adding sequences (by hand, with possible re-alignment).

User Feedback from Emily

User 1 from Emily: Researcher re-uses various data forms (morphological matrices, sequence data, trees) frequently in teaching, and uses morphological matrices and sequence data infrequently in research.

  • wild ginger
  • ITS sequences, published early 1990's, author retired, could not locate data (alignment not published); ended up determining sequences again

User 1's concerns are attribution issues and integrity of the original data, free/open access, 'community standards' in reporting raw data (e.g. Genbank...where standards are quite flexible and mostly not mandatory. This user feels that scientists must adopt 'best practices' first, then convince journals to require these best practices.

  • reasons users don't want to share data are unclear-- don't want data involved in project of high quality?

User 2 from Emily: Researcher 2 has re-used anatomical data sets, sequence data, and morphological data sets, but has not re-analyzed anatomical or morphological data (instead, has re-evaluted in light of new phylogenetic hypotheses).

Researcher 2's concerns are concerns about data quality (e.g. from Genbank, concerned about sources) mostly, but also best practices in analyses (which is why Researcher 2 does not re-use trees from Treebase or other databases. (hadn't heard of TreeBASE)

Molecular phylogenetics of the genus Neoconocephalus (katydids)


Snyder RL, Frederick-Hudson KH, Schul J. PLoS One. 2009 Sep 25;4(9):e7203.

Reuse experience

  • Some questions and answers
    • did you access public data sets in conducting your phylogenetic studies? If so, which ones?
      • While working on this paper we did pull sequence from NCBI of a closely related species just to see where they would fall out in our tree
    • have you ever encountered difficulties in reusing data provided by other researchers in your studies? If so, could you please describe what difficulties you encountered?
      • Most of this work was with AFLP's, although AFLP's are reproducible they are not very comparable which makes a difficulty in sharing this type of data. Also since we work with non-model organisms that are also not important for agriculture there isn't really allot of data out there for us to glean. I have been working on pulling sequence to compare to my mitogenomes and it is difficult to sort through the taxonomy to find relevant sequence, there are many mislabeled and redundant sequences available for Orthoptera
    • have you ever reused phylogenies developed by other researchers as part of your investigations?
      • There are very few phylogenies available for our study system. I have compared our phylogenies to those built from morphological characters and behavioral traits, but that is pretty much it... Most of these phylogenies were published years ago, so I just reproduced the newick format for purposes of comparing topology.
    • have you ever reused or replicated experimental protocols (in phylogenetic analysis) described by other researchers (e.g., described in scientific publications)?
      • I do this all of the time! Because we are using AFLP's not all methods are available, also having a larger data set is a limitation as well. I find trying to replicate phylogenetic methods a constant struggle. The most major problem being, that not everyone who makes a phylogeny knows what that means or what the computer was doing at that time... I am not saying I know more, but in replicating some task, it is obvious that the program could not have accomplished what people say. So you have to take the good with the bad, I generally read what the paper did and what program is used and then try to use the program from the manual not from the methods of a single paper. But in the past I have had to contact the program developers to complete my analysis.
    • have you ever been asked by other researchers to share your raw data or your phylogenies with them?
      • Here at MU there are quite a few people working with phylogenetics and genomics, we share data, and trees, and methods frequently, generally I just share my nexus files. I have not sent anyone outside of MU my datasets as of right now.

Sudhir's comments

  1. Incomplete data submitted to GenBank: This is not too common but it happens more often that I like. Usually the paper shows a list of accession numbers but sometimes it does not include all the reported sequences. It happens to me twice that I asked for the data and I was given some sort of excuse.
  2. Authors may not provide the alignment and sometime those cannot be accurately reproduced. That affects our ability of relating the molecular and the metadata.
  3. Poor coding and lack of access: I have had this problem with clinical data. Clinical data, even anonymized one (without patients identifiers), is seldom provided beyond averages or summary tables so it is hard to associate a sequence with a case definition. I believe this is one of the most pressing matters for those interested in associating genotypes with phenotypes of health interest. There are relatively poor standards for reporting case definitions. I believe we could invest time discussing the issue of how to code case definitions in a way that can be anonymized and made available for association analyzes. You may not recall, but we tried to contact CDC and persuade them about the importance of relational databases linking clinical and molecular data.
  4. Poor specimen identification. Sometimes specimens are not properly identified. In the case of microbes is ok because the authors explicitly state that they could not get other form of identification.
  5. I had problem understanding the parameters used in some molecular clock studies during the survey we worked on to figure out how researchers use calibrations. Many studies (the ones I looked at were 9 studies published in 2009) did not report their full parameters, for example they said they used a lognormal prior distribution for the calibrations without giving mean and standard deviation of it (I can send you the references for the papers I looked at, if you need it
  6. Another problem I encountered is mistakes in the published files. For example, in one study, the whole input file used in BEAST for phylogeny and timing estimations in the supplementary material did not work. I tried to use this file directly but did not work and found out, after emailing the authors, that the file published had a mistake in it that would prevent it from working in the software. The problem was solved by contacting the authors and asking for the correct file.

Ross's comments

  1. User disagrees with 'controversial' new phylogeny and wished to replicate the analysis (as a first step). User copies data matrix from the supplementary information and reformats it into a PAUP* readable file. User re-analyses data and finds a very different result (topologically). Despite some settings being specified in the supplementary materials (which the user follows), user still can't replicate the published phylogeny (not even close). User contacts first author of original paper and is told "Sorry I'm busy atm, can I get back to you next week?" [story still in progress, about this paper: Liu et al 2011 Nature <>]
  2. User (wanting to re-use data for re-purposing) spots a few mistakes in a recently published data matrix - the matrix is missing a few states for a few taxa, rendering it unusable, and the phylogenetic analysis unrepeatable (unless one contacts the authors to ascertain what the missing states are). User informs the authors and an erratum is published in the very next issue <>. User (who is a young grad student, who's name has never appeared in print before) feels mildly disgruntled that their small contribution "spotting the mistake" wasn't acknowledged in the erratum.
  3. User wanted to find all morphology-generated bird (Aves) phylogenetic hypotheses (trees-only) for testing stratigraphic congruence with the fossil record. User finds very few in TreeBASE. User has difficulty searching for morphology-inferred trees - lots of molecular generated hypotheses that are hard to filter out of the search strategy. When user does find appropriate studies, user re-enters the topological data into a PAUP* readable Newick format (for later use) manually. Very tedious. User tried TreeSnatcher on some of the larger trees (taxa-wise) but found it to be equally time-consuming relative to doing it completely manually.
  4. Accessing data for repurposing: Sauropterygia. User wants to re-analyze Wu EA 2011 <>, reads paper. Only character codings for one taxon given in that paper, referred to another paper (Holmes EA 2008) for the rest. User gets Holmes EA 2008 <;2.> and discovers it only has part of the matrix too, referred to earlier paper for rest of matrix. Earlier paper (Rieppel EA 2002 <;2>) STILL does not contain a full matrix, paper refers user to Rieppel, 1999 <> which mercifully does contain a full matrix. However, the user notes that the pdf for this is an image, so the user cannot just copy-out the matrix and would have to manually type it out line-by-line. User strongly doubts that Rieppel if emailed would be able to provide a nexus for for a study from over ten years ago. User notes that because of subtle changes between versions of the matrix, one can only recompose the Wu et al 2011 matrix if one has ALL the matrices in between. User thinks this is outrageous.

Phylogenetic context for a protein structural novelty

I think of this as a typical naive user story.

Background User solved the 3D structure of CAP from M. tuberculosis ([1]) and found that the allosteric transition is different from that of the classic E. coli structure. User wants to put this in a phylogenetic context. CAP has been studied for many years and is a paradigm for transcriptional regulation and allosteric transition. Any new insights on this protein will be important.

User steps before decision to re-use or not

  1. user wants to see species tree to get a sense of distance between E. coli & M. tuberculosis
    • googled "bacterial phylogeny"
      • found some tree images
      • downloaded and printed tree images of rDNA-based species tree
      • mentally composed a supertree
    • user was not aware of any other resources to search for a phylogeny
  2. user wants to get multiple sequence alignment
    • user did not think of using protein family resources like pfam (assumed that he would not have enough control)
    • uses NCBI blast to find close relatives of E. coli and M. tuberculosis sequences
    • selects a sampling of sequences to cover available diversity, using species tree as a guide
    • downloads sequences in FASTA format
    • runs clustalw from a web server
      • does this in a progressive fashion, adding seqs incrementally to understand alignment better
      • prunes redundancy by eye
  3. user prints alignment, highlights it with residues implicated structurally in allosteric transition
  4. user consults phylogeny expert for further advice
    • expert identifies source of species tree in TreeBASE
    • expert provides literature on "functional shift" methods
    • expert shows Pfam resource with CAP family, interface to select taxa (hierarchically) for alignment

Subsequent user steps This project is just beginning.

Origin of new proteins in E. coli

Background User is studying the origin of new proteins or "ORFans" in E. coli, using NCBI's clusters as a starting point. Some clusters are found only in E. coli strains, others are distributed more deeply. The design of the study is based on getting population statistics for ORFans, and comparing these to the same statistics for "normal" genes. The control sets are genes that have existed for long periods of time, as determined phylogenetically. For this, the user wants a tree

  • that is a species tree
  • that has wide coverage of proteobacterial genomes whose genes are in the clusters db
  • ideally, with reliability values on nodes.

User steps prior to decision to re-use or not

  • user searches PubMed, Google for resources
  • user disregards most of these resources, including
    • various web sites devoted to prokaryotic phylogeny that (apparently) do not provide a downloadable tree (e.g.,,
    • published species trees based on whole-genome distance methods that the user finds dubious (e.g., Henz2004)
    • published species trees that are too old (limited in coverage) but use rigorous methods (e.g., the supertree method by Daubin2002)
  • user identifies one tree that has the right coverage and quality (Wu2009)
    • this has a composite alignment (many genes) for 720 prokaryotes, and a phyml tree
    • but the tree does not have branch support values such as bootstraps
    • note that user has a personal connection to senior author of Wu2009
  • user requests and receives alignment and phylogeny from original authors

End result of this is that the user decides to re-use the alignment, but compute a new tree for a subset of OTUs, in order to obtain bootstrap values.

Subsequent user steps

  • user spends hours reconciling names (different naming scheme for OTUs in tree and alignment)
  • user combines alignment and tree in one file using Mesquite
  • user uses an interactive tool ( to prune the tree to the target group (proteobacteria)
  • user runs a RAXML analysis on pruned alignment using the CIPRES server, including bootstrap replicates

Rodent systematics

Background User has a dual interest in rodent systematics, including classification of newly characterized African species, and fossil calibration of molecular trees with respect to the KT explosion hypothesis. In most cases, he uses BEAST to compute trees from compound alignments, based on DNA sequences for mitochondrial and nuclear genes. Norris2004

User says "I use GenBank all the time", successfully, in spite of a few cases of things not in GenBank (obtained by querying author). User has not re-used trees. However, user has re-used data matrices and alignments.

User steps before decision to re-use or not

  • searched for data from specific papers ("classic" papers well known in field)
  • searched TB, trying to find which genes had big data sets for rodents
    • found hits to compound alignments
    • downloaded NEXUS
    • try to fill in gaps
      • tried to find additional genes in GenBank (i.e., updating data)
      • pruned out species with missing genes (MacClade)
      • added newly determined sequences to alignment (unpublished user data)
  • also found morphological fossil data sets
    • identified and located from literature survey of field
    • data represented in published image of a NEXUS file (or, in some cases, PDF)
    • hand-entered data into electronic NEXUS file
    • wanted to coordinate with molecular tree, but it was too difficult
      • difficult to make constraints tree
      • high proportion of missing data
      • unable to replicate results in original study (treatment of uncertain states possibly not as described in published paper)

Subsequent user steps

  • after building up a data set starting from a re-used matrix
    • re-align
    • phylo analysis
    • results to go in publications in rodent systematics
  • user re-uses his own old alignments frequently
    • user's 12S (mito SSU RNA) and 16S alignments used as template for aligning new sequences

another user story


User steps before decision to re-use or not

Subsequent user steps

  • <pubmed>19193643</pubmed>
  • Retrieved from ""