MIAPA Survey

From Evolutionary Interoperability and Outreach
Jump to: navigation, search

This is the wiki for the MIAPA survey group. Our aim is to gather information from scientist-users that will be useful in identifying and removing barriers to re-use.

Links and resources

Some possible links and topics:

Sandbox for developing survey questions

If you aren't sure where to start, see the list of crude ideas at the end. Note that people aren't very good at telling us what they might do under hypothetical conditions. Generally it is better to ask specific recall questions rather than general questions or speculation or "what if" questions. But that doesn't mean we can't ask interesting questions. With a survey, we can ask "why" questions that are impossible to answer in other ways (if you gave up { looking for | archiving | re-using } a tree, why did you give up?).

questions about re-use

cross-category questions

  1. Q1) Have you ever attempted to re-use a phylogenetic data matrix (multiple aligned sequences or other characters e.g. morphological) from a phylogenetic study? Was your attempt successful? Possible Answers: Yes (1) No (2) I have tried many different times with mixed success rates (3)
    • Q1a) Only if your answer to the above question was 'Yes': Did you obtain the data matrix from...
      • emailing one of the original study authors (1)
      • a personal or institutional webpage of one of the authors
      • an online database such as TreeBASE, MorphoBank or Cladestore
      • directly from the paper or its supplementary materials
      • Other (5)
    • Q1b1) Only if your answer to Q1 was 'No' or 'mixed success': How many times have you tried to, but failed to re-use data matrices? Possible Answers: Once (1) Twice (2) 3 to 5 (3) 6 to 10 (4) 11 to 1000
    • Q1b2) Only if your answer to Q1 was 'No' or 'mixed success': What was the major cause or causes of your inability to re-use matrices? Possible Answers: emailed author asking for matrix, author did not reply (1) emailed author, author explicitly would not send the data with amenable terms of use (2) emailed author, author claims to have lost the original data file (3) attempted to use matrix directly from the paper and/or supplementary materials, found these to be incomplete/insufficient/erroneous (4) attempted to find study data in online databases e.g. TreeBASE, could not find it there (5) tried to email author, but found the email address provided was too old and inactive, could not find newer details with which to contact the author (6) Could not understand how to use the data matrix obtained e.g unfamiliar file format (7) Other (8)
    • Discussion of Question 1: I think, particularly for question Q1b2) we should allow people to tick multiple answers. As many people may have been unable to re-use data matrices multiple times.
  2. If scope or coverage was a factor in making a decision to re-use, or not re-use, in the past year, . . . ? (Emily)
  3. Did you ever attempt to re-use a data matrix (aligned residues or other characters) from a phylogenetic study? Was it successful?
    • What about unaligned data (sequences or phenotypes)?
  4. If a perception of low or high quality was a factor in making a decision (positive or negative) about re-use in the past year, which specific indicators of quality were important? (extent of available annotations; choice of specific desirable methods; availability of provenance information; bootstrap or other measures of uncertainty)
  5. How often in the past year did you attempt to re-use methods (workflow, parameters, heuristics) from a paper?

Before acquiring potentially re-useable information...

  1. Did you ever search for a phylogenetic inference result (tree, date of divergence, inferred character state) before making decision to compute it yourself? If so, how did you search? Google, Treebase, PubMed, WoS? Ross
    • How hard did you look? How many hours, days? Were you ultimately successful in finding the tree or other item? (Ross)
  2. How often within the last year have "pay-walls" or related access restrictions to either data or publications prevented you from making use of information you hoped would be important for your work? Brandon
  3. How often in the last year did you request a data set from authors that was not available in a public archive? Did you get it?
  4. Have you ever been prevented from re-using data from a study because you do not have access to the article in which it is published? (Answers: Yes or No)

After acquiring re-useable information, but before re-use...

  1. How often did formats present a barrier to data re-use of downloaded data? Were you able to circumvent the barriers? How? (Sudhir)


  1. Have you ever encountered a situation where you knew exactly where the data of your desire was (e.g. matrix or tree/cladogram) but found it was too difficult to extract from the published paper? (yes or no)

(if no...) Did you try emailing the author for it? (if yes...) Did you get what you wanted from the author? (And) How long did you have to wait for a reply from the author? 24hrs or less, 2 or 3 days, 4-10 days, 11-30 days, months, never (no reply)

When re-use happens...

  1. If and only if you have tried to re-use the settings as used for the original analysis of a dataset, were you:
A1) confident that you could replicate what was done *exactly*  
A2) prevented by one or more missing details in the original paper, so that you could not be sure of exactly replicating their method
A3) confused (for reasons other than absence of detail) as to what method they used. 

Reasoning: to try and get a quantitative grip on how many times people encounter the barrier of 'vaguely specified methods used in the original analysis'.

  1. Have you ever found serious errors in published phylogenetic data? (e.g. raw sequences misidentified, seriously mis-aligned sequences, missing/extra characters in morphological data matrices...)

Were you eventually able to get corrected/fixed data for whatever the problem was?

  1. If you re-used methods, data or results in the course of a new study, how important were they? (central to the new study; included in report as a counterpoint to the new analysis; older information provides perspective but not publishable results)

questions from the perspective of the producer

  1. In the past year, how many times have data or analyses you have published been re-used?
    1. never to my knowledge
    2. <5
    3. >5
  2. If so, how were you credited?
    1. my data or analyses were not re-used
    2. I was (co-) author
    3. I was acknowledged
    4. I was not credited explicitly
    5. other (specify:...)

questions to characterize the respondent population

  1. What is your primary exposure to phylogenetic trees:
    • I am a primary producer of trees using molecular, morphological or other type of data
    • I am a phylogeneticist and I primarily harvest trees generated by others for my own analyses
    • I am a traditional systematist/morphologist who uses trees generated by others to inform my revisionary/classification work
    • I am an ecologist, developmental biologist, land manager, other (please specify) who harvest trees generated by others to inform/complement my own work.
  2. What is your primary exposure to raw phylogenetic data:
    • I am a primary producer of raw data
    • I am a primary user of tree topologies and do not re-use their underlying data
    • I use raw data generated by others and re-analyze them in my own context
(Nico)

Dissemination plan

Questions

  1. how long will we leave the survey open?
  2. we have the potential for a long (thousands) email list. How do we manage that without spamming?
  3. what do we need to say in our solicitation?
  4. are we going to offer a free iPod?

The survey will be disseminated electronically, with an electronic announcement and a link to the survey.

The target audience is actual or potential producers and scientific users of phylogenetic trees. We assume that the typical target is a scientist who produces phylogenies for a specific purpose linked to a scientific study. We are not attempting to cover educational uses.

To reach the target audience, we plan to use the following:

  • email list servers
    • Core-phylogenetics, systematics, and taxonomy lists
      • evoldir - thousands of evolutionary biologists
      • mol-evol <mol-evol@net.bio.net> - "Discussions about research in molecular evolution (Moderated)"
      • R-sig-phylo <r-sig-phylo@r-project.org> - discussion list for users of R with an interest in phylogenetics
      • tdwg - hundreds of members of the biodiversity information standards organization
      • lists from past NESCent phyloinformatics activities (wg-evoinfo, wg-phyloinformatics, phylo-vocamp1)
      • project lists (nexml-discuss@sf, cdao-discuss@sf, phylows@googlegroups.com
    • Other communities
      •  ? ecolog <ECOLOG-L@LISTSERV.UMD.EDU> (Hilmar) "Ecological Society of America: grants, jobs, news"
      •  ? iPlant? (jsw)
      • Taxacom
      • DML (Dinosaur Mailing List) <DINOSAUR@usc.edu>
      • VRTPALEO <VRTPALEO@usc.edu> The Vertebrate Paleontology Community discussion List
      • Paleonet <paleonet@nhm.ac.uk> "The re-establishment of connections between paleontologists of all types is PaleoNet's primary goal"
    • Japanese academic mailing lists
      • TAXA <taxa-admin@ml.affrc.go.jp> run by Nobuhiro Minaka, for taxonomy, 938 subscribers
      • EVOLVE <evolve-admin@ml.affrc.go.jp> for evolutionary biology, 2271 subscribers
    • UK-orientated but in many cases very internationally-subscribed lists
      • LERN 's mailing list <http://londonevolution.net/> London Evolutionary Research Network
      • Palaeontology Association (UK) members (~1100 email addresses)
      • Systematics Association (UK) members <SYSTASS-NEWS@JISCMAIL.AC.UK> (321 email addresses)
      • ANCIENT-DNA@JISCMAIL.AC.UK
      • ECOLOGICAL-GENETICS@JISCMAIL.AC.UK "The Ecological Genetics Group"
      • FASTMOLL@JISCMAIL.AC.UK "Cephalopod International Advisory Council (CIAC) list"
      • LINNEAN-NEWS@JISCMAIL.AC.UK "A forum for contemporary scientific debate across the life sciences..."
      • TUNICATA@JISCMAIL.AC.UK "Tunicate biology including chordate origins"
      • STATGEN@JISCMAIL.AC.UK "Statistical genetics news and discussion list"
      • conservation-research@jiscmail.ac.uk British Ecological Society mailing list
      • women-in-ecology@jiscmail.ac.uk
    • Really speculative
      • open-science <open-science@lists.okfn.org> *may not contain any phylogeny-users (?unknown)
  • syst biol twitter feed (@systbiol)
  • lists of specific email addresses
    • the tolweb curator list, via Katja Schulz (editor)
    • PIs on tree-of-life grants (Jim Leebens-Mack)
    •  ? TreeBASE submitters
    •  ? TimeTree queryers?

Development and testing

The survey was developed in November by Arlin and benefitted from feedback from members of the MIAPA-discuss email list. Bill Piel, Karen Cranston, and others were given edit permissions and added some questions. The survey has not been user-tested.

Meeting Notes

Friday, 25 Mar 2011 telecon

Members present: Arlin Stoltzfus, Nico Cellinese,

Agenda: questions

Over the past year, you may have searched for a phylogenetic inference result (tree, date of divergence, inferred character state). Characterize this experience

  • i didn't search for any such result
  • i searched for what kind of data or inference
    • phylogeny
    • date of divergence
    • alignments
    • character data
    • other
  • i search in what information resources
    • www (using google or otehr)
    • treebase
    • pubmed, wos, or other lit db
    • other
  • a typical query was like this: [ enter text ]
  • i filtered or evaluated initial hits
    • recency
    • publication venue
    • other indications that its really the right thing
  • how often did you identify a resource that apparently (before actually inspecting the resource) matched criteria?
    • rarely
    • sometimes
    • often
    • always

In the past year, when you have located resources matching search criteria . . . (choose all that apply)

  • i downloaded the material for inspection locally
  • i encountered a paywall
  • i went through a paywall
  • i tagged the resource

The last time that you identified and downloaded a resource (for human examination, or for further computation), the main focus of your interest was

  • a phylogeny
  • a date of divergence
  • an alignments
  • homologized characters
  • unaligned characters
  • methods description (e.g., workflow plan)
  • other

If you decided to use this result, how did you use it

  • for information purposes (background, my own understanding)
  • to evaluate quality (e.g., to assess reliability of conclusions based on the result)
  • to combine with other data into a larger set for analysis
  • to replicate the study (to use data as is, for the same purpose )
  • re-purpose the data (to use data as is, for a different purpose)

Friday, 18 Mar 2011 telecon

Members present: Brandon Chisham, Nico Cellinese, Arlin Stoltzfus

Agenda: question topics

  1. relative market size for re-use of trees vs. alignments
    • be sure to be clear about "re-use" and "data" (make clear, or avoid, using unambiguous terms)
    • the last time you used information from a previously published phylo study,
      • i just used the final tree
      • i just used the aligned data (sequence alignment or character data)
      • i just used the methods (i.e., applied to different set of data)
      • i recomputed the tree (from the aligned data) to verify it, using the author's methods
      • i recomputed the tree (from the aligned data) to verify it, using my own methods
    • alternative question: the last time . . . , did you re-use a) trees, b) alignment, c) both
  2. how much of research is based on new data vs. mined data?
    • in your latest study, how much of the primary char data were new
    • in your latest study that presented new char data, how much was old?
  3. how important 'not in repository' barrier
    • Did you ever search for a phylogenetic inference result (tree, date of divergence, inferred character state) before making decision to compute it yourself? If so, how did you search?
      • No
      • Yes, i searched in some_of { Google, Treebase, PubMed, WoS, other }
    • Did you find anything? If so, was it
      • No
      • yes, and it was { sufficient | insufficient } to work with
  4. or, in what form do you combine your new data with mined data?
  5. are morpho chars re-usable? repurposable ?
  6. the extent to which conflicting taxonomic schemes create barriers to users
  7. extent to which lack of any OTU name is a barrier
  8. how often have you re-used methods without re-using any data?
  • things that need to be defined
    • 'data' - information or facts
    • 're-use' -

Friday, 4 Mar 2011 telecon

Members present: Brandon Chisham, Emily Gillespie, Ross Mounce (scribe), Enrico Pontelli, Arlin Stoltzfus, Sudhir

Agenda: user stories

Brandon & Enrico

Emily

  • user 1
    • wild ginger
    • story about ITS sequences, published early 1990's, author retired, could not locate data (alignment not published); ended up determining sequences again
  • user 2
    • reasons users don't want to share data are unclear-- don't want data involved in project of high quality?

Ross

  • replicating results
  • erratum
  • timeliness of response is an issue (email author, wait days weeks months)

Arlin

  • issues of attribution

Sudhir

  • MRSA studies

action items

  • all to add provisional questions to wiki
  • telecon next Friday (Mar 11) to consider questions

'caption-generating software

Friday, 18 Feb 2011 telecon

Members present: Brandon Chisham, Emily Gillespie, Ross Mounce (scribe), Enrico Pontelli, Arlin Stoltzfus (chair)

Members absent: Rutger Vos (apologies, is in Tokyo), Sudhir (Europe), Nico (recruitment day)

Decisions

  • next meeting: next week, same time, Feb 25th 1:00 EST, 6:00 GMT

Action items

  • collecting 3 user stories from others for the wiki

Notes

  • methods re-use (discussion)
    • annotations are useful
    • but an executable methods description is futuristic (Enrico has prototyped this kind of thing, though)
  • have you tried to replicate a study? what barriers?
  • what % of papers re-used data
  • have you ever re-used data from Genbank?
    • if not, what barriers?
      • no sequences for my group
      • too hard, don't know how to use them
  • what if user tries to find morpho character data for re-use?
    1. no common warehouse
      • search treebase, not much
      • search on specific journals based on own list of journals
    2. can't access data even if a pub is found
  • what if user tries to find sequence alignment for re-use?
    1. search in pubmed ok
    2. search pfam
    3. want to re-use from specific paper
      • hard to find alignment (not archived)
  • but all of that is going to change ... due to changes in archiving policies
    • maybe not-- what if the policies only affect a small fraction of trees?
    • is this a question we can investigate (see literature analysis in KumarDudley2007.
  • what if the user wants a species tree?
    1. use NCBI taxonomy

Friday, 11 Feb 2011 telecon

Members present: Nico Cellinese, Brandon Chisham, Emily Gillespie, Sudhir Kumar, Ross Mounce (scribe), Enrico Pontelli, Arlin Stoltzfus (chair)

Members absent: Rutger Vos (apologies, is in Tokyo)

Summary: After introductions, we spent about 40 minutes discussing different aspects of re-use. We decided that

  • we need to develop hypotheses about barriers to re-use in order to design the survey effectively
  • we will have a recurring teleconference at this time
    • next telecon: Friday 18th February 1pm (EST), 6pm (GMT)

Introductions:

  • NC: University of Florida, assistant professor, co-developer of TOLKIN, HERBIS & RegNum, expertise in botanical phylogenetics
  • BC: New Mexico State University, grad student, co-developer of CDAO, background in computer sciences
  • EG: Wake Forest University, post-doc, doctoral research in botanical phylogenetics
  • SK: Arizona State University, professor, co-developer of MEGA & TimeTree amongst others
  • RM: University of Bath (UK), grad student, research - ‘the importance of fossils in phylogeny’, expertise in (animal) morphological systematics
  • EP: New Mexico State University, professor, co-developer of CDAO, background in computer sciences
  • AS: NIST, IBBR (U Md), evolutionary genetics & informatics, CDAO,

Discussion:

AS: The MIAPA-survey needs to be reconfigured a little. Potential to use some elements of branching in the survey if warranted, so that respondents only answer questions that are of relevance to them.

AS: Previous discussion with Heather Piwowar (NESCent/Dryad) persuaded us to focus the survey on investigating “Barriers to Re-Use of Phylogenetic Data” making it clear and concise.

What is ‘phylogenetic data’? Discussion from all indicated this means many things to many people.

SK: the majority of phylogenetic data being re-used is tree topology (‘trees’) by non-systematists, perhaps mostly even non-scientists e.g. TimeTree iPhone app usage statistics, and educational use-cases.

RM & NC: ...but academic re-usage by systematists is most likely to focus on the re-usage of ‘underlying data’ not topological “results” (except for “Supertree” research). By underlying data, we mean character-by-taxon data matrices; inferred evidence, inclusive of some homology assumptions.

RM: For molecular data, systematists may often want to go one step further back in the chain and re-use only “raw” sequence data from GenBank as molecular data matrices are aligned and coded (gaps/indels in particular) by many different methods in many different ways. GenBank works well and is relatively less of a “barrier” IMO.

EG: ...but sometimes not even GenBank data has everything one would want e.g. issues with provenance and multi-copy sequences Barriers

SK: TimeTree project had to scan images of tree topology and back convert into useable electronic formats (?). Clearly, not providing tree topology results from papers in recognised (e.g. Newick) electronically re-useable formats is hampering re-use.

RM: The same goes for all other aspects of phylogenetic data - they often ‘buried’ in the pdf or supplementary materials, difficult to find and/or re-use. See my recent presentation [1] for more details on how this impedes my re-use of morphological data.

EG: There are cases in which people may not re-use data simply because they do not understand the complexity of the associated methods (and/or software), particularly the case for Bayesian phylogenetic inference but also Maximum Likelihood and perhaps (RM) dynamic homology with POY.

SK: wife is very experienced with survey design. Can give an informed critique on the design

NC: Museum colleagues are highly critical and would happily point out flaws in the design too

References


<pubmed> 15120394 17485425 19193643 15166018 12097345 20033048 </pubmed>