Reuse Cases: Difference between revisions

From Evolutionary Interoperability and Outreach
Jump to navigation Jump to search
 
Line 105: Line 105:


=== References ===  
=== References ===  
The effect of fossil taxa relative to extant taxa (Cobbett et al., 2007 [http://dx.doi.org/10.1080/10635150701627296]


The impact of missing data on real morphological phylogenies (Prevosti & Chemisquy, 2010 [http://dx.doi.org/10.1111/j.1096-0031.2009.00289.x])
Andrea Cobbett, Mark Wilkinson, and Matthew A Wills (2007) Fossils Impact as Hard as Living Taxa in Parsimony Analyses of Morphology
Syst Biol (2007) 56(5): 753-766 [http://dx.doi.org/10.1080/10635150701627296 doi:10.1080/10635150701627296]


Comparison of phylogenetic signal between male genitalia and non-genital characters in insect systematics (Song & Bucheli, 2010 [http://dx.doi.org/10.1111/j.1096-0031.2009.00273.x])
Prevosti, Francisco J., Chemisquy, María A. (2010) The impact of missing data on real morphological phylogenies: influence of the number and distribution of missing entries. Cladistics 26(3):326-339. [http://dx.doi.org/10.1111/j.1096-0031.2009.00289.x doi:10.1111/j.1096-0031.2009.00289.x]


Congruence
 between
 cranial
 and postcranial
 characters
 in 
vertebrate
 systematics (Mounce & Wills, 2010 [http://www.citeulike.org/user/rossmounce/article/7853861])
Song, Hojun, Bucheli, Sibyl R. (2010) Comparison of phylogenetic signal between male genitalia and non-genital characters in insect systematics. Cladistics 26(1):23-35 [http://dx.doi.org/10.1111/j.1096-0031.2009.00273.x doi:10.1111/j.1096-0031.2009.00273.x]


The effect of lineage duration on morphology using cladistic data matrices (Liow, 2007 [http://dx.doi.org/10.1111/j.1558-5646.2007.00077.x])
Mounce, R. C. P. and M. A. Wills (2010). Congruence
 between
 cranial
 and postcranial
 characters
 in 
vertebrate
 systematics. Proceedings of the 29th Annual Meeting of the Willi Hennig Society. http://www.citeulike.org/user/rossmounce/article/7853861
 
interestingly, this last study uses phylogenetic data (cladistic data matrices of morphological characters) to create a 'morphospace' representation.
No actual phylogenetic analysis per se, but ample re-use of phylogenetic datasets...


Liow, Lee Hsiang (2007) LINEAGES WITH LONG DURATIONS ARE OLD AND MORPHOLOGICALLY AVERAGE: AN ANALYSIS USING MULTIPLE DATASETS. Evolution 61(4):885-901  [http://dx.doi.org/10.1111/j.1558-5646.2007.00077.x doi:10.1111/j.1558-5646.2007.00077.x] ''interestingly, this last study uses phylogenetic data (cladistic data matrices of morphological characters) to create a 'morphospace' representation.
No actual phylogenetic analysis per se, but ample re-use of phylogenetic datasets...''


== Another use-case ==  
== Another use-case ==  

Revision as of 17:56, 17 March 2011

Scope

This page is for developing a list of use-cases for EvoIO- and MIAPA-relevant project planning. The use-cases should focus on re-use, which might mean replication, aggregation, re-purposing, meta-analysis, integration (see below for a view of what these terms mean).

We want to enable and facilitate data re-use of phylogenetic data and metadata, which isn't happening often enough. Because it isn't happening enough, it might be useful for us to consider hypothetical cases of re-use. However, even for hypothetical cases, its very important to make every effort to document user needs, e.g., as expressed in published papers. For instance, before TimeTree existed, a resource to aggregate phylogenetic dates was a hypothetical re-use case, but the user need for placing dates on nodes of trees was not hypothetical and could be documented easily.

For comparison, some other use-case lists are available:

Some of the above use-cases (or variants of them) might be relevant here.

What constitutes re-use of data?

The primary consumer of a scientific product typically is the primary producer, e.g., an ecologist collects field observations and then uses these new observations to evaluate hypotheses or clarify patterns. "Re-use" refers to the case when there is a secondary consumer.

By "data", we mean information. That is, for the present purposes, data are coded information, as distinct from the material products of research such as specimens and samples. We do not mean data in the more restricted sense of fact, observation, although this type of empirical data may be the most likely to be re-used. Sharing data (unlike sharing materials) is an informatics problem.

The general category of data re-use may cover a large number of diverse cases described with terms such as replication, aggregation, re-purposing, meta-analysis, and integration. These do not seem to be distinct non-overlapping categories, but dimensions or qualities that may interpenetrate. For instance, Yampolsky & Stoltzfus (2005) combined data from 15 studies, comprising nearly 10,000 engineered amino acid exchanges, to generate the "EX" matrix of values representing the mean exchangeability from one amino acid to another. The authors clearly were secondary consumers: each underlying study was performed by the primary producer-consumers in order to map out regions of a protein most susceptible to amino acid changes. The study was described as a meta-analysis (in the sense of combining separate studies to address an issue beyond the scope or power of any individual study), and it clearly involves re-purposing (using results for a different goal), and aggregation (in the sense of combining results from multiple studies).

To reiterate, these terms (aggregation, re-purposing, etc) do not seem to represent non-overlapping categories, but qualities or aspects that may apply in combinations. Here is one person's (AS) interpretation of the terms (for a different view, see Fig. 1 of Sidlauskas, et al., 2010):

  • study replication means verifying results or conclusions of a published study by repeating it. Although the potential for study replication is integral to the self-policing nature of science, it happens only on the rare occasions when the published results of a study are perceived to be fraudulent or artefactual (e.g,. in recent memory, the "memory water" and "directed mutations" cases).
  • aggregation means gathering large numbers of results of a precisely defined type. Often the aggregator adds value in the process. The Sepkoski marine fossil data set is an example. TimeTree is an example that has more of a focus on making it easy for the user.
  • meta-analysis means combining several separate analyses to address issues beyond the scope or power of a single analysis. This sometimes means a meta-statistical analysis (statistical meta-analysis), in which conclusions are based on combining, not the raw data from each study, but summary statistics from each study (e.g., means and variances) in a way that is sensitive to study design. Supertree methods (for assembling composite trees from separate overlapping trees) are analogous, in that they combine trees rather than the underlying character data. Sidlauskas, et al. use "meta-analysis" to refer to two studies that "synthesized the results of hundreds of previous studies" to show conclusively that climate change causes shifts in species distributions (something that individual studies could not establish conclusively).
  • re-purposing means using the results of a study for a purpose other than that of the primary consumer.
  • integration seems similar to "synthesis" but may have more of an implication of bringing together things that obviously belong together but have been kept separate for arbitrary reasons, e.g., combining data from different domains or different types of studies. This kind of integration depends on integrating variables or keys by which data from separate studies are combined. The integrating variable might be an accession number, a species name, a geographic location, etc.
  • synthesis seems similar to "integration" but may have more of an implication of conceptual novelty and creativity, i.e., combining results in ways that were not imagined.

Note that aggregation and meta-analysis combine data from multiple studies of the same type. Synthesis and integration necessarily combine data from studies of different types. Study replication, by definition, deals with a single study.

About use-cases

A "use case" is a description, from the perspective of the user (not the developer), of a set of transactions intended to satisfy a particular category of user needs. Here is a formula for a use-case:

  • Name and description - brief overview
  • Motivation - why do researchers want to do this?
  • Ideal procedure
    • Preconditions - what does the user need to start with?
    • Steps -what are the steps in a typical case?
    • Outcomes - what outcomes satisfy user needs?
  • Key challenges - what makes it difficult to do this today?
  • References - who does this, or wants to do it?

List of use-cases

Supertree research

Name and description

A supertree is defined as an estimate of phylogeny assembled from smaller phylogenies. These partial phylogenies (or source trees) must have some taxa in common, but not necessarily all. Modern supertrees can contain hundreds or thousands of taxa and are constructed from hundreds of source phylogenies requiring the collection of large amounts of data. These phylogenies are of great use in, for example, comparative biology, and macroevolutionary studies (quoting in verbatim Davis & Hill, 2010 [1]).

Motivation

To test hypotheses that require phylogenies that are of such great scale and breadth, that creation of a similarly-sized (taxa-wise) phylogeny by conventional methods would be far too difficult for a variety of reasons.

Typical procedure

  • Preconditions
    • 'Source trees' : previously published hypotheses of evolutionary relationships for a group (taxa-wise) of choice. Generally topology-only data required but can vary depending on which exact method used.
  • Steps (after Davis & Hill, 2010)
    • 1. Data collection and entry
    • 2. Standardisation of terminal taxa
    • 3. Ensure source tree independence: Remove redundancy within the [meta]dataset that would otherwise unfairly up-weight data
    • 4. Check adequate taxonomic overlap of source trees
    • 5. Matrix creation: Create a matrix suitable for analysis
  • Outcomes
    • ???? A supertree estimate of phylogeny.

Key challenges

  • The lack of digitally-available tree topology-data in a recognised/standardised format (e.g. Newick) for most phylogenetic studies that have ever been published.
    • Most tree-topologies are generally published (in their original papers) graphically which isn't too helpful for re-use and re-purposing.
    • This problem is so widespread that ingenious methods (e.g. TreeSnatcher [2]) have been developed specifically to re-extract topology-data from published papers.
  • No standardisation of taxon names between studies, hence Step 2 (above).
  • Taxon sampling. Step 3 (above) is necessary because some taxa/groups are extremely 'popular' in phylogenetic studies, whilst others are only vary rarely included.

References

Beck, R., Emonds, O. B., Cardillo, M., Liu, F. G., and Purvis, A. 2006. A higher-level MRP supertree of placental mammals. BMC Evolutionary Biology 6:93+. [3]

Bininda-Emonds, O. R. P., Gittleman, J. L., and Purvis, A. 1999. Building large trees by combining phylogenetic information: a complete phylogeny of the extant Carnivora (Mammalia). Biological Reviews 74:143–175. [4]

Cotton, J. and Wilkinson, M. 2009a. Supertrees join the mainstream of phylogenetics. Trends in Ecology & Evolution 24:1–3. [5]

Davies, T. J., Barraclough, T. G., Chase, M. W., Soltis, P. S., Soltis, D. E., and Savolainen, V. 2004. Darwin's abominable mystery: Insights from a supertree of the angiosperms. Proceedings of the National Academy of Sciences of the United States of America 101:1904-1909. [6]

Davis, R. B., Baldauf, S. L., and Mayhew, P. J. 2010. Many hexapod groups originated earlier and withstood extinction events better than previously realized: inferences from supertrees. Proceedings of the Royal Society B: Biological Sciences 277:1597–1606. [7]

Lloyd, G. T., Davis, K. E., Pisani, D., Tarver, J. E., Ruta, M., Sakamoto, M., Hone, D. W. E., Jennings, R., and Benton, M. J. 2008. Dinosaurs and the Cretaceous Terrestrial Revolution. Proceedings of the Royal Society B: Biological Sciences 275:2483-2490. [8]

Ruta, M., Pisani, D., Lloyd, G. T., and Benton, M. J. 2007. A supertree of Temnospondyli: cladogenetic patterns in the most species-rich group of early tetrapods. Proceedings of the Royal Society B: Biological Sciences 274:3087-3095. [9]

Sanderson, M. J., Purvis, A., and Henze, C. 1998. Phylogenetic supertrees: assembling the trees of life. Trends in Ecology and Evolution 13:105-109. [10]

Thomas, G., Wills, M., and Szekely, T. 2004. A supertree approach to shorebird phylogeny. BMC Evolutionary Biology 4:28+. [11]

Other meta-analyses (multiple phylodata re-use cases)

Name and description

A 'catch-all' basket group for other non-supertree meta-analyses utilizing many phylogenetic datasets.

Motivation

To test hypotheses over many independent datasets

Typical procedure

  • Preconditions
    • Published cladistic data matrices in useable electronic formats (e.g. nexus)
  • Steps (varies depending on exact case)
    • first-order taxon jackknifing [12]
    • partitioning datasets and performing subsequent analyses on partitions to compare signal [13] [14]
  • Outcomes - what outcomes satisfy user needs?
    • ???? Interesting results

Key challenges

The poverty of morphological cladistic datasets digitally available in standardised formats (e.g. nexus). Particularly for non-botanical, non-mycological taxonomic groups. Relevent key challenges, barriers and the scale of the problem have been outlined in a talk recently (Mounce, 2010 @ The 12th Young Systematists' Forum [15]). It is even hard just to find relevant datasets - speaking from first-hand experience, if you attempt a literature search for cladistic data you'll get lots of false positives (e.g. Titles and Abstracts that refer to "Systematics of..." yet contain no primary phylogenetic analysis), AND false negatives (if you 'just' search for "morphological systematics" and "cladist*" you won't find all that's out there)!

References

Andrea Cobbett, Mark Wilkinson, and Matthew A Wills (2007) Fossils Impact as Hard as Living Taxa in Parsimony Analyses of Morphology Syst Biol (2007) 56(5): 753-766 doi:10.1080/10635150701627296

Prevosti, Francisco J., Chemisquy, María A. (2010) The impact of missing data on real morphological phylogenies: influence of the number and distribution of missing entries. Cladistics 26(3):326-339. doi:10.1111/j.1096-0031.2009.00289.x

Song, Hojun, Bucheli, Sibyl R. (2010) Comparison of phylogenetic signal between male genitalia and non-genital characters in insect systematics. Cladistics 26(1):23-35 doi:10.1111/j.1096-0031.2009.00273.x

Mounce, R. C. P. and M. A. Wills (2010). Congruence
 between
 cranial
 and postcranial
 characters
 in 
vertebrate
 systematics. Proceedings of the 29th Annual Meeting of the Willi Hennig Society. http://www.citeulike.org/user/rossmounce/article/7853861

Liow, Lee Hsiang (2007) LINEAGES WITH LONG DURATIONS ARE OLD AND MORPHOLOGICALLY AVERAGE: AN ANALYSIS USING MULTIPLE DATASETS. Evolution 61(4):885-901 doi:10.1111/j.1558-5646.2007.00077.x interestingly, this last study uses phylogenetic data (cladistic data matrices of morphological characters) to create a 'morphospace' representation. No actual phylogenetic analysis per se, but ample re-use of phylogenetic datasets...

Another use-case

Name and description

brief overview

Motivation

why do researchers want to do this?

Typical procedure

  • Preconditions - what does the user need to start with?
  • Steps - what are the steps in a typical case?
  • Outcomes - what outcomes satisfy user needs?

Key challenges

what makes it difficult to do this today?

References

who does this, or wants to do it?