VoCamp1/Pitches

From Evolutionary Interoperability and Outreach
Jump to navigation Jump to search

This Page is for submitting a "Pitch" - a call for a development task - for a particular area of interest to be covered in VoCamp.

Please put your new ideas at the top-- this makes it more likely that everyone will see the new idea.

A Biodiversity Atlas of Montpellier

Take a look at the beautiful www.sitkanature.org, which contains photographs, checklists by taxa, and profile pages for species. The site is built on semantic media wiki. Matt Goff, who built and runs the site, and I have talked about determining vocabularies to publish all this information in rdf. If we do this, we could use Matt's site as a template for citizen science groups to create a "Biodiversity Atlas of X", where X can be "my backyard", "our schoolyard", "our city", etc. Joel Sachs

How much can a triplestore know?

How much about biodiversity (expressed in triples) do humans know? 10s of billions of triples? Trillions of triples? Franz will give us a license for AllegroGraph, and may be able to provide a server for the VoCamp. This would let us scale to billions, possibly 10s of billions of triples without too much trouble.

Doing so could be either

  1. a VoCamp project, i.e. one team tries to track down: occurrence records from GBIF; biodiversity inventories of the world's protected areas; shapes files for protected areas; shapes files for countries and regions; food webs; species profiles; conservation status; invasiveness status; range maps; genomics data; etc.
  2. a VoCamp meta-project, i.e. every team loads its data into the triplestore.
  3. a combination of both.

An advantage of co-locating all data we can is that, for purposes of integration (via SPARQL query), we wouldn't have to worry about network issues or content negotiation or any of the other things that can go wrong on the semantic web. In theory, vocabulary issues would be the single point of failure. Which seems like a good thing for a VoCamp.

Here are some queries that would (i) require GUIDs for taxa, and (ii) possibly result in useful information:

  1. Find all observation records within protected areas, where the species name is not on the checklist for that protected area.
  2. Find all threatened species which are not on the checklist for any protected area.

More generally, a giant triplestore, with a healthy fraction of everything we know, would be a good testbed to experiment with. Joel Sachs

Semantic publishing in phyloinformatics

Imagine if we could get a large amount of biodiversity, taxonomy, and phylogenetics data-- from diverse sources-- into the "semantic web". It might take a while for this to affect typical end-users. First we might start to see more integrative analyses by power-users and by computer scientists looking for semantic web projects. Eventually there would be a payoff for end-users (and society in general) when applications develop that routinely make use of this data, e.g., a bio-browser plugin that, whenever it encounters a species name (properly encoded) will pop-up with the option to view an image of the critter and a map of its range (retrieved from other online resources).

What can we do right now to make this happen sooner, rather than later? What can we do to facilitate and encourage those who publish data to make these results part of the semantic web? Could we:

  • assess microformats, RDFa, and other ways to embed information
  • identify and assemble specific methods for encoding things like literature references or taxonomy links
  • identify a few specific types of resources and develop semantic-web encodings to serve as examples
  • reflect on all of the above and develop a strategy for education and evangelizing
  • As a newbie to ontologies I think this would be very helpful. I have a lot of data in a PostgreSQL database (trees in PhyloDB and spatial information in PostGIS) can this be 'semantically enabled' from within the database or does it all have to be ported to RDF - David Kidd

Protocol for ontology maintenance

TDWG has a lot of experience developing, adopting, maintaining and extending standards. As we will be meeting in parallel with TDWG we should try to tap into that experience in developing our community protocol for the maintenance and extension of our standards (e.g. CDAO and NeXML). This pitch is more about social engineering than software engineering: how do we add new terms and concepts? Who decides what new terms and concepts are included? How do we bump up version numbers? How do we test, enforce and maintain compliance? The deliverable from this pitch would be a set of flow charts and documents that describe how continuity is maintained should any one of us (or all of us) migrate to Vanuatu or be grabbed by grizzly bears. RutgerVos 20:41, 25 October 2009 (UTC)

comments and questions

  1. I'm not sure the issues involved in maintaining an ontology (e.g., CDAO) and a schema for data transfer (e.g., NeXML) are as similar as this pitch suggests. The gatekeeper/request tracker/email list approach that many OBO ontologies use works pretty well for ontologies where adding a term, particular a leaf that doesn't require any reorganization, can be transparent for users that don't care about that term. Changes to a schema require developers to update and release changes to their tools and hope that existing files don't break in the updated schema - we've seen this with NeXML. I think the pitch group would be most useful if it focussed on schema maintenance, ideally with lots of interaction with TDWG people who are maintaining schema. Handing off gatekeeper status maybe more of a common issue - I'm not sure how much experience there is with this (though ZFIN has recently done so). Peter Midford

Reasoning about, and with, related taxonomies

A primary reason to formalize a set of taxonomies and the relationships between them is to support reasoning. This could be reasoning about the concepts in the taxonomies, reasoning about relationships between taxonomies, or reasoning about data that have been annotated to the taxonomies. I would like to see a discussion of what types of reasoning needs exist in the phylogenetic context, and an investigation into which representation systems support which needs. For example, the TDWG taxonomic concept schema (TCS) captures taxonomies and the relations between them using an xml schema. The schema itself does not support reasoning, though it does contain representations for relationships that might be translated into some logical formalism. However, it is unclear whether or not OWL, for example, can represent the relationships in TCS. As another example, the Phylocode specifies taxonomies in an interesting way. What kinds of inferences would we want to make with Phylocode taxonomies, and what representations would we need to support those inferences? Some goals of this pitch might be (i) a list of current languages for describing and reasoning with and about taxonomies (OBO, LinkedData, TCS, SKOS, etc....), (ii) a sketch of several use cases for reasoning with taxonomies in phylogenetic context, and maybe (iii) an investigation into how the current systems might address the needs, leading to (iv) the identification of places that need further research. Dave Thau 05:14, 23 October 2009 (UTC)

comments and questions

  1. This is important and we ought to look at how languages constrain or guide the choice of representations (e.g., the choice of OBO vs OWL will affect how you look at the taxa as individuals vs. classes issue), and hence what sort of reasoning is supported. Integrating Phylocode with legacy taxon concepts is also an important issue. PeterMidford
  2. I find this pitch very interesting and I hope it will result in the formation of a subgroup, and if there is enough interest, I can run a quick overview on the phylocode and the different types of definitions. This can also be integrated with a mini bootcamp on phylogenetic principles. Nico Cellinese

MIAPA (Minimal Information for a Phylogenetic Analysis)

Domain scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA (Leebens-Mack, et al. 2006). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting burden yet ensures that the reported data can be interpreted and re-used.

Nothing has happened with this idea for several years, other than the development of a whitepaper at NESCent. However, it still seems useful and necessary.

Why not hack out a proof-of-concept version of a MIAPA standard, addressing how to represent character data, trees, and metadata. A variety of relevant standards exist already. This could be presented to a larger community to stimulate a more broadly based effort.

comments and questions

  1. NeXML contains elements for trees and data, but we are lacking controlled vocab to describe steps of the analyses (i.e. alignment, phylogenetic inference, consensus trees, etc). Would an ontology of analysis methods and parameters be part of the standard?
    • actually, the latest bleeding edge version of CDAO has imported terms from the mygrid-biomoby services ontology, which covers operations (e.g., alignment), algorithms (e.g., progressive pairwise), software (e.g., ClustalW) and formats (e.g., aln format). We'll try to have this ready for hte VoCamp Arlin 20:42, 2 November 2009 (UTC)
  2. I would be very interested in working on this pitch. A variety of phylogenetic reconstruction programs provide reasonably small input and output files that contain all of the information necessary to ensure repeatability of analyses, yet few journals require this information to be made available. Would a deliverable on such a pitch perhaps be: 1) a list of information necessary to incorporate when publishing phylogenetic reconstructions (and why), and 2) a proposed format or file type for said information? The latter is the more difficult option because we would need to create some sort of converter or simple GUI that converts various program input/output into a standard format. -J. Reece

Extracting legacy data (knowledge) from printed matter

This may be beyond the scope of VoCamp, but I think it should be noted that much of the data we need to create an ontology is held in a "legacy" format (print). Perhaps some brain storming should be done on methods for data extraction or building a bridge from extracted data to an ontology. An automated method for transferring information from the printed page to an ontology is necessary to addressing a task of this scale. - Anne Thessen

comments and questions

  1. Does the "the data we need to create an ontology" refer to observable facts, or is this ontological knowledge of the nature of classes and relations?
  2. I was thinking along the lines of observable facts, but could include both

Ecologically focused ontologies

I would like to see the development of ecologically-focused ontologies. By this, I mean ontologies that can be used to describe not only relationships, but environmental data which provides the context for taxonomic data. This is a huge concept that is not going to be solved at this meeting, but I would like to take advantage of the opportunity that getting so many heads together provides to at least get some brain storming done. My work on the Arctic Ocean, plankton dynamics and the Census of Marine Life has brought to my attention the need for such vocabularies. - Anne Thessen

comments and questions

  1. What about the EnvO (Environment Ontology) project?
  2. Geological ontologies must also be considered for fossil taxa. --OpenIDUser6 09:17, 28 October 2009 (UTC)

Assessing compatibility issues with regard to existing ontologies and vocabularies

I'd like to make sure that the developments here remain compatible with other bioinformatics vocabularies and ontologies. At a basic level, there is a need for a project to assess, and report publicly on, the ontologies relevant to phyloinformatics. Related work includes: CDAO, MIAPA, NCBI's taxids, Sequence Ontology (SO), Protein Ontology (PRO), Multiple Alignment Ontology (MAO), "quest for orthologs" (http://genomebiology.com/2009/10/9/403), homology ontology (http://bgee.unil.ch/download/homology_ontology.obo), OBO relations ontology, Ontology for Biomedical Investigations (OBI) and many others. I think it would be useful to identify as many related projects as possible and to see how we can interact with them. - Julie Thompson

comments and questions

  1. Since there are so many different ontologies and vocabularies out there already, I think a significant investment should be made on integration and normalization. Potential issues include accessibility and formatting. - Anne Thessen
  2. I concur with Anne and Julie that we should explore links between ontologies and controlled vocabularies accross the biological and earth sciences. I personally deal with trait, ecological data (including time-series and interaction webs), phylogenetic, population genetic, land use, climate, hydrological and geological data - David Kidd.
  3. Is there a way to search ontologies for specific terms? Answer: you can search all OBO ontologies for a specific term (http://bioportal.bioontology.org/). I don't know about others, maybe in Protégé?
  4. OBO recommends that the way to address overlaps with another ontology is to join in the development team of that ontology. I think that, in practice, there is not an agreed way to reconcile disagreements over provenance of a term that appears in multiple ontologies. Ideally they would all import the term from one ontology where the term is most "at home". Arlin 13:31, 29 October 2009 (UTC)

Vocabularies related to Treatment/Assay are required

Crop ontology (CO) [1]focuses mainly in developing vocabularies that are required for sharing crop germplasm, passport information and data that are related to phenotypes. It is fact that phenotypes are not easily predictable due to genotype-environment association. In the agricultural field, researchers always like to query important data that are related to traits for breeding purpose and like to compare data across species. However, I feel that currently available vocabularies/ontologies are not enough for capturing information or data related to methods, assays and treatments. In this meeting, I particularly wish to focus on “Treatment ontology (we can give more suitable name as well)”. In agricultural field, one trait is observed or assayed at different environments, conditions, treatments and various time intervals. For example, plant height is measured at various growth stages in different field experimental stations with various treatments such as for drought experiment, the level of water stress is well-watered, intermediate water stress, sever water stress after 2, 4, 6 weeks of irrigation. In such case, existing vocabularies and ontologies are not enough to capture such information. However, it is also difficult to standardize vocabularies and develop generic ontology so that it can be applied for all crops. Therefore, I wish to open discussion in the issue that how we can manage such information using ontology. I think Anne, Julie and David could help me more as we have similar type of issues to discuss and need to solve in future. - Rosemary Shrestha

Metadata standard for annotating triples

Though it is very early in the project, I'd like to describe to interested participants what the goals of the Concept Web Alliance are. These are, generally speaking, two-fold. One is similar to shared-names in that concepts have a non-language-based identifier that has a predictable resolution mechanism to get to the human-readable concept in whatever language you wish; the second is to develop a standard set of metadata to annotate triples (e.g. given that a triple represents a statement of "fact", who said it, when, under what circumstance, etc.). What metadata is required, at a minimum, to annotate triples? Discussion within this group would be welcome! - Mark Wilkinson

Shared names initiatives

There are several "shared names" initiatives emerging in the broader bioinformatics community (i.e., predictable URI's with predictable and useful resolution behaviours to represent the entities in biology/bioinformatics). Each has some nice features that are ~unique to that initiative. For example, the Science Commons / OBO Foundry et al. shared-names proposal has proposed a way to register new types of identifiers; the Dumontier shared-names proposal (http://code.google.com/p/semanticscience/wiki/serv and http://code.google.com/p/semanticscience/wiki/snr and http://codemonkey.dumontierlab.com/wiki/doku.php?id=sharedname) makes a clear distinction between a database record and the biological entity that the record is about (so that you don't end up saying that database records are biologicalInhibitorsOf database records!); and the LSRN is "actively being used" in the community (which doesn't mean that it "wins", but it sure helps!). Since Jonathan and I are at the VoCamp, and can speak to at least two of the three proposals between us, I'd like to at a minimum explore the various solutions and rationale behind them, and get community feedback on which direction they think they would be willing to support. -- Mark Wilkinson

comments and questions

  1. could you list the distinct features of each?
  2. Does the community need to have just one standard? Can the underlying problem be stated (in terms of functional demands from the community) so that there are possible solutions other than picking a winner?

Standard for representing classifications as Linked Data

PESI (mainly me) are working on a standard way for presenting classifications on the web as Linked Data based on the TDWG taxon concept vocabulary. It is documented here [| http://www.hyam.net/publications/LinkedTaxaTutorial] (draft). This should enable multiple classifications to be imported into a triple store and consensus classifications developed using inference. During the DB hackathon earlier in the year we developed XSLT to convert NeXML into the same TDWG vocabulary. We should be able to combine existing classifications and phylogenies in a triple store and relate them using owl:equivalentClass declarations to test hypotheses about how phylogenetic results are related to existing traditional taxonomies. This approach could lead to a "ClassBank" - a GenBank/TreeBase like repository for traditional/historic taxonomic classifications. The challenge would be to create a triple store with several classifications and several phyologenies in it, link them together using assertions in OWL, browse the inferred network of classes and answer a single question "If a specimen is identified to X which other taxa and clades must we assume it belongs to given the assertions made in the store?"

comments and questions

  1. this project has a clearly defined goal that (at least, at the level of a proof-of-concept implementation) is within reach
  2. be sure to check out HICLAS, a database for storing classifications + all sorts of fancy queries. Contact Sakti Pramanik <pramanik@cse.msu.edu> for more info.

Research the status of the TDWG Ontology

Research the status of the TDWG Ontology (http://wiki.tdwg.org/twiki/bin/view/TAG/TDWGOntology) in view of the recently ratified Darwin Core standard vocabulary (http://rs.tdwg.org/dwc) of biodiversity information. Active development of the TDWG Ontology ceased before it was fully mature, and well before being ready to propose as a standard. Even so, emerging TDWG standards refer to the TDWG Ontology. The problem at hand is to determine The Ontology's place among TDWG standards and how it relates, informs, or binds other standards in the TDWG family.

comments and questions

  1. This is a really important pitch. If you accept the p4-feedback mailing list remarks about rdf:Property there may be a fairly large disconnect between OWL 2 and the DC/DwC style of ontology. IMO this should not be taken lightly, because OWL 2 addresses keys (in the sense of databases), which in OWL 1 are a big impediment to reasoning on relational databases, because InverseFunctional forces you into OWL Full. --Bob Morris 03:23, 20 October 2009 (UTC)
  2. I don't understand what this is about. Can someone please state the problem of "status" explicitly? What types of responses to this issue would represent suitable VoCamp goals? Arlin 13:46, 23 October 2009 (UTC) Done. Tuco 03:28 1 Nov 2009 (UTC)
  1. Could someone fix the link to the rdf:Property remarks? Done. --Hilmar 20:25, 29 October 2009 (UTC)

Semantics of Phyloreferencing

The analogy is with georeferencing. Once everyone agreed on ways of tagging things with latitude-longitude, mapping tools could trawl around the web and decorate themselves with georeferenced images, data, etc. What if you could phyloreference things on the web? For example, suppose you have a phytochemical database, and for each entry you want to indicate where that chemical first evolved on a phylogeny (i.e. any relevant phylogeny). e.g. "The phytochemical Urushiol (= the poison in poison ivy) first evolved at the clade MRCA(inclusion_specifiers = {'Toxicodendron pubescens', 'Rhus dentata'}, exclusion_specifiers = {'Acer negundo', 'Commiphora myrrha'} )". Once databases around the web have been marked up in this fashion, someone with a given tree can trawl around and gather up all data points that map to its nodes and edges. Challenges include the semantic heterogeneity in OTU labels and the fact that relevant trees don't necessarily have the full set of inclusion and exclusion specifiers. - Bill Piel.

comments and questions