Taxonomic Reasoning

From Evolutionary Interoperability and Outreach
Jump to: navigation, search

Overall Plan

  • A set of questions involving inference using phylogenies and related data
  • Details of how an ontology should look to support those inferences
  • Description of current representation systems and what inferences they support

Sunday, Nov 8


The Day's Plan

Come up with high level topics
Develop one or two very defined questions that could use ontologies for inference
Develop one use case using inference that might be implemented now

High Level Topics

Identification Use Case

I saw a black bird in Montpellier in November, what is it?
The bird I saw looks like a bird I know, does that help?

Ecosystem shift Use Case

Is there a connection between ice cover reduction and a change in fish population in the Barents Sea?
The fish population in the Barents Sea is changing, infer some reasons why that might be.

Comparative anatomy Use Case

When cancers stop responding to hormone therapy, things get worse quickly. When bears hibernate, they take a "hormone vacation".

Some bears hibernate, some, like the sun bear, don't. Compare humans, hibernating bears and non-hibernating bears to see if we can isolate cellular or genetic differences that account for the ability to hibernate.

Data Sets Useful for Each Broad Use Case


biogeography over time, including range, habitat and occurances

Ecosystem Shift

ice cover over time
fish phylogeny
fish taxonomy
fishery data, fish landings, trawls
fish traits to possibly infer food webs

Comparative Anatomy Use Case

pathology information for humans and bears
tissue and endocrin interaction model
genomes for species
difference representation
bear phylogeny, check whether some sloth bears hibernate and others don't

Use Case Refinement

Come up with specific questions in each use case
Narrow them to something we *can* make inferences about
Outline what the ontologies would have to look like to support those inferences

Ecosystem Shift

Given our access to data, we're moving from fish to mammals. Here's the use case. Given a temperature gradient over a large transect, generate phylogenetic trees using observations of species at various points along the transect. From these trees, gather traits and also infer new species that may be of interest. Compare the resulting traits at different points along the transect to infer what sorts of change might be occurring between measurement points.

This use case uses the following data sets, between data sets is a line signifying that a mapping from one type of data to another will be necessary.

temperature - geography (transects) - observations (range maps) - taxonomy - phylogeny - traits

Miscellaneous Observations

Much talk about confidence values for such things as traits and identifications.

Interest in ranking results

Using phylogeny to study community composition

Ideas about what kind of phylogeny tree metadata should be returned - MarkS

Do we need alternative forms of reasoning?

non monotonic

Next Steps

The real next step is to figure out what the next steps should be. Here are some ideas:
Map out use case more thoroughly
Identify points of connection between a subset of the ontologies
Mock something up, maybe in protege


Established the workflow

geography (transects)

linked to

observations (species name and location)

linked to

species names

linked to

phylogeny (nodes, ancestors, descendants)

need to calculate most recent ancestor via some external process from that

descendants of most recent ancestor

linked to

traits in an inheritance hierarchy

Described tools needed to link the data types

The ontology links the data sets. Instances in the data sets are linked via properties. Many of the properties have inverses which may be used in reasoning.

Hacked out ontology

Here it is: File:Vocamp taxo reasoning.owl.txt

Discussed the kinds of data we needed. Struggled trying to get those data. Just starting making things up, resulting in progress.

Encountered a challenge trying to determine the most recent ancestor using DL. The working hypothesis is that the open world hypothesis is making it difficult. Further research necessary. A bit of investigation showed the Jena has a special getLCA() function that could be used to find the most recent ancestor. Alternatively, we could use a workflow system like Kepler to pass the output of the first part of our reasoning to another process that determined the most recent ancestor and passed it back to our ontology for further processing.

The phylogeny is not a subsumption hierarchy. Instead nodes are instances of a node class. Each node can have a name. Nodes are linked via hasAncestor/hasDescendant properties.

Other data sets, such as the trait data set, do have an inheritance hierarchy, which we leverage to provide interesting inferences (see below).

Having the nodes as instances permits us to link the different types of data directly via properties. We did not try to do this by using classes to represent phylogenetic nodes, it would be interesting to try.


Reviewed reasoning

From location determine species present there
We can then determine the traits of those species
Have to leave system to get the most recent ancestor of that set of species
Once we have that we can determine all descendants
Then we can determine the traits of those descendants

What's nice is that we attach traits to species and the final results (traits of observed species and of the clade) are automatically determined

We're relating three different data sets:

traits of species in a subsumption hierarchy
observations of species with location and name
phylogenetic relationships amongst the species

Look at other types of reasoning

Reasoning through inverse and subsumption relations

We checked how easy it would be do a new kind of inference by simply adding a new relation and saying it's the inverse of a relation we already used. And it was EASY!

Tell me the locations where we know we have species for which we have body shape data.

hasResident some (hasTrait some body_shape)

Challenge: What about specimens?

In our representation, species are represented as instances. What if we wanted to have specimens as the instances? So then, capybara could have both small_size and big_size if we saw examples of both

Is there a way to do both?

Lessons Learned

Computer science folks and biologists concur that working together as a group was mutally beneficial. Participants agreed that jointly working out a use case, even though not entirely practical, was extremely useful in clarifying both the data discovery and integration needs of domain scientists, and the capabilities (and limitations) of semantics-assisted approaches. Having the advantage of an experienced knowledge modeler and Protege user, and spending the time together working on our Use case had tremendous bi-directional heuristic value.

Observations of our Ontology Development Process

We didn't really have an idea how to go about developing the ontology. We did a fair amount of work determining use cases that we thought we had data for, interest in, and would exercise the area of our interest. At some point it became clear that progress would best be made by sitting down around a keyboard and implementing a prototype.

Here is a list of the things that we needed to clarify and would have been nice to know before we tried doing anything.

  • what is inference?
  • why are we using semantic web technologies?
  • what are some good ways to go about designing ontologies?

We probably should have STARTED with a reasoning mini-bootcamp. We got to that about 1/2 way through the second day.

We iteratively added knowledge to the ontology and generated questions that would require additional knowledge. The cycle of building the knowledge base and then determining the queries that the base could and should answer was only possible because of the collaboration between the biologists and computer scientists in the group.

Concluding Comments

In conclusion, the biologists learned a lot about how to use informatics tools. Anne will go home and try to use them by herself on a small scale and later expand and get other biologists interested in using informatics tools. Dave T will go home and work on a grant proposal revolving around the integration of the data set types we attempted to integrate here.


The primary accomplishments of the group were:

  • A list of grand challenges within the biological domain that could benefit from applying reasoning techniques in data integration and data discovery
  • A small set of use cases within those grand challenges
  • A working ontology that integrated many types of data sets and used reasoning to answer domain-relevant queries
  • A slogan: don't romanticize, semanticize!

Open challenges

Data discovery: How do you find the data sets? How can you plug the data sets in? Placement of the ontology within a workflow for the parts of the application that cannot be satisfied via description logic representations and reasoning. How to settle on the best possible model, given that there are many ways to model anything.