Biological Classification Meets Description Logic

From Evoio

(Redirected from Classification and DL)
Jump to: navigation, search

Contents

Classification systems

Let's define a "classification system" to be an entity that answers questions of the form "is x a member of class C?" based on answers to questions about x. By consequence it should also be able to answer questions like "what are some members of C", and "what are some classes to which individual x belongs". An individual that belongs to no classes would be said to be unclassifiable.

The symbols C are arbitrary; they are given meaning by the classification system by virtue of the system's answers to questions about it. That is, when a system answers "yes" to "Is this finch a member of class 'reptile'?" that is OK because each classification system gets to define tokens like "reptile" however it wishes.

Classification systems can be characterized in various ways, for example:

  • A classification system might or might not be a transparent. A transparent system is one whose rules can be communicated well enough that anyone can correctly make judgments for it. Transparency also may allow one to answer some categorical questions such as "do all members of C have property P" or "what are the properties of members of C". The opposite of transparent would be authority-based - there is no way to check what it says.
  • A classification system might or might not be a hierarchy. In a hierarchy, for any two classes, either one is a subclass of the other, or vice versa. (By 'A is a subclass of B' I mean that whenever the system judges x to be a member of A, it will also judge x to be a member of B.)
  • A classification system might or might not be complete, in the sense of being willing to answer all "is x a member of C" questions either yes or no (never "I don't know" or "I won't say").

Classes in a classification system can have properties, too:

  • A class C is descent-closed if whenever x is descended from y, and y is in C, then x is in C. This may require positing the existence of unobservable individuals y, so that such statements can be made.
  • A class C is a species if all of its members are conspecific with some designated type specimen. Transparency would require a reproducible rule for making judgments of conspecificity.
  • A class C is a clade if it is descent-closed and if C started out as a conspecific cohort (that is, there was/is a time up until which all its extent members were conspecific). To be transparent, such a judgment would require a common understanding of conspecificity, which could be achieved either by fiat or by persuasion.

Being phylogenetic is another property of a classification system:

  • A phylogeny is a classification system each of whose classes is a clade.

A classification might be called an alpha taxonomy if all of its classes are species.

Classification by example

Providing examples is an excellent way to promote transparency. If anyone wants to know whether a given specimen belongs to a taxon, they can always compare the specimen to exaples that have been asserted as belonging to that taxon. If an adequate range of examples has been provided, then assessing membership of one of those examples is trivial, and taxon membership of a novel specimen can be done by similarity. This is risky, in that the classification system's authority may disagree with someone else's placement of a novel specimen in any given taxon, but at least some basis for classification has been implied in a manner that might muster support.

With respect to species, the problem constantly faced by systematists is that judgments of conspecificity may differ. Hypotheses of conspecificity take the form of objective species descriptions, which are judgments that the class of individuals possessing the diagnostic characters coincides with the class of conspecifics of the type specimen. If we have conflicting descriptions published for the name N, then there are three classes: N = conspecifics of the type, N1 = individuals meeting diagnostic criteria 1, N2 = individuals meeting diagnostic criteria 2. Whether anyone judges N = N1 or N = N2 depends on how they form determinations of conspecificity, and as this has neither consensus definition nor transparency through direct observation, confusion is completely predictable if the three classes are not given three different names.

The case of paratypes helps to clarify the distinction between classes based on conspecificity and species "concept" classes based on characters: the paratypes can be seen to have the characters, and therefore belong to the diagnostic-based class, even if conspecificity of the paratypes with the type, and therefore their membership in the same species, is difficult to establish.

Where does this get us?

The main ideas here are:

  1. all classifications are fiat, so when someone hands you a classification, don't quibble over what they call the classes, just try to understand on what basis they are making classification judgments.
  2. two classification systems can be joined without contradiction as long as their respective class symbol sets are disjoint. If the same symbol is used in both, one can introduce "sensu X" and "sensu Y" to force consistency. Then you can separately evaluate hypotheses of equation or containment between classes, perhaps with the help of automated tools.

The profit of this exercise is that the idea of "classification system" can in many cases accurately be realized using an "OWL-DL ontology" combined with a DL reasoner. (There may be classification systems for which this is not possible, but hey.) This gives an answer to the original question "how should we publish classifications": Write them in OWL-DL, with rdf:type as the class membership function. Use the DL connectives to express characteristics of classes, when those characteristics are expressible in DL (e.g. subclass relations, disjointness, properties that hold for all members, and so on), and OWL annotation properties or comments only when this is not possible.

To turn this into a practical proposal we would need:

  1. a proof of my assertion that DL can express an interesting body of knowledge about classification
  2. an ontology of classification systems, so that we have a way to write down our judgments about ontologies (hierarchical, phylogenetic, transparent, etc.)
  3. a customized OWL-DL tutorial for the taxonomic community so that people who want to do this kind of thing know how to make best use of OWL-DL, and don't dump what they know onto annotation properties when they don't have to

(Cf. Ghiselin's 2001 article 10.1046/j.1467-2979.2002.00084.x, Michael Donoghue, etc. etc.)

Note: Taxon concept

It is never clear what exactly a "concept" is and isn't, and the ontology community accordingly attempts to avoid this word. If there is nothing that is not a "concept" then the word is useless, and if a "concept" is the same as a "class" in the sense of first-order logic (as used above), then it is probably clearer to say "class" (although people not used to this abstract meaning of the word will get just as confused as they would be the word "concept").

The idea of a "taxon concept" arose in order to make people stop fighting about species names and authority, and instead start talking about science. Whatever a "taxon concept" is, we know that it is associated with a piece of writing - that is, it aspires to be transparent and actionable in a way that "species" or "subfamily" is not. That is, each taxon concept induces a class of individuals that will be judged as being subsumed by that taxon concept. If it subsumes the same class for each scientist interpreting the piece of writing, then it is of high utility in achieving scientific efficiency.

The DL approach would be to focus on classes instead of on the taxon concepts that induce them. That is, to set up a classification system, give a name such as 'canine' to a class, and explain how one can, within the system, make judgments of the form 'x is a member of canine'. It is the explanation that is the most important thing here, but it is the class that needs a name, not the explanation itself.

Note: Interpreting species names

Species are intended to partition the space of possible specimens. We're often forced to deal with assertions of the form "specimen x belongs to species S" where one diagnostic description attached to the species name S puts x in S, while another description doesn't. If the information regarding which classification system (diagnostic criteria) was applied has been lost, then the judgment only tells us, at best, that one of the published species descriptions for S applied.

However, when there is a collection of such determinations, all made using the same classification system, it may be possible to determine which classification system (e.g. field guide or authoritative taxon revision) was consulted by using collateral information. For example, suppose that 'x is in S' is determined, and x is in the range of S sensu Smith but not in the range of S sensu Jones. Then we can reasonably guess that Smith's classification was used. This in turn tells us something about where y was found, if 'y is in S' was determined by the same authority as 'x is in S'.

That is, if it impossible to capture information of the form 'Ed says x is in S sensu Smith', and you can only get 'Ed says x is in S', then at least try to get a statement of the form 'Ed consults Smith'. This can be done once per expert, instead of once per specimen.

If you can't even get that information, the determination should be recorded as 'x is in S sensu Ed', so that if some downstream user of the data is lucky enough to find out later which field guide Ed consults, they can make use of that information. Date can also be a useful proxy; if the community as a whole moves from one classification to another, one can say 'S sensu 1840' and 'S sensu 1987' and this will assist a downstream interpretation.

Data versus Ontologies

Biodiversity information and classifications build bottom up from raw data deriving from observations of individual specimens. How specimens are grouped or classified varies widely according to the inclinations and skills of the investigators. The path from properties to classes can be circuitous and involved - consider the variety of methods in both alpha and higher taxonomy - but ultimately all classifications derive from particular properties of individual specimens (morphology, genotype, locale, date encountered, life history, behavior, and so on).

The investigators performing primary observations are usually not the same as those who care about accurate alpha or higher classification. They are forced, nonetheless, to act as classifier by the impracticality of recording accurately all of the properties that would be needed to perform a taxon (usually species) determination. This means that data sets inevitably contain determinations, along with whatever specimen properties were of primary interest to the investigator (e.g. location and date).

We look forward to the adoption of taxon-concept-aware data curation practices, where each assignment of a specimen to a taxon links unequivocally to the objective property-based taxon description that applies to that assignment - that is, "specimen Z belongs to species S sensu authority A" as opposed to "specimen Z belongs to species S". This would create opportunities for careful and confident scholarship of a kind that is at present difficult in biodiversity informatics. However, the real data we often have to work with often does not have this form; one simply has assignment of specimen to taxon (given by simple taxon name), and in many cases the taxon name has multiple incompatible published descriptions. We must still be able to make use of the information at hand as best we can.

The first rule of data processing is to always preserve the original sources in close to their original form as best one can. If downstream processing introduces assumptions, access to sources gives one a chance of checking those assumptions and correcting them if they are wrong. A wholesale import of a data set into a database that resolves, say, a family name to a particular description or circumscription of that family can easily be wrong - a different definition of the family might exist that was used in the data set's identifications, but was unknown to the individual writing the import script. With backlinks to the original data set, there is at least some hope that the error can be repaired.

When rendering such data sets as RDF or OWL, one must choose URIs for taxa and, more importantly, give them meaning. Fortunately we have many choices and are able in many cases to capture the information we have without lying (making unjustified assumptions). Here are two options for rendering "specimen Z belongs to taxon T" as RDF:

  1. Define T = the class that is the union of all published classes (taxon concepts) that have ever been given the name "T". Testing whether a definition of T has been published may be difficult, but at least this is close to being objective, and has a good chance of being true.
  2. Define a name T local to the data set = some published taxon concept that has been assigned the name "T". This will not be a lie as long as T referred to the same taxon concept for all identifications in the data set. We can then, as a separate step, hypothesize that the local T coincides with whatever particular taxon concept T' we want to consider - for example, we may learn later that the study used a particular field guide, and that the field guide draws from some particular authority for "T", so the local T is the same class as the class T defined by the authority.

Layer cake

[JAR: I think you (Roger) are talking about a bottom-up system of correct individual commitment followed by collaborative contribution to classification by several parties, starting in the field and going up through phylogeny: specimen -> lower taxon -> higher taxon. But maybe I've got this wrong. This seems good; we need to figure out how to articulate and justify the approach.]

[Roger says, JAR puzzles:] It may be that there can be a standard way of producing a class hierarchy for inference on top of such structures. In terms of a layer diagram it would go something like (here inverted).

Layer 0 -> CSV text files with well specified columns [JAR: not sure what's in these columns can you explain?] - for bulk transport. HTTP URIs are used as globally unique ids for taxa, with an accurate level of commitment chosen as described above (id naming taxon-within-study, id naming taxon-without-definition, id naming taxon-with-authority).

Layer 1 -> Vanilla RDF representation of [lower] taxa and nomenclature - returned from GETs of HTTP URIs. Perhaps a single RDF file per logical classification [system? or event?], although some people would prefer one RDF file per id. [JAR: I think you have in mind some expectation that certain specific information might be expected uniformly across all ids. This is a good idea and should be addressed as a protocol issue. This is orthogonal to the issue I was trying to sort out, which is getting the semantics right so that integrations are possible.]

Layer 2 -> OWL ontology or ontologies that can import [be combined with?] data [JAR: what kind of data?] (possibly scripts to run to generate class hierarchies [you mean, the ontology might be generated by a script?])

Layer 3 -> Inference and queries built around other [?] classes

From the point of view of the community this approach is low risk. If this semantic web stuff turns out to be rubbish then there are good old CSV entity relationship type stuff at the base of it all. The RDF is really simple and can be parsed and used in non-semantic type ways. Using simple content negotiation it would be easy to add JSON or other type renderings at Layer 1 in the future. It gives us the opportunity to play with SPARQL as well as inference (e.g. DL) over realistic amounts of data.

See also work going on at http://code.google.com/p/tdwg-ontology/ especially http://code.google.com/p/tdwg-ontology/source/browse/trunk/docs/publishing_taxa/index.html

Personal tools
browse site