Goal: How to harness the PhyloCode for phylogenetic queries

People: Nico, Bill, Arlin, Sheldon, Josh, Hilmar, Karen, Torsten

VoCamp Phyloreferencing Subgroup

Ideas and rough plan

Phyloreferencing clades and nodes in trees
Possibility to draw on PhyloCode conventions for referencing clades
PhyloCode is a subset, i.e. we may want to search for nodes / clades not defined in PhyloCode
In reality, to support a range of useful queries, we will need a taxonomy resolution service (RegNum will be one)
Incorporate into PhyloWS standard
Compliant with CQL query language
Will support queries directed at phylogenetic data providers
Also support association of metadata with phylogenies

Desired outcomes

Use-case queries
Phylogenetic expressions and query syntax for each use-case
OWL representation of components
Component vocabularies

People & skills

tree data providers (TreeBase, PhyLoTA)
PhyloCode expert
Data aggregators (iPlant, EOL)

Scoping

We made a restrictive prior decision to focus on queries based on tree topology and tree-character patterns, and not on tree-associated metadata or non-tree information. Searching for trees based on method of inference is out of scope because this is considered a metadata search.

On this basis, we judged most of the queries of Nahkleh, et al (see below) to be out of scope. In our initial pass, we judged only Q3 and Q5 to be in scope. However, in the second pass we decided that Q3 is an OTU-based (not phylogenetic) search, while Q1 is in scope if we interpret "minimum spanning tree" to be local to the current tree and to mean obtaining the subtree or clade containing S.

Assignments

Things didn't really work this way.

Phyloreference expressions and query syntax for each use-case query
- Bill, Sheldon, Hilmar, Nico, Josh
Semantics of a phyloreference in OWL
- Arlin, Karen
Vocabularies for phyloreferences
- Hilmar, Josh, Nico

Superficial analysis of Prior Art

PhyLoTA provides three search options:

subtrees of internal node (using direct taxon name / NCBI id)
most recent common ancestor of A, B, C ...
trees containing A, B, C... (and / or)

PAUP provides a language for asking,

"how many topologies support this relationship"
"load all trees for X node"
Filter "all trees for X node" for trees with this relationship: (1,(2,3))

This would tell you not only how many trees support a relationship, but what proportion of total trees for this group support that relationship

PhyloWS: There are use-cases detailed in the PhyloWS standard. TreeBase implements some of these. An example of an existing PhyloWS syntax for searching on any tree with "Homo" in it (please use Firefox for this):

http://treebasedb-dev.nescent.org:6666/treebase-web/phylows/tree/find?query=tb.title.taxon=Homo&format=rss1&recordSchema=tree

Phyloexplorer has a searching language (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2695458/) explain on its web site:

You can QUERY the collection to search for trees containing a group of taxa- 

Example of query: {catarrhini}>3 and {murinae} and {bos} and not {glires} i.e: get every tree 
where "murinae" and "bos" are, but not "glires" and where the number of "catarrhini" is up to 3 

Syntax of query (upper or lower case is accepted):

--> {...} {name you want to search} must be between braces e.g: {mus musculus}
--> OR {name 1} or {name 2} or {name 3} ...
--> AND {name 1} and {name 2} and {name 3} ...
--> NOT {name 1} and not {name 2} and not {name 3} ...
--> (...) not {name 1} and not ({name 2} or {name 3}) ... use brackets for priority operations
--> <, >, <=, >= {name 1} > 2 and {name 2} < 4 ... e.g: {eutheria} > 4 and {murinae} < 3
--> == {name 1} == 3 ... e.g: {laurasiatheria} == 2 

Spaces between operands are allowed: {glires}>3 is equivalent to {glires} > 3. The same is true
for ==, <=, >=...

- Selecting the queried collection: you can search for trees satisfying your queries either in your current collection or in TreeBASE

Nahkleh, et al provide a list of queries that a phylogenetic db should support:

Q1: Given a set S of taxa, find a minimum spanning clade for it.
Q2: Given three sets of taxa, S1, S2 and S3, and based upon all the phylogenies in the database, find the relationships between those three sets.
Q3: Given a set S of taxa, return all phylogenies that contain S.
Q4: Given a phylogenetic method �M, return all phylogenies that were reconstructed using �M.
Q5: Given an integer n, representing a number of taxa, return all phylogenies on a set S of taxa, where |S| =~ f(n)
Q6: Given a date/author/tool, return all phylogenies reconstructed on/before/after the given date, by the given author, using the given tool.
Q7: Given a certain characteristic l, return all phylogenies with that characteristic.
Q8: Given a type of data, and possibly a set of taxa, return all phylogenies on the set of taxa that were built using the given type of data.
Q9: Given a phylogeny T, a measure m, and a quantity c, return all phylogenies that are c units distant from T using the measure m.
Q10: Given a model of evolution G, return all the phylogenies that were reconstructed under the G assumptions.
Q11: Given some measures, return statistics about the database that correspond to those measures.

Outcomes

Use Cases

The list of enumerated use cases:

The user is focused on a specific group of organisms; the user wants a subtree for the group of interest. This is particularly relevant when the resource is a very large tree, too large for most users to handle. E.g., the user is interested in magnoliaceae, not all plants. Either the user has an enumerated list of terminals, or a taxonomic identifier.
The user is interested in the evidence for or against a specified topology such as ((A,B),C), e.g., (("chimp","human"),"gorilla"). The user may wish to compare studies with ((A,B),C) to studies that find ((A,C),B). The user may wish to extract the subtree or the submatrix including these OTUs in order to assess such things as whether the consistency index differs depending which topology is supported. In a more complex case (requiring taxonomic resolution), this may include some ambiguity where A, B, and C are sets of OTUs: the user does not care about the internal branch order in each group, but only to find trees in which the order is ((A,B),C).
The user is interested in the evidence for or against the holophyly of a specified set of OTUs S = { S1, S2 . . . }. The user may want to explore specifically whether a member of another group T = { T1, T2 . . . } is mixed into the clade with S, e.g., whether there are any old-world monkeys in my hominoids. In the more complex case requiring taxonomic resolution, S and T are indicated using identifiers for higher-level taxa.

Removing complications to focus on query primitives

We discussed two complications, one of which was using phylocode concepts, and the other of which was to allow for mediation by a taxonomy provider that would expand a term such as "Aves" or "Magnoliacea" into a list of OTUs.

We decided to focus initially on a set of query primitives covering each use-case. These primitives may harness definitions of phylogenetic names in the PhyloCode, and the corresponding syntax may be based on PhyloCode syntax for expressing those names. This should not be confused with supporting PhyloCode names, which might be layered on top of that.

Also, the issue of taxonomy resolution could be deferred. Specifically, we recognize that phyloreferences may require a resolution and/or a reconciliation service, local or external to the phylogenetic data provider being queried.

A comment about rooted and unrooted trees

The query interface developed here presumes rooted trees. The notions of searching for "ancestors" and "clades" and "subtrees" do not make sense for unrooted trees. When a connected network (unrooted tree) is bisected, the result is two networks.

This issue requires further discussion at some later date because, in fact, most of the electronically accessible phylogenetic trees found "in the wild" are unrooted, even though the typically user assumes that subject trees are rooted.

Query primitives

Now we try to find a set of query primitives that will cover (either individually, or in combination) all of the anticipated use cases.

The query should return PhyloWS-compliant URIs to the matching clades or other entities (see "Task: Retrieve a clade (subtree) of a tree" in the PhyloWS spec). This will allow retrieving the subtree, URIs to nodes in the clades, and the submatrix corresponding to the clade.

Find MRCA(S) where S = { OTU1, OTU2 . . . }. Since every connected rooted tree with S has an MRCA(S), an OTU-based search for S will yield the same set of trees as a phylogenetic search for MRCA(S): the query is properly a phylogenetic query only if we return something based on the identity of MRCA such as the node id for MRCA, the subtree, or some property such as the number of descendants, etc.
- Examples. Find MRCA("chimp","human","gorilla"). Find MRCA("falcon", "ostrich").
- This could be tied to the node-based definition in PhyloCode. The PhyloCode defines this as the least inclusive clade including the specifiers defining it.
Find Descendants(N) where N is a node. If N is the root of the tree, this is the same as getting all the OTUs in the tree.
- Examples. Find Descendants(MRCA("chimp","human","gorilla")).
Find something where topology has in = S, out = T, where S = { S1, S2, . . . }, T = { T1, T2, . . .}. These are terminals identified by species name or by some other means.
1. subcase: Find MRCA(S) where in = S, out = T.
  - We also called this "Find all trees (or data) that have a common ancestor of A and B from which C is not descended." (i.e., S = {A,B}, T = C).
2. subcase: Find MRCA(S,T) where in = S, out = T.
  - Example: Search for all trees (or data) that support a common ancestor of human (A) and chimps (B) that does not have orangutan as a descendent (C).
3. subcase: Find MostInclusiveClade(in = S, out = T).
  - Example: Search for all trees (or data) that have a clade that includes human (A) but not orangutan (C).
  - In the simplest case where |S| = |T| = 1, this can be mapped to the branch-based definition of a clade in phylocode. According to the code, the ingroup can only have 1 specifier.
An alternative primitive to in = S & out = T would be to find ((A,B),C). I think this is dispensible if we have 3 above, which I put first because of its more direct mapping to the PhyloCode branch-based clade definition.
Find clades where the common ancestor of A and B also is an ancestor of C.

Query Syntax

In the expressions below S_n and O_n are specifiers of nodes. The semantics of the specifiers, such as whether it is a taxon name, specimen, or sequence identifier, are defined by the vocabulary.

TODO: define how vocabulary terms get attached to the phyloreference expressions such that each specifier's meaning is disambiguated.

Find node(s) that are MRCA(S) where S = { S₁, S₂ . . . }
- PhyloCode: <S₁ & ... & S_n with n >= 2
- Gouret et al: [[S₁,...,S_n],[]] with n >= 2
- Phyloreference: <S₁&...&S_n with n >= 2
  - English expressions: The least inclusive clade including S₁, ..., and S_n. The clade originating with the most recent common ancestor of S₁, ..., and S_n.
Find Descendants(N) where N is a node. If N is the root of the tree, this is the same as getting all the OTUs in the tree.
- Note that there is a contradiction in this: all descendants of a node is not the same as all OTUs descending from a node. We are for now assuming that what is meant is indeed all descendants, a.k.a. the subtree originating from N.
- This is the same phyloreference as #1, but with the subtree as the return value rather than the nodeID of the MRCA node. This can be achieved as a second step using the "Retrieve a clade (subtree) of a tree" task in the PhyloWS REST specification, or using the "Retrieve a clade (subtree) of a tree defined by MRCA" task in one step.
Find node(s) where in = S, out = O, where S = { S₁, S₂, . . . }, O = { O₁, O₂, . . .}.
- PhyloCode: >S₁ ~ O₁ ∨ ... ∨ O_m with m >= 1
  - Note that PhyloCode itself doesn't allow more than one specifier for the ingroup.
- Gouret et al: not clear which one (from Figure 2) would apply here.
- Phyloreference:
  - >((S₁&...&S_n)!(O₁| ...|O_n)) with n >=1 and m >= 1.
  - >((S₁|...|S_n)!(O₁| ...|O_n)) with n >=1 and m >= 1.
  - >((S₁&...&S_n)!(O₁& ...&O_n)) with n >=1 and m >= 1.
  - >((S₁|...|S_n)!(O₁&...&O_n)) with n >=1 and m >= 1.
  - English expressions: The most inclusive clade including S₁, ..., and S_n that does not include any of O₁, ..., or O_m. The clade originating from the earliest common ancestor of S₁, ..., and S_n that is not also an ancestor of any of O₁, ..., or O_m.
  - The ingroup specifiers may be joined by "&", in which case all must be present, or by "|", in which case any one of them defines the ingroup. The outgroup specifiers may be joined by "&", in which case all of them need to be present and excluded from the ingroup, or by "|", in which case only those present must be excluded from the ingroup.
Find clades rooted at MRCA(S_n), with S_n = { S₁, S₂ . . . }, that are also ancestors of S_m = { S_a, S_b . . . }. For example, find minimum clades where the ancestor of S₁, S₂ includes S_a, S_b as a descendant.
- This is a subset of #1. Specifically, those clades where the MRCA(S₁,S₂) includes S ∈ Sm (clades that exclude any of S ∈ S_m need to be removed from the result of #1). This suggests to compose the phyloreference as a combination of #1 (the superset) and #3 (the clades to be removed).
- PhyloCode: no corresponding expression
- Gouret et al: not clear which expression would be equivalent
- Phyloreference: (<S₁&...&S_n&S_n+1&...&S_n+m)!(>((S₁&...&S_n)!(S_n+1,...,S_n+m))) with n >= 2 and m >= 1.
  - S₁, ..., S_n are the specifiers defining the MRCA, and there need to be at least 2. S_n+1, ..., S_n+m are the specifiers that also need to be descended from that MRCA, and there needs to be at least 1.
  - Example: subtrees with human, chimp and orangutan, where the most recent common ancestor of human and chimp includes orangutan. More precisely, the least inclusive clade including human, chimp, and orangutan where the ancestor of human and chimp does not exclude orangutan.
    - As a phyloreference: (<human&chimp&orangutan)!(>((human&chimp)!orangutan))

CQL Prefix Assignment for PhyloWS Finder URIs

"A Prefix Map may be used to assign context set names to specific identifiers in order to be sure that the server maps them in a desired fashion. It may occur at any place in the query and applies to anything below the map in the query tree. A prefix assignment is specified by: '>' shortname '=' identifier. The shortname and '=' sign may be omitted, in which case it sets a default context set for indexes."

Take this query: "Search on phylows service called pws_source for node(s) that are the MRCA of nodes labeled Homo sapiens and nodes labeled with the NCBI taxid for Pan troglodytes, and then return the results as URIs for nodes in rss 1.0 format":

http://purl.org/phylo/pws_source/phylows/node/find?query= >dwc="http://rs.tdwg.org/dwc/terms/#" 
cdao="http://www.evolutionaryontology.org/#" lsrn="http://lsrn.org/lsrn/registry.html#"  
cdao.phyloRef="<dwc.scientificName='scientificName:Homo sapiens'&lsrn.taxon='taxon:9598'"
&format=rss1&recordSchema=node

Search Term	Prefix Map	Example Query Term	Notes
A phylo-reference	cdao="http://www.evolutionaryontology.org/#"	cdao.phyloRef=' ... '	CDAO does not have a term for this yet; CDAO may be the wrong body to define this term
NCBI Taxon ID	lsrn="http://lsrn.org/lsrn/registry.html#"	lsrn.taxon='taxon:9598'	lsrn has this term but does not yet resolve this; the use of the word "taxon" is unfortunate
A scientific name string	dwc="http://rs.tdwg.org/dwc/terms/#"	dwc.scientificName='scientificName:Homo sapiens'	The examples given by dwc do not use the colon prefix
International Nucleotide Sequence Database Collaboration (GenBank, EMBL, DDBJ)	lsrn="http://lsrn.org/lsrn/registry.html#"	lsrn.insd='insd:FJ525395'	A node must map to a nucleotide sequence with this accession number or ID.

Vocabulary / semantics

Further resources and questions

PhyloCode glossary
what prefixes already exist that contain concepts we want?
do we need a new prefix?
vocabulary for what we mean by the specifiers, MRCA(S1,S2). How do we define S1 and S2? What can those be? Presumably those can be nodes of any kind.
the vocabulary will provide the means to disambiguate those identifiers
how do we deal with internal identifiers (i.e. link to an external source like NCBI or RegNum)

Tabled complications

Taxonomic querying. A useful variant would be to search based on a taxonomic identifier not a terminal. We can envision this as a service mediated by a taxonomy broker. The user asks for Trees with members of Aves. The provider then interprets this request according to the definition of Aves provided by the taxonomy broker. For instance, the provider may identify "Aves" via Regnum as the clade defined by ("ostrich", "falcon"). Then the provider would execute this definition on a reference tree (or taxonomy) to identify all of the members of Aves. The intersection of this membership list with the members of individuals trees would be the basis for responding to the user's query.
Apomorphy-based querying. Obtain all trees for which character-state S[n, s] is an apomorphy (character n, state s).
- Example: obtain all trees for which presence of a placenta is an apomorphy
- After some discussion, it was determined that systematists interpret apomorphy in a way that is not computable given the kinds of information that typically are available. Here is the explanation for that conclusion.
  - Let us consider four cases, #1 through #4 in which the presence or absence of placenta in OTUs A, B, C and D in the tree ( ( (A,B), C), D) is respectively
    - case #1: + - - -
    - case #2: + + - -
    - case #3: + - + -
    - case #4: + + + +
  - If "apomorphy" is a pattern, determinable only from the current tree and the character distribution, in which the most parsimonious reconstruction makes the state a derived character, then "+" is determinably an apomorphy in #1 and #2. In case #3, "+", parsimony cannot distinguish whether "+" is ancestral or derived. In case #4, parsimony determines that "+" is ancestral. Therefore, if apomorphy is a pattern so defined, the query should return cases #1 and 2, but not #4, while returning #3 raises an ambiguity.
  - However, the two systematists in our group, Nico Cellinese and Bill Piel, insisted that "placental" is an apomorphy regardless of what the current tree might suggest. The placenta evolved only once, and even if it is lost in some group (not known to have happened, but possible), then it still would be an apomorphy. In this view, apomorphy is not merely a pattern of distribution of a state relative to the current tree. Instead, it is a conclusion about its singly-derived status from all available information. Therefore, if apomorphy is so defined, searching for trees that have "placental" as an apomorphy should yield either
    1. no trees if experts have determined that placental is not an apomorphy
    2. all trees, or all trees with any "+" placental state, if experts have determined that placental is an apomorphy
  - Under the second definition, "apomorphy" is not computable on the basis of a tree-plus-character thing, instead this information must come from elsewhere.
  - Apomorphy-based queries may become computable by requiring the apomorphy character be an ontology term. This would at least make their meaning, and presumably taxonomic context, unambiguous.

PhyloCode and PhyloReference

The Phylocode General Requirements for Establishment of Clade Names is at www.ohio.edu/phylocode/art9.html.

From PhyloCode article 9, section 9.4:

The system of phylocode symbology abbreviations used here adopts conventions below. The use of non-ascii characters for phylocode is problematic for integration and URI-based queries. Some alternative abbreviations are proposed.

phylocode	phyloreference	definition
>	>	the most inclusive clade containing
<	<	the least inclusive clade containing
&	&	and
∨	\|	or
~	!	not
A, B, C, etc.	A, B, C, etc.	species or specimens used as internal specifiers
Z, Y, X, etc.	Z, Y, X, etc.	species or specimens used as external specifiers
M	Not used	an apomorphy
()		"of" or "synapomorphic with that in"; used in conjunction with M
>∇		the most inclusive crown clade containing
<∇		the least inclusive crown clade containing

The "crown clade" symbols (∇) resembles the representation of a crown clade on a phylogenetic tree diagram.

Followup ideas

Develop a vocabulary ontology in CDAO-discuss
- start a separate owl file (we can merge it later)
- import and equate relevant CDAO terms
PhyloWS mini-hackathon: Reference implementation
- have a pre-meeting to make sure there is some kind of reference implementation
- bring together a few data providers (TreeBase, TOLKIN, etc), a few end-users, and a few client programmers
- maybe a dozen people-- this would be a small meeting
- possible backers: NMSU, NESCent
PhyloWS mini-hackathon: Show me the tree for my group
- goal is to support the "show me the tree(s) for my group" use case, in which the user wants to obtain (and typically to visualize) a phylogeny of a group of organisms such as "aves". We envision this as an extremely important use-case that will be used widely.
- potential client programs such as PhyloWidget and TreeViz
- taxonomy resolvers such as ITIS or (potentially) NCBI
- data providers (TreeBase, etc)
Electronic resources for phyloreferencing community
- discussion list
- term tracker
Google Summer of Code (GSOC) project ideas
1. PhyloreferencingGUIPlan for PhyloWS topology (phyloreferencing) searches
  - need to put the sketches up on the web site
  - programmer needs to create JS (or whatever) front end and attach to PhyloWS grammar
Develop RDFa specification for phyloreferencing metadata markup in XML, HTML, etc.
1. Other?

Working Meeting at TDWG 2010

See documentation on the TDWG Phylogenetics Standards wiki.

Phyloreferencing subgroup

Contents