Difference between revisions of "Phylotastic/TNRS"

From Evolutionary Interoperability and Outreach
Jump to: navigation, search
(Design 3 (elaborate))
(Design 1 (simple))
Line 69: Line 69:
  
 
=== Design 1 (simple) ===
 
=== Design 1 (simple) ===
In the simplest scenario TNRS simply returns a list of all known possible valid names for a given (potentially invalid) name. The list of names can be annotated with attributes such as source, associated ids, their status (i.e. whether a name is canonical name), etc. In this scenario, the burden of figuring out what to do with each name is on the users of the API. The way we envision users of the API use the returned list is by searching all the mega-trees for all the given names. So, if any of the names match a name in the mega-tree, that name is used.  
+
In the simplest scenario TNRS simply returns a list of all known possible valid names for a given (potentially invalid) name. The list of names can be annotated with attributes such as source, associated ids, their status (i.e. whether a name is the canonical name for that species), etc. In this scenario, the burden of figuring out what to do with each name is on the users of the API. The way we envision users of the API will use the returned list is by searching all the mega-trees for all the given names. So, if any of the names match a name in the mega-tree, that name should be used.  
  
In those cases where a name is associated with multiple species, this API can try to return multiple lists, each corresponding to a different species. However, it is not always possible to (easily) figure out these cases from the out of external TNRS services.
+
In those cases where a name is associated with multiple species, this API can try to return multiple lists, each corresponding to a different species. However, it is not always possible to (easily) figure out these cases from the output of external TNRS services we are going to use.
  
 
=== Design 2 (in-between) ===
 
=== Design 2 (in-between) ===

Revision as of 11:41, 5 June 2012

The Taxonomic Name Resolution Service translates scientific names to scientific names as found in a TNRS, ideally identifying them by means of a URL. The goal is to standardize the names being used in the trees in Phylotastic as well as to standardize names provided by users when generating subtrees.

People planning to work on this

Please enter any programming languages you know in your order of preference (first = highest preference).

  • Gaurav: Perl, Ruby, Java, Python.
  • Helena: Javascript, PHP, Python

Deliverables

  • A web service with the following API in any programming language (depending on voting, see above).
  • A set of tests which can be run against the TNRS through the API to test its ability to correct for:
    • Synonymies
    • Typos
  • A CSV file mapping every taxon in our big tree(s) to a TNRS taxon, which can then be evaluated for accuracy.

API

This is just a sketch (doubly true for anything marked possibly?); please edit at will!

Given
  • query (string): A general taxon name (e.g. "P. tigris", "Panthera tigris", "tiger", etc.)
  • within (string, possibly?): A taxonomic grouping that this taxon is found within. This would help us disambiguate Crucibulum (within = "Fungi" or "Basidiomycetes" or "Nidulariaceae") from Crucibulum (within = "Animalia" or "Mollusca" or "Calyptraeidae").
Returns
  • score (integer from 0 to 100): indicates how sure the TNRS is that the match is correct. '100' implies that the string provided was exactly identical to the string emitted (after nomenclatural parsing, so "Panthera tigris" is identical to "Panthera tigris (Linnaeus, 1758)"); '0' indicates that no match could be found.
  • If the score is 0:
    • errorMessage (string): a human-readable error message explaining why this error occurred.
    • errorCode (integer, possibly?): a predefined error code indicating the type of error.
    • retry (boolean): 'true' if this request should be retried (if the TNRS failed because of a problem that's likely to be temporary, such as a server error), 'false' if not (if the provided name could not be matched to any string on this TNRS).
  • If the score is not 0:
    • scientificName (string): "Panthera tigris", "Panthera tigris (Linnaeus, 1758)", "Coleoptera".
    • acceptedNameUsage (string): Where the queried name is the currently accepted name for this taxon, this should be identical to scientificName; where the queried name is not the currently accepted name, this should be the currently accepted name ("Felis tigris", for instance, if the user queries for Antilocapra anteflexa, the system should return "Antilocapra anteflexa Gray, 1855" in the scientificName field and "Antilocapra americana Ord, 1815" in the acceptedNameUsage field.
    • url (url): A url which represents the taxa in the TNRS queried; for example, http://eol.org/pages/328674/overview or http://www.ubio.org/browser/details.php?namebankID=2478188 (bonus marks for linking to an RDF file!)
    • higherClassification (string, possibly?): for example, "Eukaryota;Animalia;Eumetazoa;Bilateria;Nephrozoa;Deuterostomia;Chordata;Chordata Craniata;Vertebrata;Gnathostomata;Tetrapoda;Mammalia;Theria;Eutheria;Epitheria;Laurasiatheria;Carnivora;Feliformia;Felidae;Pantherinae;Panthera" (or something smaller).

Existing TNRSes we can plug into

Please remember that any of these TNRS might have incorrect or outdated data, cross-code synonymies, or any other problem!

  • iPlant TNRS: Only plants (via Tropicos/NCBI Taxonomy/USDA Plants/Global Compositae Checklist)
  • ITIS: All life, but focuses on North American taxa
  • EOL: All life, merges multiple taxonomic trees from different providers
  • NCBI Taxonomy: All life
  • uBio: All life
  • WoRMS: Marine species
  • Global Names Index, which contains ~17million names, and returns lexical groups of similar names and links to sources
  • Global Names Recognition service, which identifies things that look like taxon names in a document or webpage


Feature Matrix

Name Animals Plants Fungi Micro Global Typos Common names Synonyms Cross ID Classification Support scores Taxonomic parsing WS info Notes
iPlant TNRS No Yes No No Partial Yes No Yes Yes Yes Yes Yes [1]? Hierarchical search possible
ITIS Yes Yes Yes  ?  ? No No Partial  ? Yes No  ? [2]
EOL Yes Yes Yes  ? Yes No Yes Partial Yes Yes No  ?  ?
NCBI Taxonomy Yes Yes Yes Yes Yes Yes No Yes ? Yes No  ? [3] Contains many taxonomically invalid names
uBio yes yes yes yes  ?  ?  ?  ?  ?  ?  ?  ? [4]
Required? Yes Yes Yes Yes Yes Yes  ? Yes  ?  ? Yes  ? NA Taxonomic parsing might be required for infraspecifics and authors


Day 1 Discussion

We came up with three alternative API designs, ranging from simple to elaborate. The choice of these strategies has to be coordinated and matched against the core architecture, especially tree storage and retrieval.

Design 1 (simple)

In the simplest scenario TNRS simply returns a list of all known possible valid names for a given (potentially invalid) name. The list of names can be annotated with attributes such as source, associated ids, their status (i.e. whether a name is the canonical name for that species), etc. In this scenario, the burden of figuring out what to do with each name is on the users of the API. The way we envision users of the API will use the returned list is by searching all the mega-trees for all the given names. So, if any of the names match a name in the mega-tree, that name should be used.

In those cases where a name is associated with multiple species, this API can try to return multiple lists, each corresponding to a different species. However, it is not always possible to (easily) figure out these cases from the output of external TNRS services we are going to use.

Design 2 (in-between)

In this design, we have the interface described in the first design, but we have an additional functionality provided. The second functionality chooses one of the available names for each species as the current name. This will not have to be the correct name for the species (whatever that means), but it has to be consistent. This single name will enable users of the TNRS service to match species across different trees, and to user query. But there is going to be a limitation to consistency. Over time, the current name can change. This complicates matters in many imaginable ways. For one thing, if mega-trees are stored, the stored taxon names could become outdated (out of synch with the current name returned by TNRS). Possible solutions to this problem are

  • Updating and synchronizing stored mega-trees periodically, so that they are not too much out of sync.
  • Every time a new query comes in, we query the current name for all the taxa in all the source trees, updating the names that have changed.

Design 3 (elaborate)

In the most elaborate design, we use IDs to formalize entities stored in mega-trees. We will assign one ID for each species stored in our system. Stored trees should use these IDs to store tree (not species names). TNRS service will include two operations: returning an ID given a (potentially incorrect) name, and returning a currently accepted name for a given ID. In case two species have the same name, the two species should be assigned different IDs and the service should return both IDs. A typical usage of the API will be taking user-provided names, mapping those to IDs, finding those IDs in the stored trees, prunning and grafting, and getting a tree with tips labeled with IDS; then, IDs are turned into current accepted names, and these are the names that are shown to the user.

The idea here is that IDs will be associated with species, and hence more stable through time, eliminating the need for frequent update of the stored trees.

How exactly the IDs should be assigned to species, has to be discussed. We considered using existing IDs from sources such as ITIS ids. An alternative is generating new IDs internal to phylotastic (maybe not a good idea?).

General Concerns

No matter which design we choose, there are two concepts that can be implemented on top of our APIs: caching and batching. Caching will permit us to improve performance, especially for the fuzzy match which can be quite slow. Batching permits the user to search for a list of names and get a list of responses in one call.

In addition, we discussed whether our API needs to be synchronous or asynchronous. Our current thinking is that we need to provide two interfaces for each operation, one that is synchronous and does a simple and fast search (without fuzzy matching), and another one that is more thorough and is asynchronous.

Design discussion

  • Return 1 name or multiple names?
    • Scores?
  • Caching?
  • Which TNRS do fuzzy matching?