Difference between revisions of "Phylotastic/TNRS"

From Evolutionary Interoperability and Outreach
Jump to: navigation, search
(Example)
(API: We should signal whether uri is RDFy or not.)
Line 59: Line 59:
 
| acceptedName || The currently accepted name for individuals of the taxon identified in matchedName. If the TNRS does not contain synonymy information, or If there is no currently accepted name, this field should be blank.  Unlike DarwinCore's acceptedNameUsage field, we prefer that this not contain the taxonomic authority, although it may contain it if the TNRS does not provide a single uni/bi/trinomial. || "Panthera tigris"
 
| acceptedName || The currently accepted name for individuals of the taxon identified in matchedName. If the TNRS does not contain synonymy information, or If there is no currently accepted name, this field should be blank.  Unlike DarwinCore's acceptedNameUsage field, we prefer that this not contain the taxonomic authority, although it may contain it if the TNRS does not provide a single uni/bi/trinomial. || "Panthera tigris"
 
|-
 
|-
| uri || A URI corresponding to the acceptedName (NOT the matchedName). Ideally, this should be an HTTP URL to an RDF document, but an HTML document is also fine. || "http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2478188" (RDF) or "http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=183805" (HTML)
+
| uri || A URI corresponding to the acceptedName (NOT the matchedName). Ideally, this should be an HTTP URL to an RDF document, but an HTML document is also fine. '''TODO''': We need a way of indicating whether this is an RDF document or not; either with different field names ("uri" vs "rdf") or possibly hacking it via different schemas: "http+rdf://" vs "http://", for instance. || "http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2478188" (RDF) or "http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=183805" (HTML)
 
|-
 
|-
 
| annotations || A dictionary of annotations specific to this TNRS. See <tt>metadata['source'][0]['annotations']</tt>, etc. for the descriptions of these annotations. || {'nucleotide_uri':&nbsp;"http://www.ncbi.nlm.nih.gov/nuccore/?term=txid9694[Organism:exp]", 'protein_uri':&nbsp;"http://www.ncbi.nlm.nih.gov/protein/?term=txid9694[Organism:exp]"}
 
| annotations || A dictionary of annotations specific to this TNRS. See <tt>metadata['source'][0]['annotations']</tt>, etc. for the descriptions of these annotations. || {'nucleotide_uri':&nbsp;"http://www.ncbi.nlm.nih.gov/nuccore/?term=txid9694[Organism:exp]", 'protein_uri':&nbsp;"http://www.ncbi.nlm.nih.gov/protein/?term=txid9694[Organism:exp]"}

Revision as of 16:43, 6 June 2012

The Taxonomic Name Resolution Service translates scientific names to scientific names as found in a TNRS, ideally identifying them by means of a URL. The goal is to standardize the names being used in the trees in Phylotastic as well as to standardize names provided by users when generating subtrees.

People planning to work on this

Please enter any programming languages you know in your order of preference (first = highest preference).

  • Gaurav: Perl, Ruby, Java, Python.
  • Helena: Javascript, PHP, Python

Deliverables

  • A web service with the following API in any programming language (depending on voting, see above).
  • A set of tests which can be run against the TNRS through the API to test its ability to correct for:
    • Synonymies
    • Typos
  • A CSV file mapping every taxon in our big tree(s) to a TNRS taxon, which can then be evaluated for accuracy.

API

Given
  • names (string): A newline-delimited list of taxon name (e.g. "P. tigris\nPanthera tigris\ntiger", etc.)
  • within (string, not in v1): A taxonomic grouping that this taxon is found within. This would help us disambiguate Crucibulum (within = "Fungi" or "Basidiomycetes" or "Nidulariaceae") from Crucibulum (within = "Animalia" or "Mollusca" or "Calyptraeidae").
Returns
Section Field Meaning Examples
metadata jobId The job-id which was submitted (for asynchronous requests) 12345, "1-ABC-789"
metadata submitDate Date on which this job was submitted in ISO 8601 format. "2012-06-06T14:54Z"
metadata sources An array of all the sources available to our TNRS service, in the following format:
Field Description Example
sourceId A short string used to name this source "ITIS", "NCBI Taxonomy", "iPlant TNRS"
sourceName The full name of this source "iPlant Collaborative TNRS v3"
uri A URL used to identify this source; generally the HTTP URL for the frontpage "http://www.itis.gov/", "http://www.ncbi.nlm.nih.gov/taxonomy"
rank The rank to which we assign this source. Multiple sources *cannot* have the same rank. 1, 4, 5
status The status of this TNRS at the time of this request. Note that "offline" or "temporarily offline" TNRSes were NOT queried for the results returned in this document. Either "online" or "offline" or "temporarily offline"
annotations A dictionary containing a list of annotations which MIGHT be produced by this TNRS, mapped to descriptions of that annotation. {'nucleotide_uri': "A link to nucleotide sequences on GenBank for this taxon", 'protein_uri': "A link to protein sequences on GenBank for this taxon."}
names submittedName The name that was submitted for name resolution. "Feeelis tigris"
names matchCount The number of successful matches 0, 2, 4
names matches An array containing a list of matches, in the following format:
Field Description Example
sourceId A short string used to name the TNRS source from which this name was extracted. See metadata['sources'] to look up the metadata associated with this source. "ITIS", "NCBI Taxonomy", "iPlant TNRS"
matchedName The name matched in this TNRS from the name submitted. There MUST be a name entry in the TNRS for this name, although it is not necessarily valid/accepted. Unlike DarwinCore's scientificName field, we prefer that this not contain the taxonomic authority, although it may contain it if the TNRS does not provide a single uni/bi/trinomial. "Felis tigris"
acceptedName The currently accepted name for individuals of the taxon identified in matchedName. If the TNRS does not contain synonymy information, or If there is no currently accepted name, this field should be blank. Unlike DarwinCore's acceptedNameUsage field, we prefer that this not contain the taxonomic authority, although it may contain it if the TNRS does not provide a single uni/bi/trinomial. "Panthera tigris"
uri A URI corresponding to the acceptedName (NOT the matchedName). Ideally, this should be an HTTP URL to an RDF document, but an HTML document is also fine. TODO: We need a way of indicating whether this is an RDF document or not; either with different field names ("uri" vs "rdf") or possibly hacking it via different schemas: "http+rdf://" vs "http://", for instance. "http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2478188" (RDF) or "http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=183805" (HTML)
annotations A dictionary of annotations specific to this TNRS. See metadata['source'][0]['annotations'], etc. for the descriptions of these annotations. {'nucleotide_uri': "http://www.ncbi.nlm.nih.gov/nuccore/?term=txid9694[Organism:exp]", 'protein_uri': "http://www.ncbi.nlm.nih.gov/protein/?term=txid9694[Organism:exp]"}
score A score (from 0 to 1) indicating how certain the TNRS is of this match. Note that in some cases (where the TNRS does not provide scores), the controller may calculate its own score (either by calculating the number of characters different between the matchedName and the submittedName, or by simply setting it to '1.0' where they are identical and '0.5' where they are not. 0.5, 0.6667, 0.98989

Example

Request:

 query=Panthera+tigris%0AEutamias+minimus%0AMagnifera+indica%0AHumbert+humbert

Result: This has been simplified by only showing a single result name per input name. Our TNRS will query every TNRS with each input name, and return all the matches found ranked in the order of TNRS preference; this should usually result in atleast two matches. Users may take the first match to be our best guess as to the validated name.

 {
   "metadata": {
       "jobId": 1,
       "submitDate": "2012-06-06T14:54Z",
       "sources": [{
           "sourceId": "ITIS",
           "sourceName": "Integrated Taxonomic Information System",
           "uri": "http://www.itis.gov/",
           "rank": 1,
           "status": "online",
           "annotations": {"TSN": "Taxonomic Serial Number, ITIS' internal identifier"}
       }, {
           "sourceId": "NCBI Taxonomy",
           "sourceName": "NCBI Taxonomy",
           "uri": "http://www.ncbi.nlm.nih.gov/taxonomy",
           "rank": 2,
           "status": "online",
           "annotations": {"nucleotide_uri": "A link to nucleotide sequences on GenBank for this taxon", "protein_uri": "A link to protein sequences on GenBank for this taxon."}
       }, {
           "sourceId": "iPlant TNRS",
           "sourceName": "iPlant Collaborative Taxonomic Name Resolution Service v3.0 ",
           "uri": "http://tnrs.iplantcollaborative.org/",
           "rank": 3,
           "status": "online",
           "annotations": {"Authority": "The taxonomic authority for the species."}
       }, {
           "sourceId": "ABC TNRS",
           "sourceName": "Animals, Birds and Cattle TNRS",
           "uri": "http://www.example.com/tnrs",
           "rank": 4,
           "status": "offline",
           "annotations": {}
       }]
   },
   "names": [{
       "submittedName": "Panthera tigris",
       "matchCount": 1,
       "matches": [{
           "sourceId": "ITIS",
           "matchedName": "Panthera tigris",
           "acceptedName": "Panthera tigris",
           "uri": "http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=183805",
           "annotations": { "TSN": "183805" },
           "score": 1.0
       }]
   }, {
       "submittedName": "Eutamias minimus",
       "matchCount": 1,
       "matches": [{
           "sourceId": "ITIS",
           "matchedName": "Eutamias minimus",
           "acceptedName": "Tamias minimus",
           "uri": "http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180195",
           "annotations": { "TSN": "180195" },
           "score": 0.5
       }]
   }, {
       "submittedName": "Magnifera indica",
       "matchCount": 1,
       "matches": [{
           "sourceId": "iPlant TNRS",
           "matchedName": "Mangifera indica",
           "acceptedName": "Mangifera indica",
           "uri": "http://www.tropicos.org/Name/1300071",
           "annotations": { "Authority": "L." },
           "score": 0.98
       }]
   }, {
       "submittedName": "Humbert humbert",
       "matchCount": 0,
       "matches": []
   }]
 }

Existing TNRSes we can plug into

Please remember that any of these TNRS might have incorrect or outdated data, cross-code synonymies, or any other problem!

  • iPlant TNRS: Only plants (via Tropicos/NCBI Taxonomy/USDA Plants/Global Compositae Checklist)
  • ITIS: All life, but focuses on North American taxa
  • EOL: All life, merges multiple taxonomic trees from different providers
  • NCBI Taxonomy: All life
  • uBio: All life
  • WoRMS: Marine species
  • Global Names Index, which contains ~17million names, and returns lexical groups of similar names and links to sources
  • Global Names Recognition service, which identifies things that look like taxon names in a document or webpage


Feature Matrix

Name Animals Plants Fungi Micro Global Typos Common names Synonyms Cross ID Classification Support scores Taxonomic parsing WS info Notes
iPlant TNRS No Yes No No Partial Yes No Yes Yes Yes Yes Yes [1]? Hierarchical search possible
ITIS Yes Yes Yes  ?  ? No No Partial  ? Yes No  ? [2]
EOL Yes Yes Yes  ? Yes No Yes Partial Yes Yes No  ?  ?
NCBI Taxonomy Yes Yes Yes Yes Yes Yes No Yes ? Yes No  ? [3] Contains many taxonomically invalid names
uBio yes yes yes yes  ?  ?  ?  ?  ?  ?  ?  ? [4]
Required? Yes Yes Yes Yes Yes Yes  ? Yes  ?  ? Yes  ? NA Taxonomic parsing might be required for infraspecifics and authors


Day 1 Discussion

We came up with three alternative API designs, ranging from simple to elaborate. The choice of these strategies has to be coordinated and matched against the core architecture, especially tree storage and retrieval.

Design 1 (simple)

In the simplest scenario TNRS simply returns a list of all known possible valid names for a given (potentially invalid) name. The list of names can be annotated with attributes such as source, associated ids, their status (i.e. whether a name is the canonical name for that species), etc. In this scenario, the burden of figuring out what to do with each name is on the users of the API. The way we envision users of the API will use the returned list is by searching all the mega-trees for all the given names. So, if any of the names match a name in the mega-tree, that name should be used.

In those cases where a name is associated with multiple species, this API can try to return multiple lists, each corresponding to a different species. However, it is not always possible to (easily) figure out these cases from the output of external TNRS services we are going to use.

Design 2 (in-between)

In this design, we still have the operation described in the first design. In addition, we return one of the available names for each species as the current name. This will not have to be the correct name for the species (whatever that means), but it has to be consistent. This single consistent name will enable users of the TNRS service to match species across different trees, and to user query. But there is going to be a limitation to this consistency. Over time, what we return as the current name can change. This complicates matters in imaginable ways for the users of the API. If mega-trees are stored, the stored taxon names could become outdated (out of synch with the current name returned by TNRS). Possible solutions to this problem are:

  • Updating stored mega-trees periodically, so that they are synchronized with current names returned by the TNRS service.
  • Every time a new query comes in, we query the current name for all the taxa, updating the changed names in all the stored trees.

Design 3 (elaborate)

In the most elaborate design, we use IDs to formalize entities stored in mega-trees. We will assign one ID for each species stored in our system. Stored trees should use these IDs to store tree (not species names). TNRS service will include two operations: returning an ID given a (potentially incorrect) name, and returning a currently accepted name for a given ID. In case two species have the same name, the two species should be assigned different IDs and the service should return both IDs. A typical usage of the API will be taking user-provided names, mapping those to IDs, finding those IDs in the stored trees, prunning and grafting, and getting a tree with tips labeled with IDS; then, IDs are turned into current accepted names, and these are the names that are shown to the user.

The idea here is that IDs will be associated with species, and hence more stable through time, eliminating the need for frequent update of the stored trees.

How exactly the IDs should be assigned to species has to be discussed. We considered using existing IDs from sources such as ITIS. This can be achieved by ranking sources, but we have to be careful about whether those IDs stay constant through time. An alternative is generating new IDs internal to phylotastic (maybe not a good idea?).

General Concerns

No matter which design we choose, there are two concepts that can be implemented on top of our APIs: caching and batching. Caching will permit us to improve performance, especially for the fuzzy match which can be quite slow. Batching permits the user to search for a list of names and get a list of responses in one call.

In addition, we discussed whether our API needs to be synchronous or asynchronous. Our current thinking is that we need to provide two interfaces for each operation, one that is synchronous and does a simple and fast search (without fuzzy matching), and another one that is more thorough and is asynchronous.

Design discussion

  • Return 1 name or multiple names?
    • Scores?
  • Caching?
  • Which TNRS do fuzzy matching?

Questions/notes

  • What if we end up renaming the name-string given to us by the user? We need to make sure to have a warning to the user ("Your query 'Panthera tigris' was renamed to 'Leonardo tigris' for this search because of ...").

Galaxy specification for PhyloTNRS

(The following sample XML file is based on http://wiki.g2.bx.psu.edu/Admin/Training/ISMB2010%20Galaxy%20Tutorial:%20Running%20Your%20Own#Tools but see http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax for a full syntax)

<tool id="org.nescent.phylotastic.tnrs" name="Phylotastic TNRS">
 <description>Extracts data from multiple TNRS </description>
 <command interpreter="python">get_flanks.py $input $out_file1 $size $direction $region -o $offset -l ${input.metadata.chromCol},${input.metadata.startCol},${input.metadata.endCol},${input.metadata.strandCol}</command>
 <inputs>
   <param format="interval" name="input" type="data" label="Select data"/>
   <param name="region" type="select" label="Region">
     <option value="whole" selected="true">Whole feature</option>
     <option value="start">Around Start</option>
     <option value="end">Around End</option>
   </param>
   <param name="direction" type="select" label="Location of the flanking region/s">
     <option value="Upstream">Upstream</option>
     <option value="Downstream">Downstream</option>
     <option value="Both">Both</option>
   </param>
   <param name="offset" size="10" type="integer" value="0" label="Offset" help="Use positive values to offset co-ordinates in the direction of transcription and negative values to offset in the opposite direction."/>
   <param name="size" size="10" type="integer" value="50" label="Length of the flanking region(s)" help="Use non-negative value for length"/>
 </inputs>
 <outputs>
   <data format="interval" name="out_file1" metadata_source="input"/>
 </outputs>
  ...
 </tool>