Phylotastic/TNRS

From Evoio
Jump to: navigation, search
For information on how we plan to test this service, please see our testing page.

The Taxonomic Name Resolution Service translates scientific names to scientific names as found in a TNRS, ideally identifying them by means of a URL. The goal is to standardize the names being used in the trees in Phylotastic as well as to standardize names provided by users when generating subtrees. See TNRS/Glossary for a standardized list of TNRS-related terms.

Contents

Team

API

To propose and discuss API changes please use the API comments page

GET | POST /submit

Submit a list of taxonomic names to be resolved.

URI

http://api.phylotastic.org/tnrs/submit

Parameters

  • query (required, string): A URL-encoded, newline-delimited list of taxon name (e.g. Panthera+tigris%0AEutamias+minimus%0AMagnifera+indica%0AHumbert+humbert)

Returns

Field Meaning Examples
message Human readable message "Your request is being processed. You can retrieve the results at http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15."
submit date Date and time at which the request was submitted "Mon Jun 11 20:25:16 2012"
token Unique identifier assigned to the request (jobId) "76ca0e9a3ab78e6bc5b4e362c8c40e15"
uri Address at which the results can be retrieved "http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15"

Example

GET http://api.phylotastic.org/tnrs/submit?query=Panthera+tigris%0AEutamias+minimus%0AMagnifera+indica%0AHumbert+humbert


{
   "message": "Your request is being processed. You can retrieve the results at http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15.", 
   "submit date": "Mon Jun 11 20:25:16 2012", 
   "token": "76ca0e9a3ab78e6bc5b4e362c8c40e15", 
   "uri": "http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15"
}


GET /retrieve

Retrieve the resolved names

URI

http://api.phylotastic.org/tnrs/retrieve/<token>

Parameters

  • none

Returns

Section Field Meaning Examples
metadata jobId The job-id which was submitted (for asynchronous requests) "76ca0e9a3ab78e6bc5b4e362c8c40e15"
metadata submitDate Date on which this job was submitted. "Mon Jun 11 20:25:16 2012"
metadata sources An array of all the sources available to our TNRS service, in the following format:
Field Description Example
sourceId A short string used to name this source "ITIS", "NCBI Taxonomy", "iPlant TNRS"
sourceName The full name of this source "iPlant Collaborative TNRS v3"
uri A URL used to identify this source; generally the HTTP URL for the frontpage "http://www.itis.gov/", "http://www.ncbi.nlm.nih.gov/taxonomy"
rank The rank to which we assign this source. Multiple sources *cannot* have the same rank. 1, 4, 5
status The status of this TNRS at the time of this request. Note that "offline" or "temporarily offline" TNRSes were NOT queried for the results returned in this document. Either "online" or "offline" or "temporarily offline"
annotations A dictionary containing a list of annotations which MIGHT be produced by this TNRS, mapped to descriptions of that annotation. {'nucleotide_uri': "A link to nucleotide sequences on GenBank for this taxon", 'protein_uri': "A link to protein sequences on GenBank for this taxon."}
names submittedName The name that was submitted for name resolution. "Feeelis tigris"
names matchCount The number of successful matches 0, 2, 4
names matches An array containing a list of matches, in the following format:
Field Description Example
sourceId A short string used to name the TNRS source from which this name was extracted. See metadata['sources'] to look up the metadata associated with this source. "ITIS", "NCBI Taxonomy", "iPlant TNRS"
matchedName The name matched in this TNRS from the name submitted. There MUST be a name entry in the TNRS for this name, although it is not necessarily valid/accepted. Unlike DarwinCore's scientificName field, we prefer that this not contain the taxonomic authority, although it may contain it if the TNRS does not provide a single uni/bi/trinomial. "Felis tigris"
acceptedName The currently accepted name for individuals of the taxon identified in matchedName. If the TNRS does not contain synonymy information, or If there is no currently accepted name, this field should be blank. Unlike DarwinCore's acceptedNameUsage field, we prefer that this not contain the taxonomic authority, although it may contain it if the TNRS does not provide a single uni/bi/trinomial. "Panthera tigris"
uri A URI corresponding to the acceptedName (NOT the matchedName). Ideally, this should be an HTTP URL to an RDF document, but an HTML document is also fine. TODO: We need a way of indicating whether this is an RDF document or not; either with different field names ("uri" vs "rdf") or possibly hacking it via different schemas: "http+rdf://" vs "http://", for instance. "http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2478188" (RDF) or "http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=183805" (HTML)
annotations A dictionary of annotations specific to this TNRS. See metadata['source'][0]['annotations'], etc. for the descriptions of these annotations. {'nucleotide_uri': "http://www.ncbi.nlm.nih.gov/nuccore/?term=txid9694[Organism:exp]", 'protein_uri': "http://www.ncbi.nlm.nih.gov/protein/?term=txid9694[Organism:exp]"}
score A score (from 0 to 1) indicating how certain the TNRS is of this match. Note that in some cases (where the TNRS does not provide scores), the controller may calculate its own score (either by calculating the number of characters different between the matchedName and the submittedName, or by simply setting it to '1.0' where they are identical and '0.5' where they are not. 0.5, 0.6667, 0.98989

Example

GET http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15

{
   "metadata": {
       "jobId": "76ca0e9a3ab78e6bc5b4e362c8c40e15", 
       "sources": [
           {
               "annotations": {}, 
               "description": "NCBI Taxonomy", 
               "name": "NCBI", 
               "publication": "Federhen S. The Taxonomy Project.2002 Oct 9 [Updated 2003 Aug 13]. In: McEntyre J., Ostell J., editors. The NCBI Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US);2002.", 
               "rank": 3, 
               "sourceId": "NCBI", 
               "status": "200: OK", 
               "uri": "http://www.ncbi.nlm.nih.gov/taxonomy"
           }, 
           {
               "annotations": {
                   "Authority": "Author attributed to the accepted name (where applicable)."
               }, 
               "description": "The iPlant Collaborative TNRS provides parsing and fuzzy matching for plant taxa.", 
               "name": "iPlant Collaborative TNRS v3.0", 
               "publication": "The Taxonomic Name Resolution Service; http://tnrs.iplantcollaborative.org; version 3.0.", 
               "rank": 2, 
               "sourceId": "iPlant TNRS", 
               "status": "200: OK", 
               "uri": "http://tnrs.iplantcollaborative.org/"
           }
       ], 
       "sub_date": "Mon Jun 11 20:25:16 2012"
   }, 
   "names": [
       {
           "matchCount": 1, 
           "matches": [
               {
                   "acceptedName": "Humbertia", 
                   "annotations": {
                       "Authority": "Lam."
                   }, 
                   "matchedName": "Humbertia", 
                   "score": "0.46973019780931", 
                   "sourceId": "iPlant TNRS", 
                   "uri": "http://www.tropicos.org/Name/40028244"
               }
           ], 
           "submittedName": "Humbert humbert"
       }, 
       {
           "matchCount": 2, 
           "matches": [
               {
                   "acceptedName": "Vitis vinifera", 
                   "annotations": {
                       "Authority": "L."
                   }, 
                   "matchedName": "Vitis vinifera", 
                   "score": "1", 
                   "sourceId": "iPlant TNRS", 
                   "uri": "http://www.tropicos.org/Name/34000217"
               }, 
               {
                   "acceptedName": "Vitis vinifera", 
                   "annotations": {}, 
                   "matchedName": "Vitis vinifera", 
                   "score": "1", 
                   "sourceId": "NCBI", 
                   "uri": "http://www.ncbi.nlm.nih.gov/taxonomy/29760"
               }
           ], 
           "submittedName": "Vitis vinifera"
       }, 
       {
           "matchCount": 2, 
           "matches": [
               {
                   "acceptedName": "Mangifera indica", 
                   "annotations": {
                       "Authority": "L."
                   }, 
                   "matchedName": "Mangifera indica", 
                   "score": "0.98210117101673", 
                   "sourceId": "iPlant TNRS", 
                   "uri": "http://www.tropicos.org/Name/1300071"
               }, 
               {
                   "acceptedName": "Mangifera indica", 
                   "annotations": {}, 
                   "matchedName": "Magnifera indica", 
                   "score": "1", 
                   "sourceId": "NCBI", 
                   "uri": "http://www.ncbi.nlm.nih.gov/taxonomy/29780"
               }
           ], 
           "submittedName": "Magnifera indica"
       }, 
       {
           "matchCount": 1, 
           "matches": [
               {
                   "acceptedName": "Euthamia", 
                   "annotations": {
                       "Authority": "(Nutt.) Cass."
                   }, 
                   "matchedName": "Euthamia", 
                   "score": "0.45701346754469", 
                   "sourceId": "iPlant TNRS", 
                   "uri": "http://www.tropicos.org/Name/40007649"
               }
           ], 
           "submittedName": "Eutamias minimus"
       }, 
       {
           "matchCount": 2, 
           "matches": [
               {
                   "acceptedName": "Megalachne", 
                   "annotations": {
                       "Authority": "Steud."
                   }, 
                   "matchedName": "Pantathera", 
                   "score": "0.47790686999749", 
                   "sourceId": "iPlant TNRS", 
                   "uri": "http://www.tropicos.org/Name/40015658"
               }, 
               {
                   "acceptedName": "Panthera tigris", 
                   "annotations": {}, 
                   "matchedName": "Panthera tigris", 
                   "score": "1", 
                   "sourceId": "NCBI", 
                   "uri": "http://www.ncbi.nlm.nih.gov/taxonomy/9694"
               }
           ], 
           "submittedName": "Panthera tigris"
       }
   ]
}

Demo

A demonstration implementation of this API (snappily named tnrastic) was developed by the team at Phylotastic. It consists of a Perl web application using the Dancer framework which handles API requests. We've written adaptors for NCBI Taxonomy, iPlant TNRS and ITIS, as well as a hook into NCBI Taxonomy's spelling correction feature.

Adapters

Each TNRS is represented by an adaptor, which is an executable (generally either a Perl or Python script).

We use a very simple subset of our main API to communicate with adaptors. Each adaptor accepts a newline-delimited list of taxa through standard input; it writes out a JSON file to standard output in the following format in case of success:

 {"names":[{"submittedName":"Eutamias minimus","acceptedName":"Tamias minimus","score":0.5,"matchedName":"Eutamias minimus","annotations":{"TSN":"180195","originalTSN":"180144"},"uri":"http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180195"}],"status":200,"errorMessage":""}

And the following in case of error:

 {"status": 500, "errorMessage": "Could not connect to the server"}


NCBI Adapter

Uses NCBI taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy) as a TNRS service.

MSW3 Adapter

This adapter is no so much an adapter, as it is a new TNRS service, based on a authentic taxonomy called Mammal Species of the World database, version 3 (MSW3) [reference: Don E. Wilson & DeeAnn M. Reeder (editors). 2005. Mammal Species of the World. A Taxonomic and Geographic Reference (3rd ed).] The adapter searches for the input names in a locally stored version of the MSW3 DB. This adapter uses heuristics to find the best match for each given name. If one match is found, accepted name is created based on the genus and species of the matched record. Note that we double checked and confirmed that we have permission to include MSW3 DB in our services.

Refer to the README file at github MSW3 directory for more info.

Demo Client

See phylotastic-tnrs-client for an example command line application that uses the demo service.

Existing TNRSes we can plug into

Please remember that any of these TNRS might have incorrect or outdated data, cross-code synonymies, or any other problem!

  • iPlant TNRS: Only plants (via Tropicos/NCBI Taxonomy/USDA Plants/Global Compositae Checklist)
  • ITIS: All life, but focuses on North American taxa
  • EOL: All life, merges multiple taxonomic trees from different providers
  • NCBI Taxonomy: All life
  • uBio: All life
  • WoRMS: Marine species
  • Global Names Index, which contains ~17million names, and returns lexical groups of similar names and links to sources
  • Global Names Recognition service, which identifies things that look like taxon names in a document or webpage


Feature Matrix

Name Scope Typos Common names Synonyms Cross ID Classification Support scores Taxonomic parsing WS info Notes
EOL Global No Yes Partial Yes Yes No  ?  ?
(GNA Resolver) global Yes  ? Yes  ? Yes Yes Yes [1] "alpha" experimental software; access to NCBI, Catalogue of Life, ITIS, Index Fungorum, GBIF, IPNI, EOL, Union
iPlant TNRS Plants Yes No Yes Yes Yes Yes Yes [2]? Hierarchical search possible
ITIS North American organisms No No Partial  ? Yes No  ? [3]
NCBI Taxonomy Sources of sequence data (see note) No Yes ? Yes No  ? [4] Doesn't do spell-check, but matches against DB with miss-spellings inherited from sources; contains many taxonomically invalid names
uBio Global No Yes Yes  ? Yes  ?  ? [5] accesses ITIS and NCBI data; license [6] (item #7) prohibits further aggregation or proxying-- access to services is only to authorized end users via personal keycode
Required? Global Yes  ? Yes  ?  ? Yes  ? NA Taxonomic parsing might be required for infraspecifics and authors

Day 1 Discussion

We came up with three alternative API designs, ranging from simple to elaborate. The choice of these strategies has to be coordinated and matched against the core architecture, especially tree storage and retrieval.

Design 1 (simple)

In the simplest scenario TNRS simply returns a list of all known possible valid names for a given (potentially invalid) name. The list of names can be annotated with attributes such as source, associated ids, their status (i.e. whether a name is the canonical name for that species), etc. In this scenario, the burden of figuring out what to do with each name is on the users of the API. The way we envision users of the API will use the returned list is by searching all the mega-trees for all the given names. So, if any of the names match a name in the mega-tree, that name should be used.

In those cases where a name is associated with multiple species, this API can try to return multiple lists, each corresponding to a different species. However, it is not always possible to (easily) figure out these cases from the output of external TNRS services we are going to use.

Design 2 (in-between)

In this design, we still have the operation described in the first design. In addition, we return one of the available names for each species as the current name. This will not have to be the correct name for the species (whatever that means), but it has to be consistent. This single consistent name will enable users of the TNRS service to match species across different trees, and to user query. But there is going to be a limitation to this consistency. Over time, what we return as the current name can change. This complicates matters in imaginable ways for the users of the API. If mega-trees are stored, the stored taxon names could become outdated (out of synch with the current name returned by TNRS). Possible solutions to this problem are:

  • Updating stored mega-trees periodically, so that they are synchronized with current names returned by the TNRS service.
  • Every time a new query comes in, we query the current name for all the taxa, updating the changed names in all the stored trees.

Design 3 (elaborate)

In the most elaborate design, we use IDs to formalize entities stored in mega-trees. We will assign one ID for each species stored in our system. Stored trees should use these IDs to store tree (not species names). TNRS service will include two operations: returning an ID given a (potentially incorrect) name, and returning a currently accepted name for a given ID. In case two species have the same name, the two species should be assigned different IDs and the service should return both IDs. A typical usage of the API will be taking user-provided names, mapping those to IDs, finding those IDs in the stored trees, prunning and grafting, and getting a tree with tips labeled with IDS; then, IDs are turned into current accepted names, and these are the names that are shown to the user.

The idea here is that IDs will be associated with species, and hence more stable through time, eliminating the need for frequent update of the stored trees.

How exactly the IDs should be assigned to species has to be discussed. We considered using existing IDs from sources such as ITIS. This can be achieved by ranking sources, but we have to be careful about whether those IDs stay constant through time. An alternative is generating new IDs internal to phylotastic (maybe not a good idea?).

General Concerns

No matter which design we choose, there are two concepts that can be implemented on top of our APIs: caching and batching. Caching will permit us to improve performance, especially for the fuzzy match which can be quite slow. Batching permits the user to search for a list of names and get a list of responses in one call.

In addition, we discussed whether our API needs to be synchronous or asynchronous. Our current thinking is that we need to provide two interfaces for each operation, one that is synchronous and does a simple and fast search (without fuzzy matching), and another one that is more thorough and is asynchronous.

Design discussion

  • Return 1 name or multiple names?
    • Scores?
  • Caching?
  • Which TNRS do fuzzy matching?

Questions/notes

  • What if we end up renaming the name-string given to us by the user? We need to make sure to have a warning to the user ("Your query 'Panthera tigris' was renamed to 'Leonardo tigris' for this search because of ...").

Galaxy specification for PhyloTNRS

(The following sample XML file is based on http://wiki.g2.bx.psu.edu/Admin/Training/ISMB2010%20Galaxy%20Tutorial:%20Running%20Your%20Own#Tools but see http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax for a full syntax)

<tool id="org.nescent.phylotastic.tnrs" name="Phylotastic TNRS">
 <description>Extracts data from multiple TNRS </description>
 <command interpreter="python">get_flanks.py $input $out_file1 $size $direction $region -o $offset -l ${input.metadata.chromCol},${input.metadata.startCol},${input.metadata.endCol},${input.metadata.strandCol}</command>
 <inputs>
   <param format="interval" name="input" type="data" label="Select data"/>
   <param name="region" type="select" label="Region">
     <option value="whole" selected="true">Whole feature</option>
     <option value="start">Around Start</option>
     <option value="end">Around End</option>
   </param>
   <param name="direction" type="select" label="Location of the flanking region/s">
     <option value="Upstream">Upstream</option>
     <option value="Downstream">Downstream</option>
     <option value="Both">Both</option>
   </param>
   <param name="offset" size="10" type="integer" value="0" label="Offset" help="Use positive values to offset co-ordinates in the direction of transcription and negative values to offset in the opposite direction."/>
   <param name="size" size="10" type="integer" value="50" label="Length of the flanking region(s)" help="Use non-negative value for length"/>
 </inputs>
 <outputs>
   <data format="interval" name="out_file1" metadata_source="input"/>
 </outputs>
  ...
 </tool>

Agreements with Data Providers

MSW3

From: "Olaf R.P. Bininda-Emonds" <olaf.bininda@uni-oldenburg.de> Date: June 22, 2012 6:17:53 AM EDT To: Arlin Stoltzfus <arlin@umd.edu> Cc: Kate Jones <kate.e.jones@ucl.ac.uk> Subject: Re: naming conventions in mammalian supertree

Hi Arlin,

I've now heard back from the primary author who compiled the list, Kate Jones, and she has given permission for you to use it.

An updated, citable version of the list can be found in conjunction with our Ecology paper describing PanTHERIA. Look under the metadata link here: http://esapubs.org/archive/ecol/E090/184/metadata.htm . You'll see that the mappings to MSW05 are also there.

In using the list, please be sure to cite the Ecology paper as well as giving particular credit to Kate Jones and Susanne Fritz, who were the main people who put it together.

Cheers,

Olaf

---

On 20.06.2012, at 22:55, Arlin Stoltzfus wrote:

Olaf--

Do we have your permission to use this file? We would like to use it to provide automated name-mappings that will make it easier to integrate the 4500-species mammal tree with other data. Ultimately we would like to implement a taxonomic name resolution service based on Mammal Species of the World.

I also am curious as to how your table of mappings was derived. One member of our team looked at a database for the current edition of Mammal Species of the World, and the list of synonyms apparently is not as extensive, though additional names can be found embedded in "comments". Does your list represent a value-added version of the information in MSW93?

Arlin

browse site