Difference between revisions of "Phylotastic/TNRS"
(→MSW3 Adapter) |
|||
(22 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
− | The '''Taxonomic Name Resolution Service''' translates scientific names to scientific names as found in a TNRS, ideally identifying them by means of a URL. The goal is to standardize the names being used in the trees in [[Phylotastic]] as well as to standardize names provided by users when generating subtrees. | + | :''For information on how we plan to test this service, please see [[Phylotastic/TNRS/Testing|our testing page]].'' |
+ | |||
+ | The '''Taxonomic Name Resolution Service''' translates scientific names to scientific names as found in a TNRS, ideally identifying them by means of a URL. The goal is to standardize the names being used in the trees in [[Phylotastic]] as well as to standardize names provided by users when generating subtrees. See [[TNRS/Glossary]] for a standardized list of TNRS-related terms. | ||
==Team== | ==Team== | ||
Line 7: | Line 9: | ||
==API== | ==API== | ||
+ | <span id="TNRS_api"></span> | ||
+ | |||
To propose and discuss API changes please use the [http://www.evoio.org/wiki/Phylotastic/TNRS/Comments API comments page] | To propose and discuss API changes please use the [http://www.evoio.org/wiki/Phylotastic/TNRS/Comments API comments page] | ||
Line 245: | Line 249: | ||
] | ] | ||
} | } | ||
− | |||
==Demo== | ==Demo== | ||
Line 289: | Line 292: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! Name !! | + | ! Name !! Scope !! Typos !! Common names !! Synonyms !! Cross ID !! Classification !! Support scores !! Taxonomic parsing !! WS info !! Notes |
|- | |- | ||
− | | | + | | EOL || Global || No || Yes || Partial || Yes || Yes || No || ? || ? || |
|- | |- | ||
− | | | + | | [http://resolver.globalnames.org/ (GNA Resolver)] || global || Yes || ? || Yes || ? || Yes || Yes || Yes || [http://resolver.globalnames.org/api] || "alpha" experimental software; access to NCBI, Catalogue of Life, ITIS, Index Fungorum, GBIF, IPNI, EOL, Union |
|- | |- | ||
− | | | + | | iPlant TNRS || Plants || Yes || No || Yes || Yes || Yes || Yes || Yes || [http://www.silverbiology.com/products/taxamatch/]? || Hierarchical search possible |
|- | |- | ||
− | | | + | | ITIS || North American organisms || No || No || Partial || ? || Yes || No || ? || [http://www.itis.gov/ws_description.html] || |
|- | |- | ||
− | | | + | | NCBI Taxonomy || Sources of sequence data || (see note) || No || Yes ||? || Yes || No || ? || [http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html] || Doesn't do spell-check, but matches against DB with miss-spellings inherited from sources; contains many taxonomically invalid names |
|- | |- | ||
− | + | | uBio || Global || No || Yes || Yes || ? || Yes || ? || ? || [http://www.ubio.org/index.php?pagename=soap_tools] || accesses ITIS and NCBI data; license [http://www.ubio.org/index.php?pagename=usageTerms] (item #7) prohibits further aggregation or proxying-- access to services is only to authorized end users via personal keycode | |
+ | |- | ||
+ | ! Required? !! Global !! Yes !! ? !! Yes !! ? !! ? !! Yes !! ? !! NA !! Taxonomic parsing might be required for infraspecifics and authors | ||
|} | |} | ||
− | |||
==Day 1 Discussion== | ==Day 1 Discussion== | ||
Line 366: | Line 370: | ||
... | ... | ||
</tool> | </tool> | ||
+ | |||
+ | == Agreements with Data Providers == | ||
+ | === MSW3 === | ||
+ | From: "Olaf R.P. Bininda-Emonds" <olaf.bininda@uni-oldenburg.de> | ||
+ | Date: June 22, 2012 6:17:53 AM EDT | ||
+ | To: Arlin Stoltzfus <arlin@umd.edu> | ||
+ | Cc: Kate Jones <kate.e.jones@ucl.ac.uk> | ||
+ | Subject: Re: naming conventions in mammalian supertree | ||
+ | |||
+ | Hi Arlin, | ||
+ | |||
+ | I've now heard back from the primary author who compiled the list, Kate Jones, and she has given permission for you to use it. | ||
+ | |||
+ | An updated, citable version of the list can be found in conjunction with our Ecology paper describing PanTHERIA. Look under the metadata link here: http://esapubs.org/archive/ecol/E090/184/metadata.htm . You'll see that the mappings to MSW05 are also there. | ||
+ | |||
+ | In using the list, please be sure to cite the Ecology paper as well as giving particular credit to Kate Jones and Susanne Fritz, who were the main people who put it together. | ||
+ | |||
+ | Cheers, | ||
+ | |||
+ | Olaf | ||
+ | |||
+ | --- | ||
+ | |||
+ | On 20.06.2012, at 22:55, Arlin Stoltzfus wrote: | ||
+ | |||
+ | Olaf-- | ||
+ | |||
+ | Do we have your permission to use this file? We would like to use it to provide automated name-mappings that will make it easier to integrate the 4500-species mammal tree with other data. Ultimately we would like to implement a taxonomic name resolution service based on Mammal Species of the World. | ||
+ | |||
+ | I also am curious as to how your table of mappings was derived. One member of our team looked at a database for the current edition of Mammal Species of the World, and the list of synonyms apparently is not as extensive, though additional names can be found embedded in "comments". Does your list represent a value-added version of the information in MSW93? | ||
+ | |||
+ | Arlin | ||
+ | |||
+ | [[Category:TNRS]] |
Latest revision as of 17:55, 31 January 2013
- For information on how we plan to test this service, please see our testing page.
The Taxonomic Name Resolution Service translates scientific names to scientific names as found in a TNRS, ideally identifying them by means of a URL. The goal is to standardize the names being used in the trees in Phylotastic as well as to standardize names provided by users when generating subtrees. See TNRS/Glossary for a standardized list of TNRS-related terms.
Contents
Team
- Naim Matasci
- Siavash Mirarab
- Gaurav Vaidya
API
To propose and discuss API changes please use the API comments page
GET | POST /submit
Submit a list of taxonomic names to be resolved.
URI
http://api.phylotastic.org/tnrs/submit
Parameters
- query (required, string): A URL-encoded, newline-delimited list of taxon name (e.g. Panthera+tigris%0AEutamias+minimus%0AMagnifera+indica%0AHumbert+humbert)
Returns
Field | Meaning | Examples |
---|---|---|
message | Human readable message | "Your request is being processed. You can retrieve the results at http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15." |
submit date | Date and time at which the request was submitted | "Mon Jun 11 20:25:16 2012" |
token | Unique identifier assigned to the request (jobId) | "76ca0e9a3ab78e6bc5b4e362c8c40e15" |
uri | Address at which the results can be retrieved | "http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15" |
Example
{ "message": "Your request is being processed. You can retrieve the results at http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15.", "submit date": "Mon Jun 11 20:25:16 2012", "token": "76ca0e9a3ab78e6bc5b4e362c8c40e15", "uri": "http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15" }
GET /retrieve
Retrieve the resolved names
URI
http://api.phylotastic.org/tnrs/retrieve/<token>
Parameters
- none
Returns
Section | Field | Meaning | Examples | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
metadata | jobId | The job-id which was submitted (for asynchronous requests) | "76ca0e9a3ab78e6bc5b4e362c8c40e15" | |||||||||||||||||||||
metadata | submitDate | Date on which this job was submitted. | "Mon Jun 11 20:25:16 2012" | |||||||||||||||||||||
metadata | sources | An array of all the sources available to our TNRS service, in the following format:
| ||||||||||||||||||||||
names | submittedName | The name that was submitted for name resolution. | "Feeelis tigris" | |||||||||||||||||||||
names | matchCount | The number of successful matches | 0, 2, 4 | |||||||||||||||||||||
names | matches | An array containing a list of matches, in the following format:
|
Example
GET http://api.phylotastic.org/tnrs/retrieve/76ca0e9a3ab78e6bc5b4e362c8c40e15
{ "metadata": { "jobId": "76ca0e9a3ab78e6bc5b4e362c8c40e15", "sources": [ { "annotations": {}, "description": "NCBI Taxonomy", "name": "NCBI", "publication": "Federhen S. The Taxonomy Project.2002 Oct 9 [Updated 2003 Aug 13]. In: McEntyre J., Ostell J., editors. The NCBI Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US);2002.", "rank": 3, "sourceId": "NCBI", "status": "200: OK", "uri": "http://www.ncbi.nlm.nih.gov/taxonomy" }, { "annotations": { "Authority": "Author attributed to the accepted name (where applicable)." }, "description": "The iPlant Collaborative TNRS provides parsing and fuzzy matching for plant taxa.", "name": "iPlant Collaborative TNRS v3.0", "publication": "The Taxonomic Name Resolution Service; http://tnrs.iplantcollaborative.org; version 3.0.", "rank": 2, "sourceId": "iPlant TNRS", "status": "200: OK", "uri": "http://tnrs.iplantcollaborative.org/" } ], "sub_date": "Mon Jun 11 20:25:16 2012" }, "names": [ { "matchCount": 1, "matches": [ { "acceptedName": "Humbertia", "annotations": { "Authority": "Lam." }, "matchedName": "Humbertia", "score": "0.46973019780931", "sourceId": "iPlant TNRS", "uri": "http://www.tropicos.org/Name/40028244" } ], "submittedName": "Humbert humbert" }, { "matchCount": 2, "matches": [ { "acceptedName": "Vitis vinifera", "annotations": { "Authority": "L." }, "matchedName": "Vitis vinifera", "score": "1", "sourceId": "iPlant TNRS", "uri": "http://www.tropicos.org/Name/34000217" }, { "acceptedName": "Vitis vinifera", "annotations": {}, "matchedName": "Vitis vinifera", "score": "1", "sourceId": "NCBI", "uri": "http://www.ncbi.nlm.nih.gov/taxonomy/29760" } ], "submittedName": "Vitis vinifera" }, { "matchCount": 2, "matches": [ { "acceptedName": "Mangifera indica", "annotations": { "Authority": "L." }, "matchedName": "Mangifera indica", "score": "0.98210117101673", "sourceId": "iPlant TNRS", "uri": "http://www.tropicos.org/Name/1300071" }, { "acceptedName": "Mangifera indica", "annotations": {}, "matchedName": "Magnifera indica", "score": "1", "sourceId": "NCBI", "uri": "http://www.ncbi.nlm.nih.gov/taxonomy/29780" } ], "submittedName": "Magnifera indica" }, { "matchCount": 1, "matches": [ { "acceptedName": "Euthamia", "annotations": { "Authority": "(Nutt.) Cass." }, "matchedName": "Euthamia", "score": "0.45701346754469", "sourceId": "iPlant TNRS", "uri": "http://www.tropicos.org/Name/40007649" } ], "submittedName": "Eutamias minimus" }, { "matchCount": 2, "matches": [ { "acceptedName": "Megalachne", "annotations": { "Authority": "Steud." }, "matchedName": "Pantathera", "score": "0.47790686999749", "sourceId": "iPlant TNRS", "uri": "http://www.tropicos.org/Name/40015658" }, { "acceptedName": "Panthera tigris", "annotations": {}, "matchedName": "Panthera tigris", "score": "1", "sourceId": "NCBI", "uri": "http://www.ncbi.nlm.nih.gov/taxonomy/9694" } ], "submittedName": "Panthera tigris" } ] }
Demo
A demonstration implementation of this API (snappily named tnrastic) was developed by the team at Phylotastic. It consists of a Perl web application using the Dancer framework which handles API requests. We've written adaptors for NCBI Taxonomy, iPlant TNRS and ITIS, as well as a hook into NCBI Taxonomy's spelling correction feature.
Adapters
Each TNRS is represented by an adaptor, which is an executable (generally either a Perl or Python script).
We use a very simple subset of our main API to communicate with adaptors. Each adaptor accepts a newline-delimited list of taxa through standard input; it writes out a JSON file to standard output in the following format in case of success:
{"names":[{"submittedName":"Eutamias minimus","acceptedName":"Tamias minimus","score":0.5,"matchedName":"Eutamias minimus","annotations":{"TSN":"180195","originalTSN":"180144"},"uri":"http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180195"}],"status":200,"errorMessage":""}
And the following in case of error:
{"status": 500, "errorMessage": "Could not connect to the server"}
NCBI Adapter
Uses NCBI taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy) as a TNRS service.
MSW3 Adapter
This adapter is no so much an adapter, as it is a new TNRS service, based on a authentic taxonomy called Mammal Species of the World database, version 3 (MSW3) [reference: Don E. Wilson & DeeAnn M. Reeder (editors). 2005. Mammal Species of the World. A Taxonomic and Geographic Reference (3rd ed).] The adapter searches for the input names in a locally stored version of the MSW3 DB. This adapter uses heuristics to find the best match for each given name. If one match is found, accepted name is created based on the genus and species of the matched record. Note that we double checked and confirmed that we have permission to include MSW3 DB in our services.
Refer to the README file at github MSW3 directory for more info.
Demo Client
See phylotastic-tnrs-client for an example command line application that uses the demo service.
Existing TNRSes we can plug into
Please remember that any of these TNRS might have incorrect or outdated data, cross-code synonymies, or any other problem!
- iPlant TNRS: Only plants (via Tropicos/NCBI Taxonomy/USDA Plants/Global Compositae Checklist)
- ITIS: All life, but focuses on North American taxa
- EOL: All life, merges multiple taxonomic trees from different providers
- NCBI Taxonomy: All life
- uBio: All life
- WoRMS: Marine species
- Global Names Index, which contains ~17million names, and returns lexical groups of similar names and links to sources
- Global Names Recognition service, which identifies things that look like taxon names in a document or webpage
Feature Matrix
Name | Scope | Typos | Common names | Synonyms | Cross ID | Classification | Support scores | Taxonomic parsing | WS info | Notes |
---|---|---|---|---|---|---|---|---|---|---|
EOL | Global | No | Yes | Partial | Yes | Yes | No | ? | ? | |
(GNA Resolver) | global | Yes | ? | Yes | ? | Yes | Yes | Yes | [1] | "alpha" experimental software; access to NCBI, Catalogue of Life, ITIS, Index Fungorum, GBIF, IPNI, EOL, Union |
iPlant TNRS | Plants | Yes | No | Yes | Yes | Yes | Yes | Yes | [2]? | Hierarchical search possible |
ITIS | North American organisms | No | No | Partial | ? | Yes | No | ? | [3] | |
NCBI Taxonomy | Sources of sequence data | (see note) | No | Yes | ? | Yes | No | ? | [4] | Doesn't do spell-check, but matches against DB with miss-spellings inherited from sources; contains many taxonomically invalid names |
uBio | Global | No | Yes | Yes | ? | Yes | ? | ? | [5] | accesses ITIS and NCBI data; license [6] (item #7) prohibits further aggregation or proxying-- access to services is only to authorized end users via personal keycode |
Required? | Global | Yes | ? | Yes | ? | ? | Yes | ? | NA | Taxonomic parsing might be required for infraspecifics and authors |
Day 1 Discussion
We came up with three alternative API designs, ranging from simple to elaborate. The choice of these strategies has to be coordinated and matched against the core architecture, especially tree storage and retrieval.
Design 1 (simple)
In the simplest scenario TNRS simply returns a list of all known possible valid names for a given (potentially invalid) name. The list of names can be annotated with attributes such as source, associated ids, their status (i.e. whether a name is the canonical name for that species), etc. In this scenario, the burden of figuring out what to do with each name is on the users of the API. The way we envision users of the API will use the returned list is by searching all the mega-trees for all the given names. So, if any of the names match a name in the mega-tree, that name should be used.
In those cases where a name is associated with multiple species, this API can try to return multiple lists, each corresponding to a different species. However, it is not always possible to (easily) figure out these cases from the output of external TNRS services we are going to use.
Design 2 (in-between)
In this design, we still have the operation described in the first design. In addition, we return one of the available names for each species as the current name. This will not have to be the correct name for the species (whatever that means), but it has to be consistent. This single consistent name will enable users of the TNRS service to match species across different trees, and to user query. But there is going to be a limitation to this consistency. Over time, what we return as the current name can change. This complicates matters in imaginable ways for the users of the API. If mega-trees are stored, the stored taxon names could become outdated (out of synch with the current name returned by TNRS). Possible solutions to this problem are:
- Updating stored mega-trees periodically, so that they are synchronized with current names returned by the TNRS service.
- Every time a new query comes in, we query the current name for all the taxa, updating the changed names in all the stored trees.
Design 3 (elaborate)
In the most elaborate design, we use IDs to formalize entities stored in mega-trees. We will assign one ID for each species stored in our system. Stored trees should use these IDs to store tree (not species names). TNRS service will include two operations: returning an ID given a (potentially incorrect) name, and returning a currently accepted name for a given ID. In case two species have the same name, the two species should be assigned different IDs and the service should return both IDs. A typical usage of the API will be taking user-provided names, mapping those to IDs, finding those IDs in the stored trees, prunning and grafting, and getting a tree with tips labeled with IDS; then, IDs are turned into current accepted names, and these are the names that are shown to the user.
The idea here is that IDs will be associated with species, and hence more stable through time, eliminating the need for frequent update of the stored trees.
How exactly the IDs should be assigned to species has to be discussed. We considered using existing IDs from sources such as ITIS. This can be achieved by ranking sources, but we have to be careful about whether those IDs stay constant through time. An alternative is generating new IDs internal to phylotastic (maybe not a good idea?).
General Concerns
No matter which design we choose, there are two concepts that can be implemented on top of our APIs: caching and batching. Caching will permit us to improve performance, especially for the fuzzy match which can be quite slow. Batching permits the user to search for a list of names and get a list of responses in one call.
In addition, we discussed whether our API needs to be synchronous or asynchronous. Our current thinking is that we need to provide two interfaces for each operation, one that is synchronous and does a simple and fast search (without fuzzy matching), and another one that is more thorough and is asynchronous.
Design discussion
- Return 1 name or multiple names?
- Scores?
- Caching?
- Which TNRS do fuzzy matching?
Questions/notes
- What if we end up renaming the name-string given to us by the user? We need to make sure to have a warning to the user ("Your query 'Panthera tigris' was renamed to 'Leonardo tigris' for this search because of ...").
Galaxy specification for PhyloTNRS
(The following sample XML file is based on http://wiki.g2.bx.psu.edu/Admin/Training/ISMB2010%20Galaxy%20Tutorial:%20Running%20Your%20Own#Tools but see http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax for a full syntax)
<tool id="org.nescent.phylotastic.tnrs" name="Phylotastic TNRS"> <description>Extracts data from multiple TNRS </description> <command interpreter="python">get_flanks.py $input $out_file1 $size $direction $region -o $offset -l ${input.metadata.chromCol},${input.metadata.startCol},${input.metadata.endCol},${input.metadata.strandCol}</command> <inputs> <param format="interval" name="input" type="data" label="Select data"/> <param name="region" type="select" label="Region"> <option value="whole" selected="true">Whole feature</option> <option value="start">Around Start</option> <option value="end">Around End</option> </param> <param name="direction" type="select" label="Location of the flanking region/s"> <option value="Upstream">Upstream</option> <option value="Downstream">Downstream</option> <option value="Both">Both</option> </param> <param name="offset" size="10" type="integer" value="0" label="Offset" help="Use positive values to offset co-ordinates in the direction of transcription and negative values to offset in the opposite direction."/> <param name="size" size="10" type="integer" value="50" label="Length of the flanking region(s)" help="Use non-negative value for length"/> </inputs> <outputs> <data format="interval" name="out_file1" metadata_source="input"/> </outputs> ... </tool>
Agreements with Data Providers
MSW3
From: "Olaf R.P. Bininda-Emonds" <olaf.bininda@uni-oldenburg.de> Date: June 22, 2012 6:17:53 AM EDT To: Arlin Stoltzfus <arlin@umd.edu> Cc: Kate Jones <kate.e.jones@ucl.ac.uk> Subject: Re: naming conventions in mammalian supertree
Hi Arlin,
I've now heard back from the primary author who compiled the list, Kate Jones, and she has given permission for you to use it.
An updated, citable version of the list can be found in conjunction with our Ecology paper describing PanTHERIA. Look under the metadata link here: http://esapubs.org/archive/ecol/E090/184/metadata.htm . You'll see that the mappings to MSW05 are also there.
In using the list, please be sure to cite the Ecology paper as well as giving particular credit to Kate Jones and Susanne Fritz, who were the main people who put it together.
Cheers,
Olaf
---
On 20.06.2012, at 22:55, Arlin Stoltzfus wrote:
Olaf--
Do we have your permission to use this file? We would like to use it to provide automated name-mappings that will make it easier to integrate the 4500-species mammal tree with other data. Ultimately we would like to implement a taxonomic name resolution service based on Mammal Species of the World.
I also am curious as to how your table of mappings was derived. One member of our team looked at a database for the current edition of Mammal Species of the World, and the list of synonyms apparently is not as extensive, though additional names can be found embedded in "comments". Does your list represent a value-added version of the information in MSW93?
Arlin