TNRS - Name Cleaner

From Evolutionary Interoperability and Outreach
Jump to: navigation, search

Synopsis Two-fold effort: provide a simple "name cleaner" to aide Tree Annotation group & put together a spellchecker JavaScript widget

Name Cleaner

Background

Background: Arlin pitches an idea for a simple tool/app that the Tree Annotation group use to overcome a problem they ran into as they were getting trees and trying to input them.

Overview

Name Cleaner (mr-naims) is a tool for generating a report (CSV) on the validity of species names in a document. It is written in python and designed to be omnivorous about the types of files on which it can operate. This is aided in large part by the [Global Names Discovery Service] API, which accepts PDFs, Office documents, images, or plain text. DendroPy is used to read trees in Newick and NeXML formats.

Source Code

Source code and tool documentation are on Github: [mr-naims]

Usage

Usage: usage:
 simple.py [options] file-input
 or
 simple.py [options] --file file-input

Options:
  -h, --help            show this help message and exit
  -f FILE, --file=FILE  the file, FILE, read from...
  -s, --skip-gnrd       Do not lookup names at GNRD.  Only valid for a text
                        file or newick tree
  -n, --newick          The file is a newick tree
  -x, --nexml           The file is NeXML
  --source=LIMIT_SOURCE
                        Limit taxosaurus to a single source:
                        [MSW3|iPlant|NCBI]
  --match-threshold=MATCH_SCORE_THRESHOLD
                        the matching score threshold to use, defined as a
                        decimal, all matches equal to or greater will be
                        replaced. The default is 0.9

Progress

Milestones from Day 1 (Tue):

  • Read txt file as list of names, call Taxosaurus for cleaning [milestone].

Milestones from Day 2 (Wed):

  • Accept minimum score, only replace if match exceeds minimum score [milestone]
  • Reading PDF Input and extracting names using GNRD API
  • Initial reporting output (CSV)

Milestones from Day 3 (Thu):

  • Fix UTF-8 issues
  • Catch additional stats from GNRD: occurrence count and location in document
  • Allow limiting to a specific source [milestone]
  • Read Newick tree files [milestone]
  • Investigational NeXML reading via DendroPy [milestone]

Milestones from Day 4 (Fri):

  • Integrated NeXML reading/writing into simple.py [milestone]
  • Allow skipping of GNRD name lookups [milestone]
  • Reporting output (CSV)

TNRS Widget

Background: Nirav Merchant mentioned a student project at iPlant trying to create a widget that would help suggest scientific names or provide name resolution within a rich user interface.

User:Gaurav has some code going in http://github.com/gaurav/species-autocomplete. During the hackathon, you might be able to see a demo at http://128.196.142.68/phylotastic/.

Updates