MIAPA/Demonstration Project

From Evolutionary Interoperability and Outreach
Jump to: navigation, search

Demonstration Project, Spring 2011

Quick overview

Build a tool that will allow users to create a NeXML file with minimal information to document a phylogenetic analysis.

  1. start by populating CDAO with a rich set of terms from various sources
  2. work out NeXML representation of methods concepts using CDAO terms (also OBI?)
  3. develop a web form that allows users to create annotations, output NeXML
  4. use natural-language workflow descriptions from papers to guide development and testing

A big-picture strategy that includes this project

under construction

Here is my vision (AS) for the long-term project to develop MIAPA while building support for data re-use. There are two stages focused on support for archiving. The work on re-use depends on the archiving work, to some extent.

  1. Archiving, first stage - Demonstration. We build submission and query tools to show what is possible. The resulting tools may not be very useful to users, but they provide a platform for further work.
    • proof-of-concept based on phenote (Arlin, 2008)
    • a web-based tool demonstration tool to create an annotated record (Maryam, spring of 2011)
      • loads vocabulary terms for sources and methods
      • provides term-completion based on the loaded vocabularies
      • provides slots for specific types of MIAPA annotations
      • provides support for term requests
      • outputs in NeXML? (ok for Dryad, but not currently supported for input by TreeBASE)
    • a demonstration architecture for annotation and submission (GSOC projects, summer of 2011?)
      • web services protocol for submitting a phylogenetic record to an archive
      • annotation tool with client capacity to submit record
      • modify TreeBASE or Dryad to supply server capacity to receive record
  2. Archiving, second stage - Build up user base. By responding to user needs and adding intelligence, we create a submission tool that is useful both to archives and to users. Meanwhile, we are using this same tool to harvest information on user needs.
    • respond to user needs
      • use MIAPA survey to identify key use-cases and annotation needs (MIAPA survey team, spring 2011)
      • work with users to build annotation support for key use cases
    • lay the foundation for applying intelligent methods (Enrico and Arlin, CREST proposal, 2011)
      • build out a formal ontology for methods annotation
        • include a high-level concept of workflow
      • harvest annotations from submitted records
      • apply NLP methods to harvest methods annotations from publications
    • incorporate intelligence into submission tool
      • extract candidate annotations from Methods text
      • use planning concepts to detect errors and gaps, suggest corrections
  3. Technology to support re-use (Enrico and Arlin, CREST proposal, 2011). The aim of this stage is to develop a system that can compile vague workflow descriptions into executable plans, allowing the user to apply the plan to a custom set of data.

Resources

Notes from meetings

August 12, 2011

present:Jim, Maryam, Arlin

Maryam's progress report

  • literature review 
  • ontology hierarchy 
    • didn't work on primitive classes or properties, etc. 
    • want to go back and work on ontology to make more clear, useful 
    • need to think about relationships 
  • JLM: importance of phyloways list
  • discussion of project plans and priorities
    • MP: not sufficient (for CS degree) to design a data submission form; want to follow established practices for ontology development
    • AS: problems in CDAO; how important is it to resolve upper ontology?
    • main focus of project is not to develop ontology, but to support users to create meaningful (re-usable) records
    • this involves knowledge discovery and engineering, not just applications programming

May 20, 2011

present:Jim, Maryam, Enrico, Rutger, Arlin

discussed case, and how to handle other cases.

had to use skype, experienced major problems with this.

April 15, 2011

present: Jim, Maryam, Enrico, Brandon, Arlin

March 18, 2011

present: Arlin, Maryam, Jim, Eric, Rutger

1. review of demo project

  • inference methods
    • term list from TreeBASE
    • Joe Felsenstein's "Inferring Phylogenies"
      • get e copy from Joe? Arlin will do this
    • search for papers. Arlin will do this
  • review project plan -- see 4 March notes

2. iEvoBio presentations

  • possible lightning talk on Maryam's demo project
  • full talk deadline is next week
    • one talk on MIAPA & related projects (Jim leads)
    • another talk on publishing trees practices (Arlin leads)

March 4, 2011

present: Arlin, Jim, Maryam, Enrico, Vivek

Agenda: sort out project ideas for spring (Maryam) and summer (GSOC)

  1. Maryam's project
    • start by populating CDAO with terms
    • work out NeXML representation of methods concepts using CDAO terms (also OBI?)
    • work on submission form to make NeXML file
    • use papers from prior literature to harvest natural-language workflow descriptions
  2. possible GSOC projects
    • graphical UI for constructing workflow descriptions
      • see http://exon.niaid.nih.gov/mobyleWorkflow/
      • Vivek is willing to mentor
      • successful applicant knows Java, ideally GWT (google web toolkit) and Jena
      • phylogeny experience not necessary
      • use library of papers from previous project
      • feedback via informal user testing, comment box
      • open issue: integrate with existing codebase (Mesquite? TreeBASE?)
    • implement NeXML submission in TreeBASE
      • Rutger agreed to be co-mentor
    • develop web services protocol for phylo record submission
      • maybe preconditions will not be met by this summer
    • NLP analysis of methods sections of papers
      • ratio of analysis to programming is too high for a GSOC

Action items:

  • Vivek, Enrico & Arlin write GSOC description by Mar 11
  • Arlin try to find co-mentor NeXML submission project by Mar 11
  • put library of papers issue on agenda for next meeting
  • Jim to let Eric know what's happening

February 25, 2011

present: Arlin, Jim, Maryam (10:15)

  • deliverables
    • search interface for TreeBASE
    • submission interface (annotation) for TreeBASE

sub-searches based methods annotation

  • hierarchy, term-completion
  • distribution of trees by method
  • download linked pubs and collect matching terms to test completion?
  • developer access to treebase code

standalone search tool (GSOC)?

  • web services API

Submission tool

  • making it easy for user
    • recognize source data
    • paste methods section, match terms, supply to user
    • start with templates from existing treebase entries
    • following methods from a previous publication

Standalone tool (GSOC? )

  • create nexml file
  • TB nexml upload (basic)
  • TB nexml process methods annotations into text statement

February 18, 2011

present: Arlin, Jim L-M, Maryam (10:20?)

Context:

  • Maryam available until mid-May
  • project outcome could support ABI proposal in July
  • could coordinate with possible GSOC proposal

discussion about ontology development. 2 mistaken presumptions

  • encoding domain knowledge of experts is enough (wrong: experts literally don't know what they are talking about when it comes to key philosophical distinctions)
  • proper ontology has only context-independent universals (wrong in practice; just creates an elaborate system of pseudo-universals)

driving biological problem or use-case

  • pre-condition: all those trees out there
  • 1. estimate species tree by combining gene trees (systematics use case)
  • 2. identify orthologs or duplication histories using gene tree (mol biology use case)

so, let's imagine a user scenario

  • pre-requisite: list of 8 species, user wants species tree with these, possibly some others
  • user searches resources with list, gets hits
    • subcase1: finds a species tree with all 8 species
      • user may wish to prune if there are too many other species
      • user is done
    • subcase2: finds gene trees with all 8 species
      • user may wish to select "best" tree
      • user needs to run reconciliation software
    • subcase3: finds a set of trees with overlapping sub-sets of species (e.g., ABCDE, CDEFG, EFGHI)
      • this case calls for supertree construction

but (Jim says), we don't want to get bogged down in reconciliation

but (Arlin says), it may be sufficient (for demonstration purposes) to offer the user

  • the right input trees for reconciliation
  • a canned workflow description for reconciliation
  • a third-party service that will execute the workflow description on the input trees

ok, we decide to pursue a simpler scenario

  • pre-requisite: list of 1 gene, E. coli CAP
  • user searches for tree with target gene, gets hits
    • user chooses by criteria (method, bootstraps, etc. . . )
    • user is done

the above could be done on a resource that aggregates from other resources (TreeBASE, Pandit, TreeFams, etc). However, an even simpler use-case would be just to provide an interface to whatever useful information is in TreeBASE.

That's where we ran out of time. Next meeting: Friday, Feb 25, 10:00 am EST.