HIPTargetsAndIdeasList
From Evoio
Contents |
PhyloSOC ideas
Toward BabelPhysh 1.0: universal tree-handling
See the BabelPhysh idea.
This project aims to build up the back-end functionality for BabelPhysh, focusing only on trees (and ignoring any other kind of data for now). The goal is to develop a set of operations that collectively allow any tree to be read, to be displayed, and to be translated to any other tree format (to the extent that this is possible without loss of information).
The operations might be distributed among several toolboxes or environments (DendroPy, R, Bio::NEXUS, etc) or concentrated into a single toolbox. Multiple implementations are desirable but are not necessary at this stage.
The project is simplified by the fact that the majority of trees out there are Newick trees. Nevertheless, not all Newick trees are the same, and not all Newick parsers are the same. In addition, there are NEXUS files, NHX files, and other types of trees.
To define the project more fully, we need to identify a set of current barriers to overcome, or operations to support, such as the following:
- display We would like to be able to pipe any tree to a viewer that can show it. Some tree-viewers cannot display some trees, because
- renaming. safely change all the names in the tree, without corrupting it
- simplification. remove grammatical features that cause imcompatibilities with some implementations of Newick parsers, e.g., NHX comments
- format translation. translate from Newick to NEXUS and back. translate from NHX to simple Newick.
another phylosoc idea
quick-and-dirty suggestions
phylogeny annotation: visual metaphors and language support
This idea has been discussed by the HIP LT to some extent. It has two goals.
- Visualization goal Extending the visual language of phylogeny representation, i.e., how to represent phylogeny concepts graphically
- Annotation support goal Develop a common language for marking up serializations of phylogenies.
'background. Currently we all know how to "read" a phylogeny visually in terms of recognizing the nodes and edges. Some additional conventions for marking up phylogenies graphically are common. For instance, many tree viewers will read square-bracketed numbers next to nodes in a Newick string, and will render these numbers next to the node. The user will then interpret this visualization in terms of a phylogeny concept, which is the degree of support (e.g., bootstrap support, Bayesian posterior) for the node.
But visualizations of phylogenies could say much more, if we had a visual language, and a way to support that in portable annotations. Given common predicates and objects for sharing annotations across APIs and serialization formats, R (for instance) would be able to know what to set tip.color and edge.color to based on what it reads from a file, and it would be able to write that out in a way that figtree understands.
Here is an example that Rutger suggests for specifying a visual motif and its representation as data:
- predicate: isextinct
- values: 0/1,
- meaning: indicates whether a taxon is extinct
- icon: http://tolweb.org/tree/icons/extinct.gif (or similar)
NeXML round-tripping
Name reconciliation
the reconcile-tree problem
one hackathon, half a dozen parts, they all fit together to build a new more interoperable framework
- Phylomatic-style capacity to supply species tree
- PhyloWS delivery of species tree to client app
- Reconcile-tree client app implements phyloWS query
- Service to read input gene tree, get list of species
- Reconcile-tree app implements NeXML output
phylo-aware interfaces to collections
social bookmarking of trees
- On November 9, 2011, Arlin suggested integrated a Phylomatic-like interface with TreeBASE and Tree-of-Life Web (ToLWeb), with the goal of eventually providing the entire Tree-of-Life through such an interface. In Arlin's words, Phylomatic is a "heavily used package that allows plant researchers to use grafting and pruning operations to create custom species trees from a mega-tree of plants." Thus, the goal would be to provide a way for researchers to prune the entire tree-of-life down to a set of terminal taxa of interest, say for the purposes of running a phylogenetic comparative analysis Heady stuff!
- Brian raised a concern about which tree(s) would be included in the interface, since we're far from having consensus on the relationships among all living organisms. One occasional failing of the current Tree-of-Life pages on the web is that they're updated infrequently and don't always reflect controversy in relationships. Where disagreement occurs, sometimes only one one potential arrangement is presented, perhaps that preferred by the experts that have been tapped to curate that particular page. The general discussion that followed favored an approach in which users can select from many offered trees. One trouble potentially results from that choice: many users of the interface will not be experts in phylogenetics, and thus will be ill-equipped to determine which tree is the most accurate or reliable.
- Social bookmarking of trees may offer the best solution to this problem, because it would allow the community to update the consensus tree as ideas change and new results become available. If users can vote (like?) for trees that are well-supported and reliable, and annotate the reasons behind their votes, than the overall community can come to a consensus without placing that onus upon any small group of people, who might exhibit bias or lack the expertise to evaluate all trees. After all, millions of organisms exist, and no one is expert in all.
- Arlin indicated that there are already several large trees available (Bininda-Emonds or Venditti, 2011 for mammals). "A hackathon-scale project would be to collect a dozen of those trees and get them into the back end of phylomatic. Making those trees available, phylomatically, would be a big plus"
- Karen pointed out that "In a recently-submitted AVAToL grant, we are proposing comprehensive phylogenetic synthesis using both merge-and-graft operations and also analytical approaches (supertrees, species trees, etc). If we get funded, there will be a draft tree of life with Phylomatic-type subtree extraction within the next 18 months." If that gets funded, there might be no need for HIP to work on that issue. That said, Karen's comment doesn't make it clear whether community-sourcing would be a portion of their vision.
- Hilmar pointed out that data quality assessment is actually in-scope for more than just phylogenies; one could imagine social commentary tools being integrated with Dryad or other data repositories. This is apparently part of the proposal for the TreeBASE / ToLWeb grant.
- It may be easier to get buy-in to social bookmarking of trees if we integrate with an extant platform, such as Google+. People are a lot more likely to interact with something that appears as part of their daily routine, rather than something that requires them to go to a special website and login with yet another password.
- Enrico pointed out that "There is a nice community-based ranking mechanism proposed for gene annotations through pubmed ([1])" and asked whether something similar could be done through google+ pointing to phylogenies in one or more repositories?
- Rutger mentioned that "if we can get the rating platform into an RSS feed it can be integrated into friendfeed, facebook and twitter". His "Phylofoundation" ([2]) tweets new studies in TreeBASE. If that can be wrapped into some sort of simple presentation allowing people to rate the study that would be something that might work.
- Ultimately it would be nice to integrate social rankings with some sort of search engine for trees. Brian mentioned that he'd love to see a database search that could provide the the most recent, highest-rated and/or most popular phylogeny for taxon X (where X is above the species level), or that includes both species Y and Z in a computable, transferable, free and easily downloadable format.
roughed-out ideas for hackathon targets
PhyloHow
PhyloHow - Developing and disseminating technical know-how in informatics, analysis, and data management using participant-generated instructional videos (and other how-to artefacts). A vision of success would be that an end-user searches online for guidance on phyloinformatics, and finds 100's of useful instructional presentations on utube and slideshare.
problem The problem is to develop and disseminate useful technical guidance on how to solve basic informatics and data-management problems faced by our user community, in a context in which most users don't know about most tools, and technology changes on a yearly basis.
background Users in the phylogenetics community frequently face challenges in data management and analysis. Of interest is the subset of challenges for which there is a freely available technical solution-- involving some combination of installed software, online services, and manual steps-- that is not widely used due to lack of awareness and training. Examples might include:
- making a really nice figure with an annotated phylogeny when no tree-drawing tool seems to do it all
- re-naming all the OTUs in one or more files
- maintaining provenance information through a typical phylogenetics workflow
- submitting the record of a phylogeny report to TreeBASE
- combining an alignment and tree from two separate sources into a single NEXUS or NeXML file
- translating any alignment or character data into NEXUS or NeXML
Users facing such challenges, without any outside source of assistance or training, may muddle through the process on their own, or simply give up. In phyloinformatics there is an enormous gap between what expert users can do using available tools, and what an ordinary user is likely to learn when she goes out in search of practical ways to address a pending challenge. The result is not just inefficiency, but many ad hoc manual solutions to the same problem-- solutions that typically do not make use of available standards, and thus contribute to a tower-of-babel effect.
In the past, we and others habitually looked on this as a problem of developing more convenient interfaces (especially GUIs) to specific operations, or wrapping a workflow of operations in a single tools to make things more convenient for users. There are factors that tend to make these strategies unlikly to succeed. We have not given enough attention to a third approach, which is simply to do a better job providing instruction to users, raising their general level of competence. In a recent analysis of best practices for sharing phylogenetic data (Stoltzfus, et al, in prep), the lack of "How-To" documents and other instructional resources was identified as a major impediment to achieving interoperability.
approach Let us imagine a project with 3 parts (not sure whether forum or docu-hackathon-- or something else-- comes first):
- a docu-hackathon
- the goal is to work out technical solutions, then document them with instructional videos and other how-to artefacts
- participants include end-users, techno-geeks, and technical writers
- organizers will provide instruction on screen-recording and other technologies
- participants may work alone or in small teams
- while we want teams to target technical problems that a team member has solved already using available tools, we can't ignore the voice of experience, which tells us that attempting to generalize on one's own limited experience usually reveals bugs and limitations that need to be worked through and understood
- an online discussion forum (phylohow.org is available) in which
- user queries are welcomed
- gurus are lurking
- how-to artefacts are reviewed and discussed
- requests for new how-to artefacts are discussed, and a target list is maintained
- note: we don't need a special forum for sharing artefacts-- this can be done on utube, slideshare, etc,
- but we do need a system that makes it easy for naive users to discover resources via google (e.g., everything gets tagged with 'phylohow'); its important that users can discover educational resources without being part of an ingroup.
- Crowd-sourcing (grad-sourcing?) documentation through open contests
- tangible prizes
- can come from sponsors, e.g., maybe SSE can offer travel to the next conference
- target can be open, or chosen from a list of problems such as that given above (background)
- rules
- all material is used with permission
- only open-source or free-for-academic-use tools
- sign over rights or adopt CC license
- Entries would be judged for
- relevance (how important is the problem, how many users are implicated)
- effectiveness (will users learn effective skills from the artefact).
- tangible prizes
a scoping comment Because we are interested in informatics and interoperability here, we are not mainly interested in instructional materials on how to run a analysis within a particular existing software package, e.g., how to do a standard likelihood analysis of your CytC sequences in PAUP*. However, imagine that this project has succeeded, and we have shown the phylogenetics community that it is possible to use open contests to stimulate the production of effective instructional materials for a targeted problem. In effect, this greatly brings down the cost of documentation for software developers. All they need to do is to take some of their grant money and offer it as prize money for instructional videos, within the context of one of our sponsored contests.
links
- http://www.emergingedtech.com/2010/01/creating-brief-instructional-videos-and-more-with-jing/
- http://www.debugmode.com/wink/
- http://www.webresourcesdepot.com/10-free-screen-recording-softwares-for-creating-attractive-screencasts/
BabelPhysh
problem: incompatible file formats.
- apps allow limited set of input file types; user's data may be in different format
- apps produce limited output, inccluding aberrant formats
approach develop BabelPhysh v 0.1, code-named 'The Federal', a universal translator for alignment and phylogeny files, with a web-services API. To do this right is a project that requires a serious investment of time from logic programmers. But we can't do that in the space of a hackathon. Therefore, although we are going to be disciplined about the interfaces, the back-end guts of version of 0.1 are going to be hacked from whatever capacities we can leverage from BioPerl, Bio::Phylo, DendroPy, Mesquite, etc. In other words, the engine of BabelPhysh 0.1 will be a black box full of programming junk. Once it is done (i.e., after the hackathon, as a separate out-of-scope problem), we can challenge computer scientists to re-engineer the black box according to the interface spec.
- phyloWS interface
- vocabulary of file types and sub-types A, B1, B2, B3 (e.g., Newick flavors), C . . . .
- 'whatever works' approach to translation sub-services (i.e., binary translation problems A --> C, B1 --> B2, etc)
- broker to send different types of requests to different services
- note that the broker is a layer of abstraction that allows flexibility in back-end services
pre-requisites completed in advance for a successful hackathon
- preliminary controlled vocabulary of supported file types and sub-types (30 min effort)
- relatively complete phyloWS spec for the translation problem (> a few hours effort)
- syntax to specify source and destination formats
- verbosity level for messages
- standard messages (error: unclear src or dest, malformed src; warning: translation necessitates data loss, translation is moot)
- 'token implementation of test platform, e.g., based only on Bio::Phylo-supported translations (? depends on programmer)
- preliminary set of test cases representing translation sub-problems (a few hours)
sub-projects
- live test platform with web forms test interface
- documentation integrated into test interface
- broker - handles direct interaction with client, negotiates with translation sub-services
- prognosticator - guesses unspecified source format
- translation sub-services, divided by backend
- bioperl for all major alignment formats
- bioPhylo for NEXUS to NeXML
- XSLT forNeXML to CDAO
- ?? for eNewick to Newick
- and so on
players to recruit
- user testers with capacity to create file-type test cases as needed, interpret errors and problems
- including test-master to aggregate test conditions and results in a central resource
- documentation writers
- test platform architect - works with interface developer and sub-service implementers
- web interface developer for test platform (implements pull-downs, help, message-display, etc)
- back-end developers to develop translation sub-servcies
- phyloWS developers
- controlled vocabulary developers (format types, sub-types, other?)
- sub-service implementers who install back-end capacities on the test platform and code PhyloWS interfaces to them
Generating MIAPA-checklist-compliant reports from end-user phylogeny apps
problem: users generate alignments and phylogenies using software, then store or archive these results in a form that is not very re-usable, due to lack of metadata about sources and methods. To solve this problem we need a reporting standard as well as tools to generate and consume compliant records. Submission could be negotiated directly with TB, but a more modular way to develop would be to focus on producing NeXML output that could be consumed secondarily by any resource. Approach
- We don't have a MIAPA yet, but at least we have a preliminary checklist.
- First step is to build language support for the checklist into MIAPA.
- possible NeXML schema additions so that the annotations go into the right slots in NeXML classes.
- build up language support , e.g., in CDAO, so that NeXML can reference it
- we have discussed the possibility to create well annotated archivable records directly from a phylogeny inference app such as MEGA, PAUP*, MrBayes, RaXML and so on. This makes it easier on the typical user. Sudhir Kumar (MEGA) has resources to devote to this. We would have to carve this off from the main MEGA codebase, which includes proprietary code (not OS, thus not supported by NESCent)
interactive naming and reconciliation tool
problem: many users have files with matching entities that have non-matching names, e.g., names in the tree do not match names in the alignment.
- these inconsistent names are a barrier to archiving in TreeBASE
- they are a barrier to re-use (someone else has to figure out the mapping of names)
approach a problem description and a crude interface design are presented here: http://dl.dropbox.com/u/7727158/name_matching.pptx
- but the above description doesn't say anything about architecture
Phylotastic
This idea has been moved to its own page.