From Evolutionary Interoperability and Outreach
Jump to: navigation, search

Ideas from Phylotastic2

Phylotastic reference guide

Professional incentive (read: ability to cite, to measure impact) is important, otherwise documentation is rarely maintained and becomes stale. Alternatives being considered:

  • Consider the model of "Topic Pages" in PLOS Comp Biol, which upon publication became Wikipedia articles and are further maintained there. Example: Approximate Bayesian Computation, and corresponding PLOS CB article.
  • Collaborative authoring of an eBook, for example on github using Markdown format (or its extensions implemented in Pandoc).

community science strategy

There are many ways that individual scientists can get involved in making phylotastic better:

  • submitting trees to a treestore
  • submitting calibrated trees to DateLife
  • bookmarking or reviewing a tree for quality
  • providing feedback on a service (speed, quality, convenience)

How is that going to happen? When do we start engaging people? Who are our partners? Do we leave the tree submission part to OToL project?

Phylotastic Alpha (integration challenge)

One proposal we came up with on Wednesday was to push on to the features we wanted on Phylotastic Alpha -- a well-documented, end-to-end system. End-to-end means that we start with a query that the end-user can construct, and end with a result that the end-user can employ, without requiring the user to have any special tools other than what phylotastic provides.

Minimal steps (?) in response to user submitting query consisting of list of species names

  1. system cleans up names
    • sends list to TNRS
    • parses result from TNRS
    • imposes rule to choose matches
    • records metadata on the matches that are used
  2. system finds tree with best coverage (or other desired features)
  3. system executes pruning and grafting with tree
  4. system scales tree using DateLife, if possible (animals only?)
  5. system returns scaled tree with metadata to user

R wrappers to all phylotastic components

Alot of non-technical users already use R. Creating a set of R package to hook into Phylotastic APIs wouldn't be hard. TNRS could be hooked into the taxize package I started here. Could include tree store, pruner, and phylomatic in another package to do tree acquisition/manipulation (perhaps wrap treebase into this package too by porting over the treebase package).

Phylotastic Lite

At the first hackathon, we treated Rutger's MapReduce pruner as a stripped-down version delivery system for phylogenetic knowledge, and we were able to make cool but highly limited demos including Mesquite-o-tastic and Reconcili-o-Tastic.

The idea here is to make another big jump forward in increasing capacity to handle real-life use-cases, without working out the larger problems associated with a multi-component phylotastic API. Let's take the shortest path to getting something that people actually can use for a wide range of queries. We could start with either phylomatic, or Rutger's MapReduce pruner. We'll debug the current system, load up the back end with 20 big trees that look really useful-- like the 5K-species bird tree that just came out last month, and including the NCBI taxonomy-- and we'll integrate quick-n-dirty fuzzy matching so that folks don't have to get the names perfect. We'll come up with an ad hoc system for annotating output.

This will give us something that is considerably more than a proof of concept:

  • a service to invoke in even cooler demos
  • a testbed to assess phylogenetic coverage
  • a testbed to generate challenges for annotation
  • a testbed to integrate phylotastic functionality for TNRS (name reconciliation) or tree-finding (choosing a source tree when the user doesn't specify it)
  • a source of a wide range of phylogenies for developing reconciliotastic applications


Note: Gaurav has some additions to this based on recent work

An important step in data integration is matching on a key-- species names in our case. In our version of the problem, we have a resource A (the user's input list of species, a PDF, a character matrix file, etc), and we want to match that with some external resources B (a set of source trees or core TNRSs) so as to integrate the two.

But this is part of a much bigger problem that has two-- visualizing an auto-generated mapping between A and B, and interactive choosing of a mapping by the user. In the case of phyotastic, the auto-generated mapping comes from a TNRS. I think we are going to need at least the first part in order to develop and evaluate the TNRS as a practically useful too.

If we think about this graphically as a table of pairwise matches, we get an idea code-named MatchMaker, which could be a killer app with uses not just in phylotastic, not just in bioinformatics, but throughout informatics and beyond. How often to people need to reconcile or integrate two resources A and B by finding the best mapping of A names to B names? I have some mock-ups and GUI ideas from a slightly different version of the problem (the "marriage problem" of finding the optimal mapping between 2 sets of N names) here:


But we also might want to think about the TNRS meta-service, which creates a one-to-many mapping between submitted names and matched names, i.e., an input name in A might match names B1, B2, B3 . . . in different namebanks served by difference core TNRSs (aggregated by the meta-TNRS).

In this case, the mapping is naturally an unconnected graph consisting of connected sub-graphs, each with a single A node linked to 0 or more B nodes, with the A-B links representing a matching event (exact, fuzzy, deprecated, etc). So, one way to visualize this mapping is as a graph. Hilmar has constructed an ontology for TNRS language (http://phylotastic.org/terms/tnrs.rdf). SA JSON parser can be used to take the JSON output of a TNRS and turn it into RDF: visualize the RDF graph with a tool such as Protege, or parse it out of the RDF and visualize it with GraphViz. The representation could be improved with some way of graphically encoding information about the match, e.g., blue node = NCBI taxon, thick solid edge = exact match, dotted edge = fuzzy match.

A tree in the hand

Quick idea: as you roam a museum, zoo, collection or preserve, you scan QR codes (e.g., on signage) for various species, then press the "get tree" button, and you have a tree for all the critters you have seen. Great for class field trips.

In the more advanced version of this idea, there is some kind of automated or semi-automated species identification where the inputs are not encoded species names, but the user's feature descriptions, sequence samples, or photos.


I'm still rather keen to see an attractive handheld app where people can grow their own ToL the way they "check in" on locations, such as on foursquare.
The Netherlands now has a funding mechanism to take proof-of-concept technologies (i.e. phylotastic) to market (i.e. the app store) in public/private partnerships.
Once upon a time I worked at a web development company that specializes in life sciences (biomedia.nl). I think they would be excellent partners to write a proposal with to see if phylotastic technology can be applied to such an app.

From Michael

I'm friendly with the group behind iNaturalist.com, which is a website for recording species observations and citizen science, and once floated the idea of a phylotastic interface. I could approach them again about it.

From Brian S

I think this is a really cool idea, and I would probably use an app like that in my classes if it were available. In particular, I can think of integrating it with one of the field trips in my Systematics of Fishes class. Right now they do a scavenger hunt at the Oregon Coast Aquarium and then assemble a tree of life linking the species that they found by hand. This would be way cooler if they could do it on their iPhones or Androids.

From Dan L

I think this is a great idea, I've been building iPhone apps since 2009, so I have a good handle on what's available in the native SDK. I read about a project with some similar ideas: What The Feuille ? - a web app to let you find out from what tree or plant a leaf is, which has some similar ideas at play. Wrapping something like that in an app to upload pictures and talk to a web API would go a long way.


Possible integration with OneZoom?
A front-end for the general public: a fun way to accumulate a species list (e.g. QR codes in a zoo or a museum) and get an attractive tree with extra information, as a web app that displays nicely in a mobile environment (e.g. could then be wrapped into a thin iOS and/or Android app).


Roll Eastman congruifier code into DateLife, so input topologies, rather than just lists of species, can be dated (Brian O'M)

Phylotastic metadata

Problem statement To ensure that phylotastic trees are useful to scientists, develop vocabularies to annotate phylotastic trees, and formats to embed the annotations with phylotastic trees delivered to users.

Background Phylotastic trees need metadata, for several reasons that include debugging the system and ensuring that results are reproducible. Importantly, in the absence of an external standard for truth, scientists judge the quality of trees by assessing the quality of methods used to produce them. Therefore, if a phylotastic system is to be useful to research scientists, the metadata for trees produced by the system must include sources and methods.

A very simple example At the first hackathon, we often demonstrated phylotastic by sending a query like "Homo sapiens, Mus musculus, Pan paniscus" to the MapReduce pruner and getting back a tree of the form "(( Homo sapiens, Pan paniscus ), Mus musculus )" based on pruning from the Bininda-Emonds tree. How would we want to annotate that result so that users can undertand how the phylogeny was obtained, so that they can judge how much to trust it? What will we need to know about the source tree, the Bininda-Emonds tree? Here is a quick example of a simple report, in loosely structured text:

date = 9 Jan 2013, 10:27 am
service = MapReduce pruner at http://phylotastic-wg.nescent.org/script/phylotastic.cgi
query = { taxa="Homo sapiens, Mus musculus, Pan paniscus"&tree=mammals }
topology_method = pruning only based on exact match to species binomials
topology_source = { dc:title="The delayed rise of present-day mammals" dc:creator="Bininda-Emonds, O." dcterms:bibliographiccitation="Nature 2007, 446(7135):507-512" and so on }
topology_log = (no errors or warnings)
scaling_method = none
scaling_source = none
scaling_log = none
result = (( Homo sapiens, Pan paniscus ), Mus musculus )


Synopsis Annotate a small set of large trees used as sources.

This is described on a separate TreeAnnotation page.

comment: This potentially links to the social bookmarking / crowd-annotation of trees idea, which will become more critical as the number of available large trees expands.

Tree reconciliation, non-tree-topology presentation

From an email sent to the list: Is it necessary to resolve conflict outright, or is it possible to figure out a way to present and incorporate uncertainty into the interface and presentation? As a systematist, I often *want* to see the conflict and degree of conflict amongst trees, be they from different genes or from different authorities, etc. It seems to me that systematics still doesn't have a good way of presenting uncertainty in trees or in graphs, and that maybe Phylotastic, with its emphasis on good presentation and visuals, would be a great place to hash out a better way of displaying incongruity amongst several trees.

I'm imagining a possible use case like this: I am working on the genus Populus and I want to find out what's been published, tree-wise. I know that I will get back several trees that have pretty significant conflicts, grouping species together in different monophyletic groups. Instead of that resolving to one single "authoritative" tree, it might be more informative to me to see something like a Splitstree network, where the incongruities are visualized in some way, and maybe with branch weightings indicating support amongst published trees for that branch. Then perhaps I could click on the branch and see the breakdown of references that support that branch.

If we could work out a good way to visualize these sorts of incongruities, I imagine it would scale well into other sorts of tree conflicts, like incomplete lineage sorting, hybridization, etc.


Splitstree-like network visualization
Densitree-like presentation, with different levels of support indicated by thicknesses
Network presentations
Reticulate branches in a normal tree presentation

NeXML -> R converter

From Brian O. There are two main formats in R for phylogenies: ape's phylo format, and phylobase's phylo4d format. Greg Jordan in his ggphylo package also keeps phylogeny data in a separate data.frame referencing to nodes.

PhyloSOC ideas

Toward BabelPhysh 1.0: universal tree-handling

See the BabelPhysh idea.

This project aims to build up the back-end functionality for BabelPhysh, focusing only on trees (and ignoring any other kind of data for now). The goal is to develop a set of operations that collectively allow any tree to be read, to be displayed, and to be translated to any other tree format (to the extent that this is possible without loss of information).

The operations might be distributed among several toolboxes or environments (DendroPy, R, Bio::NEXUS, etc) or concentrated into a single toolbox. Multiple implementations are desirable but are not necessary at this stage.

The project is simplified by the fact that the majority of trees out there are Newick trees. Nevertheless, not all Newick trees are the same, and not all Newick parsers are the same. In addition, there are NEXUS files, NHX files, and other types of trees.

To define the project more fully, we need to identify a set of current barriers to overcome, or operations to support, such as the following:

  1. display We would like to be able to pipe any tree to a viewer that can show it. Some tree-viewers cannot display some trees, because
  2. renaming. safely change all the names in the tree, without corrupting it
  3. simplification. remove grammatical features that cause imcompatibilities with some implementations of Newick parsers, e.g., NHX comments
  4. format translation. translate from Newick to NEXUS and back. translate from NHX to simple Newick.

another phylosoc idea

quick-and-dirty suggestions

phylogeny annotation: visual metaphors and language support

This idea has been discussed by the HIP LT to some extent. It has two goals.

  1. Visualization goal Extending the visual language of phylogeny representation, i.e., how to represent phylogeny concepts graphically
  2. Annotation support goal Develop a common language for marking up serializations of phylogenies.

'background. Currently we all know how to "read" a phylogeny visually in terms of recognizing the nodes and edges. Some additional conventions for marking up phylogenies graphically are common. For instance, many tree viewers will read square-bracketed numbers next to nodes in a Newick string, and will render these numbers next to the node. The user will then interpret this visualization in terms of a phylogeny concept, which is the degree of support (e.g., bootstrap support, Bayesian posterior) for the node.

But visualizations of phylogenies could say much more, if we had a visual language, and a way to support that in portable annotations. Given common predicates and objects for sharing annotations across APIs and serialization formats, R (for instance) would be able to know what to set tip.color and edge.color to based on what it reads from a file, and it would be able to write that out in a way that figtree understands.

Here is an example that Rutger suggests for specifying a visual motif and its representation as data:

NeXML round-tripping

Name reconciliation

the reconcile-tree problem

one hackathon, half a dozen parts, they all fit together to build a new more interoperable framework

  • Phylomatic-style capacity to supply species tree
  • PhyloWS delivery of species tree to client app
  • Reconcile-tree client app implements phyloWS query
  • Service to read input gene tree, get list of species
  • Reconcile-tree app implements NeXML output

phylo-aware interfaces to collections

social bookmarking of trees

  • On November 9, 2011, Arlin suggested integrated a Phylomatic-like interface with TreeBASE and Tree-of-Life Web (ToLWeb), with the goal of eventually providing the entire Tree-of-Life through such an interface. In Arlin's words, Phylomatic is a "heavily used package that allows plant researchers to use grafting and pruning operations to create custom species trees from a mega-tree of plants." Thus, the goal would be to provide a way for researchers to prune the entire tree-of-life down to a set of terminal taxa of interest, say for the purposes of running a phylogenetic comparative analysis Heady stuff!
  • Brian raised a concern about which tree(s) would be included in the interface, since we're far from having consensus on the relationships among all living organisms. One occasional failing of the current Tree-of-Life pages on the web is that they're updated infrequently and don't always reflect controversy in relationships. Where disagreement occurs, sometimes only one one potential arrangement is presented, perhaps that preferred by the experts that have been tapped to curate that particular page. The general discussion that followed favored an approach in which users can select from many offered trees. One trouble potentially results from that choice: many users of the interface will not be experts in phylogenetics, and thus will be ill-equipped to determine which tree is the most accurate or reliable.
  • Social bookmarking of trees may offer the best solution to this problem, because it would allow the community to update the consensus tree as ideas change and new results become available. If users can vote (like?) for trees that are well-supported and reliable, and annotate the reasons behind their votes, than the overall community can come to a consensus without placing that onus upon any small group of people, who might exhibit bias or lack the expertise to evaluate all trees. After all, millions of organisms exist, and no one is expert in all.
  • Arlin indicated that there are already several large trees available (Bininda-Emonds or Venditti, 2011 for mammals). "A hackathon-scale project would be to collect a dozen of those trees and get them into the back end of phylomatic. Making those trees available, phylomatically, would be a big plus"
  • Karen pointed out that "In a recently-submitted AVAToL grant, we are proposing comprehensive phylogenetic synthesis using both merge-and-graft operations and also analytical approaches (supertrees, species trees, etc). If we get funded, there will be a draft tree of life with Phylomatic-type subtree extraction within the next 18 months." If that gets funded, there might be no need for HIP to work on that issue. That said, Karen's comment doesn't make it clear whether community-sourcing would be a portion of their vision.
  • Hilmar pointed out that data quality assessment is actually in-scope for more than just phylogenies; one could imagine social commentary tools being integrated with Dryad or other data repositories. This is apparently part of the proposal for the TreeBASE / ToLWeb grant.
  • It may be easier to get buy-in to social bookmarking of trees if we integrate with an extant platform, such as Google+. People are a lot more likely to interact with something that appears as part of their daily routine, rather than something that requires them to go to a special website and login with yet another password.
  • Enrico pointed out that "There is a nice community-based ranking mechanism proposed for gene annotations through pubmed ([1])" and asked whether something similar could be done through google+ pointing to phylogenies in one or more repositories?
  • Rutger mentioned that "if we can get the rating platform into an RSS feed it can be integrated into friendfeed, facebook and twitter". His "Phylofoundation" ([2]) tweets new studies in TreeBASE. If that can be wrapped into some sort of simple presentation allowing people to rate the study that would be something that might work.
  • Ultimately it would be nice to integrate social rankings with some sort of search engine for trees. Brian mentioned that he'd love to see a database search that could provide the the most recent, highest-rated and/or most popular phylogeny for taxon X (where X is above the species level), or that includes both species Y and Z in a computable, transferable, free and easily downloadable format.

roughed-out ideas for hackathon targets

Before deciding to target Phylotastic, the Leadership Team discussed these 5 ideas at its January, 2012 meeting in Durham.


PhyloHow - Developing and disseminating technical know-how in informatics, analysis, and data management using participant-generated instructional videos (and other how-to artefacts). A vision of success would be that an end-user searches online for guidance on phyloinformatics, and finds 100's of useful instructional presentations on utube and slideshare.

problem The problem is to develop and disseminate useful technical guidance on how to solve basic informatics and data-management problems faced by our user community, in a context in which most users don't know about most tools, and technology changes on a yearly basis.

background Users in the phylogenetics community frequently face challenges in data management and analysis. Of interest is the subset of challenges for which there is a freely available technical solution-- involving some combination of installed software, online services, and manual steps-- that is not widely used due to lack of awareness and training. Examples might include:

  • making a really nice figure with an annotated phylogeny when no tree-drawing tool seems to do it all
  • re-naming all the OTUs in one or more files
  • maintaining provenance information through a typical phylogenetics workflow
  • submitting the record of a phylogeny report to TreeBASE
  • combining an alignment and tree from two separate sources into a single NEXUS or NeXML file
  • translating any alignment or character data into NEXUS or NeXML

Users facing such challenges, without any outside source of assistance or training, may muddle through the process on their own, or simply give up. In phyloinformatics there is an enormous gap between what expert users can do using available tools, and what an ordinary user is likely to learn when she goes out in search of practical ways to address a pending challenge. The result is not just inefficiency, but many ad hoc manual solutions to the same problem-- solutions that typically do not make use of available standards, and thus contribute to a tower-of-babel effect.

In the past, we and others habitually looked on this as a problem of developing more convenient interfaces (especially GUIs) to specific operations, or wrapping a workflow of operations in a single tools to make things more convenient for users. There are factors that tend to make these strategies unlikly to succeed. We have not given enough attention to a third approach, which is simply to do a better job providing instruction to users, raising their general level of competence. In a recent analysis of best practices for sharing phylogenetic data (Stoltzfus, et al, in prep), the lack of "How-To" documents and other instructional resources was identified as a major impediment to achieving interoperability.

approach Let us imagine a project with 3 parts (not sure whether forum or docu-hackathon-- or something else-- comes first):

  1. a docu-hackathon
    • the goal is to work out technical solutions, then document them with instructional videos and other how-to artefacts
    • participants include end-users, techno-geeks, and technical writers
    • organizers will provide instruction on screen-recording and other technologies
    • participants may work alone or in small teams
    • while we want teams to target technical problems that a team member has solved already using available tools, we can't ignore the voice of experience, which tells us that attempting to generalize on one's own limited experience usually reveals bugs and limitations that need to be worked through and understood
  2. an online discussion forum (phylohow.org is available) in which
    • user queries are welcomed
    • gurus are lurking
    • how-to artefacts are reviewed and discussed
    • requests for new how-to artefacts are discussed, and a target list is maintained
    • note: we don't need a special forum for sharing artefacts-- this can be done on utube, slideshare, etc,
    • but we do need a system that makes it easy for naive users to discover resources via google (e.g., everything gets tagged with 'phylohow'); its important that users can discover educational resources without being part of an ingroup.
  3. Crowd-sourcing (grad-sourcing?) documentation through open contests
    • tangible prizes
      • can come from sponsors, e.g., maybe SSE can offer travel to the next conference
    • target can be open, or chosen from a list of problems such as that given above (background)
    • rules
      • all material is used with permission
      • only open-source or free-for-academic-use tools
      • sign over rights or adopt CC license
    • Entries would be judged for
      • relevance (how important is the problem, how many users are implicated)
      • effectiveness (will users learn effective skills from the artefact).

a scoping comment Because we are interested in informatics and interoperability here, we are not mainly interested in instructional materials on how to run a analysis within a particular existing software package, e.g., how to do a standard likelihood analysis of your CytC sequences in PAUP*. However, imagine that this project has succeeded, and we have shown the phylogenetics community that it is possible to use open contests to stimulate the production of effective instructional materials for a targeted problem. In effect, this greatly brings down the cost of documentation for software developers. All they need to do is to take some of their grant money and offer it as prize money for instructional videos, within the context of one of our sponsored contests.



problem: incompatible file formats.

  • apps allow limited set of input file types; user's data may be in different format
  • apps produce limited output, inccluding aberrant formats

approach develop BabelPhysh v 0.1, code-named 'The Federal', a universal translator for alignment and phylogeny files, with a web-services API. To do this right is a project that requires a serious investment of time from logic programmers. But we can't do that in the space of a hackathon. Therefore, although we are going to be disciplined about the interfaces, the back-end guts of version of 0.1 are going to be hacked from whatever capacities we can leverage from BioPerl, Bio::Phylo, DendroPy, Mesquite, etc. In other words, the engine of BabelPhysh 0.1 will be a black box full of programming junk. Once it is done (i.e., after the hackathon, as a separate out-of-scope problem), we can challenge computer scientists to re-engineer the black box according to the interface spec.

  • phyloWS interface
  • vocabulary of file types and sub-types A, B1, B2, B3 (e.g., Newick flavors), C . . . .
  • 'whatever works' approach to translation sub-services (i.e., binary translation problems A --> C, B1 --> B2, etc)
  • broker to send different types of requests to different services
    • note that the broker is a layer of abstraction that allows flexibility in back-end services

pre-requisites completed in advance for a successful hackathon

  • preliminary controlled vocabulary of supported file types and sub-types (30 min effort)
  • relatively complete phyloWS spec for the translation problem (> a few hours effort)
    • syntax to specify source and destination formats
    • verbosity level for messages
    • standard messages (error: unclear src or dest, malformed src; warning: translation necessitates data loss, translation is moot)
  • 'token implementation of test platform, e.g., based only on Bio::Phylo-supported translations (? depends on programmer)
  • preliminary set of test cases representing translation sub-problems (a few hours)


  • live test platform with web forms test interface
  • documentation integrated into test interface
  • broker - handles direct interaction with client, negotiates with translation sub-services
  • prognosticator - guesses unspecified source format
  • translation sub-services, divided by backend
    • bioperl for all major alignment formats
    • bioPhylo for NEXUS to NeXML
    • XSLT forNeXML to CDAO
    •  ?? for eNewick to Newick
    • and so on

players to recruit

  • user testers with capacity to create file-type test cases as needed, interpret errors and problems
    • including test-master to aggregate test conditions and results in a central resource
  • documentation writers
  • test platform architect - works with interface developer and sub-service implementers
  • web interface developer for test platform (implements pull-downs, help, message-display, etc)
  • back-end developers to develop translation sub-servcies
  • phyloWS developers
  • controlled vocabulary developers (format types, sub-types, other?)
  • sub-service implementers who install back-end capacities on the test platform and code PhyloWS interfaces to them

Generating MIAPA-checklist-compliant reports from end-user phylogeny apps

problem: users generate alignments and phylogenies using software, then store or archive these results in a form that is not very re-usable, due to lack of metadata about sources and methods. To solve this problem we need a reporting standard as well as tools to generate and consume compliant records. Submission could be negotiated directly with TB, but a more modular way to develop would be to focus on producing NeXML output that could be consumed secondarily by any resource. Approach

  • We don't have a MIAPA yet, but at least we have a preliminary checklist.
  • First step is to build language support for the checklist into MIAPA.
    • possible NeXML schema additions so that the annotations go into the right slots in NeXML classes.
    • build up language support , e.g., in CDAO, so that NeXML can reference it
  • we have discussed the possibility to create well annotated archivable records directly from a phylogeny inference app such as MEGA, PAUP*, MrBayes, RaXML and so on. This makes it easier on the typical user. Sudhir Kumar (MEGA) has resources to devote to this. We would have to carve this off from the main MEGA codebase, which includes proprietary code (not OS, thus not supported by NESCent)

interactive naming and reconciliation tool

problem: many users have files with matching entities that have non-matching names, e.g., names in the tree do not match names in the alignment.

  • these inconsistent names are a barrier to archiving in TreeBASE
  • they are a barrier to re-use (someone else has to figure out the mapping of names)

approach a problem description and a crude interface design are presented here: http://dl.dropbox.com/u/7727158/name_matching.pptx

  • but the above description doesn't say anything about architecture


This idea has been moved to its own page.