EvoIO Working Group Proposal: Difference between revisions

From Evolutionary Interoperability and Outreach
Jump to navigation Jump to search
Line 149: Line 149:
* Mark Westneat, Field Museum of Natural History / BioSynC
* Mark Westneat, Field Museum of Natural History / BioSynC
* Mark Wilkinson, University of British Columbia
* Mark Wilkinson, University of British Columbia
An additional leadership team member with expertise in ecological research applications of phylogenetic and comparative data is currently being recruited.

Revision as of 20:39, 1 September 2011

Hackathons, Interoperability, Phylogenetics (HIP) a NESCent working group, envision the future as a virtual phyloinformatics bazaar in which comparative data and phylogenies are saved, shared, annotated, liked, re-used, aggregated, mashed up, and linked in. In pursuit of this vision, the working group will stage a series of hackathons (intensive participant-driven code-development meetings) that empower early-career scientists to build the links of an emerging network of interoperable evolutionary resources.

Title: HIP: Hackathons for Interoperability in Phylogenetics

This proposal was submitted to NESCent in December 2010, and subsequently accepted. A PDF version of the proposal is also available.

Short Title: HIP

Name and contact information for Project Leader, and any Co-Leaders

Project Summary

The potential for synthetic research based on aggregating, integrating, and re-using data is enormous, yet most resources remain interoperable. To realize this potential, software and databases that handle evolutionary trees (and their associated annotations) must be interoperable. Interoperability, in turn, requires tools based on common standards. In the past few years, evolutionary informaticists, with help from NESCent, have been building a software toolbox for solving interoperability problems, based on the EvoIO “stack” of NeXML, CDAO and PhyloWS. This toolbox makes it possible to begin building a worldwide network of interoperable evolutionary resources. The HIP (Hackathons, Interoperability, Phylogenies) aims to use the hackathon mechanism (which we have helped to develop at NESCent) to grow this network directly, by adding links to it, and indirectly, by creating examples for others to follow. To support this project within a working-group budget, we leverage support from strategic partners. Each of the planned series of 3 hackathons will bring together scientific programmers with related challenges. The hackathons target early-career scientists, who often have the most technical expertise and the most potential to pass along their skills and enthusiasm.

Public Summary

The Internet increases the potential for scientists to share information, find new patterns, and test ideas. Yet, this potential often is not realized, due to software components’ inability to work together (commonly referred to as “interoperability”). To realize the potential of synthetic evolutionary science software and databases that handle evolutionary trees (and their associated annotations) must work together, i.e., they must be “interoperable”. Interoperability, in turn, requires tools based on common standards. In the past few years, evolutionary researchers, with help from NESCent, have been building a software toolbox for solving interoperability problems. This toolbox makes it possible to begin building a worldwide network of interoperable evolutionary resources. Growing this network is the goal of the HIP (Hackathons, Interoperability, Phylogenies) working group. We aim to grow the network directly, by adding links to it, and indirectly, by creating examples for others to follow. To achieve this, we will stage “hackathons”: small, intense meetings that bring together scientific programmers to bring down interoperability barriers. The hackathons target early-career scientists, who often have the most technical expertise and the most potential to pass along their skills and enthusiasm. The hackathons will have different themes, and will focus significantly on the needs of strategic partners.

Introduction and Goals

Sidlauskas, et al (2010) make a strong case for the value of synthetic research defined as “the extraction of otherwise unobtainable insight from a combination of disparate elements”. They distinguish several modes of synthesis: data aggregation, methodological integration, conceptual synthesis and reuse of results. Synthesis as a problem in knowledge engineering requires semantic integrity across boundaries (e.g., disciplinary or methodological). That is, in order to re-use, aggregate, or integrate data (in the sense of “information”), the meaning of data mean must be clear to an information consumer in a context other than its original. This requires agreements about data representation. Furthermore, to support large-scale integration, data must not require “hands-on” interpretation, but must be accessible to automatic processing by software.

Thus, synthetic research depends on development and adoption of standards for knowledge representation, and the technology to support these standards, so as to make data accessible, searchable and combinable. Examples of this were shown at the 2010 iEvoBio meeting, where the adoption of this approach by the TreeBASE project made its data combinable with the Tree of Life and the UniProt project, searchable in new ways, and accessible in new visualizations such as superimposition of taxa on Google maps (Vos, et al., 2010). The innovations that make TreeBASE interoperable were enabled by NESCent’s forward-looking support for behind-the-scenes technology development. From 2006 to 2009, NESCent supported an “evolutionary informatics” working group that spawned a trio of projects:

NeXML, an XML format for phylogenetic data and metadata (Vos, et al., in review.; http://www.nexml.org). Data expressed in NeXML can be annotated with terms from the CDAO and other knowledge representations, thereby allowing metadata stored by community resources to be “unlocked” and available for clients, strategic partners and third party projects to be integrated in novel ways.

CDAO, the Comparative Data Analysis Ontology (Prosdocimi, et al., 2009 http://www.evolutionaryontology.org). The CDAO provides a framework for the explicit, computable representation of core concepts in comparative evolutionary analysis. It allows for the import of other vocabularies for data and metadata: this is essential to keep pace with synthetic use of data from other domains of knowledge.

PhyloWS, a web services standard (http://evoinfo.nescent.org/PhyloWS) that provides a common application programming interface for phylogenetic resources. This allows phylogenetic data to be queried and identified in a way that is agnostic of the underlying implementation of the resource.

Together, these form the EvoIO stack, designed to enable a global network of interoperable resources. Using this stack, TreeBASE can now represent its contents using CDAO and serialize it using NeXML, and data can be searched using the PhyloWS standard. Thus, NESCent’s past efforts have increased the potential for a global network of interoperable phylogenetic resources to emerge. However, the EvoIO stack is in its early stages, and it is not easy for novices to deploy. Most researchers do not know that it exists or are unaware of its benefits. The sociology and infrastructure of science (its system of hiring, promotion and funding) discourage forward-looking technology changes that would benefit the entire community without providing direct and tangible benefits to a resource-provider.

We propose to grow the network of interoperable evolutionary resources by forming a working group to organize a series of hackathons. Such meetings have been successfully organized for evolutionary informatics at NESCent before (Lapp, et al., 2007, Lapp, et al., 2009). The proposed hackathons promote synthesis by the development of standards-compliant software. Achieving this in parallel projects will increase connectivity of a network of interoperable community resources, which in turn will increase the capacity for end-users to conduct synthetic research based on aggregating data, integrating resources, and automating analyses on a large scale.

The working group promotes the adoption of common standards in phylogenetics by raising awareness of their existence and utility and by sharing knowledge on how to deploy them. Young scientists will be trained in the application of best practices to implement standards and stack technologies. This will go beyond the hackathons in initiatives such as Google Summer of Code projects or graduate fellowships. Leaders of community resources will learn to leverage software engineering concepts, programming techniques and communication tools to manage software development teams.

Proposed Activities

The proposed activities will be carried out by the working group and the hackathon participants. The working group remains relatively constant in membership and provides planning, follow-up and evaluation for all aspects of the project, while hackathons will have different sets of participants.

Planning and partnering by the working group

We will recruit additional working group members (up to a total of 8 to 10), from key projects, NESCent staff, and under-represented projects. The working group will meet at NESCent in mid-2011 to refine its strategic vision, and to begin planning the first hackathon. Subsequently, the working group will hold quarterly teleconferences and communicate on-line. From past experience, the proposers are adept at this mode of collaboration. The working group is responsible for engaging strategic partners and pursuing further financial support.

Hackathons

Organization of hackathons will follow the scheme established by previous NESCent hackathons, which typically host 20 participants for 5 days. The first day will begin with talks on best practices and common resources, and will end with participants self-organizing into project groups (3 to 7 members), based on OpenSpace principles. Subsequent days will be spent on projects. Participants will be instructed to develop links that directly build the network of interoperable resources available to end-users, or to work on reference implementations and proofs-of-concept that will inspire such links.

Participants are roughly a 2:1 mixture of invitees and applicants responding to an advertised call for participation. By inviting participants, we take advantage of the fact that real networks (social, biological, computer) have critical nodes with a high degree of connectivity. Key nodes in an interoperable network of tree-related resources might include EoL, iPlant, NCBI, TreeBASE, ToLWeb and others. By issuing an open call, we expand our network of connections, encouraging participation from under-represented groups and developers whose resources are not well known or widely used. Applicants are selected on technical ability, collaborativeness, strategic opportunities, and diversity.

The proposers do not envision the development of novel resources or databases; rather, the scope of the hackathons is to improve interoperability between existing resources. For three hackathons, we identify areas of interest and projected key participants. The hackathons are ordered such that one builds on the previous, and so, in addition to key participants and interoperability experts, there are return visitors to enable the transition from one hackathon to the next.

  • Hackathon 1: Data resources. NESCent (Durham, NC) Winter 2011 - will focus on key data providers. Hackathon projects will be in the area of supporting import, querying and export of richly annotated data. External projects targeted for hackathon participation will include TreeBASE, Dryad, the Tree of Life web project, PhenoScape, MorphBank, MorphoBank, TimeTree and PhylomeDB.
  • Hackathon 2: Data integration environments. Field Museum (Chicago, IL) Summer 2012 - will focus on data integration and exploration environments. Hackathon projects will take advantage of the accomplishments of the first hackathon and will include aggregating data via phyloreferencing and taxon referencing, and integrating data exploration environments with image repositories. External projects targeted for hackathon participation will include BioSync, TOLKIN, PhyLoTa, pPOD and the iPlant discovery environment.
  • Hackathon 3: Visualization tools. University of Arizona (Tucson, AZ) Winter 2012 - will focus on visualization of rich phylogenetic data as is available from data providers and data integration environments. Hackathon projects will take advantage of preceding hackathons to enable visualization of semantically annotated phylogenies (e.g., ones that incorporate character state changes and other biological events such as speciations, extinctions, gene duplications and metadata). External projects targeted for hackathon participation will include iPlant, Mesquite, PhyloBox, jsPhyloSVG, PhyloWidget and Archeopteryx.

Follow-up

Hackathons produce tangible outcomes such as proof-of-concept software and code revisions, and intangible outcomes such as agreement on best practices, awareness of available resources, and opportunities for collaboration. From experience, we know that these intangible outcomes bear fruit after hackathons end. However, tangible outcomes (though publicly available often do not bear fruit for scientific end-users. To address this we plan to put more emphasis on project follow-ups. First, we will stress that each team aim to produce: a stable software deliverable; a proof-of-concept used to gain further support; a technical publication; or a plan for a Google Summer of Code project, graduate fellowship or visiting scientist visit. Second, we will keep a list of projects and their current status on the working group web site. Each project will be assigned a working group member, who tracks the status of the project and advises participants on how to bring the project to fruition. Working group teleconferences will include, as a fixed agenda item, reports on project status.

Participating Fields and Partial List of Proposed Participants

The working group will include the 3 authors of the present proposal, Rutger A. Vos (post-doc, phyloinformatics), Arlin Stoltzfus (senior researcher, bioinformatics & evolution) and Enrico Pontelli (professor, knowledge representation & reasoning, as well as Dr. Mark Westneat, and a set of 4 to 6 other individuals chosen from key projects, NESCent staff, and under-represented projects. Hackathon participants are not known in advance. We will be careful in extending invitations and in reviewing applicants to an open call. Experience indicates that it is not sufficient merely to pick a representative of a targeted resource: the applicant’s ability to collaborate and their technical expertise are crucial. These events naturally attract early-career scientists.


Rationale for NESCent support

Many potential hackathon participants belong to NESCent’s in-house community, and several of the projects developed by them make excellent targets for hackathon projects (see collaborations with other NESCent activities, below). The proposers note the excellent informatics support provided by NESCent staff and whose help in meeting the IT needs for the hackathons will be invaluable. NESCent has world-class IT and logistic resources and the know-how to host hackathons. The proposers recognize NESCent’s unique culture of institutional support for initiatives such as ours. In contrast, other granting agencies usually do not support sustainable, ongoing development of infrastructure that serves community needs. Lastly, the proposers have been instrumental in developing the hackathon strategy deployed at NESCent, whereas other agencies might underestimate the value of this approach. Indeed, NSF declined a proposal for a phylogenetic Data Interoperability Network (Stoltzfus, et al, 2009) that included many of the ideas proposed here. However, NESCent understands the strengths of this approach, as well as its weaknesses (which we aim to address in our project).

Note on budget considerations

We do not expect NESCent to commit more funds to this project than it would to a typical working group. We estimate the cost of a typical hackathon at 2 to 2.5 times that of a working group meeting (allowing that hackathons outside of NESCent may entail extra costs). Therefore, the potential of our proposal depends on our ability to secure external funds, which so far include two major commitments and the following additional commitments to fund the travel of personnel:

  • $15K from iPlant to co-sponsor a hackathon at their Arizona location (letter, Dr. S. Goff)
  • $10K from the EoL BioSynC to co-sponsor a hackathon (letter, Dr. M. Westneat)
  • 3 person-trips for TOLKIN project personnel (letter, Dr. N Cellinese)
  • 3 person-trips for working group leader (Dr. A Stoltzfus)
  • 2 person-trips for Biodiversity Synthesis Center of EoL personnel (letter, Dr. M. Westneat)
  • 1 person-trip for working group leader (Enrico Pontelli)

We estimate that these commitments are sufficient to stretch a normal working group budget to cover 1 working group meeting and 2 hackathons. We intend to secure further support to allow a third hackathon as our plans mature over the next year. Options to obtain further support include:

  • Applying to organizations such as NCBO and the Phenotype RCN for meeting support
  • Submitting an NSF workshop proposal to fund a full hackathon
  • Requesting travel support from individual grant-funded projects

To support hackathon followups, we have some of the same options, in addition to applying for NESCent short-term visiting scientist funds.

Collaborations with other NESCent Activities

There are many researchers at NESCent who we’d like to invite to the hackathons. We note especially the following individuals and the areas of their expertise relevant to the proposed hackathons: Jim Balhoff’s implementation of NeXML support with EQ annotation of character states in Phenex/PhenoScape; Vladimir Gapeyev’s contributions to TreeBASE; Ryan Scherle’s expertise with Dryad; Hilmar Lapp’s development of the PhyloWS standard; Jeet Sukumaran’s contributions to the design of the NeXML standard and of DendroPy.

Anticipated IT Needs

The working group does not expect long-term maintenance by NESCent of a public resource. In addition to communication tools we will supply ourselves (mailing list, wiki at http://www.evoio.org, live channels such as friendfeed or twitter) we envision the following IT needs:

  • Conferencing facilities for conference calls when organizing the hackathons, and video-conferencing during the events. We would like to use NESCent’s infrastructure for this.
  • LCD projectors for group programming and wiki review, ideally for each hackathon team.
  • WiFi access for all participants at the hackathons.

Proposed Timetable

  • Throughout - The working group will have quarterly teleconferences throughout the period of its mandate.
  • Summer, 2011 - Leaders have filled out the working-group roster. The group meets at NESCent for 3 days to develop cohesion on a strategic vision, and to develop themes for the first hackathon. Planning for the hackathon begins immediately after the meeting.
  • Winter 2011 or Spring, 2012 - Over the past 3 months, the working group has selected applicants. First hackathon takes place, probably at NESCent.
  • Summer or Fall, 2012 - Over the past 3 months, the working group has selected applicants. Second hackathon takes place, probably at the Field Museum in Chicago.
  • Winter 2012 or Spring, 2013 - Over the past 3 months, the working group has selected applicants. Third hackathon takes place, probably at the University of Arizona at Tucson.

Anticipated Outcomes

The working group will develop a wiki publicizing its strategic vision for a network of interoperable resources. It will maintain a publicly accessible spreadsheet with the current status of hackathon projects. By mid-2012, it will produce a report for publication on progress in achieving its strategic vision. In addition to describing tangible outcomes, this report will serve as a guide for others wishing to organize scientific hackathons. Hackathons produce intangible outcomes on an individual level, such as awareness of resources, training, and connections. On a community level, hackathons increase appreciation of the benefits of interoperability, and its connection to standards. This includes appreciation for emerging standards (NeXML, PhyloWS, CDAO) and undeveloped or under-supported standards (e.g. MIAPA, LSIDs). Computer code produced from hackathon projects is open-source and publicly available by the end of the hackathon. The specific nature of these outcomes cannot be predicted reliably. However, the following likely outcomes indicate what we mean by “growing the network of interoperable resources”:

  • Increased utilization of next-generation data formats - Participants working on tree visualization tools and data resources (listed earlier) will increase their support for NeXML as format for phylogenetic data exchange. For end-users, this means increased interoperability of software and resources that exchange phylogeny data. Ultimately, users will be able to choose software for strictly scientific reasons, instead of limiting themselves to those compatible with their existing workflow.
  • Increased use of web services to import or export phylogenies - For a tree visualization tool that uses NeXML it is a short step to implement a PhyloWS search interface to access trees directly from TreeBASE or other resources. If cutting-edge tree viewers provide access to several such resources, this will stimulate other projects to export phylogenies via PhyloWS to leverage their cutting-edge visualization capabilities. Ultimately, the end-user will not be limited to locally saved trees; users with special visualization needs will choose a preferred tool, rather than the one chosen by the data provider.
  • Scientific use-cases driving expanded vocabulary support - Participating projects will present use-cases that drive improvements in language support for representing data and metadata, expanding the scope of artefacts such as CDAO and NeXML. For end-users, this means that interoperable resources will cover more of the kinds of information important for their research. In our discussions with stakeholders, it is clear that there is an urgent need for language to annotate methods and phenotypes.

References

  • Lapp H, Bala S, Balhoff J, Bouck A, Goto N, Holder M, Holland R, Holloway A, Katayama T, Lewis P, et al. 2007. The 2006 NESCent Phyloinformatics Hackathon: A Field Report. Evolutionary Bioinformatics, 3:287-296.
  • Lapp H, Stoltzfus A, Vision T, Vos R. 2009. Evolutionary Data Leaping to Web 3.0: Some Highlights From NESCent’s Third Hackathon. ASN/SSB/SSE meeting.
  • Prosdocimi F, Chisham B, Pontelli E, Thompson J, Stoltzfus A. 2009. Initial Implementation of a Comparative Data Analysis Ontology. Evolutionary Bioinformatics:47-66.
  • Sidlauskas B, Ganapathy G, Hazkani-Covo E, Jenkins K, Lapp H, McCall L, Price S, Scherle R, Spaeth P, Kidd D. 2010. Linking Big: The Continuing Promise of Evolutionary Synthesis. Evolution, 64:871-880.
  • Vos R, Lapp H, Piel W, Tannen V. 2010. TreeBASE2: Rise of the Machines. Nature Precedings doi:10.1038/npre.2010.4600.1.
  • Vos R, J P Balhoff, J A Caravas, M T Holder, H Lapp, P E Midford, A Priyam, J Sukumaran, X Xia, and A Stoltzfus. In Review. NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Systematic Biology

Appendices

Leadership Team

After acceptance of the proposal by NESCent, a leadership team was assembled with the following members:

  • Rutger Vos, University of Reading (co-PI)
  • Enrico Pontelli, New Mexico State University, Dept. of Computer Science (co-PI)
  • Arlin Stolzfus, University of Maryland/NIST (co-PI)
  • Karen Cranston, NESCent
  • Sergei Kosakovsky Pond, UC San Diego
  • Hilmar Lapp, NESCent
  • Mark Westneat, Field Museum of Natural History / BioSynC
  • Mark Wilkinson, University of British Columbia

An additional leadership team member with expertise in ecological research applications of phylogenetic and comparative data is currently being recruited.