EvoIO Working Group Proposal

From Evolutionary Interoperability and Outreach
Revision as of 21:32, 17 March 2011 by Hilmar (talk | contribs) (moved EvoIOWorkingGroupProposal to EvoIO Working Group Proposal)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

A New Working Group Proposal

Proposal deadline: December 1, 2010 (see guidelines)

Proposal details: "Proposals for working groups are short, not to exceed 5 single-spaced (12-pt type) pages (not including budgets or CVs)." (see guidelines)

To do

  • (?) polish up text of each section
  • (?) cut this down to <= 5 single-space pages
  • (Enrico) upload 2-page cv to wiki
  • (Rutger) upload 2-page cv to wiki
  • (Arlin) upload 2-page cv to wiki
  • final check to be sure that commitments are correct and documented with letters
  • wrap up everything into a single PDF for submission to NESCent by 5 pm

Title: HIP: Hackathons for Interoperability in Phylogenetics

Short Title: HIP

Name and contact information for Project Leader, and any Co-Leaders

Project Summary

We propose a working group to make evolutionary data accessible, searchable, and combinable, thereby promoting synthesis at a practical, technology-oriented level. The working group will organize "hackathons" at NESCent and elsewhere. The normal budget for a working group will be stretched to cover these events by securing commitments of funding support from participating stakeholders.

We will bring developers and users together to improve integration of phylogenetic data resources, visualization tools and data exploration environments. This will be done by organizing hackathons: meetings for collaborative software development that address tangible issues by leveraging the interests and expertise of the participants. Hackathons facilitate intense face-to-face interactions. The collaborative nature of these meetings gives every participant a direct sense of ownership of the produced software.

Members of the working group build on successful past experiences organizing hackathons at NESCent. These events include the PhyloInformatics Hackathon and the Database Interoperability Hackathon, which were invaluable in fostering and promoting the community standards NeXML, CDAO and PhyloWS. Hackathons have sometimes generated less follow-through than anticipated. We will remedy this by raising expectations among participants, encouraging projects to build on established tools, tracking and reporting follow-up results, and working with participants to secure short-term funds needed to complete projects.

The hackathons will generate an on-going stream of software showcasing the added value of interoperable data to research and dissemination. Targets for outreach and training, opportunities for reference implementations, and assessments of utility and gaps in community standards will be created.

Public Summary

The internet greatly increases the potential for scientists to share information, find new patterns, and test ideas. Yet, this potential often is not realized, due to "interoperability" barriers-- computer programs that can't work together. Evolutionary synthesis requires interoperability of those computational resources (e.g., software, databases) that handle "phylogenies" (evolutionary trees) and their associated annnotations ("metadata"). Interoperability, in turn, requires tools based on common standards.

In the past few years, evolutionary researchers-- often with help from NESCent-- have been building a software toolbox for solving interoperability problems. This toolbox makes it possible to begin building a worldwide network of interoperable evolutionary resources.

Growing this network is the goal of the HIP (Hackathons, Interoperability, and Phylogenetics) working group. We aim to grow the network directly, by adding links to it, and indirectly, by creating examples for others to follow. To achieve this goal, we will stage a series of "hackathons": small, intense meetings that bring together scientific programmers (the "hackers") to bring down interoperability barriers.

The hackathons especially target the early-career scientists who often have the most technical expertise, and the most potential to pass along their skills and enthusiasm. Each hackathon will have a different theme, and will focus on the needs of different strategic partners who are helping to provide support, including the Encyclopedia of Life, iPlant, TOLKIN, and other projects.

Introduction and Goals

Sidlauskas, et al (2010) make a strong case for the value of synthetic research, which they define as “the extraction of otherwise unobtainable insight from a combination of disparate elements”. They distinguish several modes of synthesis, including data aggregation, methodological integration, conceptual synthesis and reuse of results.

Synthesis, considered as a problem in knowledge engineering, requires semantic integrity across boundaries (e.g., disciplinary or methodological). That is, in order to re-use, aggregate, or integrate data, it must be clear what the data would mean to a different information consumer, in a context other than its original context. This depends on common agreements about how to represent data. In particular, to support integration on a large scale, individual data points must not require "hands-on" expert interpretation, but must be accessible to computer software.

Thus, the potential for synthetic research depends on development, adoption and deployment of standards for knowledge representation, and the programmatic interfaces to make these accessible, searchable and combinable. Examples of this were presented at the iEvoBio satellite meeting at the 2010 ASN/SSB/SSE Evolution Conference, where the adoption of this approach by the TreeBASE project made its data combinable with that of the Tree of Life Web Project (Maddison, et al., 2007) and the UniProt project (Apweiler, et al., 2004), searchable in new ways such as custom news feeds of search results, and accessible in new visualizations such as superimposition of taxon samplings on Google maps (Vos, et al., 2010).

The innovations that make TreeBASE interoperable were made possible through NESCent's forward-looking support for behind-the-scenes technology development. From 2006 to 2009, NESCent supported an "evolutionary informatics" working group that spawned a trio of projects:

NeXML, an XML format for phylogenetic data and metadata (Vos, et al., in prep.; http://www.nexml.org). The NeXML standard allows phylogenetic data to be expressed in a predictable and validatable way. Data expressed in this way can be annotated with terms from the CDAO and other knowledge representations, thereby allowing metadata stored by community resources to be “unlocked” and available for clients, strategic partners and third party projects to be integrated in novel ways. Software libraries to process NeXML data now exist for a variety of programming languages and platforms. We seek to deploy these.

CDAO, the Comparative Data Analysis Ontology (Prosdocimi, et al., 2009 http://www.evolutionaryontology.org). The CDAO provides a framework within which the core concepts in phylogenetics can be expressed. Additional terms for data and metadata from other knowledge representations (such as DarwinCore, DublinCore, SKOS, as well as resource-specific subclasses such as the terms and concepts that TreeBASE uses) can be attached to core CDAO concepts (phylogenetic trees, character state matrices, OTUs) to make the meaning and context within which phylogenetic data is provided by community resources explicit, computable and combinable.

PhyloWS, a web services standard (http://evoinfo.nescent.org/PhyloWS). The PhyloWS standard provides a common application programming interface for phylogenetic web services and resources. Using PhyloWS, phylogenetic data can be queried and uniquely identified over the web in a way that is agnostic of the underlying implementation (e.g. database schema) of the resource. Community adoption of PhyloWS will facilitate the development of environments within which novel combinations of community resources can be integrated easily.

Together, these represent an interoperability "stack", which we call the "EvoIO stack", designed to support a global network of interoperable resources. TreeBASE can now represent its contents using CDAO and serialize it using NeXML. These data can be searched using the PhyloWS standard for programmatic access to phylogenetic data resources.

Thus, NESCent's past efforts have increased the potential for a global network of interoperable phylogenetic resources to emerge. Yet, its not clear how such a network can emerge. The EvoIO stack is still in its early stages, and it is not easy for novices to learn and apply. Most researchers do not know that it exists, nor are they aware of the benefits it might bring. The sociology and infrastructure of science (its system of hiring, promotion and funding) does not encourage the kind of forward-looking technology changes that would benefit the entire community without providing direct and tangible benefits to a resource-provider. Many significant phylogenetic resources remain "side projects" that are not directly supported.

Here we propose to remedy this situation by forming a working group which will organize a series of hackathons (intensive, face-to-face meetings with a focus on collaborative software development). Such meetings have been successfully organized for evolutionary informatics (Lapp, et al., 2007, Lapp, et al., 2009) and related fields (Katayama, et al., 2010). The hackathons we propose will promote synthesis at a technology-oriented level firstly by the implementation of small-scale success stories where hackathon participants, self-organized in small projects, produce a deliverable within the scope of the working group. For example, a data resource with a non-standard way of identifying its data objects implements identification of these same objects using PhyloWS identifiers; a software tool or library implements reading and writing of NeXML. As multiple such deliverables are produced in parallel, bridges can be built between them. For example, if projects can process the same types of data serialized in the same format and adressed in the same way, they can exchange data with one another to present an integrated view; or, a third project can generate such views by "mashing up" the data from any number of standards-compliant resources. Self-organization to achieve this in a number of parallel projects will increase connectivity of a network of interoperable community resources, which in turn will increase the capacity to provide end-users with seamless integration of resources and improve the capacity for integrative and high-throughput projects. The repeated deployment of the standards and technologies in the EvoIO stack will improve their ease of installation and robustness and will create reference implementations that serve as examples and testbeds.

The working group seeks to promote the adoption of common standards and stack technologies in evolutionary informatics. This will be done by raising awareness of their existence and utility in the wider network of community resources, and by sharing knowledge on how to implement and extend them. New programmer/scientists, such as undergraduate students, graduate students and postdocs, will be trained in the application of software development best practices (agile programming, test-driven development, effective use of revision control and communication tools) to implement standards and stack technologies. We envision this to go beyond the hackathons in initiatives such as Google Summer of Code projects or working group graduate fellowships. Key leaders of community resources will learn to leverage the tools to manage software development teams. Effective use of software engineering concepts (formulating use cases, gathering requirements, testing), programming techniques (agile programming, test-driven development) and communication tools (wikis, twitter, revision control, bug trackers) can greatly help teams to get things done.

Proposed Activities

The working group will remain relatively constant in membership and will provide ongoing planning, follow-up and evaluation for all aspects of the project, while each hackathon will have a different set of participants.

Planning and partnering by the working group

Upon success of this proposal, we will recruit additional working group members (up to a total of 8 to 10), including some NESCent staff and resident scientists. The working group will meet at NESCent in mid-2011 to refine its strategic vision, and to begin planning the first hackathon. Subsequently, the working group will hold quarterly teleconferences, and will communicate more frequently using an email list, supplemented with electronic file-sharing. From past experience, the proposers are adept at this mode of collaboration. The working group is responsible for engaging strategic partners and pursuing further financial support as needed to satisfy its aims.

Hackathons

The organization of hackathons will follow the general scheme established by previous NESCent hackathons. A typical hackathon will have around 20 participants for 5 days. The first day will begin with talks on interoperability technology, best practices and common resources, and will end with a process of self-organizing into a small set of task groups (each with 3 to 7 members), based on OpenSpace principles. The subsequent days will be spent on projects. Participants will be instructed to aim for tangible outcomes, either in the form of links or bridges that directly build the network of interoperable resources, or examples (reference implementations, proof-of-concept applications) that serve as paradigms and inspire further work.

Participants will be roughly a 2:1 mixture of invited participants and applicants responding to a widely advertised call for participation. By inviting participants, we take advantage of the fact that real networks (social, biological, computer) have critical nodes with an unusually high degree of connectivity. Key nodes in an interoperable network of tree-related resources might include EoL, iPlant, NCBI, TreeBASE, ToLWeb and others. By issuing an open call for applicants, we expand our network of connections, encouraging participation from under-represented groups and from developers whose resources are not well known or widely used. Applicants will be selected based on technical ability, collaborativeness, strategic opportunities, and diversity.

The proposers do not envision the development of novel resources or databases, rather, the scope of the hackathons is primarily to improve interoperability between existing resources. For three hackathons, we identify the respective areas of interest and projected key participants. The hackathons are ordered such that one builds on the previous, and so, in addition to key participants and interoperability technology experts, there will be return visitors who will enable the transition from one hackathon to the next.

  • Hackathon 1: Data resources. NESCent (Durham, NC) Winter 2011 - the focus of the first hackathon will be on key data providers. Targets of opportunity for this hackathon will include increased support for uploading, querying and downloading richly annotated data to and from such resources as TreeBASE, Dryad, the Tree of Life web project, PhenoScape, MorphBank, MorphoBank, TimeTree, PhylomeDB and PaleoDB.
  • Hackathon 2: Data integration environments. Field Museum (Chicago, IL) Summer 2012 - the second hackathon will focus on data integration and exploration environments. Targets for this hackathon will take advantage of the accomplishments of the first hackathon and will include aggregating data via phyloreferencing and taxon referencing, and integrating data exploration environments with image repositories. Key roles will be played by BioSync, TOLKIN, PhyLoTa, DendroPy, pPOD and the iPlant discovery environment.
  • Hackathon 3: Visualization tools. University of Arizona (Tucson, AZ) Winter 2012 - the third hackathon will focus on visualization of rich phylogenetic data as is available from data providers and data integration environments. Participants of this hackathon will take advantage from the preceding hackathons to enable visualization of semantically annotated phylogenies that incorporate character state changes and other biological events such as speciations, extinctions, gene duplications and metadata such as images and links to external resources. Key participants include iPlant, Mesquite, PhyloBox, jsPhyloSVG, PhyloWidget and Archeopteryx.

Follow-up

Hackathons produce tangible outcomes such as proof-of-concept software and code revisions, as well as intangible human outcomes such as cohesion around standards and best practices, enthusiasm to face technical challenges, awareness of available resources, and opportunities for collaboration. From experience, we are confident that these intangible outcomes continue to bear fruit long after the hackathon; and we are equally confident that the same cannot be said for the tangible outcomes. To address this problem we have a specific plan for follow-ups. Firstly, will make clear that we expect each project to lead to one of the following outcomes: i) a sustainable software product, stand-alone or incorporated into an existing code base; ii) a demo or proof-of-concept that is used to gain further support; iii) a technical publication of some kind iv) a plan for a Google Summer of Code project, a graduate fellowship or a visiting scientist visit. Secondly, we will keep a list of projects, participants, and current status and put this at the top level of the working group web site. Each hackathon project will be assigned to a working group member, who track the status of the project for 9 months. Working group teleconferences will include, as a fixed agenda item, reports on project status. Working group members will be aware that early-career hackathon participants may need guidance and encouragement to turn promising initial results into an outcome of la

Participating Fields and Partial List of Proposed Participants

  • Working group leaders
    • Enrico Pontelli, PhD; New Mexico State University/Computer Science; email
    • Rutger A. Vos, PhD; University of Reading/NeXML; email
    • Arlin Stoltzfus, PhD; University of Maryland/NIST; email
  • Additional working group members
    • Mark Westneat
    • Mike Sanderson
    • To be named, NESCent
    • To be named, individuals from under-represented projects

Hackathon participants are not known in advance. Based on past experience, we intend to be careful in extending invitations and in reviewing applicants who have responded to an open call. Experience indicates that it is not sufficient merely to pick a representative of a targeted project: the nature of the individual, particularly their ability to collaborate and their technical expertise are crucial. These events naturally attract early-career scientists.


Rationale for NESCent support

Many potential hackathon participants belong to NESCent's in-house community, and several of the projects developed by them would make excellent targets for the deployment or extended support for the foundational technologies which this proposal seeks to leverage. Examples of this are Vladimir Gapeyev and his expertise w.r.t. the TreeBASE project, Ryan Scherle and the Dryad project, Jim Balhoff and the PhenoScape/Phenex project, Karen Cranston and the PhyLoTA Browser and David Swofford and his many contributions to the field, including paup*.

The proposal authors also note the excellent informatics support provided by Hilmar Lapp, Jon Auman and others for a variety of projects and whose help in meeting the IT needs for the hackathons will be invaluable. NESCent has world-class IT and logistic resources, and the know-how to take full advantage of these kinds of meetings. In addition to the people and resources noted here, the proposers also recognize NESCent's unique culture of institutional support for initiatives such as ours. In contrast, other granting agencies usually do not support sustainable, ongoing development of infrastructure that serves community needs.

Lastly, the proposers have been instrumental in developing the successful hackathon strategy deployed at NESCent. We also have leveraged support from other organizations, specifically TDWG. Other agencies might easily underestimate the value of the hackathon approach. Indeed, NSF declined a proposal for a phylogenetic Data Interoperability Network (Stoltzfus, et al, 2009) that included many of the same ideas we are proposing here. However, NESCent understands the strengths of this approach, as well as its weaknesses (which we hope to address with this proposal).

Note on budget considerations

We do not expect NESCent to commit more funds to this project than it would to a typical working group. We estimate that the cost of a typical hackathon is 2 to 2.5 times that of a working group meeting (allowing that hackathons outside of NESCent may entail extra costs). Therefore, the potential of our proposal depends on our ability to secure external funds, which so far include the following major commitments:

  • $ 15 K from iPlant to co-sponsor a hackathon at their Arizona location (letter, Dr. S. Goff)
  • $ 10 K from the Biodiversity Synthesis Center of EoL to co-sponsor a hackathon (letter, Dr. M. Westneat)

and the following additional commitments to fund the travel of personnel:

  • 3 person-trips for TOLKIN project personnel (letter, Dr. N Cellinese)
  • 3 person-trips for working group leader (Dr. A Stoltzfus)
  • 2 person-trips for Biodiversity Synthesis Center of EoL project personnel (letter, Dr. M. Westneat)
  • 2 person-trips for Mesquite project personnel (Dr. W. Maddison)
  • 2 person-trips for PhyLoTa project personnel (letter, Dr. M. Sanderson)
  • 1 person-trip for working group leader (Enrico Pontelli)

We estimate that these commitments are sufficient to stretch a normal working group budget to cover 1 working group meeting and 2 hackathons. We intend to secure further support to allow a third hackathon, as our plans mature over the next year. Options to obtain further support include:

  • applying to organizations such as NCBO and the Phenotype RCN for meeting support
  • submitting an NSF workshop proposal to fund a full hackathon
  • requesting travel support from individual grant-funded projects

To support hackathon followups, we have some of the same options, in addition to applying for NESCent short-term visiting scientist funds.

Collaborations with other NESCent Activities

There are many researchers at NESCent who we'd like to invite to the hackathons. For example, Jim Balhoff has implemented NeXML support in the Phenex application. His experience in deploying the data standard to annotate character states with EQ statements will be invaluable for PhenoScape-related projects. Vladimir Gapeyev is the leading developer at NESCent for the TreeBASE project; any hackathon targets that involve additions to TreeBASE will benefit from his contributions. Ryan Scherle's expertise with Dryad will be essential for any hackathon projects that target this data resource. The PhyloWS standard has been developed to a large extent by Hilmar Lapp; in addition, he is a key developer for BioSQL. His participation will be vital. Finally, Jeet Sukumaran will commence a postdoc at NESCent next year. As the lead developer of DendroPy and one of the core designers of the NeXML standard we are very eager to invite him to the hackathons.

Anticipated IT Needs

The working group does not expect long-term maintenance by NESCent of a public resource outside of wikis and mailing lists. We envision the following overall IT needs for the working group:

  • A mailing list for hackathon participants in order to prepare self-organization during the hackathons and followup afterwards. This can be a google group or similar, no need for NESCent to host this.
  • A live channel during the hackathons, so that participants in separate groups or rooms stay up to date. We will use freely usable technologies such as friendfeed, twitter hash tags and IRC channels, there is no need for NESCent to provide any.
  • A wiki for self-organized projects to report their plans and progress, and to document their deliverables. This will be the wiki at http://www.evoio.org.
  • Facilities for conference calls when organizing the hackathons, and video-conferencing during. We would like to use NESCent's infrastructure for this.
  • At the hackathons, ideally each project can use an LCD projector for pair or group programming and wiki review.
  • At the hackathons, WiFi access for all participants.

Proposed Timetable

  • Throughout - The working group will have quarterly teleconferences throughout the period of its mandate.
  • Summer, 2011 - Leaders have filled out the working-group roster. The group meets at NESCent for 3 days to develop cohesion on a strategic vision, and to develop themes for the first hackathon. Planning for the hackathon begins immediately after the meeting.
  • Winter 2011 or Spring, 2012 - Over the past 3 months, the working group has selected applicants. First hackathon takes place, probably at NESCent.
  • Summer or Fall, 2012 - Over the past 3 months, the working group has selected applicants. Second hackathon takes place, probably at the Field Museum in Chicago.
  • Winter 2012 or Spring, 2013 - Over the past 3 months, the working group has selected applicants. Third hackathon takes place at a location to be determined, probably at the University of Arizona at Tucson.

Anticipated Outcomes

I'd like this to have some examples that are even more concrete. can someone help?

Working group outcomes. The working group will develop a wiki document publicizing its strategic vision for growing the network of interoperable phylogenetic resources. It will maintain a publicly accessible spreadsheet with the current status of hackathon projects. By mid-2012, it will produce a report for publication on progress in achieving its strategic vision. In addition to describing tangible outcomes, this report will serve as a guide for others who may wish to organize scientific programming hackathons.

Hackathon intangible outcomes. Hackathons produce intangible outcomes that operate on an individual level, such as awareness of resources, specific skills training, enthusiasm, and connections. On a community-wide level, we anticipate that hackathons will increase appreciation of the benefits of interoperability, and its connection to standards. This includes appreciation for emerging standards (NeXML, PhyloWS, CDAO, DarwinCore) as well as the need for undeveloped or under-supported standards such as MIAPA and LSIDs.

Hackathon tangible outcomes. Computer code produced from hackathon projects will be open-source and publicly available by the end of the hackathon. The specific nature of these outcomes cannot be predicted reliably. However, we have indicated above that we aim to "grow the network of interoperable phylogenetic resources", and it may be difficult to understand the significance of this proposal without giving some concrete examples of what that might mean.

Increased utilization of next-generation data formats We anticipate that participants working on tree visualization tools and data resources (listed earlier) will increase their support for NeXML and PhyloXML as preferred formats for phylogenetic data exchange. For end-users, this will mean increased interoperability of software and resources that exchange phylogeny data. Ultimately users will be able to choose software tools for strictly scientific reasons, rather than limiting themselves to a few tools compatible with their existing workflow.

Increased use of web services to import or export phylogenies For a tree visualization tool that already uses NeXML (above), it is a short step to implement a search interface to access trees directly from TreeBASE or other resources with a PhyloWS web-services interface. Ultimately, every tree viewer will provide access to several data resources, and this will stimulate other data resources to export phylogenies via web services in order to leverage these interface capabilities. This means the end-user will not be limited to trees saved on the user's hard drive; users with special visualization needs can use their preferred tool, rather than the one chosen by the data provider.

Scientific use-cases driving expanded vocabulary support We consider it highly likely that participating projects will present concrete scientific use-cases that drive specific improvements in artefacts to support representation, such as CDAO and NeXML. For end-users, this means that interoperable resources will cover more of the kinds of information important for their research. In our discussions with stakeholders, it is clear that there is an urgent need for language to annotate methods and to annotate phenotypes.

References

  • Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al. 2004. UniProt: the Universal Protein knowledgebase. Nucl. Acids Res., 32:D115-119.
  • Katayama T, Arakawa K, Nakao M, Ono K, Aoki-Kinoshita K, Yamamoto Y, Yamaguchi A, Kawashima S, Chun H-W, Aerts J, et al. 2010. The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. The DBCLS BioHackathon Consortium*. Journal of biomedical semantics, 1:8.
  • Lapp H, Bala S, Balhoff J, Bouck A, Goto N, Holder M, Holland R, Holloway A, Katayama T, Lewis P, et al. 2007. The 2006 NESCent Phyloinformatics Hackathon: A Field Report. Evolutionary Bioinformatics, 3:287-296.
  • Lapp H, Stoltzfus A, Vision T, Vos R. 2009. Evolutionary Data Leaping to Web 3.0: Some Highlights From NESCent's Third Hackathon. ASN/SSB/SSE meeting.
  • Maddison D, Schulz K-S, Maddison W. 2007. The Tree of Life Web Project. Zootaxa:19-40.
  • Prosdocimi F, Chisham B, Pontelli E, Thompson J, Stoltzfus A. 2009. Initial Implementation of a Comparative Data Analysis Ontology. Evolutionary Bioinformatics:47-66.
  • Sidlauskas B, Ganapathy G, Hazkani-Covo E, Jenkins K, Lapp H, McCall L, Price S, Scherle R, Spaeth P, Kidd D. 2010. Linking Big: The Continuing Promise of Evolutionary Synthesis. Evolution, 64:871-880.
  • Vos R, Lapp H, Piel W, Tannen V. 2010. TreeBASE2: Rise of the Machines. Nature Precedings doi:10.1038/npre.2010.4600.1.

Short CV of Project Leaders (2 pages for each)

("Do not include talks, society memberships, nor papers in preparation.")

Appendices

I nagged Greg, Mike and Val again for letters. Let's hope we get them tomorrow! RutgerVos 00:18, 1 December 2010 (UTC)