PhylotasticDataPolicy

From Evolutionary Interoperability and Outreach
Jump to: navigation, search

discussion

Arlin:

I'd like to ask for your comments on an issue that will come up at a future LT meeting, namely the issue of a data policy.
We have created a system that is exposing (to the general public) a variety of data on phylogenies, naming systems, fossil dates, etc, from a variety of sources. Almost none of this is covered by a clear licensing agreement (an exception would be the ToLWeb tree. In the absence of a clear licensing agreement or condition of publication, scientific data may have other strings attached, i.e., data providers may have customary expectations that represent implicit obligations (things that we want to do in order not to aggravate partners), like an obligation to ask the authors nicely if you can use "their" data for a particular purpose, or an obligation not to pass the data along to a third party, or an obligation to incorporate updates or revisions in a timely manner.
Ideally we will come up with a policy that makes clear, both to our partners who provide data, and to users, the conditions under which we redistribute data.
If you have comments on this issue, please share them. Your comments are particularly valuable if you have experience with a project that redistributes data from many different authors (e.g., Dryad, TreeBASE).

Rutger's comments (email, August 6, 2012)

thank you Arlin for raising this issue. With TreeBASE the general thinking is that data are facts which consequently can be shared, and that the abstracts we post fall under 'fair use' policies. This, however, is not ideal - in some parts of the world, databases are considered works of art, not repositories of facts, so automated harvesting (as phylotastic clients would do) can not necessarily be assumed to be OK.
I would argue for a model that is as permissive as possible but still keeps our "upstream" data providers (i.e. the people and places that we get the megatrees from) happy.

comments from Hilmar (August 10, 2012 email)

I think you're raising the good point that we're reusing data from others by aggregating and processing it, and that our users will by definition be reusing the data we aggregated, and therefore also the data from our sources. Part of our data policy should therefore be to conspicuously state what our data sources are, to give attribution to them in the way they ask to be attributed, and to ask our users to do so as well. The policy should give guidance to users what our expectations from them are in terms of citing and crediting our sources, as well as ourselves.
(This may be less straightforward for some questions than one might think. For example, if a TNRS queries 5 providers, but only 1 gives a good hit, should a user still credit all 5 sources? If there are rules, they need to be very simple, or they are too hard to follow. The TNRS could also return the citation(s) as part of the results - some data aggregators are doing that.)
When it comes to redistributing data, this can be a tricky question depending on which encumbrances you accept for source data. The API will obviously redistribute data or otherwise it can't function, and so a data provider who allows use of their data will be content with that or using their data is a non-starter by definition. Also, if the source data is in the public domain or available for download, reuse, and redistribution without restrictions (e.g., ITIS, NCBI), there are no problems no matter how the data is redistributed. It becomes tricky for non-open [1] source data and an aggregator app with API calls that would allow one to download much or all of the data through the API. This is one reason, for example, that GBIF has result limits on their API calls - most museum collections providers loath the possibility that someone might download all of their holdings through GBIF's powerful API. Therefore, I'd argue it's best to stay away from non-open data. Open data as source requires no odd engineering of artificial limits on APIs, and we can simply point to the sources for bulk download.
Another aspect in which redistribution becomes relevant is if we modify source data on a large scale. For example, when we compute and add branch lengths to trees. Again, source data that has encumbrances is problematic now - because we've modified data, for reproducibility reasons we really do want to provide bulk downloads of the data, and now we can't just point back to the sources for the data because they're modified. With open data as sources, this is not a problem, but with encumbered data it is. Specifically for the case of trees in publications, at least in the US trees are not included in the copyright a publisher holds for the article is non-OA (which most papers still are). At least this is how we (Dryad, Phenoscape, TreeBASE) have been treating them. So one can extract trees (and small or large makes no difference here) from articles, including their SOMs, put them all together, and redistribute the collection freely - *provided* attribution is given.
So overall, I think short of a full-blown policy having been written, what can be done right now to do right by our data sources and by promoting the culture of data sharing, is to really give correct and full citations for all courses, taxonomies as well as large trees, for all tools we have created. The "mammal tree by Bininda-Emonds" is not a citation.