Phylotastic/Datastore

From Evolutionary Interoperability and Outreach
Revision as of 17:38, 7 June 2012 by Hilmar (talk | contribs) (moved PhylotasticDatastore to Phylotastic/Datastore)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Datastore subgroup

The datastore subgroup is keeping notes on a google doc. The rest of this page is pre-hackathon notes about potential approaches.

About

Information on various options that could be used as datastores to host the trees underlying the phylotastic service that prunes trees.

Requirements

A place to document requirements for a phylotastic data store.

Ontology

Should the datastore include ontology support? This would allow for flexible markup of node attributes, as well as flexible attribution of source trees.

Nodes and edges

Will the use of the data store require nodes and edges store in separate tables (if SQL used)?

Datastores

RDMBS

Options for trees/hierarchies in RDMBS

This documentation copied from options for storing hierarchical data in a relational database

  • Adjacency List:
    • Columns: ID, ParentID
    • Easy to implement.
    • Cheap node moves, inserts, and deletes.
    • Expensive to find level (can store as a computed column), ancestry & descendants (Bridge Hierarchy combined with level column can solve), path (Lineage Column can solve).
    • Use Common Table Expressions in those databases that support them to traverse.
  • Nested Set (a.k.a Modified Preorder Tree Traversal)
    • First described by Joe Celko - covered in depth in his book Trees and Hierarchies in SQL for Smarties
    • Columns: Left, Right
    • Cheap level, ancestry, descendants
    • Compared to Adjacency List, moves, inserts, deletes more expensive.
    • Requires a specific sort order (e.g. created). So sorting all descendants in a different order requires additional work.
  • Nested Intervals
    • Combination of Nested Sets and Materialized Path where left/right columns are floating point decimals instead of integers and encode the path information. In the later development of this idea nested intervals gave rise to matrix encoding.
  • Bridge Table (a.k.a. Closure Table: some good ideas about how to use triggers for maintaining this approach)
    • Columns: ancestor, descendant
    • Stands apart from table it describes.
    • Can include some nodes in more than one hierarchy.
    • Cheap ancestry and descendants (albeit not in what order)
    • For complete knowledge of a hierarchy needs to be combined with another option.
  • Flat Table
    • A modification of the Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
    • Expensive move and delete
    • Cheap ancestry and descendants
    • Good Use: threaded discussion - forums / blog comments
  • Lineage Column (a.k.a. Materialized Path, Path Enumeration)
    • Column: lineage (e.g. /parent/child/grandchild/etc...)
    • Limit to how deep the hierarchy can be.
    • Descendants cheap (e.g. LEFT(lineage, #) = '/enumerated/path')
    • Ancestry tricky (database specific queries)

RDBMS

Feature MySQL PostgreSQL
Max allowed packet (a limit on the list of species for trimming the tree) By default this is 1 MB but can be increased
Forced Referential integreity Yes (with InnoDB tables but these generally are slower than MyISAM tables) Yes
Stored procedures Yes (> MySQL 5.0) However does not accept arrays (ie List of leaf nodes) as variables in stored procedures. Yes

Schema

The following schema support storing trees.

Schema Supported RDBMS Ontology Support
BioSQL::Phylo PostgreSQL Yes
EnsembleCompara PostgreSQL, MySQL No, but it is possible with schema extensions that add Chado controlled vocabulary tables.
Chado::Phylogeny PostgreSQL, MySQL Yes (OBO, OWL with conversion to OBO)

Additional Information

NOSQL

Examples of storing trees in MongoDB.

Hadoop

HBASE

HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Tolomatic

Rutger's documentation on using MapReduce

Additional Information