From Evoio
Jump to: navigation, search

Datastore subgroup

The datastore subgroup is keeping notes on a google doc. The rest of this page is pre-hackathon notes about potential approaches.

Rdf model.png


Information on various options that could be used as datastores to host the trees underlying the phylotastic service that prunes trees.


A place to document requirements for a phylotastic data store.


Should the datastore include ontology support? This would allow for flexible markup of node attributes, as well as flexible attribution of source trees.

Nodes and edges

Will the use of the data store require nodes and edges store in separate tables (if SQL used)?



Options for trees/hierarchies in RDMBS

This documentation copied from options for storing hierarchical data in a relational database

  • Adjacency List:
    • Columns: ID, ParentID
    • Easy to implement.
    • Cheap node moves, inserts, and deletes.
    • Expensive to find level (can store as a computed column), ancestry & descendants (Bridge Hierarchy combined with level column can solve), path (Lineage Column can solve).
    • Use Common Table Expressions in those databases that support them to traverse.
  • Nested Set (a.k.a Modified Preorder Tree Traversal)
    • First described by Joe Celko - covered in depth in his book Trees and Hierarchies in SQL for Smarties
    • Columns: Left, Right
    • Cheap level, ancestry, descendants
    • Compared to Adjacency List, moves, inserts, deletes more expensive.
    • Requires a specific sort order (e.g. created). So sorting all descendants in a different order requires additional work.
  • Nested Intervals
    • Combination of Nested Sets and Materialized Path where left/right columns are floating point decimals instead of integers and encode the path information. In the later development of this idea nested intervals gave rise to matrix encoding.
  • Bridge Table (a.k.a. Closure Table: some good ideas about how to use triggers for maintaining this approach)
    • Columns: ancestor, descendant
    • Stands apart from table it describes.
    • Can include some nodes in more than one hierarchy.
    • Cheap ancestry and descendants (albeit not in what order)
    • For complete knowledge of a hierarchy needs to be combined with another option.
  • Flat Table
    • A modification of the Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
    • Expensive move and delete
    • Cheap ancestry and descendants
    • Good Use: threaded discussion - forums / blog comments
  • Lineage Column (a.k.a. Materialized Path, Path Enumeration)
    • Column: lineage (e.g. /parent/child/grandchild/etc...)
    • Limit to how deep the hierarchy can be.
    • Descendants cheap (e.g. LEFT(lineage, #) = '/enumerated/path')
    • Ancestry tricky (database specific queries)


Feature MySQL PostgreSQL
Max allowed packet (a limit on the list of species for trimming the tree) By default this is 1 MB but can be increased
Forced Referential integreity Yes (with InnoDB tables but these generally are slower than MyISAM tables) Yes
Stored procedures Yes (> MySQL 5.0) However does not accept arrays (ie List of leaf nodes) as variables in stored procedures. Yes


The following schema support storing trees.

Schema Supported RDBMS Ontology Support
BioSQL::Phylo PostgreSQL Yes
EnsembleCompara PostgreSQL, MySQL No, but it is possible with schema extensions that add Chado controlled vocabulary tables.
Chado::Phylogeny PostgreSQL, MySQL Yes (OBO, OWL with conversion to OBO)

Additional Information


Examples of storing trees in MongoDB.



HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.


Rutger's documentation on using MapReduce

Additional Information