The datastore subgroup is keeping notes on a google doc. The rest of this page is pre-hackathon notes about potential approaches.
Information on various options that could be used as datastores to host the trees underlying the phylotastic service that prunes trees.
A place to document requirements for a phylotastic data store.
Should the datastore include ontology support? This would allow for flexible markup of node attributes, as well as flexible attribution of source trees.
Nodes and edges
Will the use of the data store require nodes and edges store in separate tables (if SQL used)?
Options for trees/hierarchies in RDMBS
This documentation copied from options for storing hierarchical data in a relational database
- Adjacency List:
- Columns: ID, ParentID
- Easy to implement.
- Cheap node moves, inserts, and deletes.
- Expensive to find level (can store as a computed column), ancestry & descendants (Bridge Hierarchy combined with level column can solve), path (Lineage Column can solve).
- Use Common Table Expressions in those databases that support them to traverse.
- Nested Set (a.k.a Modified Preorder Tree Traversal)
- First described by Joe Celko - covered in depth in his book Trees and Hierarchies in SQL for Smarties
- Columns: Left, Right
- Cheap level, ancestry, descendants
- Compared to Adjacency List, moves, inserts, deletes more expensive.
- Requires a specific sort order (e.g. created). So sorting all descendants in a different order requires additional work.
- Nested Intervals
- Combination of Nested Sets and Materialized Path where left/right columns are floating point decimals instead of integers and encode the path information. In the later development of this idea nested intervals gave rise to matrix encoding.
- Bridge Table (a.k.a. Closure Table: some good ideas about how to use triggers for maintaining this approach)
- Columns: ancestor, descendant
- Stands apart from table it describes.
- Can include some nodes in more than one hierarchy.
- Cheap ancestry and descendants (albeit not in what order)
- For complete knowledge of a hierarchy needs to be combined with another option.
- Flat Table
- A modification of the Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
- Expensive move and delete
- Cheap ancestry and descendants
- Good Use: threaded discussion - forums / blog comments
- Lineage Column (a.k.a. Materialized Path, Path Enumeration)
- Column: lineage (e.g. /parent/child/grandchild/etc...)
- Limit to how deep the hierarchy can be.
- Descendants cheap (e.g. LEFT(lineage, #) = '/enumerated/path')
- Ancestry tricky (database specific queries)
|Max allowed packet (a limit on the list of species for trimming the tree)||By default this is 1 MB but can be increased|
|Forced Referential integreity||Yes (with InnoDB tables but these generally are slower than MyISAM tables)||Yes|
|Stored procedures||Yes (> MySQL 5.0) However does not accept arrays (ie List of leaf nodes) as variables in stored procedures.||Yes|
The following schema support storing trees.
|Schema||Supported RDBMS||Ontology Support|
|EnsembleCompara||PostgreSQL, MySQL||No, but it is possible with schema extensions that add Chado controlled vocabulary tables.|
|Chado::Phylogeny||PostgreSQL, MySQL||Yes (OBO, OWL with conversion to OBO)|
- Good overview table of adjacency lists vs. Nested intervals
- Managing Hierarchical Data in MySQL
- Storing Hierarchical Data in a Database
- Slideshare .. Models for Hierarchical data
- More links on hierarchical data in SQL
Examples of storing trees in MongoDB.
HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.