Datastore subgroup

The datastore subgroup is keeping notes on a google doc. The rest of this page is pre-hackathon notes about potential approaches.

About

Information on various options that could be used as datastores to host the trees underlying the phylotastic service that prunes trees.

Requirements

A place to document requirements for a phylotastic data store.

Ontology

Should the datastore include ontology support? This would allow for flexible markup of node attributes, as well as flexible attribution of source trees.

Nodes and edges

Will the use of the data store require nodes and edges store in separate tables (if SQL used)?

Datastores

RDMBS

Options for trees/hierarchies in RDMBS

This documentation copied from options for storing hierarchical data in a relational database

Adjacency List:
- Columns: ID, ParentID
- Easy to implement.
- Cheap node moves, inserts, and deletes.
- Expensive to find level (can store as a computed column), ancestry & descendants (Bridge Hierarchy combined with level column can solve), path (Lineage Column can solve).
- Use Common Table Expressions in those databases that support them to traverse.
Nested Set (a.k.a Modified Preorder Tree Traversal)
- First described by Joe Celko - covered in depth in his book Trees and Hierarchies in SQL for Smarties
- Columns: Left, Right
- Cheap level, ancestry, descendants
- Compared to Adjacency List, moves, inserts, deletes more expensive.
- Requires a specific sort order (e.g. created). So sorting all descendants in a different order requires additional work.
Nested Intervals
- Combination of Nested Sets and Materialized Path where left/right columns are floating point decimals instead of integers and encode the path information. In the later development of this idea nested intervals gave rise to matrix encoding.
Bridge Table (a.k.a. Closure Table: some good ideas about how to use triggers for maintaining this approach)
- Columns: ancestor, descendant
- Stands apart from table it describes.
- Can include some nodes in more than one hierarchy.
- Cheap ancestry and descendants (albeit not in what order)
- For complete knowledge of a hierarchy needs to be combined with another option.
Flat Table
- A modification of the Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
- Expensive move and delete
- Cheap ancestry and descendants
- Good Use: threaded discussion - forums / blog comments
Lineage Column (a.k.a. Materialized Path, Path Enumeration)
- Column: lineage (e.g. /parent/child/grandchild/etc...)
- Limit to how deep the hierarchy can be.
- Descendants cheap (e.g. LEFT(lineage, #) = '/enumerated/path')
- Ancestry tricky (database specific queries)

RDBMS

Feature	MySQL	PostgreSQL
Max allowed packet (a limit on the list of species for trimming the tree)	By default this is 1 MB but can be increased
Forced Referential integreity	Yes (with InnoDB tables but these generally are slower than MyISAM tables)	Yes
Stored procedures	Yes (> MySQL 5.0) However does not accept arrays (ie List of leaf nodes) as variables in stored procedures.	Yes

Schema

The following schema support storing trees.

Schema	Supported RDBMS	Ontology Support
BioSQL::Phylo	PostgreSQL	Yes
EnsembleCompara	PostgreSQL, MySQL	No, but it is possible with schema extensions that add Chado controlled vocabulary tables.
Chado::Phylogeny	PostgreSQL, MySQL	Yes (OBO, OWL with conversion to OBO)

Additional Information

NOSQL

Examples of storing trees in MongoDB.

Hadoop

HBASE

HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Apache HBase Home

Tolomatic

Rutger's documentation on using MapReduce

Additional Information

Meet the Big Data Equivalent of the LAMP Stack

Phylotastic/Datastore

Contents

Datastore subgroup

About

Requirements

Ontology

Nodes and edges

Datastores

RDMBS

Options for trees/hierarchies in RDMBS

RDBMS

Schema

Additional Information

NOSQL

Hadoop

HBASE

Tolomatic

Additional Information

Navigation menu

Phylotastic/Datastore

Datastore subgroup

About

Requirements

Ontology

Nodes and edges

Datastores

RDMBS

Options for trees/hierarchies in RDMBS

RDBMS

Schema

Additional Information

NOSQL

Hadoop

HBASE

Tolomatic

Additional Information

Navigation menu

Search