A week or so ago the topic of NoSQL databases came up at a discussion at work, so I decided it was time I learnt about this topic.
A few days ago Julie Lerman (@julielerman) tweeted that she had published an article in MSDN Magazine on Document Databases (one of the major types of NoSQL databases) and reading that article started me on my research.
I purchased the Wrox Press book “Professional NoSQL” by Shashank Tiwari for my iPad Kindle App and much of this post is gleaned from the first few chapters of that book (as well as from Julie’s article).
In a previous post I introduced the concept of NoSQL databases, and in this post I will discus the three most common types of NoSQL databases.
Sorted Ordered Column-Oriented Stores
Google, who were one of the main pioneers of the trend to NoSQL databases, due to the sheer volume of data they store uses a system (Bigtable) that stores data in a column-oriented way. This compares with typical Relational systems which are row-oriented.
Each unit of data can be thought of as a set of key-value pairs, with the unit being identified by a “primary key”. Bigtable calls this the “row key”. The units of data are sorted and ordered on the basis of this row key.
So far this isn’t really much different than a Table in a Relational model that has a primary key field and a clustered index.
What makes the store “column-oriented” is that the various pieces of information that define the “record” of data, can be divided into groups of columns or column families. For example, if we are saving information about a person, we may define first_name and last_name fields, which can be grouped in a name column family. Likewise we could define street_address, city and zip_code, which can be grouped in a address column family, and sex and age which can be grouped in a profile column family.
We now have 3 column families or buckets of information. In a column-oriented store column families are typically defined at configuration or startup but the individual columns need not be pre-defined.
Within each bucket, only key/value pairs are defined. The column key identifies the column family or bucket to use and the row key identifies the individual columns within the bucket.
Like many NoSQL databases there is not really a concept of NULL data. New columns can be added at any time as it is just another key/value pair in the bucket.
While data that relates to the same row key will often be stored in a contiguous fashion, this set up allows for data to be partitioned across multiple computer nodes.
Examples of Sorted Ordered Column-Oriented Stores include:
- HBase (http://hbase.apache.org) – part of the Hadoop family – used by Facebook, StumbleUpon, Hulu, Yahoo and others and
- HyperTable (www.hypertable.org) - used by Baidu – China’s biggest search engine and Rediff – India’s biggest portal.
A Hash Map or Hash Table is the simplest data structure that can hold a set of key/value pairs. Such structures are popular because they are very efficient typically approaching O(1). The key of a key/value pair must be unique and can be easily looked up.
Examples of Key/Value Stores include:
- MemBase (www.membase.org) – build on the popular MemCached Cache – used by Zynga
- Redis (http://redis.io) – used by CraigsList
- Cassandra (http://cassandra.apache.org/) – an Amazon Dynamo clone, developed by Facebook and open sourced – used by Facebook, Digg, Reddit, Twitter and others
One core benefit for object-oriented developers is that we can think of a document as mapping to an object, including any contained collections/objects, although in reality what we mean here are objects that are considered to be “aggregate roots”.
Document Databases treat a document as a whole – rather than splitting it into its constituent key/value pairs.
This post is getting quite long so I won’t get into further detail on this class of NoSQL databases, as I will be diving into Document Databases in more detail in future posts.
Examples of Document Databases include: