I’ve been researching data persistence recently, and wrote a few notes to try and organise my thoughts. Here’s a summary that might be useful, along with a list of useful sources at the end. I’ve missed out a huge amount (CAP theorem, map-reduce, etc), so this just scratches the surface.
Firstly, what is NoSQL?
NoSQL as a term was coined for a conference of next generation databases in 2009. However it’s not a particularly useful name. My favourite definition comes from the Fowler/Sadalage book ‘NoSQL Distilled’ where they define it as:
“an ill defined set of mostly open-source databases, mostly developed in the early 21st century, and mostly not using SQL”. Hmm…
Some typical characteristics of NoSQL databases are:
- Most of them run well on clusters (except for graph databases) and are designed for very high scale
- They don’t use SQL, though some are getting close
- They don’t have a schema, so you can add new fields without having to define schema changes. This doesn’t mean that there is no schema. There is always a schema, it’s just now in code instead of the database. So engineering best practice around schema migrations still applies…
- They do have transactions but not in the SQL sense. For example a single write to a document database is a transaction which could persist one or many entities, depending upon the design of the document.
There are four main types of NoSQL database:
Key Value Store
I think of these as distributed Dictionaries or HashMaps. You store and retrieve values by key. The values are opaque to the database and joins across multiple values are typically not allowed. As a result they are very fast.
A classic use case is storing session data. They are not recommended when you have relationships between values.
Examples: MongoDB, CouchDB
A shorthand way to think of these is a bit like key value stores except that the values are not opaque. This opens up a world of possibilities. You can store complex structures and query them by something other than the primary key. So for example if you store addresses, keyed by ID, you could also query them by town or zip/post code.
Joins across documents are either not allowed or expensive. However a document can be arbitrarily complex, containing many nested elements, for example, a customer and their recent orders could be stored as a single document. By careful grouping of related information into the same document, lookups can be fast lookups in typical business use cases, as all related information can be returned by a single query.
Document databases are being used for object persistence. Both MongoDB and CouchDB store documents as json. They are also great for logging, where different log messages may have different contents.
Examples: Cassandra, HBase
In a column database a column consists of a key value pair. Collections of columns that are accessed together are called Column Families. Rows within a column family may have many columns, accessed together using a row key. Different rows may have different columns.
If that didn’t make much sense, another way to think of a Column family is as a Table, each of whose rows can have different columns.
It’s also possible to have super-columns, where the value of the key value pair is a collection of columns.
Logging is a great use case for Column databases and they are used heavily in content management systems.
Column databases are sensitive to the query used. Changing the queries may require changing the column family design.
Examples: NeoDB, OrientDB
These are useful when you need to traverse relationships between entities. They allow both entities and relationships to be stored as first class elements, and both entities and relationships can have properties. Relationships can also have directionality. So for example given two people ‘Bob’ and ‘Sue’ with a ‘Likes’ relationship between them, it could be one way, so that ‘Bob likes Sue’ is true but ‘Sue likes Bob’ is not.
Graph databases make it easy to construct complex queries involving many entity types. In SQL those queries would involve many joins across different tables.
Should you change?
As always, it depends. I attended an interesting talk by Lisa Phillips from Twitter recently where she said they have loads of cool NoSQL stuff, some of which they invented themselves, but still use MySQL very extensively, to the extent that the default position for all new projects is ‘use MySQL unless it can’t do what you want’.
It also looks like the two worlds are converging a little. For example recent additions to MySQL include a handler socket interface that bypasses the SQL layer, and MariaDB has also added dynamic columns. It’s also possible to use MySQL as a front end to InfiniDB, and the MariaDB team has experimented with using Cassandra as a storage engine.
An excellent introduction to the subject.
Seven Databases in seven weeks
Practical dives into seven real databases, illustrating the strengths of each. Covers: Postgres, Riak, HBase, MongoDB, CouchDB, NeoJ, Redis
Software Engineering Radio:
I also attended a surprisingly good conference in Oxford recently on Database technologies for developers, called All your Base, which I hope they run again next year.