The hype and disinformation that grudgingly prevails in the data warehouse world today brings me to raise the debate to a rational level. Let’s set aside 3NF and STAR schemas for a moment and the many flavors of analytics along with all their technologies. Let’s temporarily ignore e-commerce, database migrations, business intelligence, and data collection and processing systems. Instead let’s look at three different data storage methodologies. These are:
- NoSQL – very new, lots of hype, and which really means ‘NOT ONLY SQL’
- ROW – your traditional record database, well known and loved
- COLUMN – still relatively new, widely misunderstood, yet still feels like normal SQL
To look at these three together I think we must first look at them separately. So here goes…
The ROW based database storage methodology is one most of us are already familiar with. Depending upon your vendor of choice (like Oracle, Microsoft, MySQL, DB2, etc…) DDL and DML syntax creates tables that stores and retrieves records. Largely based upon some form of key, be it natural or surrogate (let’s debate the many issues of schema design another time). The relational data model thrives upon the ROW based database and is widely used for many OLTP and OLAP system and/or applications. Highly efficient in complex schema designs and SQL queries, ROW based database engines offer a tried and true way to build solid solutions. We should not throw this away, I won’t!
The COLUMN based database storage methodology has been around for a while as an alternative to ROW based databases from various new vendors (like InfoBright, Vertica, Sybase IQ, etc…). Generally the DDL and DML syntax is similar to ROW based databases, yet under the hood things are usually radically different, and much more efficient for processing aggregations. This is the main thing that sets it apart from ROW based engines. Some of these column based technologies also provide high data storage compression which allows for a much smaller disk footprint. In some cases as much as 10/1 over their row based counterpart. We should adopt this where appropriate, I am!
The NoSQL based storage methodology (notice I don’t call it a database) is the new kid on the block which many vendors vying for your attention (like Hadoop, Cassandra, MongoDB, etc…). Many people view the NoSQL technology as the replacement to ROW or COLUMN based databases, but let me say right off, this is the wrong way to think of NoSQL. Instead, as a highly optimized, highly scalable, high performance Distributed File System. Yet the NoSQL storage capabilities offer striking features simply not practical with ROW or COLUMN databases.
Let’s however be very clear about what NoSQL is.
While there are three main variants (which I will cover shortly), NoSQL technologies address narrow yet important business needs. Most NoSQL vendors support structured, semi-structured, or non-structured data which can be very useful indeed. The real value, I believe, comes in the fact that NoSQL can ingest HUGE amounts of data, very fast. Forget Gigabytes, and even Terabytes, we are talking Petabytes! Gobs and gobs of data! With clustering support and multi-threaded inner-workings, scaling to the future expected explosion of data will seem a no-brainer with a NoSQL environment in play. Let’s get excited, but temper it with the understanding that NoSQL is COMPLIMENTARY and not COMPETITIVE to ROW and COLUMN based databases. And also note that NoSQL is NOT A DATABASE but a high performance distributed file system and really great at dealing with lots and lots of data; did I say BIG DATA!
I mentioned that there are three main variations of NoSQL. These include:
- Key Value – which support fast transaction inserts (like an internet shopping cart); Generally stores data in memory and great for web. applications that need considerable in/out operations
- Document Store – which stores highly unstructured data as named value pairs; great for web traffic analysis, detailed information, and applications that look at user behavior, actions, and logs in real time.
- Column Store – which is focused upon massive amounts of unstructured data across distributed systems (think Facebook & Google); great for shallow but wide based data relationships yet fails miserably at ad-hoc queries
So to bring these three very different database storage technologies into a conjoined perspective, I think it behooves us all to consider that essentially we need all three. Regardless of what type of system being built, I’ve always subscribed to the notion that one should use the right tool for the job. You just gotta know what those are!
In summary let me say this: I believe, generally speaking, that each of these three data storage technologies offer specific features and therefore should be used in specific ways.
- ROW based databases should prevail when you want a complex, but not too-huge data set that requires efficient storage and retrieval for OLTP and even some OLTP usage;
- COLUMN based database are clearly aimed at analytics; optimized for aggregations coupled with huge data compression and should be adopted for most business intelligence usage;
- NoSQL based data solutions step in when you need to ingest BIG DATA, fast, Fast, FAST… and when you only really need to make simple correlations across the data quickly;
Well, there you have it… most of it, in my humble opinion anyway!