Sunday, September 18, 2011

To SQL or NoSQL - that's not the question


We receive a lot of requests from our clients about the applicability of NoSQL technologies on their enterprise systems.
If you haven’t heard about NoSQL by now, we contend that you’ve been in hiding for the past three years. The term NoSQL itself refers to a broad class of data management that is not based on the traditional relational tables in typical RDBMS. Examples include document-based data storage such as MongoDB, key-value pair such as Riak, or column-based storage such as HBase.
While assessing usage of NoSQL, we are often presented with the CAP model which is great from the theoretical perspective, but grasping its implications on real world enterprise systems is difficult.
In this post, we’ll explain some of the design factors and the associated implications of using NoSQL in the context of three data-centric enterprise applications: Portal, Reporting and Analytics. Enterprise transactional systems are not included for obvious reasons.
1. Portal: Anything but the simplest static HTML page has some sort of data access module at the back-end that formulates a query to fetch relevant records or composite views from one or more data stores. Generally, the overall fetching operations on many NoSQL technologies could be several orders faster than RDBMS as one can fetch the data from the shards in a scale-out parallel manner.
But here is where the complication arises: not all NoSQL stores are equally capable. Out of the box, some do not support range queries, some support only simple predicates with constant values, some do not support composite indices or queries that are not based on keys. Thus, while the data fetch may be blazingly fast, the complexity is pushed up to the application layer (string-based joins, anyone?).
Now, if your portal requires data update functionality, the write operation in NoSQL is generally smooth even in high volume because many of them exist largely in memory. In contrast, traditional RDBMS deals with locks, latches, re-do logs and so forth. However, bear in mind that NoSQL does not generally support atomicity, consistency and in most cases proper durability. Thus for applications that require absolute transactional integrity and serialisation, SQL databases are still the top choice.
2. Reporting: Reporting applications inherently employ more complex queries than portal. A typical read query may have multi-table joins, nested joins, filters etc. Any write or update operation may cascade to multiple tables and will have to maintain referential integrity constraints. In general, NoSQL is a poor choice to support these applications (but products like HBase or GFS supports time interval-based and condition-based queries). Further, if you need to prepare your data for reporting by running elaborate batch applications to transform data with many-to-many mappings, there is no easy alternative to trusted SQL-based loading and transformation engines.
3. Analytics: The sweet spot for many NoSQL data stores is on our third set of applications: those that heavily use analytics, especially when it comes to processing to large amount of unstructured data (aka big data). Used in conjunction with MapReduce, NoSQL can parallelise and quickly perform complex operations such as statistical aggregates, filtering, grouping and sorting. The scale-out capabilities of NoSQL databases - e.g., Cloudera and Aster Data Systems -- allow handling of large volumes of unstructured data suitable for analytics. In terms of price/performance, NoSQL quickly gains ground over traditional data warehouses and BI appliances.
So, where does that leave us? From the perspective of application requirements, would you go for SQL or for NoSQL?
Let’s learn from those who have traversed this path.
Take LinkedIn for example. It started by using a RDBMS, but went on to develop a NoSQL data store instead. The primary reason: performance. According to Jay Krep’s blogs, LinkedIn started with a single big database and a cluster of front-end servers, but soon they split the database for a variety of functions like serving up profiles, performing searches, group interactions, networking updates etc. Each function had different application profiles and workload.
It became apparent that LinkedIn may require NoSQL storage for supporting specific high-volume OLTP workloads. This was the genesis of the Voldemort project. Facebook similarly has a mixed architecture with Memcached and MySQL in combination for high performance OLTP (such as message posting to the Wall), and Cassandra for the inbox search mechanism, each serving a different workload and purpose.
From the application perspective, the question is therefore, not of “SQL or NoSQL”. Rather we should embrace a world with SQL and NoSQL. The majority of NoSQL data stores are not general purpose, but application specific. If you are running a SQL shop with 3-tier architecture with pretty heavy business logic and even heavier stored procedures, then there is absolutely no reason to change that. But if you require very fast simple OLTP operations, if you are adding a chat capability to drive social commerce, or if you need to constantly tweak your customer attributes and constraints, then NoSQL should be considered. For your complex queries and data-driven integrity rules and for consistency and durability, RDBMS still rules.

No comments:

Post a Comment