Blog: Correlation, Causation and Big Data

Decision making based on BI and Analytics is a well-established industry practice. There are numerous BI implementations around the world that are used on a regular basis to run operational reports, KPI based dashboards and analytics. Lot of decisions are made based on those BI systems.

big-data

However, have you ever paused and wondered, if the decision is based on ‘correlation’ or ‘causation’. Let me explain. Two events could be correlated, but that does not necessarily mean causation. In other words, just because the events occur concurrently, does not necessarily mean one caused the other or was directly responsible for the other. For instance, hiring of a new sales person and subsequent increase in revenue, or a new coach and a team’s victory in a game. How do we know, for sure, that the new sales person caused the increase in sales and not some other factor that was purely coincidental, such as a favorable market trend?

And this is where I would like to introduce the concept of a ‘Modern Data Warehouse’. We all know what a DW is, but a modern DW is simply a traditional relational based DW, augmented with a Hadoop implementation for Big Data / Unstructured analysis. As soon as you can integrate relevant unstructured data into your existing DW implementation, you have added a force multiplier. You can now easily and seamlessly integrate a Hadoop cluster with your existing DW and setup a load process to add that unstructured dimension. For example, in an ecommerce DW, when a customer transacts on the web site, the relational OLTP system captures the sales record with the customer and product information. This is stored in the relational DW and can be queried and reported on. However, a critical piece of information is missing, that could be important for business decision making. Before buying the particular product, the customer could have clicked on other products that he/she did not buy. The relational OLTP and the DW did not capture what was not purchased by the customer. However, the web logs most likely did capture the click stream and could help identify that.

Now imagine, if by some process we could extract that information from the unstructured web logs and correlate it in the relation DW so the business users could simply use their existing tools to analyze ‘what happened’ and ‘what did not happen’. This can be very powerful to create a marketing campaign or to set price or address numerous other business use cases.

Simply put, you can now very easily integrate a Hadoop Cluster into an existing RDBMS based DW. You are not redesigning or rewriting, but you are simply augmenting. This gives you tremendous new capability and opens to new business dimensions that were not possible before. Let me know if you have any questions or comments on using Hadoop / Big Data capabilities to augment your existing DW and make it a force multiplier.

Hadoop