As many organisations are discovering, big data does not start with skilled data scientists or sophisticated algorithms. The journey to implement real time customer applications or advanced business intelligence begins with the optimal integration of data, regardless of its source, format or volume.
For the past 30 years, the most popular method of integrating data has been by means of a common data storage mechanism. This means that data is physically replicated and stored independent of its point of origination, typically in a Data Warehouse (DWH). By means of an ETL (Extract-Load-Transform) process, this data is pulled from source systems, normalised, stored in a single repository, and made queryable via a common interface. In this way, DWHs make it possible for organisations to have a unified view of their data across heterogenous sources and formats.
However, this physical integration of data has several drawbacks. The deployment of a seperate system to house data that is continually expanding costs organisations from hundreds of thousands to several million dollars to maintain. This is because DWHs require organisations to ‘scale-up’ for every block increase of data by ramping up CPUs, hard-disks, network cards, and of course, license fees.
These skyhigh costs are further exacerbated by data management processes that often require data to be copied many times. For example, disparate data from multiple systems, once extracted, is copied to a staging area where it is transformed and aligned before being copied again and stored in the DWH.Besides the additional disk space needed, this replication process adds to the problems of latency, complexity, failure and security risks.
More recently, instead of a physical data warehouse, many organisations have turned to cloud storage services. By leveraging a distributed and virtualised cloud computing infrastructure, these services enable organisations to pay only for the storage they actually consume. Although not necessarily cheaper than traditional DWHs, cloud services are sometimes preferred because they represent an operating, rather than capital, expense. Importantly, cloud services result in data copies being stored in even more locations, potentially worsening the problem of unauthorised access.
While high data storage costs have always been a problem, the extreme proliferation of data over the past decade has caused this to reach crisis proportions. Due to ongoing advancements in digitisation, data is now estimated. Author: Ravi Shankar Nair, Chief Technology Officer