By Michael Blaha, — March 2015
We define agile analytics as the iterative and rapid processing of data for decision support. There is minimal upfront investment. Developers instead process source data and build queries incrementally as business needs emerge.
Big Data and data warehouses are both technologies for realizing agile analytics. They differ in several ways.
- Big Data is architected for fast writing and slow reading. Big Data software stores data as-is without regard for structure or quality. It defers processing until data is read and the requisite amount of structure and cleansing is made apparent by business needs. Big Data supports different data representation paradigms such as key-value stores and graph databases that ease the phrasing of some queries.
- Data warehouses are architected for slow writing and fast reading. Data warehouses take more time to store data as the data must be structured, cleansed, and integrated before storage. ETL coding for the processing is typically the most costly and time consuming aspect of building a data warehouse. Queries can run fast but they must conform to the database structure.
We agree that Big Data has much promise. Furthermore, we do not see the technology as a fad but rather as an advance that will have staying power. But data warehouses are still important and will remain important. We see Big Data and data warehouses as coexisting into the future. Because of their different tradeoffs, a mix of the technologies can yield the greatest business benefits.
Big Data by its very nature supports agile analytics. You can store data with minimal setup and defer most development effort until there are specific business questions to answer.
Conventional data warehouses have the overhead of upfront processing in advance of business need. In our projects we have made data warehousing more agile by using a data staging tool and extensive use of SQL.
- We use a tool to load staging data. See a2bdata.com. The staging tool saves much rote ETL coding by developers. Tool processing is more robust and uniform than hand written ETL code.
There are fewer loading errors from interruptions and restarts.
- Many of the tables in a warehouse are straightforward. Many tables have one data source and simple processing. For simple tables, we replace ETL code with SQL. SQL is an order of magnitude faster to write than ETL. A sequence of SQL queries or a SQL view can often suffice.
These two improvements reduce the quantity of ETL code to write and narrow the focus to where ETL is most valuable — for large data sets with multiple sources to integrate and data to cleanse.
So when you think about agile analytics, keep in mind Big Data. Also consider data warehousing with the two improvements we have described.