Archiving Everything with Hadoop
By Mark Cusack, Chief Architect, RainStor. December 2014.
Perhaps the most widespread use of Hadoop today is as an archive or ‘data lake.’ In this use-case, data from a whole variety of sources and formats lands on Hadoop, where it is stored, transformed, and mined for new business insights.
Enterprise Data Warehouses (EDWs) are a primary source of data for the Hadoop archive. The cheap, scalable and resilient storage Hadoop provides makes it an ideal centralized location for combining data sets from EDWs across the business.
It also provides a great opportunity to offload older data from the EDW to ‘buy back’ capacity in terms of space and workload.
However, there’s more to offloading EDW data to Hadoop than simply dumping it into HDFS. What if you want to move it back into the data warehouse at some point in the future? What if the offloaded data is sensitive in some way? What if you want to keep the archived data on-line and accessible?
An EDW archiving solution for Hadoop must provide three key features:
• Schema preservation. The archive must accurately mirror the schema of the source EDW.
It is important to ensure that data values will be archived without loss of precision. Changes to the source schema, for example adding new columns or changing data types, should also be captured by the archive. This allows the archive to grow organically over a long period of time while maintaining a continuous historical record of the changes to the schema and the data in the source EDW.
• Governance and security. Archived data generally inherits the same governance requirements as the EDW. If the source records are regulated by PCI-DSS for example, then the archived records are also subject to these controls. The archive must provide access to data on a ‘need to know’ basis; it must guarantee that sensitive data is encrypted or masked, and that access is audited. An archive must also integrate with the same enterprise security infrastructure as the EDW.
• SQL support. Support for SQL access to the archived data is a must. Users want their EDW reports to run against the archive out of the box. They want to be able to run interactive queries against the data, as well as batch reporting. Also, the same business intelligence tools that they use to analyze data on the EDW must be supported by the archive. However, because the data in the archive is typically older, the strict SLAs and mixed workloads that EDWs are designed to address are often not required for an archive.
Archiving on Hadoop provides a cost-effective way of storing records for longer. When combined with state-of-the-art compression technologies, the idea of storing data indefinitely with Hadoop becomes a real possibility. The winners are the analysts and data scientists: offloading data and reporting workloads to Hadoop keeps EDWs lean and mean, while at the same time ensuring that huge volumes of historical data remain available for interactive analysis.