How to Regroup After Hadoop
And you can support that rationale because on the one hand there is a need for something like a Data Lake to deal with data variety and data silos that have resulted from RDBMS proliferation.
(My colleague Damon Feldman explores three different approaches to unifying content, with a great piece on Data Lakes, Virtual Databases and Data Hubs. SPOILER: Movement, Harmonization and Indexing are the key differentiator.)
On the other hand, Data Lake has become synonymous with the Hadoop ecosystem. And since the universe was making assumptions, it was assumed that Hadoop was all you would ever need — if by “all you would ever need” is Hadoop and the ever-changing flavor of the “now.” Once Map Reduce was the way to go, only to be usurped by Apache Spark which now may be giving way to the Concord framework (by the time this blog is posted, there may be yet another technology that is “all the rage”). In addition to this churn, there is also the cost of integration of open source “fit for purpose” technologies.
And if you did all this then, unfortunately, you could join the legions of organizations that incorporated Hadoop for its Data Lake – and found security and governance and the solution itself lacking.
Hadoop Hasn’t Elimnated Silos
Because while the concept of a Data Lake was conceived as a way to minimize data silos, the reality is it hasn’t delivered on that promise for a number of reasons including:
- A focus on only the analytical (i.e. observe-the-business) side of things instead of also considering the operational (i.e. run-the-business) side of things
- A dependency on the complex and changing Hadoop ecosystem, resulting in higher-than-expected costs to integrate
- A “hand-wavy” approach to enterprise features such as security and operational maintenance
What has resulted for many first adopters is that in addition to still having their data silos, they’ve compounded some of their problems by introducing technical silos requiring expensive “care and feeding.” Not to mention, it is difficult to maintain provenance of your data.
So while the Hadoop ecosystem is certainly a solution, the “Hadoop-first-and-foremost” mentality is misguided or at least incomplete.
Oh sure, Hadoop is great with analytics (so you can observe the business), but what if you actually want to run the business? My colleague Ken Krupa, Enterprise CTO, told me this is exactly the scenario at many of the big banks. “Investment banks will have a new regulation related to post-trade processing that will affect trades from multiple source systems,” he explained. “If the regulation requires an operational workflow around all of those trades, you can’t very well request that each of the downstream systems – perhaps dozens – make changes to accommodate the workflow. If you don’t have the operational capability at the point of enterprise integration, this may be your only choice — a choice most banks don’t want to embrace.”
As an architecture, Data Lakes only address part of today’s data integration challenges and on their own are insufficient to integrate data in silos. To fully address the operational challenges as well as the analytical challenges of data integration, something more than Hadoop is needed.
That “something more” is an Operational Data Hub (ODH) which offers the “three V’s” capability of “big data” technologies but with mature operational capabilities of a database.
With the MarkLogic ODH as part of the data strategy, security, data governance and operational maturity are covered, but without sacrificing the agility expected of modern data strategies. When approaching from this perspective, you’ll find that you’re not stretching any Hadoop investments you might have made into more than you’re capable of handling.
So if you already have a Hadoop investment, not all is lost. An ODH will enhance the value of the Hadoop ecosystem by providing operational capabilities on top of all the core Enterprise features, such as security and maintainability, needed for a mature data strategy.