Data Lakes Transformation – Leading with a business driven, technology led approach
BY Ritesh Ramesh, Data and Analytics Leader, Consumer Markets Industries, PricewaterhouseCoopers U.S.
As more companies start to use data lakes, the right approach is essential or transformative insights sink like stones beneath the murk of a data swamp. Many companies too often take a technology centric approach and fail to capitalize on the opportunity. These six leading practices can help you capitalize on your data lakes transformation:
- Take a business driven approach: Many businesses dive in without a strategic perspective, which leads to mistakes and mismatched priorities. Start with a business-centric—and importantly, value-based iterative approach—to drive the lake architecture, design, and implementation. Ask these questions to help to ensure your data lake has a business-centric operating model:
- How will the data lake create and capture value?
- What skills do we need to manage the lake and its technologies, and to derive information, insight, and value?
- What incentive model will foster innovation and new insights?
- How will we prioritize new demands?
- What services will the lake provide?
- Do we need data discovery and ideation services via self-service?
- How will we divide responsibility between business and IT?
- Will we allow external access from strategic partners and vendors?
- How will the data lake co-exist with other ecosystems (e.g., data warehouse, analytical applications)?
- Should it be on premise or on the Cloud?
- What is our operations and support strategy?
- Will the platform extend to multiple geographies?
- Talk to stakeholders early: Before building a data lake, talk to relevant stakeholders to understand priorities, and bake the findings into the strategy. One of our clients initially wanted to use a data lake to store massive volumes of digital data, perform complex computations, and deliver information to downstream applications. IT set to work without first connecting with business stakeholders to confirm their needs, which included self-service data discovery and other analytics capabilities. This feedback delayed implementation by six months.
- Identify the right technologies: Identify the desired native technology components—primarily open source and other third-party commercial technologies with the right deployment model (cloud vs. on premise). Many capabilities like metadata management, semantic analytics, and visualization tools are evolving in the open source space. Too often, businesses try to design a modern data lake based on traditional approaches; e.g., building cumbersome data models and selecting a data integration engine that doesn’t leverage native capabilities of Hadoop. Consider ‘adaptive data preparation’ tools, which ingest a variety of data formats (e.g., clickstream data from websites and mobile devices, legacy transaction data, and third-party demographics) and leverage scalable data processing and in-memory engines like Spark with machine language techniques, to automatically tag and catalog data and identify data relationships. Consider analytical tools that bring logic to the data stored in the lake and not vice versa. Spark is emerging as the foundation for modern data processing engines – you should question your architecture if you are not leveraging it
One of our largest services clients increased operational efficiency and performance ten-fold by storing large quantities of financial transactions (100+M rows) in-memory leveraging Spark on AWS cloud to enable on-demand analytics. The age of denormalized datasets and on-demand analytics is here. Explore tapping into the ecosystem of such tools and techniques.
- Build the right organizational structure to connect the dots across initiatives: Empower your leader to build a team that can smooth implementation, balance priorities and map to the vision and core capabilities. This role should also be able to communicate with an unconventional mix of stakeholders ranging from the Chief Marketing Officer to the Chief Financial Officer to the Chief Data Officer and Chief Analytics Officer. The data lake leader should be knowledgeable and open-minded about emerging technologies to drive decisions that can make or break data lake implementation. As the lake will eventually connect ecosystems across an enterprise—such as Analytics, digital, AI, IoT, cloud and emerging technologies—the right organizational structure is critical.
- Address risk and governance: Who will access the data lake? How will access be provided; e.g., SQL, visualization tools, API’s? Address the compliance and regulatory requirements around security and access, and identify the tools and processes to catalog, link data sets, enable data provenance, and provide access to the right people. Many technologies are available to keep your lake from becoming a swamp.
- Focus on the last mile of the Data Lake: Last but not least, focus on the last mile of your data lake – the semantic layer is where the value is created for the business; You may end up spending millions of dollars in emerging technologies and processes ingesting, standardizing and organizing the data in the lake but if your semantic data access layer is not well architected to be efficient and flexible – it’s most likely that the data lake won’t gain any user adoption and your expensive investments will turn into sunk costs. It’s extremely critical to leverage in-memory technologies to speed up data access and to develop self-service capabilities for end users. Companies are increasingly developing micro services architectures to abstract technical complexity and develop user centric applications on top of the data lake and this trend is bound to increase
Data lakes present a tremendous opportunity to drive insights and make better decisions, but if they are not implemented properly, their potential can be missed. We consistently see these six practices within companies that are ahead of the data and analytics curve. If you want to capitalize on the data lake, look beyond technology alone and adopt a holistic business-driven, technology- led approach.
Ritesh is the Data and Analytics leader for the U.S. Consumer Markets Industries at PricewaterhouseCoopers (PwC). He has 15+ years of professional consulting experience working with several Fortune 500 companies across multiple industries on strategic Data & Analytics initiatives. Areas of experience include Emerging Data & Analytics technologies, Cloud Platforms, Analytics Innovation and Next Generation Information Architecture.