For the series Q&A with Data Scientists: Ritesh Ramesh
Ritesh Ramesh is the Chief Technology Officer for Global Data and Analytics at PricewaterhouseCoopers(PwC). Ritesh has 15 years of professional consulting experience working with several Fortune 500 companies across multiple industries on strategic Data & Analytics initiatives. Areas of experience include Emerging Data & Analytics technologies, Cloud Platforms, Analytics Innovation and Next Generation Information Architecture.
Q1. What are the most significant challenges and opportunities in predictive analytics?
Predictive Analytics is no longer an optional choice but a strategic growth opportunity for companies to differentiate their products and services in the market to drive competitive advantage. In PwC’s recent Global Data and Analytics: Big Decisions survey we heard from 2100+ global executives from 50+ countries and across 15+ industries that majority of the companies (61%) are not highly ‘data driven’ and still rely on descriptive and diagnostic analytics. It does not surprise me that many of them are struggling to navigate the predictive and prescriptive analytics frontier because it requires a resilient foundational infrastructure, progressive leadership, agile delivery processes and talent strategy in the areas of data management, emerging technologies and advanced analytics. Too much innovation, too fast in the last 2-3 years has left many organizations to play catch-up with the current trends, delaying their decision making process to progress forward and execute. These are some of the major challenges.
In terms of opportunities, PwC’s Big Decisions survey also indicated that many of the companies are very keen to move swiftly to launch new products and services in their industry this year. This could be an opportune time for these firms to think strategically about how predictive analytics capabilities can be embedded in their organizational processes, products and services to drive their growth agenda. Rapid technology innovation around scalable cloud-based, cost effective and efficient analytics technologies has levelled the playing field in such a way that technology costs and access are no longer the barriers to predictive analytics enablement. Talent development continues to be an issue for many organizations but the evolution of crowdsourcing analytics platforms augmented with human talent will gradually help address this issue. Companies should focus on developing a forward looking analytics vision, incremental outcomes driven execution strategy and an open minded culture to nurture and sustain predictive analytics.
Q2. What should every data scientist know about machine learning?
Machine learning is a subset of Artificial Intelligence, it has been a constant feature in the academic and research world for many many years but has emerged mainstream as a strategic enabler and value driver for enterprises. Machine Learning algorithms does not have static logic and can learn from analysis of massive amounts of data by optimizing its parameters and progressively improving its performance. Availability of large volumes of structured and unstructured data, high performance processing power and open source software has unlocked the barriers for companies to embed machine learning processes as a core engine in their products, services and processes. Every single industry can capitalize on this huge potential by looking for opportunities in their value chain to complement human judgment with machine algorithms to create value, improve total factory productivity and mitigate risks proactively. Broadly, machine learning techniques can be categorized into 3 buckets – supervised (e.g. Decision trees), unsupervised (e.g. classification and clustering) and reinforcement learning (e.g. complex control based logic).
Couple of examples of Machine learning I can think about that anyone can relate to –
1) Connected Cars these days can monitor fuel tanks, send an alert in case of a low-fuel range and route the driver to a near-by fuel station. I see this feature extended to any predictive maintenance scenarios in automobiles which can contribute productively to safety and quality issues.
2) Recommendation engines implemented in many digital products and services based on the historical behavior patterns both at the individual user and cohort level. The option of rejecting the recommendation is still in user’s control but it’s definitely a feature of convenience that can influence consumer behavior, drive commerce and capture value
Q3. Is domain knowledge necessary for a data scientist?
Absolutely, Domain, industry expertise and business acumen are core competencies for any data scientist to be effective beyond their base analytical and technical skills.
Good data scientists often focus on the business problems and outcomes before they dive deep into the data for analysis. As they analyze data and unearth insights, they need strong communication skills to interact with the business stakeholders to tell the ‘big picture’ story and further collaborate to turn the insights into action.
If I were to advise data scientists who are in their early stages of their career, one thing I would tell them is to develop a deep understanding of their industry and then augment their domain knowledge with their analytical and technical skills for problem solving. I know many data scientists who have mastered multiple domains over their career and this is critical because insights gained through one domain can be applied to another domain and there can be cross-pollination of ideas even across industries. For example, the demand signals you pick up through your consumer insights analytics better be well connected and correlated to your supply side analytics to deliver value across the firm
Q4. What is your experience with data blending?
Data blending is a critical capability in the data and analytics toolbox. We live in a global, interconnected world where velocity of external information (like macroeconomic variables, weather, supply chain risks etc.) is volatile and supersonic. The more real-time and contextual the information, the better the probability of indicators for success from your analytical models. Companies are increasingly augmenting their internal data with external data to help them drive towards actionable insights. Majority of external data management processes I observe are still managed very manually, most of them are batch-based processes with tangible information latencies that would make your analytical results invalid.
It’s easy to confuse data blending with traditional data integration. IT organizations must ensure that their business partners are enabled with an on-demand ability to mashup external data feeds in their analytics workflows by designing self-service processes in the data access layer which are less cumbersome and more agile. IT can also play a productive role in the selection of smart data blending tools and training business users on these tools
Q5. Can data ingestion be automated?
Data Ingestion processes must be automated to eliminate operational inefficiencies in the end-end analytics lifecycle. Many companies are now building large, shared landing zones in scalable data environments like Hadoop and Spark for analytics. Their end-goal is to automate data ingestion processes in various frequencies (real time, near real time or batch) on a predictable as well as adhoc basis. This eliminates the situation where each individual analyst is spending lots of time in developing redundant “one off” data ingestion processes rather than in productive analysis activities. This is where IT organizations can enable their business stakeholders in developing automated data ingestion processes to enable agile analytics. Many recent open source technologies (Apache Ni-Fi, Sqoop, Kafka etc.) and self-data preparation tools make this endeavor easy and seamless.
Q6. How do you ensure data quality?
Data Quality is critical. We hear often from many of our clients that ensuring trust in the quality of information used for analysis is a priority. The thresholds and tolerance of data quality can vary across problem domains and industries but nevertheless data quality and validation processes should be tightly integrated into the data preparation steps.
Data scientists should have full transparency on the profile and quality of datasets that they are working with and have tools at their disposal to remediate fixes with proper governance and procedures as necessary. Emerging data quality technologies are embedding leverages machine learning features to detect proactive data errors and make data quality a business-user friendly and an intelligent function more than ever it has been for years
Q7. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
Many people view Analytics and Data science as some magic crystal ball into the future events and don’t realize that it is just one of many probable indicators for successful outcomes – If the model predicts that a there’s an 80% chance of success, you also need to read it as there’s still a 20% chance of failure. To really assess the ‘quality’ of of insights from the model you may start with the below areas –
1) Assess whether the model makes reasonable assumptions on the problem domain and takes into account all the relevant input variables and business context – I was recently reading an article on a U.S. based insurer who implemented an analytics model that looked for number of unfavorable traffic incidents to assess risk on the vehicle driver but they missed out on assigning weights to the severity of the traffic incident. If your model makes wrong contextual assumptions – the outcomes can backfire
2) Assess whether the model is run on a sufficient sample of datasets. Modern scalable technologies have made executing analytical models on massive amounts of data possible.
More data the better although every problem does not need large datasets of the same kind
3) Assess where extraneous events like macroeconomic events, weather, consumer trends etc. are considered in the model constraints. Use of external data sets with real time API based integrations is highly encouraged since it adds more context to the model
4) Assess the quality of data used as an input to the model. Feeding wrong data to a good analytics model and expecting it to produce the expected outcomes is an unreasonable expectation. The stakes are higher in high regulatory environments where minimal error in the model might mean millions of dollars of lost revenues or penalties
Even successful organizations who execute seamlessly in generating insights struggle to “close the loop” in translating the insights into the field to drive shareholder value.
It’s always a good practice to pilot the model on a small population, link its insights and actions to key operational and financial metrics, measure the outcomes and then decide whether to improve or discontinue the model