Where is the Science in “Data Science”?
BY Vikas Agrawal, Senior Principal Data Scientist, Oracle Corporation —
July 14, 2016–
William Edwards Deming said: In God We Trust; All Others Must Bring Data. In the following discussion, we will explore how we “bring” data scientifically into decision-making.
It is quite surprising to see many practitioners in the field apply the latest and greatest in tools and technologies to fairly large and complex datasets, and then find their results discarded by decision-makers because the science of the domain remained to be addressed. Therefore, even large decisions get made with gut-feeling without the luxury of a data-driven process. Even when decision-makers go by data, they have a theory or hypothesis they want to test based on a certain gut feeling, and allocate a budget. Therefore, managers often define what experiments get done, and what data is collected, without a data-driven process.
What if we could work from the business problems, start with domain knowledge, define experiments and hypotheses based on available data, and thus help the decision-makers make data-driven decisions?
The Process of Science: Making sense of the world and manipulating it has long been an aspiration of mankind, often using quantitative approaches. Today, our data collection systems seem to “know” a lot, yet we know so little of value. If we know how our world works, we can make reasonable predictions about it, and manipulate it effectively for economic benefits. The world could be the entire universe or it could be our enterprise business. One technique for discovering how the world works is empirical Science.
Science requires a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about the world with steps that must necessarily include:
1. Analysis through Analytics: We collect initial data using different sensing mechanisms, define a specific problem to be solved, make hypotheses about how the world works in relation to the problem, design experiments to test hypotheses, collect data systematically at scale, test our hypotheses using statistical or machine-learning techniques, and refine/correct our hypotheses for new ones through models and experimentation iteratively.
The data analysis component here requires importing data of various types (schema on read, schema on write, high velocity streams, error prone, high volume), Extract relevant portions (Impose Schema, Clean Data, Join/Link/Cross-Reference Data), Transform (Normalization), and Load. Then we need to identify and create features from the data that are relevant to the problem, define a sampling strategy to find training and test sets, train specific machine learning or statistical formulations, run Monte-Carlo optimizations for model parameter sensitivity, cross-validate the models, select the ones with the optimal level of false positives and false negatives, and deploy the models.
2. Synthesis through Modeling. We use the knowledge from out tested hypotheses to create a theory of the world which could include a machine-learning based model, perhaps with an ensemble of expert systems (if-then-else), dynamic mathematical models (partial differential equations) and statistical techniques, cross-validate our models, determine the models’ parameter sensitivity, and deploy the models to make predictions with new data. The better we understand the “physics” of the problem, the better our models will be at making predictions on unseen data, within the boundary conditions and model assumptions.
3. Prescription through What-if Simulations and Optimization: We can create predictions using simulations (what if) of the model, find what actions to take (what-to-do) based on optimizations (what is better) on top of Monte Carlo simulations within the boundaries of the independent variables we have the discretion to change, and finally use the results for the economic benefit of the world, repeating this cycle in greater quality and quantity, with feedback (what went well and what did not) to the models to make them better reflect the world and improve them over time.
Data Science Means Science First: We have all seen this Prescription translated to major advances in technology for electronics, genomics, chemistry, mechanics, aeronautics, and other fields through this scientific process being applied in the Physical Sciences and Life Sciences.
Now we are using the same process of Science to create new fields of Banking Science, Investment Management Science, Human Capital Management Science, Customer Relationship Management Science, Supply Chain Management Science, and Manufacturing Management Science, Asset Management Science , Financial Fraud Science etc. This application of the scientific method to a broader set of domains has come to be known as Data Science as an industry shorthand. The scientists in each of these domains have to know their domains very well, understand the peculiarities of the datasets, and apply this knowledge to derive novel prescriptions for the domain.
Here is a proposal for a blueprint of this process of Science applied to multiple domains, that works remarkably successfully:
1. Define Key Business Problems First through Design Thinking [Desirability of Testing a Hypothesis for the Business, Feasibility of Doing it and Business Sustainability]
- What is Desirable for our internal/external customers?
- What is the Differentiation? What can we sell? What is valuable?
- What is Feasible from a technology perspective?
- Using Existing Internal and External Technology
- What is Viable/Sustainable for us as a business?
- Balancing Time/Money and Complexity of Solution
- High quality and reliability vs. POC level efforts.
- Investment in Exploration and Pathfinding to create IP
- Can we do Development and Deployment in an Agile mode?
- Fail fast, find the best with customers, then deploy to production.
2. Data Analytics is Team Work, with at least three key roles that need specialized training
- Data Steward for Security and Privacy, Extract-Transform-Loading Data, and Efficient Data Transformations. The data must maintain algorithmically reproducible provenance (not just documented or metadata or linked provenance) through all its movement and transformations to give confidence to all stakeholders of the reliability of the conclusions drawn from such transformations.
- Data Engineers focus on tools and architecture to create and scale data “pipelines”
- Data Scientists are scientists of the domain focusing on the science i.e. problem solving, while knowing how the tools and architecture scales and its limits, create domain-specific hypotheses and visualizations, and evangelize the analytics to the business decision-makers.
3. Data Integration and Cleaning Takes the Most Time:
- Data is usually found in multiple sources, with schemas that need to be remapped
- Data occurs at varying levels of quality, inconsistencies, errors and missing values.
- Data needs to be reconciled by the data scientist, engineer and steward working together
- Sometimes key data critical to the problem is not included.
4. Prediction is Hard: Predictions are usually hard to make and producing high quality models takes hard work. A hybrid of Knowledge-driven (Expert Systems, Differential Equations) and Data-driven (Statistical, Machine-Learning) models is required to create truly predictive models. Purely data-driven models simply reflect a summary of the past data upon the future, and fail where an understanding of the “physics” of the problem would have solved the problem.
5. Making sense (inducing a gut-feeling) through Visualization: If we cannot produce a clear gut-feeling in the decision-makers based on the visualizations, simulations, and reliability of the data provenance, then our exercise of analytics, synthesis and prescription is in vain. Therefore, carefully crafted visualizations showing the highest information content (maximal entropy) relationships relevant to the problem must be surfaced and presented.
6. Multiple Tools and Techniques Needed: In the rapidly growing technology landscape, we need to be careful to use stable and scalable tools for enterprise grade problems. Tools for Extract-Transform-Load (Informatica, ODI, Ab Initio), for Scalability such as Distributed Computing/Parallel Computing, Map-Reduce (Hadoop), Streaming Data Processing (Apache Spark, Apache Kafka, Apache Storm), Machine Learning, Statistics, Mathematical Modeling and Simulation (SAS, R, Matlab, Mahout, MLLib, SAS JMP, Minitab, SPSS, Mathematica), Artificial Intelligence (Speech Recognition (Google’s Speech API, Microsoft’s Speech API, Nuance’s ASR), Intelligent Context-aware Natural Interfaces, Natural Language Processing – GATE, NLTK, Apache OpenNLP, Stanford Parser, SyntaxNet with TensorFlow), Operations Research (IBM ILOG CPLX for Optimization) and Visualization (Tableau, Oracle Visual Analyzer and Data Visualization Cloud Service and Desktop, SAS Visual Analyzer), along with columnar stores for in-memory OLAP (Oracle 12c, SAP HANA, Amazon RedShift) on Cloud infrastructure are part of the Toolbox.
7. Experimentation Required: While we know what algorithms are best used for which data types and data sizes, with relatives levels of precision/recall with small/large datasets experimentation is still required.