Q&A with Data Scientists: Raquel Seville
Raquel Seville is a business intelligence professional with over a decade of experience working for fast-paced, complex and demanding multinational telecommunications companies designing and implementing data warehouses, management reports, predictive analytics and executive dashboards. She is the recipient of the 2015 SAP Mentor award for influence and contribution to the SAP BI ecosystem and she is founder of popular business intelligence blog exportBI.com.
Raquel is currently leading reporting and analytics at Cable and Wireless Communications and has lectured undergraduate students in database management, data mining and knowledge discovery. She is keen on BI user adoption and having not just the right tools but the right talent and processes to ensure maximum return on your BI investment.
She is a foodie, loves to travel, and enjoys living in the Caribbean with her family. For more info, visit her LinkedIn or follow her on Twitter
Q1. What is the difference between Business Intelligence and Data Science?
This is a very interesting question that has been debated many times over in the community.
In the traditional sense, Business Intelligence has served to answer questions dealing with the past, historical trends, for example “Why did sales drop 10% yesterday?” while the field of Data Science has been more focused on the future and predicting “When will sales drop 10% again? And how do we prevent that?”. The conversation shifts from trying to find answers to events that occurred in the past to trying to solve the mysteries of the future before it happens again.
In essence, one could say Business Intelligence is more reactive and Data Science is more proactive, although not always the case because some organisations have managed to achieve optimised value from their BI analytics with predictive and prescriptive analytics being a reality, so they are able to determine when something will happen and how to make it happen or prevent it if necessary. It is interesting to note that these organisations also invest in Data Science teams as well who serve to complement their BI resources.
Business Intelligence and Data Science do have overlaps and similarities. Professionals in both fields will work with large datasets for modelling and analysis such as a data warehouse or data lake, they will also do programming and are expected to have expertise in SQL and a solid understanding of different database technologies. They are both expected to have a strong knowledge of the business, industry and market trends and they will also do data visualizations and use similar tools for presentation of their findings. The main difference will be the additional skillsets and responsibilities that a Data Scientist would have, such as statistics, machine learning and mathematics that he/she will use to build and optimise models to gain additional insight into the data. Also noteworthy is that a BI professional could possess these skillsets but while BI would stop at presenting the data to users, Data Science continues to experiment and dig deeper into the data and models to answer questions no one has asked.
Q2. What should every data scientist know about machine learning?
We have all heard that machine learning is the future and it is the new exciting skillset that every Data Scientist should have but maybe what is not said enough is that machine learning is a type of artificial intelligence (AI) and predictive analytics is an area or subset of machine learning. While predictive analytics uses the data mining approach with statistical models, machine learning encompasses that and much more with supervised and unsupervised learning.
I believe that machine learning will be fundamental in how we do business, interact on social platforms and go about our daily lives in general. It is already happening as programs we use regularly are learning without being programmed, from Siri to Alexa and everything in between such as Facebook’s personalised news feed and Amazon’s recommendation system.
What we will see is a greater investment and development of these programs where the possibilities are somewhat limitless, requiring little to no human input. For some, the prospects may be daunting and scary but I think it gives us professionals within the industry, more so data scientists and data engineers, the thrilling opportunity to design and automate a simulation of the human brain with the ability to learn and improve but most importantly we should all be mindful of what a monumental responsibility that is and how it can impact the world as we know it.
Q3. Is domain knowledge necessary for a data scientist?
This question is quite curious and has seen dicey debates and numerous opinions. Off the top of my head, I am inclined to say yes because it can be awfully challenging to accurately understand and interpret data if you do not have the domain knowledge and context, however I also believe that there should be a multi-faceted approach. A data scientist is called a unicorn because of the laundry list of required skillsets and expertise, which includes domain knowledge but in teams and group settings, he/she can rely on the SME for guidance and knowledge. The implications can be detrimental where a lack of domain knowledge results in inaccurate modelling or interpretation of models, however to circumvent this data scientists can benefit from domain knowledge of experts around them in cases where they lack such proficiency.
Q4. What is your experience with data blending? (*)
I lightly touched on data blending as a SAS Programmer working with incomplete and imperfect datasets, but I got in a little deeper in my current role, where I lead a project to analyse and correlate semi-structured data across two distinct yet related dataset of customers to determine matches and similarities towards achieving a single view of a customer. It was a challenge working with incomplete and dirty data and the bulk of the time was spent doing data cleansing and some amount of data integration.
A match probability matrix was created with thirteen different scenarios across five attributes and this helped to create clusters used to differentiate households from customers. The clusters were then used to do segmentation to show overlapping and non-overlapping service subscriptions across the two datasets. R was used to find the top three association rules and a product distribution created to determine the strongest overall product association and also product popularity across datasets. The results of the findings were rather interesting and served as a strong foundation for decision-making, planning and predictions.
Q5. What are the barriers and challenges for Mobile Business Intelligence and the role HTML5?
There are considerable barriers and challenges for mobile business intelligence and it’s relatively lukewarm adaptation speaks volumes. A major barrier is the different mobile platforms that exist for developers to create a seamless experience across all platforms, tagged to that is the issue of security. While security may not be top of mind when it comes to the simple mobile apps we use, for business and competitive data being accessible on a mobile device the risks are much higher.
Other challenges include memory, storage capacity, and varying screen sizes.
There are vendors offering native mobile solutions for their cloud and on-prem applications but it may be limited to specific platforms and operating systems, sometimes even devices. HTML5 can help to address the shortcomings of native apps by providing a robust, dynamic mobile experience across any device, platform, screen size or operating system. It eliminates the need to build for multiple platforms and allows you to build a single app to deploy once.
Q6. How do you ensure data quality?
This is a little tricky to answer because even with the best methods and principles, you cannot always ensure one hundred percent quality data. There will always be areas outside of your control, like GIGO (Garbage In Garbage Out). If the source data is flawed, inaccurate or inconsistent then the output will be the same regardless of the data quality techniques applied.
That being said, a starting requirement when working with data is that the data is useful and relevant, although this may be fuzzy or unclear at the onset. Applying data cleansing methodologies to strip irrelevant characters and values is an important step combined with data conformity to ensure that related datasets are consistent and homogeneous. Data validation strategies should also be applied to check for errors and useless data.
Q7. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
To determine if insights are correct, they have to be proven with some degree of acceptance. We can prove the correctness of our insights by first establishing base models with projected outcomes or we can prove by testing the outcome with different parameters or scenarios. For good insights, we want to check the reliability, quality and usefulness of data, while relevant insights requires evaluation from the initial stages. Therefore, we must examine the data available and whether it is applicable to the situation or problem at hand in an effort to solve problems or contribute to its solution.