Big Data Analytics– Interview with Duncan Ross
“The biggest technical challenge is actually the separation of the technology from the business use! Too often people are making the assumption that big data is synonymous with Hadoop, and any time that technology leads business things become difficult.” –Duncan Ross.
I asked Duncan Ross (Director Data Science, EMEA, Teradata), what is in his opinion the current status of Big Data Analytics industry.
Q1. What is in your opinion the current status of Big Data Analytics Industry?
Duncan Ross: The industry is still in an immature state, dominated by a single technology whilst at the same time experiencing an explosion of different
technological solutions. Many of the technologies are far from robust or enterprise ready, often requiring significant technical skills to support the software even before analysis is attempted.
At the same time there is a clear shortage of analytical experience to take advantage of the new data. Nevertheless the potential value is becoming increasingly clear.
Q2. What are the main technical challenges for Big Data analytics?
Duncan Ross: The biggest technical challenge is actually the seperation of the technology from the business use! Too often people are making the assumption that big data is synonymous with Hadoop, and any time that technology leads business things become difficult. Part of this is the difficulty of use that comes with this. It’s reminiscent of the command line technologies of the 70s – it wasn’t until the GUI became popular that computing could take off.
Q3. Is BI really evolving into Consumer Intelligence? Could you give us some examples of existing use cases?
Duncan Ross: BI and big data analytics are far more than just Consumer Intelligence. Already more than 50% of IP traffic is non human, and M2M will become increasingly important. But out of the connected vehicle we’re already seeing behaviour based insurance pricing and condition based maintenance. Individual movement patterns are being used to detect the early onset of illness.
New measures of voice of the customer are allowing companies to reach out beyond their internal data to understand the motivations and influence of their customers. We’re also seeing the growth of data philanthropy, with these approaches being used to benefit charities and not-for-profits.
Q4. How do you handle large volume of data? When dealing with petabytes of data, how do you ensure scalability and performance?
Duncan Ross: Teradata has years of experience dealing with Petabyte scale data. The core of both our traditional platform and the Teradata Aster big data platform is a shared nothing MPP system with a track history of proven linear scalability. For low information density data we provide a direct link to HDFS and work with partners such as Hortonworks.
Q5. How do you analyze structured data; semi-structured data, and unstructured data at scale?
Duncan Ross: The Teradata Aster technology combines the functionality of MapReduce within the well understood framework of ANSI SQL, allowing complex programatic analysis to sit alongside more traditional data mining techniques. Many MapReduce functions have been simplified (from the users’ perspective) and can be called easily and directly – but more advanced users are free to write their own functions. By parallelizing the analysis within the database you get extremely high scalability and performance.
Q6. How do you reduce the I/O bandwidth requirements for big data analytics workloads?
Duncan Ross: Two methods: firstly by matching analytical approach to technology – set based analysis using traditional SQL based approaches, and programmatic and iterative analysis using MapReduce style approaches.
Secondly by matching data ‘temperature’ to different storage medium: hot data on SSD, cool data on fast disk drives, and cold data on cheap large (slow) drives.
The skill is to automatically rebalance without impacting users!
Q7. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?
Duncan Ross: In the world of data mining this isn’t reeally a problem as our approaches are based around sampling anyway. A more important distinction is between speed of analysis and business demand. We’re entering a world where data requires us to work far more agiley than we have in the past.
Q8. Brewer’s CAP theorem says that for really big distributed database systems you can have only 2 out of 3 from Consistency (“atomic”), Availability and (network) Partition Tolerance. Do you have practical evidence of this? And if yes, how?
Duncan Ross: No. Although it may be true for an arbitarily big system, in most real world cases this isn’t too much of a problem.
Q9. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?
Duncan Ross: Well we aren’t using Hadoop, and as I commented earlier, equating Hadoop with Big Data is a dangerous assumption. Many analyses do not require
anything approaching real time, but as closeness to an event becomes more important then we can look to scoring within an EDW environment or even embedding code within an operational system.
To do this requires you to understand the eventual use of your analysis when starting out of course.
A great example of this approach is Ebay’s Hadoop-Singularity-EDW configuration.
Q10. What are the typical In-database support for analytics operations?
Duncan Ross: It’s clear that moving the analysis to the data is more beneficial than moving data to the analysis. Teradata has great experience in this area.
We have examples of fast fourier transforms, predictive modelling, and parameterised modelling all happening in highly parallel ways within the database. I once built and estimated 64 000 models in parallel for a project.
Q11. Cloud computing: What role does it play with Analytics? What are the main differences between Ground vs Cloud analytics?
Duncan Ross: The cloud is a useful approach for evaluating and testing new approaches, but has some significant drawbacks in terms of data security. Of course
there is a huge difference between public and private cloud solutions.
Duncan Ross, Director Data Science, EMEA, Teradata.
Duncan has been a data miner since the mid 1990s. He was Director of Advanced Analytics at Teradata until 2010, leaving to become Data Director of Experian UK. He recently rejoined Teradata to lead their European Data Science team.
At Teradata he has been responsible for developing analytical solutions across a number of industries, including warranty and root cause analysis in manufacturing, and social network analysis in telecommunications.
These solutions have been developed directly with customers and have been deployed against some of the largest consumer bases in Europe.
In his spare time Duncan has been a city Councillor, chair of a national charity, founded an award winning farmers’ market, and is one of the founding Directors of the Society of Data Miners.