Hadoop – ODBMS Industry Watch

On Vertica 10.0 Interview with Mark Lyons

Roberto V. Zicari — Tue, 28 Apr 2020 08:29:03 +0000

“Supporting arrays, maps and structs allows customer to simplify data pipelines, unify more of their semi-structured data with their data warehouse as well as maintain better real world representation of their data from relationships between entities to customer orders with item level detail. A good example is groups of cell phone towers that are used for one call while driving on the highway.” –Mark Lyons

I have interviewed Mark Lyons, Director of Product Management at Vertica. We talked about the new Vertica 10.0

RVZ

Q1. What is your role at Vertica?

Mark Lyons: My role at Vertica is Director of Product Management. I have a team of 5 product managers covering analytics, security, storage integrations and cloud.

Q2. You recently announced Vertica Version 10. What is special about this release?

Mark Lyons: Vertica 10.0 is a milestone release and special in many ways from Eon Mode improvements to TensorFlow integration for trained models, and the ability to query complex data types like arrays, maps and structs. This release delivers on all aspects of why customers choose Vertica including performance improvements and our constant dedication to keeping our platform in front of the competition from an architecture standpoint.

Q3. Specifically why and how did you improve Vertica in Eon Mode?

Mark Lyons: The Eon Mode improvements are a long list of features including faster elasticity with sub-clusters, stronger workload isolation between sub-clusters and more control over the depot for performance tuning. Also Eon Mode is now available on two new communal storage options Google Cloud Platform and Hadoop Distributed File System (HDFS) in addition to what we already support which is Amazon Web Services S3, Pure Storage Flash Blades and MinIO. From here we are working on adding Azure and Alibaba for public cloud options and expanding our on-prem options with other vendors that our customers have shown interest in like Dell/EMC ECS storage offering and others.

Eon mode is run worldwide by many of our largest customers in production at this point and if you are interested in learning more about the scale and flexibility I recommend looking at our case study with The Trade Desk. They now run 2 Eon Mode clusters both at 320 nodes and petabytes of data growing every day.

Q4. So, one of your key improvement is to support complex types to improve integration with Parquet. How will that benefit businesses?

Mark Lyons: We’ve had high performance query ability on Parquet data whether that is on HDFS or S3 for years including column pruning and predicatepushdown. We continue to invest in our Parquet integration since it is an important part of many organizations’ analytics & data lake strategy. Over the past couple of releases we’ve been building the ability to query complex data types while maintaining columnar execution and late materialization for high performance.

Supporting arrays, maps and structs allows customer to simplify data pipelines, unify more of their semi-structured data with their data warehouse as well as maintain better real world representation of their data from relationships between entities to customer orders with item level detail. A good example is groups of cell phone towers that are used for one call while driving on the highway. We have seen tremendous interest from our customers in this new functionality. We have been actively testing preview builds of Vertica 10 with many customers for querying maps, arrays and structs for some time now.

Q5. How will this new release benefit people using Vertica in Enterprise Mode, people who or aren’t even on the Cloud and have no plans to go to the Cloud?

Mark Lyons: Vertica Enterprise Mode benefits from all of the improvements to the query optimizer, execution engine, machine learning, complex data types and beyond since there is only one Vertica code base and the Eon mode differences are limited to communal storage and sub-clusters. Enterprise Mode is the traditional direct attached storage. Massively Parallel Processing (MPP), shared nothing architectures are still appropriate for many organizations that don’t have plans to move to public cloud and have traditional data center infrastructure. Vertica doesn’t restrict on-premises customers to only using shared storage options like Pure Storage Flash Blades or HDFS. With a Vertica license these customers have the flexibility to deploy where they want in whichever architecture fits today, and they can change in the future without any license cost.

Q6. What are the features you offer in your in-database machine learning in Vertica?

Mark Lyons: Vertica is not normally thought of as a data science platform coming out of the MPP Column Store RDBMS space but we have built functions to make the data science pipeline very easy. We offer functions from data loading a variety of formats, enrichment, preparation and quality functions for data transformation as well as algorithms to train, score and evaluate models. There’s a lot more than I can begin to cover here. To learn more I suggest reading about the Vertica machine learning capabilities here.

In Vertica 10.0 we’ve added the capabilities to import and export PMML models to support data scientist training models in other tools like Python or Spark or use cases where a model trained in Vertica should be pushed to the edge for evaluation in a complex event processing/streaming system. We also added TensorFlow integration to import deep learning models trained on GPUs outside of Vertica into the data warehouse for scoring and evaluation on new customer or device data as it arrives.

Q7. There are several companies offering data platforms for machine learning (e.g. Alteryx Analytics, H2O.ai, RapidMiner, SAS Enterprise Miner (EM), SAS Visual Analytics, Databricks Unified Analytics Platform, IBM SPSS, Microsoft Azure Machine Learning, Teradata Unified Data Architecture, InterSystems IRIS Data Platform). How does Vertica compare to the other data platforms offering machine learning features?

Mark Lyons: The Vertica differentiation compared to these other tools is all about bringing scale, concurrency, performance and operational ML simplicity to the story. All of the data science tools you mention have equivalent functions for data prep, modeling, algorithms etc. and with all of the work we have done the past 3+ years Vertica has much of the same functionality you find in those other tools except for the GUI. For that, many of our customers use Jupyter notebooks.

Vertica performance is achieved by first the MPP architecture and second by in-memory and auto spill to disk so we can handle the largest data sets without being limited by the compute power of a single node or by the memory available like many solutions are. The beauty of this is users do not have to be aware of this or do anything. It just works! In addition to being able to train on trillions of rows and thousands of columns you get enterprise readiness with resource management between jobs, high concurrency to support many users on the same system at the same time, workload isolation to keep ML workloads from overwhelming other analytics areas, security built in with authentication & access policies, and compliance with logging/audit of all ML workflows.

Q8. What are the most successful use cases which use the Machine Learning features offered by Vertica?

Mark Lyons: We are solving all of the most common use cases from fraud detection to churn prevention to cybersecurity threat analytics. Vertica brings a new level of scale and speed to allow for more frequent model re-training, and use of the entire dataset even if that is trillions of rows and PBs of data without data movement or down sampling.

Q9. Anything else you wish to add?

Mark Lyons: A few weeks ago we wrapped up our Vertica Big Data Conference 2020 and I recommend anyone interested in learning more to come and watch replays of the sessions.

———————-

Mark Lyons, Director of Product Management, Vertica
Mark leads the Vertica product management team. His expertise is in new product introductions, go-to-market planning, product roadmaps, requirement development, build/buy/partner analysis, new products, innovation and strategy.

Resources

Virtual Vertica Big Data Conference 2020

Follows us on Twitter: @odbmsorg

On the Database Market. Interview with Merv Adrian

Roberto V. Zicari — Tue, 23 Apr 2019 10:30:24 +0000

“Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market.”– Merv Adrian.

I have interviewed Merv Adrian, Research VP, Data & Analytics at Gartner. We talked about the the database market, the Cloud and the 2018 Gartner Magic Quadrant for Operational Database Management Systems.

RVZ

Q1. Looking Back at 2018, how has the database market changed?

Merv Adrian: At a high level, much is similar to the prior year. The DBMS market returned to double digit growth in 2017 (12.7% year over year in Gartner’s estimate) to $38.8 billion. Over 73% of that growth was attributable to two vendors: Amazon Web Services and Microsoft, reflecting the enormous shift to new spending going to cloud and hybrid-capable offerings. In 2018, the trend grew, and the erosion of share for vendors like Oracle, IBM and Teradata continued. We don’t have our 2018 data completed yet, but I suspect we will see a similar ballpark for overall growth, with the same players up and down as last year. Competition from Chinese cloud vendors, such as Alibaba Cloud and Tencent, is emerging, especially outside North America.

Q2. What most surprised you?

Merv Adrian: The strength of Hadoop. Even before the merger, both Cloudera and Hortonworks continued steady growth, with Hadoop as a cohort outpacing all other nonrelational DBMS activity from a revenue perspective. With the merger, Cloudera becomes the 7th largest vendor by revenue and usage and intentions data suggest continued growth in the year ahead.

Q3. Is the distinction between relational and nonrelational database management still relevant?

Merv Adrian: Yes, but it’s less important than the cloud. As established vendors refresh and extend product offerings that build on their core strengths and capabilities to provide multimodel DBMS and/or or broad portfolios of both, the “architecture” battle will ramp up. New disruptive players and existing cloud platform providers will have to battle established vendors where they are strong – so for DMSA players like Snowflake will have more competition and on the OPDBMS side, relational and nonrelational providers alike – such as EnterpriseDB, MongoDB, and Datastax – will battle more for a cloud foothold than a “nonrelational” one.
Specific nonrelational plays like Graph, Time Series, and ledger DBMSs will be more disruptive than the general “nonrelational” category.

Q4. Artificial intelligence is moving from sci-fi to the mainstream. What is the impact on the database market?

Merv Adrian: Vendors are struggling to make the case that much of the heavy lifting should move to their DBMS layer with in-database processing. Although it’s intuitive, it represents a different buyer base, with different needs for design, tools, expertise and operational support. They have a lot of work to do.

Q5. Recently Google announced BigQuery ML. Machine Learning in the (Cloud) Database. What are the pros and cons?

Merv Adrian: See the above answer. Google has many strong offerings in the space – putting them together coherently is as much of a challenge for them as anyone else, but they have considerable assets, a revamped executive team under the leadership of Thomas Kurian, and are entering what is likely to be a strong growth phase for their overall DBMS business. They are clearly a candidate to be included in planning and testing.

Q6. You recently published the 2018 Gartner Magic Quadrant for Operational Database Management Systems. In a nutshell. what are your main insights?

Merv Adrian: Much of that is included in the first answer above. What I didn’t say there is that the degree of disruption varies between the Operational and DMSA wings of the market, even though most of the players are the same. Most important, specialists are going to be less relevant in the big picture as the converged model of application design and multimodel DBMSs make it harder to thrive in a niche.

Q7. To qualify for inclusion in this Magic Quadrant, vendors must have had to support two of the following four use cases: traditional transactions, distributed variable data, operational analytical convergence and event processing or data in motion. What is the rational beyond this inclusion choice?

Merv Adrian: The rationale is to offer our clients the offerings with the broadest capabilities. We can’t cover all possibilities in depth, so we attempt to reach as many as we can within the constraints we design to map to our capacity to deliver. We call out specialists in various other research offerings such as Other Vendors to Consider, Cool Vendor, Hype Cycle and other documents, and pieces specific to categories where client inquiry makes it clear we need to have a published point of view.

Q8. How is the Cloud changing the overall database market?

Merv Adrian: Massively. In addition to functional and architectural disruption, it’s changing pricing, support, release frequency, and user skills and organizational models. The future value of data center skills, container technology, multicloud and hybrid challenges and more are hot topics.

Q9. In your Quadrant you listed Amazon Web Services, Alibaba Cloud and Google. These are no pure database vendors, strictly speaking. What role do they play in the overall Operational DBMS market?

Merv Adrian: Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market. And their deep pockets, and the availability of open source versions of every DBMS technology type that they can use – including creating their own versions of with optimizations for their stack and pre-built integrations to upstream and downstream technologies required for delivery – makes them formidable.

Q10. What are the main data management challenges and opportunities in 2019?

Merv Adrian: Avoiding silver bullet solutions, sticking to sound architectural principles based on understanding real business needs, and leveraging emerging ideas without getting caught in dead end plays. Pretty much the same as always. The details change, but sound design and a focus on outcomes remain the way forward.

Qx Anything else you wish to add?

Merv Adrian: Fasten your seat belt. It’s going to be a bumpy ride.

————————

Merv Adrian, Research VP, Data & Analytics. Gartner

Merv Adrian is an Analyst on the Data Management team following operational DBMS, Apache Hadoop, Spark, nonrelational DBMS and adjacent technologies. Mr. Adrian also tracks the increasing impact of open source on data management software and monitors the changing requirements for data security in information platforms.

Resources

– 2018 Gartner Magic Quadrant for Operational Database Management Systems

– 2019 Gartner Magic Quadrant for Data Management Solutions for Analytics

– 2019 Gartner Magic Quadrant for Data Science and Machine Learning Platforms

Follow us on Twitter: @odbmsorg

On using AI and Data Analytics in Pharmaceutical Research. Interview with Bryn Roberts

Roberto V. Zicari — Mon, 10 Sep 2018 03:53:49 +0000

” I’m intrigued by the general trend towards empowering individuals to share their data in a secure and controlled environment. Democratisation of data in this way has to be the future. Imagine what we will be able to do in decades to come, when individuals have access to their complete healthcare records in electronic form, paired with high quality data from genomics, epigenetics, microbiome, imaging, activity and lifestyle profiles, etc., supported by a platform that enables individuals to share all or parts of their data with partners of their choice, for purposes they care about, in return for services they value – very exciting! “ —Bryn Roberts

I have interviewed Bryn Roberts, Global Head of Operations for Roche Pharmaceutical Research & Early Development, and Site Head in Basel. We talked about using AI and Data Analytics in Pharmaceutical Research.

RVZ

Q1. What are your responsibilities as Global Head of Operations for Roche Pharmaceutical Research & Early Development, and Site Head in Basel?

Bryn Roberts: I have a broad range of responsibilities that center around creating and operating a highly innovative global R&D enterprise, Roche pRED, where excellent scientific decision making is optimised along with efficiency, effectiveness, sustainability and compliance. Informatics is my largest department and includes workflow platforms in discovery and early development, architecture, infrastructure and software development, data science and digital solutions.

Facilities, infrastructure and end-to-end lab services, including three new R&D center building projects, provide state-of-the-art innovation centers and labs that integrate the latest architectural concepts, instrumentation, automation, robotics and supply chain to facilitate cutting-edge science.

These more tangible assets are complemented by a number of business operations teams, who oversee quality, compliance, risk management and business continuity, information and knowledge management, research contracts, academic and industrial collaborations, change and transformation, procurement, safety-health-environment, etc.

Finally, leadership of our Personalised Healthcare and Medical Genomics strategies and initiatives, ensuring close collaboration and insight sharing across the Roche Group.

As the Site Head for the Roche Innovation Center in Basel, I am, together with my local leadership team, accountable for the engagement and well-being of more than a thousand Research and Early Development colleagues at our headquarter site.
Our task is to create a vibrant environment that attracts, motivates and equips world-class talent. Initiatives range from scientific meetings, wellbeing programmes, workplace improvements, communication and knowledge sharing, celebrations and social events, to engagement of local academic and governmental organizations, sponsorship of local scientific conferences, and contribution to the overall Roche site development in Basel and Kaiseraugst.

Q2. Understanding a disease now requires integrated data and advanced analytics. What are the most common problems you encounter when integrating data from different sources and how you solve them?

Bryn Roberts: The challenges often relate to the topics represented by the FAIR acronym. These are Findability, Accessibility, Interoperability and Reusability. As with all organizations where large data assets have been generated and acquired over a long period of time, across many departments and projects, it is challenging to establish and maintain these FAIR Data principles. We have been committed to FAIR data for many years and continue to increase our investment in ‘FAIRification’, with particular emphasis currently on clinical trial and real-world data. Addressing the challenges requires a well thought through, and robustly implemented, information architecture incorporating data catalogues based on high quality meta-data, a holistic terminology service that enables semantic data integration, curation processes supporting data quality and annotation, appropriate application of data standards, etc.

On the advanced analytics side, it is very helpful to establish mechanisms for sharing algorithms and analysis pipelines, such as code repositories, and for annotating derived data and insights. Applying the FAIR principles to algorithms and analysis pipelines, as well as datasets, is an excellent way of sharing knowledge and leveraging expertise in an organization.

We are currently implementing a ‘Data Commons’ architecture framework to facilitate data management, integration, ‘FAIRification’, and to enable analysts of different types to leverage fully the data, as well as insights and analyses from their colleagues. Frameworks like this are essential in a large R&D enterprise, utilising complex high-dimensional data (e.g. genomics, imaging, digital monitoring), requiring federation of data and/or analyses, robust single-point-of-truth or master data management, access control, etc. In this regard, we are in our second generation architecture for our platform supporting disease understanding. My colleague, Jan Kuentzer, presented an excellent overview at the PRISME Forum last year and the slides are available if people would like to learn more (Roche Data Commons).

Q3. How do you judge the quality of data, before and after you have done data integration?

Bryn Roberts: This question of data quality is even more complicated than it may first appear. Although there are some more elaborate models out there, conceptualising data quality in two broad perspectives may help.

Firstly, what we might call prescriptive quality, where we can test data against pre-determined standards such as vocabularies, ontologies, allowed values or ranges, counts, etc. This is an obvious step in data quality assessment and can be automated to a large degree, including in database schema and constraints. A very challenging aspect of prescriptive quality is judging the upstream processes involved in data collection and pre-processing. For example, determining: if analytical data have been associated with the correct samples, if a manual entry was correctly read and typed, if data from a collaborator have been falsified. The probability of quality issues such as these can be reduced through robust protocols, in-process QA steps, automation, and algorithms to detect systematic anomalies, etc. In the standard data quality models, we might consider prescriptive quality as covering dimensions such as accuracy, integrity, completeness, conformity, consistency and validity.

Secondly, what we might call the interpretive quality perspective, relating to the way the data will be interpreted and used for decision making. For example, the smoking status of patients with lung diseases may be recorded simply as: current, former or never. Despite the data meeting the prescribed standards, being accurate, complete and conforming to the model, they may not be of sufficient quality to address the complexities of the biology underlying the diseases, where one might need information describing the number of cigarettes smoked per day and the time since the individual smoked their last cigarette.
Similarly, when working with derived data from an algorithm, one may need to understand the training set and boundaries of the model to understand how far the derived data can be interpreted for specific input conditions. One can address some of these issues with meta-data describing how data were generated in lab, clinic or silico.

Q4. What are the criteria you use to select the data to be integrated?

Bryn Roberts: Certainly the quality aspects above play a key role. We have, for example, discarded historical laboratory results when, after careful consideration, we decided that the meta-data (lab protocols, association with target information, etc.) were insufficient for anyone to make meaningful use of them. Data derived from old technologies, despite being valuable at some point in the past, may have been superseded or may not meet today’s requirements, so will have lower priority for integration, although may still be archived for specialist reference. Relevance is another critical factor – we prioritise the integration of data relating to our current molecules or disease targets, and data that we deem to have the most valuable content.

Q5. Is there a risk that data integration introduces unwanted noise or bias? If yes, how do you cope with that?

Bryn Roberts: I’m not too concerned about these aspects when the above architectures and principles are applied. There is clearly a risk of bias when the integrated landscape is incomplete, so understanding what you have, and what you don’t, when searching is important. Storing only aggregated or derived data can be risky, as aggregation can mask properties such as skewness and outliers, and there are obviously benefits in having the ability to access and re-analyse upstream and raw data, as models and algorithms improve or an analyst has a specific use-case. Integration, if performed well, should not introduce additional noise, although noise reduction may potentially mask signals in data when they are aggregated or transformed in other ways.

I often hear people talking about Data Lakes and it certainly seems to be one of the hype terms of the last couple of years. This approach to data ‘integration’ does concern me if not implemented thoughtfully, especially for the complex scientific and clinical data used in R&D. Given that the Big Data stack allows for data to be poured into the metaphorical ‘Lake’ with a very low cost of entry, it is tempting to throw everything in, getting caught up with KPIs such as volume captured, with little thought to the backend use-cases and costs incurred when utilising the data. I also wouldn’t advocate the opposite extreme of RDBMS-only data Warehousing, where the front-end costs and timelines escalate to unreasonable levels and the models struggle to incorporate new data types. There is a pragmatic middle-ground, where up-front work on the majority of data has a positive return on investment, but challenging data are not excluded from the integration. This complementary Warehouse+Lake approach allows for continuous refinement, based on ongoing use of the data, to maximize value over the longer-term.

Q6. What specific data analytics techniques are most effective in pharma R&D?

Bryn Roberts: We have so many data types and use-cases, spanning chemistry, biology, clinic, business, etc. that we apply almost every analytic technique you can think of. Classical statistical methods and visual analytics have broad application, as do modelling and simulation. The latter being used extensively in areas from molecular simulations and computational chemistry to genotype-phenotype associations, pharmacokinetics and epidemiology. We are increasingly using Artificial Intelligence (AI), Machine Learning and Deep Learning, in applications such as image analysis, large scale clinico-genomic analysis, phenotypic profiling and analysis of high dimensional time-series data.

Q7. What are your current main projects in the so called “Precision Medicine”?

Bryn Roberts: In Roche we tend to use the term Personalised Healthcare rather than Precision Medicine, since we have both Pharmaceuticals and Diagnostics divisions. However, the intention is similar in that we want to identify which treatments and other interventions will be effective and safe for which patients, based on profiling, which may include genetics, genomics, proteomics, imaging, etc. We have many initiatives ongoing in research, development and for established products. Developing a deeper understanding how mutational and immunological status of tumours influences response to targeted therapeutics, immunotherapies and combinations is one example. Forward and reverse translation in such examples is critical, as we design clinical trials and select participants then, in turn, inform new research initiatives based on data fed back from the clinic. We have made considerable headway in this space thanks to progress in genomic analysis and quantitative digital pathology, supported by collaborations across the Roche Group, including organizations such as Tissue Diagnostics, Foundation Medicine and Flatiron.

Another example is in ophthalmology, where we are trying to identify biomarkers that predict disease progression and drug response to enable appropriate treatment selection, including prophylaxis.

A third quite different example is our application of mobile and sensor technology to monitor symptoms, disease progression and treatment response – the so called “Digital Biomarkers”. We have our most advanced programmes in Multiple Sclerosis (MS) and Parkinson’s Disease (PD), with several more in development. Using these tools, a longitudinal real-world profile is built that, in these complex syndromes, helps us to identify signals and changes in symptoms or general living factors, which may have several potential benefits. In clinical trials we hope to generate more sensitive and objective endpoints with high clinical relevance, with the potential to support smaller and shorter studies, and possibly validate targets in earlier studies that might otherwise be overlooked. In the general healthcare setting, tools like these may have great value for patients, physicians and healthcare systems if they are used to inform tailored treatment regimens and enable supportive interventions such as the timing of home visits or provision of walking aids to reduce falls. For those interested in learning more about our work in MS there is more information available online about our Floodlight Open programme.

Q8. You have been working on trying to detect Parkinson disease. What is your experience of using Deep Learning of that purpose?

Bryn Roberts: The data we collect with the digital biomarker apps fall into two classes: 1) active test data, where the subject performs specific tasks on a daily basis, and 2) continuous passive monitoring data, where the subject carries the device (e.g. smartphone) with them as they go about their daily lives and sensors, such as accelerometers and gyrometers, collect data continuously. These latter data form complex time series, with acceleration and rotation being measured in 3 axes each, many times per second. From these data, we build a picture of the individual’s daily activities and performance, which is ultimately what we hope to improve for patients with our new therapies. We apply Deep Learning to do this activity-performance classification, or Human Activity Recognition (HAR), using deep artificial neural networks that have been trained using well-annotated datasets. Since the data are time-series, the network utilises Long Short-Term Memory (LSTM) layers to provide recurrence, hence the name “Recurrent Neural Network” or RNN. Examples of what we might study here are how well a patient with PD is able to stand up from a chair or climb a staircase.

The advantages of using digital, mobile and AI technologies in this way, compared to infrequent in-clinic assessments, is that they are highly objective and sensitive, have the possibility to detect symptom fluctuations day-to-day, they are performed in the real-world setting providing increased relevance, they have a relatively low burden for patients, and data can be assessed by the patient and/or physician in near real-time so they become better informed and empowered.

Extending the application beyond clinical trials, disease monitoring and management, these technologies have the potential, in some disorders, to deliver solutions with a direct beneficial effect that can be measured objectively through improved outcomes. Thus, this work is also laying the foundation for advances in digital therapeutics, or “digiceuticals”, where we’ve seen a huge increase in interest, and the first regulatory approvals, over the last year or so.

With great power comes great responsibility, so we work closely with the participants and regulators to ensure that data protection and privacy are upheld to the highest standards and that participants are fully informed and consent.
As we have developed the platform over the past few years, the establishment of robust end-to-end data processes and the building of trust has run side-by-side with the technology innovation.

Q9. What kind of public data did you use for training your models?

Bryn Roberts: The Human Activity Recognition (HAR) model was initially trained using two independent public datasets of everyday activity from normal individuals. The first from Stisen et al. (“Smart devices are different: Assessing and mitigating mobile sensing heterogeneities for activity recognition”, Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, 2015.) and the second from Weiss et al. (“The impact of personalization on smartphone-based activity recognition”, AAAI Workshop on Activity Context Representation: Techniques and Languages, 2012.). From these data, 90% were used to train the model and 10% for model validation.

Q10. What results did you obtain so far?

Bryn Roberts: In passive monitoring of gait and mobility we have published, for example, significant differences between healthy subjects and PD patients in parameters such as sitting-to-standing transitions, turning speed when walking and in overall gait parameters. In the active test panel, we have demonstrated correlation with the current standard rating scale for PD (MDS-UPDRS) in symptom areas such as tremor, dexterity, balance and postural stability. However, in some measurements (e.g. rest tremor) the digital biomarker appears to be more sensitive at detecting low-intensity symptoms than the in-clinic rating, and corresponds better with patients’ self-reported data.

For more information, see, for example: “Evaluation of Smartphone-Based Testing to Generate Exploratory Outcome Measures in a Phase 1 Parkinson’s Disease Clinical Trial”, Lipsmeier et al., Movement Disorders, 2018.

Q11. Are there any other technological advances on the horizon that you are excited about?

Bryn Roberts: There’s a lot of activity in the healthcare and pharma sector at the moment around blockchain. Some of the use-cases have potential interest to us in R&D, such as secure sharing of genomic and medical data.
I don’t think blockchain is a requirement to do this effectively but may be an enabler, especially if it gains broad adoption.

I’m intrigued by the general trend towards empowering individuals to share their data in a secure and controlled environment. Democratisation of data in this way has to be the future. Imagine what we will be able to do in decades to come, when individuals have access to their complete healthcare records in electronic form, paired with high quality data from genomics, epigenetics, microbiome, imaging, activity and lifestyle profiles, etc., supported by a platform that enables individuals to share all or parts of their data with partners of their choice, for purposes they care about, in return for services they value – very exciting!

This vision, and even the large datasets available today, are driving a paradigm shift in data management and compute for us. The need to federate, both data and compute, across multiple locations and organisations is a change from the recent past, when we could internalise all the data of interest into our own data centers. Cloud, Hadoop, containers and other technologies that support federation are maturing quickly and are a great enabler to big data and advanced analytics in R&D.

What I’m particularly excited about just now is the potential of universal quantum computing (QC). Progress made over the last couple of years gives us more confidence that a fault-tolerant universal quantum computer could become a reality, at a useful scale, in the coming years. We’ve begun to invest time, and explore collaborations, in this field. Initially, we want to understand where and how we could apply QC to yield meaningful value in our space. Quantum mechanics and molecular dynamics simulation are obvious targets, however, there are other potential applications in areas such as Machine Learning.
I guess the big impacts for us will follow “quantum inimitability” (to borrow a term from Simon Benjamin from Oxford) in our use-cases, possibly in the 5-15 year timeframe, so this is a rather longer-term endeavour.

————————————————–

Dr Bryn Roberts

Bryn gained his BSc and PhD in pharmacology from the University of Bristol, UK. Following post-doctoral work in neuropharmacology, he joined Organon as Senior Scientist in 1996. A number of roles followed with Zeneca and AstraZeneca, including team and project leader roles in high throughput screening and research informatics. In 2004 he became head of Discovery Informatics at the AstraZeneca sites in Cheshire, UK.

Bryn joined Roche in Basel in 2006, and his role as Global Head of Informatics was expanded in 2014 to Global Head of Operations for Pharma Research and Early Development. He is also the Centre Head for the Roche Innovation Centre Basel.

Beyond Roche, Bryn is a Visiting Fellow at the University of Oxford, where he is a member of the External Advisory Board of the Dept. of Statistics and the Scientific Management Committee for the Systems Approaches to Biomedical Sciences Centre for Doctoral Training. He is a member of the Advisory Board to the Pistoia Alliance. Bryn was recognized in the Fierce Biotech IT list of Top 10 Biotech Techies 2013 and in the Top 50 Big Data Influencers in Precision Medicine by the Big Data Leaders Forum in 2016.

Resources

– How artificial intelligence is changing drug discovery. Nature, 30 May 2018

– On Precision Medicine. Q&A with Todd Winey. ODBMS.org, February 14, 2018

– Beyond the Molecule and Beyond the Device: Machine Learning and the Future of Healthcare. By Shomit Ghose, Onset Ventures. ODBMS.org, August 24, 2017

Follow us on Twitter: @odbmsorg

On Vertica and the new combined Micro Focus company. Interview with Colin Mahony

Roberto V. Zicari — Wed, 25 Oct 2017 09:25:58 +0000

” There has been no uncertainty with respect to the Micro Focus leadership’s commitment to building on the great brand and product we have developed at Vertica.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, Vertica Product Group, Micro Focus.
In this interview we covered the recent spin-off of HPE software into a new combined Micro Focus company, and how this is affecting Vertica. We also covered the new release of Vertica 9, and the importance of Big Data analytics.

RVZ

Q1.With the recent spin-off of HPE software into a new, combined Micro Focus company, do you see things changing for Vertica?

Colin Mahony: From a product development, sales and customer support perspective – it’s been business as usual at Vertica leading up to and since the spin-merge with Micro Focus. Our focus, as always, is to build the best possible product and deliver world-class support for our growing customer base. That won’t change any time soon.

The biggest changes I see post spin-merge is that Vertica is now part of a pure-play software company, rather than a business where a majority of revenue comes from hardware. Running a software company is a lot different than running a hardware business. Under HPE, the software assets sometimes struggled in establishing their own identify as part of a much larger hardware business. Micro Focus on the other hand is designed from the ground up to build, sell and support software for our customers, that’s all we do. The new, combined Micro Focus is the 7^th largest pure-play software company in the world, and we have the global scale to be an industry shaper.
But maybe even more exciting is the level of support and GTM independence that we are already seeing from Micro Focus in support of Vertica. You have likely seen Vertica’s logo and you’ll continue to see more of that, especially on the Vertica.com website that we launched in February and that already has almost 1 million page views! We have been structured uniquely in the new Micro Focus and this gives me complete confidence in our future. I’m genuinely excited about the opportunity to be in a business that is dedicated and focused purely on software – especially software with analytics built in, the new Micro Focus company mission – and the business value of that software for customers.

Q2. There are concerns that Micro Focus may end up managing mature software assets of HPE and extending their shelf life, rather than actively investing in feature developments. What is your take on this?

Colin Mahony: I fundamentally disagree with that. Micro Focus helps companies bridge their existing technologies with new infrastructure and applications. It helps them maximize their ROI while embracing innovation to address the opportunities of the new Hybrid IT and analytics-driven environment. It’s frankly wrong to expect customers to make investments in core technologies without working hard to maximize the investment in those technologies. Over the years, Micro Focus has taken core assets and made them modern, delivering significant value to the company and our customers.

It’s also important to note that the new, combined Micro Focus has an incredible depth and breadth of software assets in its portfolio – covering DevOps, IT Operations, Cloud, Security, Big Data and more – not all of which are mature products.
Take SUSE for instance, a Micro Focus product and the fastest growing open Linux platform. I’m very impressed with the approach that Micro Focus has on supporting growth businesses like this. I have the very same expectations for our Vertica business, especially because this is a massive new opportunity for Micro Focus, which prior to the spin-merge did not have a Big Data offering.
This means no confusion, no duplication of resources, and a lot of potential because we know that every company in virtually every industry is thinking about how to leverage analytics at the core of everything they do, and again, why “analytics built in” is at the core of the new company’s mission.

Q3. Will Micro Focus continue to develop Vertica?

Colin Mahony: There has been no uncertainty with respect to the Micro Focus leadership’s commitment to building on the great brand and product we have developed at Vertica. Since the spin-merge with Micro Focus was first announced in 2016, we have actually been reinvigorating the Vertica brand name, all based on the recognition that Micro Focus has a tremendous market opportunity in front of it with the advent of Big Data and the growing importance all companies are placing on the value of analytics. You can see this commitment with the build-out of our new website, www.vertica.com, our presence at industry trade shows and conferences, and more.

In a recent interview, Chris Hsu, CEO of the new, combined Micro Focus, expressed his commitment to big data analytics – and specifically Vertica – as the number one area he is most excited to focus on and grow within the portfolio. It’s an exciting time to be part of Vertica. We have an incredible opportunity in front of us.

Q4. Micro Focus now has a number of software assets covering Hybrid IT, DevOps, Security and more, where analytics is critical. Does or will Vertica play a role in those products?

Colin Mahony: Absolutely. Not only is there a strong commitment in continuing to develop Vertica as a product and brand, there’s wide recognition within Micro Focus that predictive analytics is critical for the success of data-centric enterprises, and therefore a critical component to the breadth of assets in our own portfolio.

Vertica is an ideal solution for embedded analytics. Businesses that embed Vertica stand out from the competition and deliver higher value to customers. Specifically designed for analytic workloads, Vertica’s speed and performance, advanced analytics, ease of deployment, and support for data scientists make it tailor-made for embedding. We now have an opportunity to embed these great analytical features in a range of Micro Focus software assets, something we’ve already begun to do in application delivery management, IT operations and security. As I’ve said, a core part of our company’s core mission moving forward is to provide customers with enterprise-grade scalable software with analytics built in. I see this as a large and growing opportunity for innovation here at Micro Focus.

Q5. You recently released Vertica Version 9, with major enhancements in cloud deployments and separation of compute and storage. Are these common themes for Vertica moving forward?

Colin Mahony: They are. Vertica has always been 100% committed to helping our customers deploy advanced analytics free from underlying infrastructure and hardware lock-in. We’ve seen that legacy data warehouse solutions have forced many enterprises into rigid and high-cost proprietary hardware and analytics solutions supporting only limited data formats and deployment options. As data formats and storage locations continuously evolve, organizations require a powerful and unified solution to analyze data in the right place at the right time, with the performance and economics that the business requires. Our continued commitment to this principle – and our support for any major cloud platform, whether AWS, Azure or GCP – is foundational to Vertica’s core.

Separation of compute and storage is a logical extension of this product development ethos. Vertica’s beta release of its new Eon Mode architecture, offering separation of compute and storage, provides rapid elastic scaling up and down of the Vertica cluster, with just-in-time workload-based provisioning.
An intelligent, new caching mechanism on the nodes enable organizations to benefit from Vertica’s industry-leading query performance. Companies in the AWS ecosystem will be able to leverage AWS S3 for storage and Vertica’s query-optimized analytics engine for processing speed to capitalize on cloud economics.

You can expect continued product development and investment in these areas.

Q6. With the explosion of data lakes and other external data storage (including Hadoop, AWS S3, etc.), does this complicate the analytical database market or change the dynamics of how and where you analyze data?

Colin Mahony: It certainly changes the big data landscape. Hadoop has been a boon to companies and organizations that want to store vast new volumes of unstructured data cheaply in the form of a data lake. AWS S3 has extended that cheap storage to the cloud. Although Hadoop stores massive volumes of unstructured data, performing analytics on Hadoop proved challenging. Despite this challenge, companies did not want to move large amounts of data in and out of their Hadoop data lakes. As a result, more and more companies were looking to build out enterprise-grade SQL analytics on top of their Hadoop investments. This created a tremendous opportunity for Vertica, and Vertica for SQL on Hadoop was born. Vertica SQL on Hadoop is the same binary, the same core engine, with the ability to deploy natively on Hadoop nodes. Since then, we’ve continued to innovate on how Vertica integrates with the various Hadoop distributions and file formats. We’ve leveraged our years of experience in the Big Data analytics marketplace to enable organizations to analyze their data not only in place, but in the right place – without data movement – while supporting any major cloud deployment for fast and reliable read and write for multiple data formats.

Starting with the release of Vertica 8, users could derive more value from their Hadoop data lakes with Vertica’s high-performance Parquet and ORC Readers that enable users to securely access and analyze data that resides in Hadoop data lakes without copying or moving the data. And now with our latest Vertica 9 release, we’ve introduced a new HDFS Parquet writer – built on Vertica’s fast and reliable ability to not only read, but now write data and results on HDFS – to derive and contribute immediate insights on growing data lakes. Organizations can use Vertica 9’s flexible and expanded deployment options across on-premise, private, and public clouds, and on Hadoop and AWS S3 data lakes, to adopt a best-fit analytical solution.

The days of having to move data in and out of various databases and data lakes is coming to an end. In the future, more and more companies will bring analytics to the data, analyzing it in place. We believe Vertica is working at the forefront of this market transformation.

Q7. Over the last few releases, Vertica has made significant advancements in the area of in-database machine learning. How do you see this set of capabilities contributing to Vertica’s strategy and the success of your customers?

Colin Mahony: There’s no doubt that machine learning and predictive analytics are, and will continue to be a core differentiator for organizations. In today’s data-driven world, creating a competitive advantage depends on your ability to transform massive volumes of data into meaningful insights. Vertica has always supported the world’s leading data-driven organizations with the fastest SQL and extended SQL analytics. And now, by building machine learning functions directly into Vertica’s core — with no need to download and install separate packages — we are transforming the way data scientists and analysts across industries interact with data; removing barriers and accelerating time to value on predictive analytics projects. And it’s not just about developing the right algorithms and models. Our goal at Vertica is to support the entire machine learning and predictive analytics process, from data preparation to model evaluation and deployment – all using Vertica’s industry-leading scalability and performance. I’m incredibly excited to see these features transform data science and predictive analytics projects within our customer base, and for this reason, in-database machine learning will play a major role in Vertica’s future, and the future of our customers.

Our commitment to this area can be seen in the latest Vertica 9 release, which provides a comprehensive set of new Machine Learning algorithms for categorization, overfitting and prediction to enhance processing speed by eliminating the need for down-sampling and data movement. There’s also support for new data-preparation functions for deriving greater meaning from the data, while improving the quality of analysis, and a streamlined end-to-end workflow that simplifies production deployment of Machine models – particularly for customers that embed Vertica and require the ability to replicate models across clusters.

————————

Colin Mahony, SVP & General Manager, Vertica Product Group, Micro Focus

Colin Mahony leads the Vertica Product Group for Micro Focus, helping the world’s most data driven organizations to leverage and monetize their business data. Vertica was founded in 2005 and is one of the industry’s fastest growing, advanced analytics platform with in database machine learning, the ability to analyze data in the right place, and freedom from underlying infrastructure. Micro Focus also leverages Vertica to deliver embedded analytics across a very broad portfolio of enterprise grade software.

In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University. He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis as well as a mentor and board member of Year Up Boston.

————–

Resources

– What’s New in Vertica 9.0?, ODBMS.org, 22 Oct, 2017

– What’s New in Vertica 9.0: Eon Mode Beta, ODBMS.org, 22 Oct, 2017

– Vertica Version 9.0, ODBMS.org, 22 Oct, 2017

– Micro Focus Introduces Vertica 9, ODBMS.org, Sept. 27, 2017

Related Posts

– On Data Analysis. Interview with Rob Winters, ODBMS Industry Watch, January 9, 2017

Follow us on Twitter: @odbmsorg

On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov

Roberto V. Zicari — Fri, 30 Jun 2017 13:40:51 +0000

“Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.”–Nikita Ivanov.

I have interviewed Nikita Ivanov,CTO of GridGain.
Main topics of the interview are Apache Ignite, Apache Spark and MySQL, and how well they perform on big data analytics.

RVZ

Q1. What are the main technical challenges of SaaS development projects?

Nikita Ivanov: SaaS requires that the applications be highly responsive, reliable and web-scale. SaaS development projects face many of the same challenges as software development projects including a need for stability, reliability, security, scalability, and speed. Speed is especially critical for modern businesses undergoing the digital transformation to deliver real-time services to their end users. These challenges are amplified for SaaS solutions which may have hundreds, thousands, or tens of thousands of concurrent users, far more than an on-premise deployment of enterprise software.
Fortunately, in-memory computing offers SaaS developers solutions to the challenges of speed, scale and reliability.

Q2. In your opinion, what are the limitations of MySQL® when it comes to big data analytics?

Nikita Ivanov: MySQL was originally designed as a single-node system and not with the modern data center concept in mind. MySQL installations cannot scale to accommodate big data using MySQL on a single node. Instead, MySQL must rely on sharding, or splitting a data set over multiple nodes or instances, to manage large data sets. However, most companies manually shard their database, making the creation and maintenance of their application much more complex. Manually creating an application that can then perform cross-node SQL queries on the sharded data multiplies the level of complexity and cost.

MySQL was also not designed to run complicated queries against massive data sets. MySQL optimizer is quite limited, executing a single query at a time using a single thread. A MySQL query can neither scale among multiple CPU cores in a single system nor execute distributed queries across multiple nodes.

Q3. What solutions exist to enhance MySQL’s capabilities for big data analytics?

Nikita Ivanov: For companies which require real-time analytics, they may attempt to manually shard their database. Tools such as Vitess, a framework YouTube released for MySQL sharding, or ProxySQL are often used to help implement sharding.
To speed up queries, caching solutions such as Memcached and Redis are often deployed.

Many companies turn to data warehousing technologies. These solutions require ETL processes and a separate technology stack which must be deployed and managed. There are many external solutions, such as Hadoop and Apache Spark, which are quite popular. Vertica and ClickHouse have also emerged as analytics solutions for MySQL.

Apache Ignite offers speed, scale and reliability because it was built from the ground up as a high performant and highly scalable distributed in-memory computing platform.
In contrast to the MySQL single-node design, Apache Ignite automatically distributes data across nodes in a cluster eliminating the need for manual sharding. The cluster can be deployed on-premise, in the cloud, or in a hybrid environment. Apache Ignite easily integrates with Hadoop and Spark, using in-memory technology to complement these technologies and achieve significantly better performance and scale. The Apache Ignite In-Memory SQL Grid is highly optimized and easily tuned to execute high performance ANSI-99 SQL queries. The In-Memory SQL Grid offer access via JDBC/ODBC and the Ignite SQL API for external SQL commands or integration with analytics visualization software such as Tableau.

Q4. What is exactly Apache® Ignite?

Nikita Ivanov: Apache Ignite is a high-performance, distributed in-memory platform for computing and transacting on large-scale data sets in real-time. It is 1,000x faster than systems built using traditional database technologies that are based on disk or flash technologies. It can also scale out to manage petabytes of data in memory.

Apache Ignite includes the following functionality:

· Data grid – An in-memory key value data cache that can be queried

· SQL grid – Provides the ability to interact with data in-memory using ANSI SQL-99 via JDBC or ODBC APIs

· Compute grid – A stateless grid that provides high-performance computation in memory using clusters of computers and massive parallel processing

· Service grid – A service grid in which grid service instances are deployed across the distributed data and compute grids

· Streaming analytics – The ability to consume an endless stream of information and process it in real-time

· Advanced clustering – The ability to automatically discover nodes, eliminating the need to restart the entire cluster when adding new nodes

Q5. How Apache Ignite differs from other in-memory data platforms?

Nikita Ivanov: Most in-memory computing solutions fall into one of three types: in-memory data grids, in-memory databases, or a streaming analytics engine.
Apache Ignite is a full-featured in-memory computing platform which includes an in-memory data grid, in-memory database capabilities, and a streaming analytics engine. Furthermore, Apache Ignite supports distributed ACID compliant transactions and ANSI SQL-99 including support for DML and DDL via JDBC/ODBC.

Q6. Can you use Apache® Ignite for Real-Time Processing of IoT-Generated Streaming Data?

Nikita Ivanov: Yes, Apache Ignite can ingest and analyze streaming data using its streaming analytics engine which is built on a high-performance and scalable distributed architecture. Because Apache Ignite natively integrates with Apache Spark, it is also possible to deploy Spark for machine learning at in-memory computing speeds.
Apache Ignite supports both high volume OLTP and OLAP use cases, supporting Hybrid Transactional Analytical Processing (HTAP) use cases, while achieving performance gains of 1000x or greater over systems which are built on disk-based databases.

Q7. How do you stream data to an Apache Ignite cluster from embedded devices?

Nikita Ivanov: It is very easy to stream data to an Apache Ignite cluster from embedded devices.
The Apache Ignite streaming functionality allows for processing never-ending streams of data from embedded devices in a scalable and fault-tolerant manner. Apache Ignite can handle millions of events per second on a moderately sized cluster for embedded devices generating massive amounts of data.

Q8. Is this different then using Apache Kafka?

Nikita Ivanov: Apache Kafka is a distributed streaming platform that lets you publish and subscribe to data streams. Kafka is most commonly used to build a real-time streaming data pipeline that reliably transfers data between applications. This is very different from Apache Ignite, which is designed to ingest, process, analyze and store streaming data.

Q9. How do you conduct real-time data processing on this stream using Apache Ignite?

Nikita Ivanov: Apache Ignite includes a connector for Apache Kafka so it is easy to connect Apache Kafka and Apache Ignite. Developers can either push data from Kafka directly into Ignite’s in-memory data cache or present the streaming data to Ignite’s streaming module where it can be analyzed and processed before being stored in memory.
This versatility makes the combination of Apache Kafka and Apache Ignite very powerful for real-time processing of streaming data.

Q10. Is this different then using Spark Streaming?

Nikita Ivanov: Spark Streaming enables processing of live data streams. This is merely one of the capabilities that Apache Ignite supports. Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.

Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.

Qx. Is there anything else you wish to add?

Nikita Ivanov: The world is undergoing a digital transformation which is driving companies to get closer to their customers. This transformation requires that companies move from big data to fast data, the ability to gain real-time insights from massive amounts of incoming data. Whether that data is generated by the Internet of Things (IoT), web-scale applications, or other streaming data sources, companies must put architectures in place to make sense of this river of data. As companies make this transition, they will be moving to memory-first architectures which ingest and process data in-memory before offloading to disk-based datastores and increasingly will be applying machine learning and deep learning to make understand the data. Apache Ignite continues to evolve in directions that will support and extend the abilities of memory-first architectures and machine learning/deep learning systems.

——–
Nikita IvanovFounder & CTO, GridGain,
Nikita Ivanov is founder of Apache Ignite project and CTO of GridGain Systems, started in 2007. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today. Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. He is an active member of Java middleware community, contributor to the Java specification. He’s also a frequent international speaker with over two dozen of talks on various developer conferences globally.

Resources

– Apache Ignite Community Resources

– apache/ignite on GitHub

– Yardstick Apache Ignite Benchmarks

–Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite

–Misys Uses GridGain to Enable High Performance, Real-Time Data Processing

–The Spark Python API (PySpark)

– On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah. ODBMS Industry Watch,March 13, 2017

Follow ODBMS.org on Twitter: @odbmsorg

Identity Graph Analysis at Scale. Interview with Niels Meersschaert

Roberto V. Zicari — Tue, 09 May 2017 07:10:19 +0000

“I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that.”–Niels Meersschaert.

I have interviewed Niels Meersschaert, Chief Technology Officer at Qualia. The Qualia team relies on over one terabyte of graph data in Neo4j, combined with larger amounts of non-graph data to provide major companies with consumer insights for targeted marketing and advertising opportunities.

RVZ

Q1. Your background is in Television & Film Production. How does it relate to your current job?

Niels Meersschaert: Engineering is a lot like producing. You have to understand what you are trying to achieve, understand what parts and roles you’ll need to accomplish it, all while doing it within a budget. I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that. Engineering is both art and science and creativity is a critical skill for the best engineers. I also believe that a breath of languages is critical for engineers.

Q2. Your company collects data on more than 90% of American households. What kind of data do you collect and how do you use such data?

Niels Meersschaert: We focus on high quality data that is indicative of commercial intent. Some examples include wishlist interaction, content consumption, and location data. While we have the breath of a huge swath of the American population, a key feature is that we have no personally identifiable information. We use anonymous unique identifiers.
So, we know this ID did actions indicative of interest in a new SUV, but we don’t know their name, email address, phone number or any other personally identifiable information about a consumer. We feel this is a good balance of commercial need and individual privacy.

Q3. If you had to operate with data from Europe, what would be the impact of the new EU General Data Protection Regulation (GDPR) on your work?

Niels Meersschaert: Europe is a very different market than the U.S. and many of the regulations you mentioned do require a different approach to understanding consumer behaviors. Given that we avoid personal IDs, our approach is already better situated than many peers, that rely on PII.

Q4. Why did you choose a graph database to implement your system consumer behavior tracking system?

Niels Meersschaert: Our graph database is used for ID management. We don’t use it for understanding the intent data, but rather recognizing IDs. Conceptually, describing the various IDs involved is a natural fit for a graph.
As an example, a conceptual consumer could be thought of as the top of the graph. That consumer uses many devices and each device could have 1 or more anonymous IDs associated with it, such as cookie IDs. Each node can represent an associated device or ID and the relationships between each node allow us to see the path. A key element we have in our system is something we call the Borg filter. It’s a bit of a reference to Star Trek, but essentially when we find a consumer is too connected, i.e. has dozens or hundreds of devices, we remove all those IDs from the graph as clearly something has gone wrong. A graph database makes it much easier to determine how many connected nodes are at each level.

Q5. Why did you choose Neo4j?

Niels Meersschaert: Neo4J had a rich query language and very fast performance, especially if your hot set was in RAM.

Q6. You manage one terabyte of graph data in Neo4j. How do you combine them with larger amounts of non-graph data?

Niels Meersschaert: You can think of the graph as a compression system for us. While consumer actions occur on multiple devices and anonymous IDs, they represent the actions of a single consumer. This actually simplifies things for us, since the unique grouping IDs is much smaller than the unique source IDs. It also allows us to eliminate non-human IDs from the graph. This does mean we see the world in different ways they many peers. As an example, if you focus only on cookie IDs, you tend to have a much larger number of unique IDs than actual consumers those represent. Sadly, the same thing happens with website monthly uniques, many are highly inflated both on the number of unique people they represent, but also since many of the IDs are non-human. Ultimately, the entire goal of advertising is to influence consumers, so we feel that having the better representation of actual consumers allows us to be more effective.

Q7. What are the technical challenges you face when blending data with different structure?

Niels Meersschaert: A key challenge is some unifying element between different systems or structures that link data. What we did with Neo4J is create a unique property on the nodes that we use for interchange. The internal node IDs that are part of Neo4J aren’t something we use except internally within the graph DB.

Q8. If your data is sharded manually, how do you handle scalability?

Niels Meersschaert: We don’t shard the data manually, but scalability is one of the biggest challenges. We’ve spent a lot of time tuning queries and grouping operations to take advantage of some of the capabilities of Neo4J and to work around some limitations it has. The vast majority of graph customers wouldn’t have the volume nor the volatility of data that we do, so our challenges are unique.

Q9. What other technologies do you use and how they interact with Neo4j?

Niels Meersschaert: We use the classic big data tools like Hadoop and Spark. We also use MongoDB and Google’s Big Query. If you look at the graph as the truth set of device IDs, we interact with it on ingestion and export only. Everything in the middle can operate on the consumer ID, which is far more efficient.

Q10. How do you measure the ROI of your solution?

Niels Meersschaert: There are a few factors we consider. First is how much does the infrastructure cost us to process the data and output? How fast is it in terms of execution time? How much development effort does it take relative to other solutions? How flexible is it for us to extend it? This is an ever evolving situation and one we always look at how to improve, especially as a smaller business.

———————————-

Niels Meersschaert
I’ve been coding since I was 7 years old on an Apple II. I’d built radio control model cars and aircraft as a child and built several custom chassis using controlled flex as suspension to keep weight & parts count down. So, I’d had an early interest in both software and physical engineering.

My father was from the Netherlands and my maternal grandfather was a linguist fluent in 43 languages. As a kid, my father worked for the airlines, so we traveled often to Europe to see family, so I grew up multilingual. Computer languages are just different ways to describe something, the basic concepts are similar, just as they are in spoken languages albeit with different grammatical and syntax structure. Whether you’re speaking French, or writing a program in Python or C, the key is you are trying to get your communication across to the target of your message, whether it is another person or a computer.

I originally started university in aeronautical engineering, but in my sophomore year, Grumman let go about 3000 engineers, so I didn’t think the career opportunities would be as great. I’d always viewed problem solutions as a combination of art & science, so I switched majors to one in which I could combine the two.

After school I worked producing and editing commercials and industrials, often with special effects. I got into web video early on & spent a lot of time on compression and distribution systems. That led to working on search, and bringing the linguistics back front and center again. I then combined the two and came full circle back to advertising, but from the technical angle at Magnetic, where we built search retargeting. At Qualia, we kicked this into high gear, where we understand consumer intent by analyzing sentiment, content and actions across multiple devices and environments and the interaction and timing between them to understand the point in the intent path of a consumer.

Resources

EU General Data Protection Regulation (GDPR):

– Reform of EU data protection rules

– European Commission – Fact Sheet Questions and Answers – Data protection reform

– General Data Protection Regulation (Wikipedia)

– Neo4j Sandbox: The Neo4j Sandbox enables you to get started with Neo4j, with built-in guides and sample datasets for popular use cases.

– Graphalytics benchmark.ODBMS.org 6 APR, 2017
The Graphalytics benchmark is an industrial-grade benchmark for graph analysis platforms such as Giraph. It consists of six core algorithms, standard datasets, synthetic dataset generators, and reference outputs, enabling the objective comparison of graph analysis platforms.

– Collaborative Filtering: Creating the Best Teams Ever. By Maurits van der Goes, Graduate Intern | February 16, 2017

Follow us on Twitter: @odbmsorg

On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah

Roberto V. Zicari — Mon, 13 Mar 2017 10:54:21 +0000

“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah

I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop 3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.

RVZ

Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?

Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
Namely:
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.

Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?

Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.

Q3. What are the typical challenging business problems that world’s leading organisations have?

Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.

Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?

Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).

Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?

Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.

Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?

Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.

Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?

Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.

That said, the most notable features in Apache Spark 2.0 are:

1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.

2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.

3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.

Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.

Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?

Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).

Some of the key benefits of the Impala approach include:

* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL

Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.

That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.

Q8. What is Apache Kudu, and why is it relevant for Impala Users?

Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.

Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.

Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?

Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).

Other cool additions are ATS v2 and classpath isolation which you can read more about here

Q10. What is the roadmap ahead for Cloudera Enterprise?

Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
——————————
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.

Resources

– Download Page for Apache Spark

– Apache Impala supported by Cloudera Enterprise

– DATA-X: Videobook- 8 short videos introduce query analytics for Apache Hadoop

– A package that allows R developers to use Hadoop HBase

– Book: Big Data Analytics with Spark

– Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org SEPTEMBER 23, 2016

– Next-Generation Genomics Analysis with Apache Spark. by Jason Bailey. ODBMS.org Thursday, June 30th, 2016

– Supporting the Fast Data Paradigm with Apache Spark BY Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org,23 APR, 2016

– The new series of Q&A with Leading Data Scientists– ODBMS.org:
Part II
Part I

Follow us on Twitter: @odbmsorg

On Data Analysis. Interview with Rob Winters

Roberto V. Zicari — Mon, 09 Jan 2017 19:35:49 +0000

“I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. “–Rob Winters

I have interviewed Rob Winters,Head of Business Intelligence at TravelBird. The interview covers Rob`s projects experience with data analytics and HPE Vertica.

RVZ

Q1. What is the business of TravelBird?

Rob Winters: TravelBird builds and provides a daily selection of inspirational holiday offerings in twelve markets across Europe. Our goal is to create packages which excite the imagination and bring simplicity and joy to the act of travelling. These packages are then shared with our travellers via email, our website, and our iOS and Android applications.

Q2. What are the current data projects at TravelBird?

Rob Winters: TravelBird’s journey with being data driven is relatively short, beginning our initial Business Intelligence buildout in mid-2015. Currently our BI team is engaged in a number of projects, both more traditional BI and advanced analytics, including:
– Building data sources and training an organization in self-service BI
– Replacing our generic daily selections with personalized content selection models
– Optimizing pricing of packages based on product price volatility and customer demand
– Adjusting email frequency and timing to improve customer engagement and lifetime value

Q3. What is your experience in using predictive analytics?

Rob Winters: I have been working in the predictive analytics field for six years now across a variety of problem areas – customer service, retail, gaming, and now travel. From a technology standpoint I originally worked heavily with commercial solutions (Teradata, SAS) but for the last four years have used almost exclusively open source software including Hadoop, Spark, R, and Python.

Q4. How do you evaluate if your discovering insights are “good”?

Rob Winters: During the initial development of our algorithms we will typically follow a basic version of CRISP-DM to build an initial working model for our problem. To test models, we always use an A/B test and typically follow a two phase process: first the model is split-test against the current operational process/human selection, then when the model consistently outperforms the status quo, we will test future model iterations against the control.

Q5. Can you tell us a bit about the work you did in designing and implementing a fully automated, machine learning based content selection platform?

Rob Winters: To provide context, every day our planning team creates six unique product offerings for their target market of 50-500k customers to be shared via web, iOS/Android app, and email. Our goal was to replace that model with one that selects six unique products for each recipient based on past browsing and travel behavior. To do so, we designed an ensemble model consisting of several components:
– A customer preference model (user-item recommendation model)
– A product similarity model (item-item similarity)
– A “hotness” model to promote destinations which are trending/outperforming/expected to do well
– A portfolio model to select the right diversity for each recipient based on recommendation confidence, lifecycle state, and yield optimization of cannibalization vs product fit for a recipient

The data to feed these models is based on observing dozens of events per recipient per day, positive and negative feedback events of the recipient, all observable product features, and human expert input. The models are also able to improve themselves by continuously tuning the input parameters of each model based on recommendation split testing.

Q6. What are the primary technologies you are using?

Rob Winters: Our technology stack consists of the following:
-BI: Tableau
-Data warehousing: HPE Vertica
-Operations DBs: MySQL (web services) + Postgres (internal services)
-Recommendations serving: Redis
-Modeling/Analysis: Python, Spark via PySpark

Q7. What is your experience in using HPE Vertica?

Rob Winters: I have been using Vertica for five years in a number of organizations and facilitated the first rollout in the Netherlands. During that time I have been primarily an end user/data analyst but have also been the DBA for my deployments for the last two years.

Q8: Can you give us some more technical details of what was this first rollout in the Netherlands? What challenges did you solve in using HPE Vertica? What business benefits did you obtain?

Rob Winters: The objective of our rollout was to implement a centralized company datawarehouse to unify several production databases plus external API data.
The existing platform was Postgres (row-based solution) and relatively limited in performance. Primary gains were significantly faster analytics, the ability to add in several terabytes of event data (which was not possible on the prior platform), and new insights into the email database regarding churn, conversion, and customer value.

Q9: What were the main criteria for you to choose HPE Vertica? Did you do any performance test for HPE Vertica?

Rob Winters: We considered a number of alternatives including Microsoft PDW, Greenplum, and Infobright.
The primary considerations were price/performance, scalability, and analytical functionality. We found Vertica to be the best options across those aspects. Regarding performance testing, we did compare Infobright and Vertica and found the latter to be both more performant and easier to work with.

Q10. What specific functionalities of HPE Vertica do you find particularly useful in your job?

Rob Winters: There are a number of aspects which I find extremely beneficial, including:
-Ease of administration
-Performance tunability is very good, much higher than (for example) Redshift
-Analytical function extensions enable extremely powerful analyses directly via SQL
-The ability to load JSON data allows very rapid data integration from new sources

Q11. Do you think is it possible to turn an employee into a data analyst?

Rob Winters: Absolutely, I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. The biggest drivers for success in the transitition have been:
– Attitude/eagerness to learn
– Close collaboration with a more experienced analyst, either their supervisor or a more senior peer
– Making their initial projects in areas where they are unable to fall back on domain knowledge

——
Rob Winters, Head of Business Intelligence at TravelBird.
Rob has been working with and leading analytics teams since 2006 across a number of industries including telco, gaming, retail, and travel. His primary focus since 2011 has been green-field implementations of technology and team creation for both traditional business intelligence and predictive analytics; full details are listed on my linkedin profile. He holds a bachelor’s in economics and an MBA with a IT concentration.

Resources

– Data-X: Video lectures on very practical and applied Data Analytics. Data-X is a project to produce a collection of video lectures on very practical and applied data analytics.

– HPE Vertica 8 “Frontloader” BY Jeff Healey. ODBMS.org SEPTEMBER 12, 2016

–Benchmarking HPE Vertica and Amazon Redshift. (Webinar)

– HPE Vertica Analytics Platform on Microsoft Azure. By Chris_Daly. ODBMS.org SEPTEMBER 12, 2016

– Hewlett Packard Enterprise Introduces HPE Vertica 8. ODBMS.org SEPTEMBER 7, 2016

– On data analytics for finance. Interview with Jason S.Cornez. ODBMS Industry Watch, May 17, 2016

– A/B Testing is not art, it is science. By Ramkumar Ravichandran, Director, Analytics, Visa Inc. ODBMS.org, MAY 22, 2015

Follow us on Twitter: @odbmsorg

Democratizing the use of massive data sets. Interview with Dave Thomas.

Roberto V. Zicari — Mon, 12 Sep 2016 19:04:14 +0000

“Any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.”–Dave Thomas.

I have interviewed Dave Thomas,Chief Scientist at Kx Labs.

RVZ

Q1. For many years business users have had their data locked up in databases and data warehouses. What is wrong with that?

Dave Thomas: It isn’t so much an issue of where the data resides, whether it is in files, databases, data warehouses or a modern data lake. The challenge is that modern businesses need access to the raw data, as well as the ability to rapidly aggregate and analyze their data.

Q2. Typical business intelligence (BI) tool users have never seen their actual data. Why?

Dave Thomas: For large corporations hardware and software both used to be prohibitively expensive, hence much of their data was aggregated prior to making it available to users. Even today when machines are very inexpensive most corporate IT infrastructures are impoverished relative to what one can buy on the street or in the Cloud.
Compounding the problem, IT charge-back mechanisms are biased to reduce IT spending rather than to maximize the value of data delivered to the business.
Traditional technologies are not sufficiently performant to allow processing of large volumes of data.
Many companies have inexpensive data lakes and have realized after the fact that using a commodity storage systems, such as HDFS, has severely constrained their performance and limited their utility. Hence more corporations are moving data away from HDFS into high-performance storage or memory.

Q3. What are the limitations of the existing BI and extract, transform and load (ETL) data tools?

Dave Thomas: Traditional BI tools assume that it is possible for DBAs and BI experts to a priori define the best way to structure and query the data. This reduces the whole power of BI to mere reporting. In an attempt to deal with huge BI backlogs, generic query and reporting tools have become popular to shift reporting to self-serve. However, they are often designed for sophisticated BI users rather than for normal business users. They are often not performant because they depend on the implementation of the underlying data stores.
For the most part, existing ETL tools are constrained by having to move the data to the ETL process and then on to the end user. Many ETL tools only work against one kind of data source. ETL can’t be written by normal users and due to the cost of an incorrect ETL run, such tools are not available to the data analyst. One of the major topics of discussion in Big Data shops is the complexity and performance of their Big Data pipeline. ETL, data blending, shouldn’t be a separate process or product. It should be something one can do with queries in a single efficient data language.

Q4. What are the typical technical challenges in finance, IoT and other time-series applications?

Dave Thomas:
1. Speed, as data volumes and variety are always increasing.
2. Ability to deal with both real-time events and historical events efficiently. Ideally in a single technology.
3. To handle time-series one needs to be able to deal with simultaneous arrival of events. Time with nanosecond precision is our solution. Other solutions are constrained by using milliseconds and event counters that are much less efficient.
4. High-performance operations on time, over days, months and years are essential for time-series. This is why time is a native type in Kx.
5. The essence of time-series is processing sliding time windows of data for both joins and aggregations.
6. In IOT, data is always dirty. Kx’s native support for missing data and out of band data due to failing sensors, allows one to deal with the realities of sensor data.

Q5. Kx offers analysts a language called q. Why not extend standard SQL?

Dave Thomas: I think there is a misunderstanding about q. Q is a full functional data language that both includes and extends SQL. Selects are easier than SQL because they provide implicit joins and group-bys. This makes queries roughly 50% of the code of SQL. Unlike many flavors of SQL, q lets one put a functional expression in any position in an SQL statement. One can easily extend the aggregation operations available to the end-user.

Q6. Can you show the difference between a query written in q and in standard SQL?

Dave Thomas: Here’s an example of retrieving parts from an orders table with a foreign key join to a parts table, summing by quantity and then sorting by color:

q:
select sum qty by p.color from sp

SQL:
select p.color, sum(sp.qty) from sp, p
where sp.p=p.p group by p.color order by color

Q7. How do queries execute inside the database?

Dave Thomas: Q is native to the database engine. Hence queries and analytics execute in the columns of the Kx database. There is no data shipping between the client and database server.

Q8. Shawn Rogers of Dell said: “A ‘citizen data scientist’ is an everyday, non-technical user that lacks the statistical and analytical prowess of a traditional data scientist, but is equally eager to leverage data in order to uncover insights, and importantly, do so at the speed of business.” What is your take on this?

Dave Thomas: High-performance data technologies, such as Kx, using modern large-memory hardware, can support data analysts versus data scientist queries. In the product Analyst for Kx, for example, users can work interactively on a sample of data using visual tools to import, clean, query, transform, analyze and visualize data with minimal, if any programming or even SQL. Given correct operations on one or more samples they then can be run against trillions of rows of data. Data analysts today can truly live in their data.

Q9. What are the risks of bringing the power of analytics to users who are non-expert programmers?

Dave Thomas: Clearly any important analysis needs to be validated and cross-checked. Hence any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.
In our experience users do make initial mistakes, but as they live in their data they quickly learn.
Visualization really helps, as does the provision of metadata about the data sources. Reducing the cycle time provides increased understanding, and allows one to make mistakes.
Runaway query performance has been a concern of DBAs, but for many years frameworks have been in place such as our smart query router that will ensure that ad hoc queries against massive datasets are throttled so they don’t run away. Fortunately, recent cost reductions in non-volatile memory make it possible to have high-performance query-only replicas of data that can be made available to different parts of the organization based on its needs.

Q10. How can non-expert programmers understand if the information expressed in visual analytics such as heat maps or in operational dashboard charts, is of good quality or not?

Dave Thomas: In our experience users spot visual anomalies much faster than inconsistencies in a spreadsheet.

Q11. What are the opportunities arising in “democratizing” the use of massive data sets?

Dave Thomas: We are finally living in a world where for many companies it is possible to run a real-time business where everyone can have fast, efficient access to the data they need. Rather than being held hostage to aggregations, spreadsheets and all sorts of variants of the truth, the organization can expediently see new opportunities to improve results in sales, marketing, production and other business operations.

Q12. How important is data query and data semantics?

Dave Thomas: Unfortunately we are not educated on how to express data semantics and data query.
Even computer scientists often study less about writing queries than how to execute them efficiently.
We need to educate students and employees on how to live in their data. It may well be that the future of programming for most will be writing queries. Given powerful data languages even compiler optimizations can be expressed by queries.
We need to invest much more in data governance and the use of standard terminology in order to share data within and across companies.

——————-
Dave Thomas, Kx Labs.
As Chief Scientist Dave envisions the future roadmap for Kx tools. Dave has had a long and storied career in computer software development and is perhaps best known as the founder and past CEO of Object Technology International, formerly OTI, now IBM OTI Labs, a pioneer in Agile Product Development. He was the principal visionary and architect for IBM VisualAge Smalltalk and Java tools and virtual machines including the popular open-source, multi-language Eclipse.org IDE. As the cofounder of Bedarra Research Labs he led the creation of the Ivy visual analytics workbench. Dave is a renowned speaker, university lecturer and Chairman of the Australian developer YOW! conferences.

Resources

– New Kx release includes encryption, enhanced compression and Tableau integration. ODBMS.org JULY 4, 2016.

–Resources for learning more about kdb+ and q benchmarking results.

–Kdb+ and the Internet of Things/Big Data. InDetail Paper by Bloor Research Author: Philip Howard. ODBMS.org- JANUARY 28, 2015

–On Data Governance. Interview with David Saul. ODBMS Industry Watch, Published on 2016-07-23

–On the Challenges and Opportunities of IoT. Interview with Steve Graves. ODBMS Industry Watch, Published on 2016-07-06

–On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, Published on 2016-05-24

Follow us on Twitter: @odbmsorg

On the Challenges and Opportunities of IoT. Interview with Steve Graves

Roberto V. Zicari — Wed, 06 Jul 2016 09:00:29 +0000

“Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud.”–Steve Graves.

I have interviewed, Steve Graves, co-founder and CEO of McObject. Main topic of the interview is the Internet of Things and how it relates to databases.

RVZ

Q1. What are in your opinion the main Challenges and Opportunities of the Internet of Things (IoT) seen from the perspective of a database vendor?

Steve Graves: Let’s start with the opportunities.

When we started McObject in 2001, we chose “eXtremeDB, the embedded database for intelligent, connected devices” as our tagline. eXtremeDB was designed from the get-go to live in the “things” comprising what the industry now calls the Internet of Things. The popularization of this term has created a lot of visibility and, more importantly, excitement and buzz for what was previously viewed as the relatively boring “embedded systems.” And that creates a lot of opportunities.

A lot of really smart, creative people are thinking of innovative ways to improve our health, our workplace, our environment, our infrastructure, and more. That means new opportunities for vendors of every component of the technology stack.
The challenges are manifold, and I can’t begin to address all of them. The media is largely fixated on security, which itself is multi-dimensional.
We can talk about protecting IoT-enabled devices (e.g. your car) from being hacked. We can talk about protecting the privacy of your data at rest. And we can talk about protecting the privacy of data in motion.
Every vendor needs recognize the importance of security. But, it isn’t enough for a vendor, like McObject, to provide the features to secure the target system; the developer that assembles the stack along with their own proprietary technology to create an IoT solution needs to use available security features, and use them correctly.

After security, scaling IoT systems is the next big challenge. It’s easy enough to prototype something.
But careful planning is needed to leap from prototype to full-blown deployment. Obvious decisions have to be made about connectivity and necessary bandwidth, how many things per gateway, one tier of gateways or more, and how much compute capacity is needed in the cloud. Beyond that, there are less obvious decisions to be made that will affect scalability, like making sure the DBMS used on devices and/or gateways is able to handle the workload (e.g. that the gateway DBMS can scale from 10 input streams to 100 input streams); determining how to divide the analytics workload between gateways and the cloud; and ensuring that the gateway, its DBMS and its communication stack can stream data to the cloud while simultaneously processing its own input streams and analytics.
Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud. In fact, ‘things’ are usually very much like good ol’ embedded systems, and system engineers that know their way around real-time/embedded operating systems, JTAG debuggers, and so on, have always been at a premium.

Q2. Data management for the IoT: What are the main differences between data management in field-deployed devices and at aggregation points?

Steve Graves: Quite simply: scale. A field-deployed device (or a gateway to field-deployed devices that do not, themselves, have any data management need or capability) has to manage a modest amount of data. But an aggregation point (the cloud being the most obvious example) has to manage many times more data – possibly orders of magnitude more.
At the same time, I have to say that they might not be all that different. Some IoT systems are going to be closed, meaning the nature of the things making up the system is known, and these won’t require much scaling. For example, a building automation system for a small- to mid-size building would have perhaps 100s of sensors and 10s of gateways, and may (or may not) push data up to a central aggregation point. If there are just 10s of gateways, we can create a UI that connects to the database on each gateway where each database is one shard of a single logical database, and execute analytics against that logical database without any need of a central aggregation point. We can extend this hypothetical case to a campus of buildings, or to a landlord with many buildings in a metropolitan area, and then a central aggregation point makes sense.

But the database system would not necessarily be different, only the organization of the physical and logical databases.
The gateways of each building would stream to a database server in the cloud. In the case of 10 buildings, we could have 10 database servers in the cloud that represent 10 shards of that logical database in the cloud. This architecture allows for great scalability. The landlord acquires another building? Great, stand up another database server and the UI connects to 11 shards instead of 10. In this scenario, database servers are software, not hardware. For the numbers we’re talking about (10 or 11 buildings), it could easily be handled by a single hardware server of modest ability.

At the other end of the scale (pun intended) are IoT systems that are wide open. By that, I mean the creators are not able to anticipate the universe of “things” that could be connected, or their quantity. In the first case, the database system should be able to ingest data that was heretofore unknown. This argues for a NoSQL database system, i.e. a database system that is schema-less. In this scenario, the database system on field-deployed devices is probably radically different from the database system in the cloud. Field-deployed devices are purpose-specific, so A) they don’t need and wouldn’t benefit from a NoSQL database system, and B) most NoSQL database systems are too resource-hungry to reside on embedded device nodes.

Q3. If we look at the characteristics of a database system for managing device-based data in the IoT, how do they differ from the characteristics of a database system (typically deployed on a server) for analyzing the “big data” generated by myriad devices?

Steve Graves: Again, let’s recognize that field-deployed devices in the IoT are classic embedded systems. In practical terms, that means relatively modest hardware like an ARM, MIPS, PowerPC or Atom processor running at 100s of megahertz, or perhaps 1 ghz if we’re lucky, and with only enough memory to perform its function. Further, it may require a real-time operating system, or at least an embedded operating system that is less resource hungry than a full-on Linux distro. So, for a database system to run in this environment, it will need to have been designed to run in this environment. It isn’t practical to try to shoehorn in a database system that was written on the assumption that CPU cycles and memory are abundant. It may also be the case that the device has little-to-no persistent storage, which mandates an in-memory database.

So a database system for a field-deployed device is going to
1. have a small code size
2. use little stack
3. preferably, allocate no heap memory
4. have no, or minimal, external dependencies (e.g. not link in an extra 1 MB of code from the C run-time library)
5. have built-in ability to replicate data (to a gateway or directly to the cloud)
a. Replication should be “open”, meaning be able to replicate to a different database system
6. Have built-in security features

7. Nice to have:
a. built-in analytics to aggregate data prior to replicating it
b. ability to define the schema
c. ability to operate entirely in memory

A database system for the cloud might benefit from being schema-less, as described previously. It should certainly have pretty elastic scalability. Servers in the cloud are going to have ample resources and robust operating systems. So a database system for the cloud doesn’t need to have a small code size, use a small amount of stack memory, or worry about external dependencies such as the C run-time library. On the contrary, a database system for the cloud is expected to do much more (handle data at scale, execute analytics, etc.) and will, therefore, need ample resources. In fact, this database system should be able to take maximum advantage of the resources available, including being able to scale horizontally (across cores, CPUs, and servers).
In summary, the edge (device-based) DBMS needs to operate in a constrained environment. A cloud DBMS needs to be able to effectively and efficiently utilize the ample resources available to it.

Q4. Why is the ability to define a database schema important (versus a schema-less DBMS, aka NoSQL) for field-deployed devices?

Steve Graves: Field-deployed devices will normally perform a few specific functions (sometimes, just one function). For example, a building automation system manages HVAC, lighting, etc. A livestock management system manages feed, output, and so on. In such systems, the data requirements are well known. The hallmark NoSQL advantage of being able to store data without predefining its structure is unwarranted. The other purported hallmark of NoSQL is horizontal scalability, but this is not a need for field-deployed devices.
Walking away from the relational database model (and its implicit use of a database schema) has serious implications.
A great deal of scientific knowledge has been amassed around the relational database model over the last few decades, and without it developers are completely on their own with respect to enforcing sound data management practices.

In the NoSQL sphere, there is nothing comparable to the relational model (e.g. E.F. Codd’s work) and the mathematical foundation (relational calculus) underpinning it.
There should be overwhelming justification for a decision to not use relational.
In my experience, that justification is absent for data management of field-deployed devices.
A database system that “knows” the data design (via a schema) can more intelligently manage the data. For example, it can manage constraints, domain dependencies, events and much more. And some of the purported inflexibility imposed by a schema can be eliminated if the DBMS supports dynamic DDL (see more details on this in the answer to question Q6, below).

Q5. In your opinion, do IoT aggregation points resemble data lakes?

Steve Graves: The term data lake was originally conceived in the context of Hadoop and map-reduce functionality. In more recent times, the meaning of the term has morphed to become synonymous with big data, and that is how I use the term. Insofar as a gateway can also be an aggregation point, I would not say ‘aggregation points resemble data lakes’ because gateway aggregation points, in all likelihood, will not manage Big Data.

Q6. What are the main technical challenges for database systems used to accommodate new and unforeseen data, for example when a new type of device begins streaming data?

Steve Graves: The obvious challenges are
1. The ability to ingest new data that has a previously unknown structure
2. The ability to execute analytics on #1
3. The ability to integrate analytics on #1 with analytics on previously known data

#1 is handled well by NoSQL DBMSs. But, it might also be handled well by an RDBMS via “dynamic DDL” (dynamic data definition language), e.g. the ability to execute CREATE TABLE, ALTER TABLE, and/or CREATE INDEX statements against an existing database.
To efficiently execute analytics against any data, the structure of the data must eventually be understood.
RDBMS handle this through the database dictionary (the binary equivalent of the data definition language).
But some NoSQL DBMSs handle this through different meta data. For example, the MarkLogic DBMS uses JSON metadata to understand the structure of documents in its document store.
NoSQL DBMSs with no meta data whatsoever put the entire burden on the developers. In other words, since the data is opaque to the DBMS, the application code must read and interpret the content.

Q7. Client/server DBMS architecture vs. in-process DBMSs: which one is more suitable for IoT?

Steve Graves: For edge DBMSs (on constrained devices), an in-process architecture will be more suitable. It requires fewer resources than client/server architecture, and imposes less latency through elimination of inter-process communication. For cloud DBMSs, a client/server architecture will be more suitable. In the cloud environment, resources are not scarce, and the the advantage of being able to scale horizontally will outweigh the added latency associated with client/server.

Qx Anything else you wish to add?

Steve Graves: We feel that eXtremeDB is uniquely positioned for the Internet of Things. Not only have devices and gateways been in eXtremeDB’s wheelhouse for 15 years with over 25 million real world deployments, but the scalability, time series data management, and analytics built into the eXtremeDB server (big data) offering make it an attractive cloud database solution as well. Being able to leverage a single DBMS across devices, gateways and the cloud has obvious synergistic advantages.

———————
Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

Resources

Big Data, Analytics, and the Internet of Things, by Mohak Shah, analytics leader and research scientist at Bosch Research, USA.ODBMS.org APRIL 6, 2015

Privacy considerations & responsibilities in the era of Big Data & Internet of Things, by Ramkumar Ravichandran, Director, Analytics, Visa Inc. ODBMS.org January 8, 2015.

Securing Your Largest USB-Connected Device: Your Car,BY Shomit Ghose, General Partner, ONSET Ventures, ODBMs.org MARCH 31, 2016.

eXtremeDB Financial Edition DBMS Sweeps Records in Big Data Benchmark,ODBMS.org JULY 2, 2016

eXtremeDB in-memory database

User Experience Design for the Internet of Things

Related Posts

On the Internet of Things. Interview with Colin Mahony, ODBMS Industry Watch, Published on 2016-03-14

A Grand Tour of Big Data. Interview with Alan Morrison, ODBMS Industry Watch, Published on 2016-02-25

On the Industrial Internet of Things. Interview with Leon Guzenda, ODBMS Industry Watch, January 28, 2016

Follow us on Twitter: @odbmsorg