ODBMS Industry Watch » Amazon http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/ http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/#comments Mon, 13 Mar 2017 10:54:21 +0000 http://www.odbms.org/blog/?p=4326

“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah

I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.  
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop  3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.

RVZ

Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?

Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
Namely:
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.

Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?

Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.

Q3. What are the typical challenging business problems that world’s leading organisations have?

Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.

Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?

Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).

Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?

Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.

Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?

Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.

Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?

Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.

That said, the most notable features in Apache Spark 2.0 are:

1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.

2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.

3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.

Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.

Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?

Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).

Some of the key benefits of the Impala approach include:

* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL

Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.

That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.

Q8. What is Apache Kudu, and why is it relevant for Impala Users?

Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.

Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.

Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?

Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).

Other cool additions are ATS v2 and classpath isolation which you can read more about here

Q10. What is the roadmap ahead for Cloudera Enterprise?

Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
——————————
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.

Resources

Download Page for Apache Spark™

Apache Impala supported by Cloudera Enterprise

DATA-X: Videobook- 8 short videos introduce query analytics for Apache Hadoop

A package that allows R developers to use Hadoop HBase

Book: Big Data Analytics with Spark

Related Posts

Streaming Analytics for Chain Monitoring. By Natalino Busa, Head of Data Science at Teradata — Thursday, ODBMS.org January 12, 2017

Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org SEPTEMBER 23, 2016

Next-Generation Genomics Analysis with Apache Spark. by Jason Bailey. ODBMS.org Thursday, June 30th, 2016

Supporting the Fast Data Paradigm with Apache Spark BY Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org,23 APR, 2016

– The new series of Q&A with Leading Data Scientists– ODBMS.org:
Part II
Part I

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/feed/ 0
On in-memory, key-value data stores. Ofer Bengal and Yiftach Shoolman http://www.odbms.org/blog/2017/02/on-in-memory-key-value-data-stores-ofer-bengal-and-yiftach-shoolman/ http://www.odbms.org/blog/2017/02/on-in-memory-key-value-data-stores-ofer-bengal-and-yiftach-shoolman/#comments Mon, 13 Feb 2017 10:52:57 +0000 http://www.odbms.org/blog/?p=4278

“While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.”  –Ofer Bengal.

I have interviewed Ofer Bengal, Co-Founder and CEO of Redis Labs, and Yiftach Shoolman, Co-Founder and CTO of Redis Labs.
Main topics of the interview are: How is the database market evolving, proprietary vs. open source software, in-memory/ key-value data stores, and the new features of Redis.

RVZ

Q1. How do you see the database market evolving?

Ofer Bengal, Yiftach Shoolman: The main trends we identify today and believe will continue in upcoming years are:
1) Non-relational databases will continue to see growing adoption, because the schema framework is ineffective when it comes to unstructured data, change in data patterns, growing data volumes, more stringent performance requirements and the way modern apps are built.
2) Multiple database models as opposed to the absolute dominance of RDMS in the past few decades, each model solving the requirements of certain use cases.
Moreover, certain modern databases can run several database models (document, graph, etc.)
3) Multiple databases (different types or the same type) serving the same app. Modern applications are based on micro service architecture, in which each micro service works with the best database for its use case.
This creates new challenges for modern databases: (a) Instant provisioning – sometime hundreds or thousands of databases are provisioned within a second, and (b) Multi-tenancy, otherwise the cost associated with managing database infrastructure becomes extremely high.
4) Database-as-a-service is growing vs. self deployed and operated databases. With enterprises gradually moving to the cloud and having to deal with multiple type databases, it makes a lot of sense to outsource deployment and ongoing operations rather than building in-house practice of DBAs and Devops.
5) Hybrid transactional and analytical processing (HTAP). Driven by the need for application analytics to drive business decision making in real time, certain modern databases can handle those two different workloads simultaneously, eliminating the need for exporting transactional data to a separate dedicated analytical database.

Q2. Proprietary vs. open source software: what are the pros and cons?

Ofer Bengal, Yiftach Shoolman: From the community perspective, open source is great. If there is a vibrant community, it pushes innovation, problem solving and compatibility issues with different environments.
From users perspective, open source is “open”, accessible, can be used by anyone, transparent, and free of charge.
It often comes with less of a danger of vendor lock-in. It is very suitable for independent developers and startups. However enterprises using open source products may have certain challenges:
1. The product is not always suitable for enterprise workloads, especially when it comes to databases. Capabilities like infinite seamless scaling, high-availability with instant failover and stable performance at scale are not always the open source developer’s top priority.
2. Commercial support must be obtained and this typically comes with a price tag which is not much different than acquiring a commercial database product.
3. Commercial support is typically provided by a single company (most probably founded by the open source creators), which creates “vendor lock-in” by itself.
4. In the case of databases, using database-as-a-service may turn out to be lower in cost compared to provisioning cloud instances and running zero cost open source software on them, because commercial can be based on efficient multi-tenant architecture.

Q3. What is the current market for in-memory, key-value data stores?

Ofer Bengal: In-memory key-value data stores (sometimes called in-memory data grids (IMDGs)) have been around since more than a decade and have proven capable of supporting digital business needs for responsive, always-on user experience; real-time, actionable insights; and dynamic scaling. They are widely employed when you want to scale/modernize legacy applications without spending additional money on extremely expensive RDBMS licenses and hardware.This is achieved by providing a scalable and reliable in-memory datastore that enables low-latency transactional and analytical processing.
While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.

From a Redis perspective, our innovation in data structures brings about the ability to simplify development to the extent that now most Redis users use it as a first responder and primary datastore for substantial pieces of their data. Furthermore with Redis’ data-structures, users can run operational and analytical use cases on the same database.
In addition, acceleration of other in-memory platforms like Spark is possible with Redis.

Gartner estimates that, in 2015, the stand-alone IMDG market was worth approximately $600 million, having grown by about 30% from the previous year. Gartner expects the market to continue to grow in the double-digit range through 2020 and to exceed $1 billion by 2018. Redis, one of the leaders in this space, grew in just a few years to be one of the most popular databases used by developers and enterprises.

Q4. Amazon ElastiCache supports two open-source in-memory engines: Redis and Memcached. What does it mean in practice?

Yiftach Shoolman: In practice, Amazon ElastiCache is a simple caching service that simplifies a developer experience by providing these two open source in-memory engines. Legacy applications that use simple cache can use ElastiCache seamlessly.
However, ElastiCache is single-tenant, limited to caching use cases and cannot be used as a database, lacking enterprise-grade functionalities such as infinite seamless scalability, instant failover and predictable performance.
The Redis Labs equivalent service, called Redis Cloud provides all the benefits of an enterprise-class Redis.

Q5. What are the pros and cons of Memcached and Redis?

Yiftach Shoolman: Redis can be thought of as modern database while memcached is older technology designed specifically for ephemeral caching.
The most important difference is in persistence and HA – memcached is not persistent nor HA, while Redis can operate as a full-fledged in-memory database, highly available through both in-memory replication and data persistence. This reflects the fact that caches in older architectures were not required to be highly available, but in modern architectures, built for scale and volume, cache outages can significantly impact the business and user experience.
Redis, the newer and more versatile technology allows individual data elements to be manipulated while memcached often incurs serialization/deserialization overheads that makes the entire application processing much slower. This is because Memcached can handle only simple key value use cases, whereas Redis offers many more data structures (hashes, sets, sorted sets, lists, hyperloglog..) that simplify complex data processing, analysis and operational use cases with ease.
Even when used as a cache, Redis has more sophisticated eviction policies which can be both active or passive while memcached has only a simple LRU and lazy eviction.
Redis and Memcached are both very popular open source projects, but given its richer functionality, more advanced design, many potential uses, and greater cost efficiency at scale, Redis should be your first choice in nearly every case.

Q6. For very large data sets or analytics workloads, running everything in-memory might not be cost effective. What is your take on this?

Ofer Bengal, Yiftach Shoolman: For very large data sets or analytics workloads, it is advantageous to utilize alternative memory technologies(such as Flash memory, which is a tenth of the cost), as extensions of memory rather than impose a disk access penalty. We have extended enterprise Redis in this manner to take advantage of Flash memory, while using a tiered approach (keys and hot values are still in the fastest memory, while cold values are in “slower” Flash memory) to ensure that you still see sub-millisecond latencies with millions of ops/sec throughput.

Q7. Redis was created by Salvatore Sanfilippo in 2009. What is his role today?

Ofer Bengal: Salvatore is leading the development of open source Redis within Redis Labs. He works with a group of experienced developers on extending the capabilities of Redis. A good example of this collaborative works is the recent introduction of Redis Modules, which extend Redis to a variety of new modern use cases. Salvatore wrote the API and the other team members in a very short time created and tested a few modules, such as Redisearch (a full-text search engine) and Redis-ML (enhancing the performance of Spark machine learning capabilities). Salvatore’s role is to continue the community innovation around the Redis core, together with his team of Redis Labs developers.

Q8. What are the differences of Redis Labs` version of Redis with the original one developed in 2009?

Yiftach Shoolman: Redis Labs fully supports the open source Redis versions, but enhances them with a container-like layer that adds a proxy, cluster management and a shared nothing architecture. Taken together, Redis Labs provides a solid enterprise foundation to Redis, allowing it to scale seamlessly in memory across many hundreds of servers with the high availability through persistence, in-memory cross-rack/zone/region/datacenter replication and instant automatic failover. No retooling or re-architecting is required to move from open source Redis to enterprise Redis, the process is basically effortless and immediate. Redis Labs also offers various database modules, like a RediSearch, multiple probabilistic modules like Bloom Filter, TopK, CMS, Redis-ML for Machine Learning, Redis-TS for Time Series processing, JSON and Graph support.

Q9. What are the possible scenarios of using Redis for data analytics?

Ofer Bengal, Yiftach Shoolman: Redis data structures come with built-in simple analytic operations like counting, ranking, scoring, ranges and more. Over time, probabilistic data structures have added the ability to analytically estimate millions and trillions of events, without requiring memory to store all of the events.
Set operations have made it possible to simplify comparisons, intersections, unions of sets – analytics that are usually complicated with data stores. RQL (Redis SQL) and secondary indexing, allows executing complex SQL queries on an existing Redis database. And finally recent modules like RediSearch, Neural Redis and Redis-ML have added advanced search and machine learning capabilities – not naturally occurring in any other databases.
With all of these possibilities, and with the move to automated decision making, we see increasing usage of Redis for data analytics scenarios.

Q10. How safe is a Redis server?

Yiftach Shoolman: The Redis enterprise server comes with client-based SSL authentication, built-in cloud firewall support (when running on public clouds), password authentication and role-based authorization that enables customizing security levels.

Qx. Anything else you wish to add?

Ofer Bengal: Redis is a game -changer when it comes to databases, and its progression over the last seven years has demonstrated that the industry and market are demanding performance and increasing flexibility to deal with all types of data processing, storage and analytic scenarios. Redis’ core values have always included high performance, high throughput and very low latencies. With the visionary addition of modules. The community has turned it into an all purpose datastore – suitable for any scenario that needs a database.

____________________________________

Ofer BengalCo-Founder and CEO of Redis Labs
Ofer is a serial entrepreneur who has founded and led several companies in the areas of data communications, telecommunications, Internet, homeland security and medical devices. Ofer was founder & CEO of RIT Technologies (NASDAQ: RITT), a provider of sophisticated telecommunications and data communications systems to major world carriers. He began his career as an aerospace engineer in the Israeli Air Force and then built his own aerospace engineering consulting firm. As a hobby, he has also invented, developed and licensed toy concepts to companies such as Milton Bradley, Hasbro and Tomy. Ofer holds a Bachelor of Science (cum laude) in aerospace engineering from the Technion, Israel Institute of Technology.

Yiftach ShoolmanCo-Founder and CTO of Redis Labs
Yiftach is an experienced technologist, having held leadership engineering and product roles in diverse fields from application acceleration, cloud computing and software-as-a-service (SaaS), to broadband networks and metro networks. He was the founder, president and CTO of Crescendo Networks (acquired by F5, NASDAQ:FFIV), the vice president of software development at Native Networks (acquired by Alcatel, NASDAQ: ALU) and part of the founding team at ECI Telecom broadband division, where he served as vice president of software engineering. Yiftach holds a Bachelor of Science in Mathematics and Computer Science and has completed studies for Master of Science in Computer Science at Tel-Aviv University.

Resources
Redis Cloud Now Available with Integrated Billing through AWS Marketplace- News Release- January 10, 2017.

AWS SaaS Marketplace.

Redis Documentation

EBOOK – REDIS IN ACTION This book covers the use of Redis, an in-memory database/data structure server.

Related Posts

New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch, November 30, 2016

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/02/on-in-memory-key-value-data-stores-ofer-bengal-and-yiftach-shoolman/feed/ 0
High-performance Compliance Capture and Analytics Solution for Financial Institutions. Interview with Michael Hay and Oskar Mencer. http://www.odbms.org/blog/2017/01/high-performance-compliance-capture-and-analytics-solution-for-financial-institutions-interview-with-michael-hay-and-oskar-mencer/ http://www.odbms.org/blog/2017/01/high-performance-compliance-capture-and-analytics-solution-for-financial-institutions-interview-with-michael-hay-and-oskar-mencer/#comments Thu, 26 Jan 2017 08:50:14 +0000 http://www.odbms.org/blog/?p=4318

“New regulations such as MIFID II indeed aim at increasing transparency, which in turn requires more precise reporting. These reports require a lot of data to be stored and data capture to be ultra accurate.”– Michael Hay and Oskar Mencer.

Hitachi Data Systems and Maxeler Technologies announced a cooperation around High-performance Compliance Capture and Analytics Solution for Financial Institutions. I have interviewed Michael Hay, VP & CHIEF ENGINEER – HITACHI DATA SYSTEMS, and Oskar Mencer, CEO, CTO, Maxeler Technologies Inc.

RVZ

Q1. What is Multi-scale Dataflow Computing?

O. Mencer: Generally, Multiscale Dataflow Computing is a computing paradigm aimed at optimizing operational efficiency of computing by computing data as it is moving through a system. We use Dataflow to minimize the sum of all distances that the data has to travel. We overlay Dataflow with a Multiscale approach of vertically optimizing the algorithm, the architecture and arithmetic.

Q2. There is an emerging EU Financial Services directive called MIFID II. This EU directive, and its associated regulation, was designed to help the regulators better handle High Frequency Trading (HFT) and so called Dark Pools, in other words, to increase transparency in the markets. What are the technological demands posed by these new financial legislation and compliance regulations?

M. Hay, O. Mencer: New regulations such as MIFID II indeed aim at increasing transparency, which in turn requires more precise reporting. These reports require a lot of data to be stored and data capture to be ultra accurate. It is an ideal environment for Hitachi data solutions to be combined with Maxeler’s low latency capability.

Q3. To address these challenges, Maxeler Technologies Inc. announced a collaboration with Hitachi Data Systems to offer a high-performance compliance capture and analytics solution. Can you please explain what this solution is about?

M. Hay, O. Mencer: We are combining programmable low latency compute with high capacity “Dataflow-like storage” and modern analytics software. This allows us to attack even the toughest customer challenges and provide competitive advantage within modest development time.

Q4. How can this solution help financial institutions achieve high-frequency, transaction-related record keeping mandated in European Union MiFID II and US Dodd-Frank regulations?

M. Hay: Hitachi’s Data Lake solutions can help to unify the wide range of regulatory data challenges faced by today’s financial institutions. With high end filtering and analytics capability added to the system, we can address regulation but also integration and security issues all within a single system.

Q5. In this cooperation, you have accomplished an operational prototype through the use of Maxeler’s DFE (Data Flow Engine) network cards, Dataflow based capture/decode capability executing on Dataflow hardware, a hardware accelerated NFS client, Hitachi’s CB500, Pentaho, and Hitachi Unified Storage (HUS). Can you explain how this architecture works?

M. Hay, O. Mencer: Our architecture accomplishes tight integration between realtime on-the-wire compute and storage. The realtime computing ability and reliability of the storage ensure that no data is lost and reports can be generated on time and on budget.

Q6. With your Multiscale Dataflow technology data is streamed from memory onto a chip where the data moves directly from one functional unit to another, without being written to off-chip memory until the entire process is complete. What is the advantage of this solution with respect to a classical ETL process?

O. Mencer: In a classical ETL process the database is in the critical loop. With the Multiscale Dataflow approach we remove the database from the critical loop and utilize an in-memory copy of the data for ultrafast access and in-memory analytics.

Q7. The overall system from packet capture to NFS write does not use a single server side CPU cycle. What does it mean in practice?

O. Mencer: We use a special substrate to create a dataflow computer by connecting vast numbers of arithmetic units, and implement networking state machines right down on the hardware level. This means that the packet flow through the system is in a tight hardware loop and only metadata travels through conventional CPUs. Additionally, on the storage side Hitachi’s Unified Storage also uses Dataflow-like structures to implement a full set of Network File Serving, a Filesystem and smart object caching for file system object I/O. In this way usage of general CPU cycles if further minimized.
The impact to customers is decreased space needed for the solution coupled to significant performance improvements.

Q8. You claim that dataflow computing can accelerate and run different applications orders of magnitude faster than conventional CPUs. Do you have any benchmarking results to share?

O. Mencer: Benchmarks are not applications and there is no claim that we can accelerate tiny benchmarks.
Our technology enables complete applications with a purpose in the real world to run orders of magnitude faster. For example, in 2011 a Tier 1 investment bank won the American Finance Technology Award for their installation of a machine from Maxeler, which reduced the time to calculate risk from 8 hours down to 2 minutes.

Q9. The Maxeler-Hitachi Data Systems solution leverages the new Amazon AWS F1 instance. Why? Can you please elaborate on this?

M. Hay, O. Mencer: Our joint hardware solution complements the F1 instance for on-premise activities in a hybrid cloud setting. It helps that the latest Maxeler generation (MAX5) is fully compatible with F1 and it is therefore easy to build a hybrid cloud solution with a single code base. If the reader would like to learn more we’re open and able to entertain discussions about finding relevant problems to engage on.

——————————————-

MICHAEL HAY | マイケル ヘイ
VP & CHIEF ENGINEER – HITACHI DATA SYSTEMS. GENERAL MGR, DIGITAL SOLUTIONS BUSINESS DEVELOPMENT – HITACHI, SPBD
As Vice President and Chief Engineer at Hitachi Data Systems and a General Manager of the Service Business Platform Division in Japan, Michael leads a global team that contemplates and enacts the future of Hitachi’s expanding ICT and Social Innovation portfolios. Michael engages a variety R&D teams, using a clear understanding of market requirements, to guide direction and inspire innovation. Michael joined HDS in 2001 after serving as CEO and owner of a consultancy company focused on complex Enterprise and Systems management design and deployments. His professional background spans over 20 years and includes stints at IBM, IBM partners, and other IT start-up companies. These roles have helped Michael develop a capacity to define solutions for tomorrow’s problems. Michael holds a Masters in Industrial Engineering with a focus in Human Factors from San Jose State and a Bachelors degree in Electrical Engineering from the University of New Mexico, in Albuquerque, NM.

Oskar Mencer. Prior to founding Maxeler, Oskar was Member of Technical Staff at the Computing Sciences Center at Bell Labs in Murray Hill, leading the effort in “Stream Computing”. He joined Bell Labs after receiving a PhD from Stanford University. Besides driving Maximum Performance Computing (MPC) at Maxeler, Oskar was Consulting Professor in Geophysics at Stanford University and he is also affiliated with the Computing Department at Imperial College London, having received two Best Paper Awards, an Imperial College Research Excellence Award in 2007 and a Special Award from Com.sult in 2012 for “revolutionising the world of computers”.

————————–
Resources

Video: Maxeler Dataflow Engine attached to a Hitachi Data Systems HNAS. Dr. Itay Greenspon, Maxeler.

Maxeler Technologies Inc Collaborates with Hitachi Data Systems Around High-performance Compliance Capture and Analytics Solution for Financial Institutions. 23 Dec 2016

Maxeler is official AWS F1 Instance Partner.05 Dec 2016

Video: What is OpenSPL? Professor Michael J Flynn, Stanford University
OpenSPL is an open standard for a novel Spatial Programming Language. It is based on the core concept that a program executes in space, rather than in time sequence. All operations are assumed to be parallel unless specified to be sequential. This is similar to a factory floor where all operations execute in parallel, but each operation executes a different part of the overall process. Temporal Programming is a recipe for the execution of actions, whereas Spatial Programming builds a factory to execute the recipe.

Multiscale Dataflow Computing AppGallery

HPC Matters to our Quality of Life and Prosperity. by Don Johnston, Lawrence Livermore National Laboratory

Related Posts

Hitachi Data Systems Works with Maxeler Technologies. Posted by Michael Hay, Jan 3, 2017

The Many Core Phenomena. Blog Post created by Michael Hay on Dec 13, 2016

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/01/high-performance-compliance-capture-and-analytics-solution-for-financial-institutions-interview-with-michael-hay-and-oskar-mencer/feed/ 0
Big Data and The Great A.I. Awakening. Interview with Steve Lohr http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/ http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/#comments Mon, 19 Dec 2016 08:35:56 +0000 http://www.odbms.org/blog/?p=4274

“I think we’re just beginning to grapple with implications of data as an economic asset” –Steve Lohr.

My last interview for this year is with Steve Lohr. Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting. We discussed Big Data and how it influences the new Artificial Intelligence awakening.

Wishing you all the best for the Holiday Season and a healthy and prosperous New Year!

RVZ

Q1. Why do you think Google (TensorFlow) and Microsoft (Computational Network Toolkit) are open-sourcing their AI software?

Steve Lohr: Both Google and Microsoft are contributing their tools to expand and enlarge the AI community, which is good for the world and good for their businesses. But I also think the move is a recognition that algorithms are not where their long-term advantage lies. Data is.

Q2. What are the implications of that for both business and policy?

Steve Lohr: The companies with big data pools can have great economic power. Today, that shortlist would include Google, Microsoft, Facebook, Amazon, Apple and Baidu.
I think we’re just beginning to grapple with implications of data as an economic asset. For example, you’re seeing that now with Microsoft’s plan to buy LinkedIn, with its personal profiles and professional connections for more than 400 million people. In the evolving data economy, is that an antitrust issue of concern?

Q3. In this competing world of AI, what is more important, vast data pools, sophisticated algorithms or deep pockets?

Steve Lohr: The best answer to that question, I think, came from a recent conversation with Andrew Ng, a Stanford professor who worked at GoogleX, is co-founder of Coursera and is now chief scientist at Baidu. I asked him why Baidu, and he replied there were only a few places to go to be a leader in A.I. Superior software algorithms, he explained, may give you an advantage for months, but probably no more. Instead, Ng said, you look for companies with two things — lots of capital and lots of data. “No one can replicate your data,” he said. “It’s the defensible barrier, not algorithms.”

Q4. What is the interplay and implications of big data and artificial intelligence?

Steve Lohr: The data revolution has made the recent AI advances possible. We’ve seen big improvements in the last few years, for example, in AI tasks like speech recognition and image recognition, using neural network and deep learning techniques. Those technologies have been around for decades, but they are getting a huge boost from the abundance of training data because of all the web image and voice data that can be tapped now.

Q5. Is data science really only a here-and-now version of AI?

Steve Lohr: No, certainly not only. But I do find that phrase a useful way to explain to most of my readers — intelligent people, but not computer scientists — the interplay between data science and AI. To convey that rudiments of data-driven AI are already all around us. It’s not — surely not yet — robot armies and self-driving cars as fixtures of everyday life. But it is internet search, product recommendations, targeted advertising and elements of personalized medicine, to cite a few examples.

Q6. Technology is moving beyond increasing the odds of making a sale, to being used in higher-stakes decisions like medical diagnosis, loan approvals, hiring and crime prevention. What are the societal implications of this?

Steve Lohr: The new, higher-stakes decisions that data science and AI tools are increasingly being used to make — or assist in making — are fundamentally different than marketing and advertising. In marketing and advertising, a decision that is better on average is plenty good enough. You’ve increased sales and made more money. You don’t really have to know why.
But the other decisions you mentioned are practically and ethically very different. These are crucial decisions about individual people’s lives. Better on average isn’t good enough. For these kinds of decisions, issues of accuracy, fairness and discrimination come into play.
That, I think, argues for two things. First, some sort of auditing tool; the technology has to be able to explain itself, to explain how a data-driven algorithm came to the decision or recommendation that it did.
Second, I think it argues for having a “human in the loop” for most of these kinds of decisions for the foreseeable future.

Q7. Will data analytics move into the mainstream of the economy (far beyond the well known, born-on-the-internet success stories like Google, Facebook and Amazon)?

Steve Lohr: Yes, and I think we’re seeing that now in nearly every field — health care, agriculture, transportation, energy and others. That said, it is still very early. It is a phenomenon that will play out for years, and decades.
Recently, I talked to Jeffrey Immelt, the chief executive of General Electric, America’s largest industrial company. GE is investing heavily to put data-generating sensors on its jet engines, power turbines, medical equipment and other machines — and to hire software engineers and data scientists.
Immelt said if you go back more than a century to the origins of the company, dating back to Thomas Edison‘s days, GE’s technical foundation has been materials science and physics. Data analytics, he said, will be the third fundamental technology for GE in the future.
I think that’s a pretty telling sign of where things are headed.

—————————–
Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years and writes for the Times’ Bits blog. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting.
He was a foreign correspondent for a decade and served as an editor, and has written for national publications such as the New York Times Magazine, the Atlantic, and the Washington Monthly. He is the author of Go To: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Maverick Scientists, Iconoclasts—the Programmers Who Created the Software Revolution and Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else.
He lives in New York City.

————————–

Resources

Google (TensorFlow): TensorFlow™ is an open source software library for numerical computation using data flow graphs.

Microsoft (Computational Network Toolkit): A free, easy-to-use, open-source, commercial-grade toolkit that trains deep learning algorithms to learn like the human brain.

Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else. by Steve Lohr. 2016 HarperCollins Publishers

Related Posts

Don’t Fear the Robots. By STEVE LOHR. -OCT. 24, 2015-The New York Times, SundayReview | NEWS ANALYSIS

G.E., the 124-Year-Old Software Start-Up. By STEVE LOHR. -AUG. 27, 2016- The New York Times, TECHNOLOGY

Machines of Loving Grace. Interview with John Markoff. ODBMS Industry Watch, Published on 2016-08-11

Recruit Institute of Technology. Interview with Alon Halevy. ODBMS Industry Watch, Published on 2016-04-02

Civility in the Age of Artificial Intelligence, by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

On Big Data and Society. Interview with Viktor Mayer-Schönberger, ODBMS Industry Watch.

Follow us on Twitter:@odbmsorg

##

]]>
http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/feed/ 1
On the Internet of Things. Interview with Colin Mahony http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/ http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/#comments Mon, 14 Mar 2016 08:45:56 +0000 http://www.odbms.org/blog/?p=4101

“Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, HPE Big Data Platform. Topics of the interview are: The challenges of the Internet of Things, the opportunities for Data Analytics, the positioning of HPE Vertica and HPE Cloud Strategy.

RVZ

Q1. Gartner says 6.4 billion connected “things” will be in use in 2016, up 30 percent from 2015.  How do you see the global Internet of Things (IoT) market developing in the next years?

Colin Mahony: As manufacturers connect more of their “things,” they have an increased need for analytics to derive insight from massive volumes of sensor or machine data. I see these manufacturers, particularly manufacturers of commodity equipment, with a need to provide more value-added services based on their ability to provide higher levels of service and overall customer satisfaction. Data analytics platforms are key to making that happen. Also, we could see entirely new analytical applications emerge, driven by what consumers want to know about their devices and combine that data with, say, their exercise regimens, health vitals, social activities, and even driving behavior, for full personal insight.
Ultimately, the Internet of Things will drive a need for the Analyzer of Things, and that is our mission.

Q2. What Challenges and Opportunities bring the Internet of Things (IoT)? 

Colin Mahony: Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data. The reason? Traditional data warehouses were designed well before the Internet of Things, or, at least before OT (operational technology) like medical devices, industrial equipment, cars, and more were connected to the Internet. So, having an analytical platform to provide the scale and performance required to handle these volumes is important, but customers are taking more of a two- or three-tier approach that involves some sort of analytical processing at the edge before data is sent to an analytical data store. Apache Kafka is also becoming an important tier in this architecture, serving as a message bus, to collect and push that data from the edge in streams to the appropriate database, CRM system, or analytical platform for, as an example, correlation of fault data over months or even years to predict and prevent part failure and optimize inventory levels.

Q3. Big Data: In your opinion, what are the current main demands/needs in the market?

Colin Mahony: All organizations want – and need – to become data-driven organizations. I mean, who wants to make such critical decisions based on half answers and anecdotal data? That said, traditional companies with data stores and systems going back 30-40 years don’t have the same level playing field as the next market disruptor that just received their series B funding and only knows that analytics is the life blood of their business and all their critical decisions.
The good news is that whether you are a 100-year old insurance company or the next Uber or Facebook, you can become a data-driven organization by taking an open platform approach that uses the best tool for the job and can incorporate emerging technologies like Kafka and Spark without having to bolt on or buy all of that technology from a single vendor and get locked in.  Understanding the difference between an open platform with a rich ecosystem and open source software as one very important part of that ecosystem has been a differentiator for our customers.

Beyond technology, we have customers that establish analytical centers of excellence that actually work with the data consumers – often business analysts – that run ad-hoc queries using their preferred data visualization tool to get the insight need for their business unit or department. If the data analysts struggle, then this center of excellence, which happens to report up through IT, collaborates with them to understand and help them get to the analytical insight – rather than simply halting the queries with no guidance on how to improve.

Q4. How do you embed analytics and why is it useful? 

Colin Mahony: OEM software vendors, particularly, see the value of embedding analytics in their commercial software products or software as a service (SaaS) offerings.  They profit by creating analytic data management features or entirely new applications that put customers on a faster path to better, data-driven decision making. Offering such analytics capabilities enables them to not only keep a larger share of their customer’s budget, but at the same time greatly improve customer satisfaction. To offer such capabilities, many embedded software providers are attempting unorthodox fixes with row-oriented OLTP databases, document stores, and Hadoop variations that were never designed for heavy analytic workloads at the volume, velocity, and variety of today’s enterprise. Alternatively, some companies are attempting to build their own big data management systems. But such custom database solutions can take thousands of hours of research and development, require specialized support and training, and may not be as adaptable to continuous enhancement as a pure-play analytics platform. Both approaches are costly and often outside the core competency of businesses that are looking to bring solutions to market quickly.

Because it’s specifically designed for analytic workloads, HPE Vertica is quite different from other commercial alternatives. Vertica differs from OLTP DBMS and proprietary appliances (which typically embed row-store DBMSs) by grouping data together on disk by column rather than by row (that is, so that the next piece of data read off disk is the next attribute in a column, not the next attribute in a row). This enables Vertica to read only the columns referenced by the query, instead of scanning the whole table as row-oriented databases must do. This speeds up query processing dramatically by reducing disk I/O.

You’ll find Vertica as the core analytical engine behind some popular products, including Lancope, Empirix, Good Data, and others as well as many HPE offerings like HPE Operations Analytics, HPE Application Defender, and HPE App Pulse Mobile, and more.

Q5. How do you make a decision when it is more appropriate to “consume and deploy” Big Data on premise, in the cloud, on demand and on Hadoop?

Colin Mahony: The best part is that you don’t need to choose with HPE. Unlike most emerging data warehouses as a service where your data is trapped in their databases when your priorities or IT policies change, HPE offers the most complete range of deployment and consumption models. If you want to spin up your analytical initiative on the cloud for a proof-of-concept or during the holiday shopping season for e-retailers, you can do that easily with HPE Vertica OnDemand.
If your organization finds that due to security or confidentiality or privacy concerns you need to bring your analytical initiative back in house, then you can use HPE Vertica Enterprise on-premises without losing any customizations or disruption to your business. Have petabyte volumes of largely unstructured data where the value is unknown? Use HPE Vertica for SQL on Hadoop, deployed natively on your Hadoop cluster, regardless of the distribution you have chosen. Each consumption model, available in the cloud, on-premise, on-demand, or using reference architectures for HPE servers, is available to you with that same trusted underlying core.

Q6. What are the new class of infrastructures called “composable”? Are they relevant for Big Data?

Colin Mahony: HPE believes that a new architecture is needed for Big Data – one that is designed to power innovation and value creation for the new breed of applications while running traditional workloads more efficiently.
We call this new architectural approach Composable Infrastructure. HPE has a well-established track record of infrastructure innovation and success. HPE Converged Infrastructure, software-defined management, and hyper-converged systems have consistently proven to reduce costs and increase operational efficiency by eliminating silos and freeing available compute, storage, and networking resources. Building on our converged infrastructure knowledge and experience, we have designed a new architecture that can meet the growing demands for a faster, more open, and continuous infrastructure.

Q7. What is HPE Cloud Strategy? 

Colin Mahony: Hybrid cloud adoption is continuing to grow at a rapid rate and a majority of our customers recognize that they simply can’t achieve the full measure of their business goals by consuming only one kind of cloud.
HPE Helion not only offers private cloud deployments and managed private cloud services, but we have created the HPE Helion Network, a global ecosystem of service providers, ISVs, and VARs dedicated to delivering open standards-based hybrid cloud services to enterprise customers. Through our ecosystem, our customers gain access to an expanded set of cloud services and improve their abilities to meet country-specific data regulations.

In addition to the private cloud offerings, we have a strategic and close alliance with Microsoft Azure, which enables many of our offerings, including Haven OnDemand, in the public cloud. We also work closely with Amazon because our strategy is not to limit our customers, but to ensure that they have the choices they need and the services and support they can depend upon.

Q8. What are the advantages of an offering like Vertica in this space?

Colin Mahony: More and more companies are exploring the possibility of moving their data analytics operations to the cloud. We offer HPE Vertica OnDemand, our data warehouse as a service, for organizations that need high-performance enterprise class data analytics for all of their data to make better business decisions now. Built by design to drastically improve query performance over traditional relational database systems, HPE Vertica OnDemand is engineered from the same technology that powers the HPE Vertica Analytics Platform. For organizations that want to select Amazon hardware and still maintain the control over the installation, configuration, and overall maintenance of Vertica for ultimate performance and control, we offer Vertica AMI (Amazon Machine Image). The Vertica AMI is a bring-your-own-license model that is ideal for organizations that want the same experience as on-premise installations, only without procuring and setting up hardware. Regardless of which deployment model to choose, we have you covered for “on demand” or “enterprise cloud” options.

Q9. What is HPE Vertica Community Edition?

Colin Mahony: We have had tens of thousands of downloads of the HPE Vertica Community Edition, a freemium edition of HPE Vertica with all of the core features and functionality that you experience with our core enterprise offering. It’s completely free for up to 1 TB of data storage across three nodes. Companies of all sizes prefer the Community Edition to download, install, set-up, and configure Vertica very quickly on x86 hardware or use our Amazon Machine Image (AMI) for a bring-your-own-license approach to the cloud.

Q10. Can you tell us how Kiva.org, a non-profit organization, uses on-demand cloud analytics to leverage the internet and a worldwide network of microfinance institutions to help fight poverty? 

Colin Mahony: HPE is a major supporter of Kiva.org, a non-profit organization with a mission to connect people through lending to alleviate poverty. Kiva.org uses the internet and a worldwide network of microfinance institutions to enable individuals lend as little as $25 to help create opportunity around the world. When the opportunity arose to help support Kiva.org with an analytical platform to further the cause, we jumped at the opportunity. Kiva.org relies on Vertica OnDemand to reduce capital costs, leverage the SaaS delivery model to adapt more quickly to changing business requirements, and work with over a million lenders, hundreds of field partners and volunteers, across the world. To see a recorded Webinar with HPE and Kiva.org, see here.

Qx Anything else you wish to add?

Colin Mahony: We appreciate the opportunity to share the features and benefits of HPE Vertica as well as the bright market outlook for data-driven organizations. However, I always recommend that any organization that is struggling with how to get started with their analytics initiative to speak and meet with peers to learn best practices and avoid potential pitfalls. The best way to do that, in my opinion, is to visit with the more than 1,000 Big Data experts in Boston from August 29 – September 1st at the HPE Big Data Conference. Click here to learn more and join us for 40+ technical deep-dive sessions.

————-

Colin Mahony, SVP & General Manager, HPE Big Data Platform

Colin Mahony leads the Hewlett Packard Enterprise Big Data Platform business group, which is responsible for the industry leading Vertica Advanced Analytics portfolio, the IDOL Enterprise software that provides context and analysis of unstructured data, and Haven OnDemand, a platform for developers to leverage APIs and on demand services for their applications.
In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University.  He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis.

Resources

What’s in store for Big Data analytics in 2016, Steve Sarsfield, Hewlett Packard Enterprise. ODBMS.org, 3 FEB, 2016

What’s New in Vertica 7.2?: Apache Kafka Integration!, HPE, last edited February 2, 2016

Gartner Says 6.4 Billion Connected “Things” Will Be in Use in 2016, Up 30 Percent From 2015, Press release, November 10, 2015

The Benefits of HP Vertica for SQL on Hadoop, HPE, July 13, 2015

Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business , Walter Maguire, Chief Field Technologist, HP Big Data Group, ODBMS.org, 2 Nov, 2015

HP Distributed R ,ODBMS.org,  19 FEB, 2015.

Understanding ROS and WOS: A Hybrid Data Storage Model, HPE, October 7, 2015

Related Posts

On Big Data Analytics. Interview with Shilpa LawandeSource: ODBMS Industry Watch, Published on December 10, 2015

On HP Distributed R. Interview with Walter Maguire and Indrajit RoySource: ODBMS Industry Watch, Published on April 9, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/feed/ 0
A Grand Tour of Big Data. Interview with Alan Morrison http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/ http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/#comments Thu, 25 Feb 2016 15:52:44 +0000 http://www.odbms.org/blog/?p=4087

“Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.”–Alan Morrison.

I have interviewed Alan Morrison, senior research fellow at PwC, Center for Technology and Innovation.
Main topic of the interview is how the Big Data market is evolving.

RVZ

Q1. How do you see the Big Data market evolving? 

Alan Morrison: We should note first of all how true Big Data and analytics methods emerged and what has been disruptive. Over the course of a decade, web companies have donated IP and millions of lines of code that serves as the foundation for what’s being built on top.  In the process, they’ve built an open source culture that is currently driving most big data-related innovation. As you mentioned to me last year, Roberto, a lot of database innovation was the result of people outside the world of databases changing what they thought needed to be fixed, people who really weren’t versed in the database technologies to begin with.

Enterprises and the database and analytics systems vendors who serve them have to constantly adjust to the innovation that’s being pushed into the open source big data analytics pipeline. Open source machine learning is becoming the icing on top of that layer cake.

Q2. In your opinion what are the challenges of using Big Data technologies in the enterprise?

Alan Morrison: Traditional enterprise developers were thrown for a loop back in the late 2000s when it comes to open source software, and they’re still adjusting. The severity of the problem differs depending on the age of the enterprise. In our 2012 issue of the Forecast on DevOps, we made clear distinctions between three age classes of companies: legacy mainstream enterprises, pre-cloud enterprises and cloud natives. Legacy enterprises could have systems that are 50 years old or more still in place and have simply added to those. Pre-cloud enterprises are fighting with legacy that’s up to 20 years old. Cloud natives don’t have to fight legacy and can start from scratch with current tech.

DevOps (dev + ops) is an evolution of agile development that focuses on closer collaboration between developers and operations personnel. It’s a successor to agile development, a methodology that enables multiple daily updates to operational codebases and feedback-response loop tuning by making small code changes and see how those change user experience and behaviour. The linked article makes a distinction between legacy, pre-cloud and cloud native enterprises in terms of their inherent level of agility:

Fig1
 Most enterprises are in the legacy mainstream group, and the technology adoption challenges they face are the same regardless of the technology. To build feedback-response loops for a data-driven enterprise in a legacy environment is more complicated in older enterprises. But you can create guerilla teams to kickstart the innovation process.

Q3. Is the Hadoop ecosystem now ready for enterprise deployment at large scale? 

Alan MorrisonHadoop is ten years old at this point, and Yahoo, a very large mature enterprise, has been running Hadoop on 10,000 nodes for years now. Back in 2010, we profiled a legacy mainstream media company who was doing logfile analysis from all of its numerous web properties on a Hadoop cluster quite effectively. Hadoop is to the point where people in their dens and garages are putting it on Raspberry Pi systems. Lots of companies are storing data in or staging it from HDFS. HDFS is a given. MapReduce, on the other hand, has given way to Spark.

HDFS preserves files in their original format immutably, and that’s important. That innovation was crucial to data-driven application development a decade ago. But Hadoop isn’t the end state for distributed storage, and NoSQL databases aren’t either. It’s best to keep in mind that alternatives to Hadoop and its ecosystem are emerging.

I find it fascinating what folks like LinkedIn and Metamarkets are doing data architecture wise with the Kappa architecture–essentially a stream processing architecture that also works for batch analytics, a system where operational and analytical data are one and the same. That’s appropriate for fully online, all-digital businesses.  You can use HDFS, S3, GlusterFS or some other file system along with a database such as Druid. On the transactional side of things, the nascent IPFS (the Interplanetary File System) anticipates both peer-to-peer and the use of blockchains in environments that are more and more distributed. Here’s a diagram we published last year that describes this evolution to date:
Fig2

From PWC Technology Forecast 2015

People shouldn’t be focused on Hadoop, but what Hadoop has cleared a path for that comes next.

Q4. What are in your opinion the most innovative Big Data technologies?

Alan Morrison: The rise of immutable data stores (HDFS, Datomic, Couchbase and other comparable databases, as well as blockchains) was significant because it was an acknowledgement that data history and permanence matters, the technology is mature enough and the cost is low enough to eliminate the need to overwrite. These data stores also established that eliminating overwrites also eliminates a cause of contention. We’re moving toward native cloud and eventually the P2P fog (localized, more truly distributed computing) that will extend the footprint of the cloud for the Internet of things.

Unsupervised machine learning has made significant strides in the past year or two, and it has become possible to extract facts from unstructured data, building on the success of entity and relationship extraction. What this advance implies is the ability to put humans in feedback loops with machines, where they let machines discover the data models and facts and then tune or verify those data models and facts.

In other words, large enterprises now have the capability to build their own industry- and organization-specific knowledge graphs and begin to develop cognitive or intelligent apps on top those knowledge graphs, along the lines of what Cirrus Shakeri of Inventurist envisions.

Fig3

From Cirrus Shakeri, “From Big Data to Intelligent Applications,”  post, January 2015 

At the core of computable semantic graphs (Shakeri’s term for knowledge graphs or computable knowledge bases) is logically consistent semantic metadata. A machine-assisted process can help with entity and relationship extraction and then also ontology generation.

Computability = machine readability. Semantic metadata–the kind of metadata cognitive computing apps use–can be generated with the help of a well-designed and updated ontology. More and more, these ontologies are uncovered in text rather than hand built, but again, there’s no substitute for humans in the loop. Think of the process of cognitive app development as a continual feedback-response loop process. The use of agents can facilitate the construction of these feedback loops.

Q5. In a recent note Carl Olofson, Research Vice President, Data Management Software Research, IDC, predicted the RIP of “Big Data” as a concept. What is your view on this?

Alan Morrison: I agree the term is nebulous and can be misleading, and we’ve had our fill of it. But that doesn’t mean it won’t continue to be used. Here’s how we defined it back in 2009:

Big Data is not a precise term; rather, it is a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used. (See https://www.pwc.com/us/en/technology-forecast/assets/pwc-tech-forecast-issue3-2010.pdf, pg. 6.)

For that issue of the Forecast, we focused on how Hadoop was being piloted in enterprises and the ecosystem that was developing around it. Hadoop was the primary disruptive technology, as well as NoSQL databases. It helps to consider the data challenge of the 2000s and how relational databases and enterprise data warehousing techniques were falling short at that point.  Hadoop has reduced the cost of analyzing data by an order of magnitude and allows processing of very large unstructured datasets. NoSQL has made it possible to move away from rigid data models and standard ETL.

“Big Data” can continue to be shorthand for petabytes of unruly, less structured data. But why not talk about the system instead of just the data? I like the term that George Gilbert of Wikibon latched on to last year. I don’t know if he originated it, but he refers to the System of Intelligence. That term gets us beyond the legacy, pre-web “business intelligence” term, more into actionable knowledge outputs that go beyond traditional reporting and into the realm of big data, machine learning and more distributed systems. The Hadoop ecosystem, other distributed file systems, NoSQL databases and the new analytics capabilities that rely on them are really at the heart of a System of Intelligence.

Q6. How many enterprise IT systems do you think we will need to interoperate in the future? 

Alan Morrison: I like Geoffrey Moore‘s observations about a System of Engagement that emerged after the System of Record, and just last year George Gilbert was adding to that taxonomy with a System of Intelligence. But you could add further to that with a System of Collection that we still need to build. Just to be consistent, the System of Collection articulates how the Internet of Things at scale would function on the input side. The System of Engagement would allow distribution of the outputs. For the outputs of the System of Collection to be useful, that system will need to interoperate in various ways with the other systems.

To summarize, there will actually be four enterprise IT systems that will need to interoperate, ultimately. Three of these exist, and one still needs to be created.

The fuller picture will only emerge when this interoperation becomes possible.

Q7. What are the  requirements, heritage and legacy of such systems?

Alan Morrison: The System of Record (RDBMSes) still relies on databases and tech with their roots in the pre-web era. I’m not saying these systems haven’t been substantially evolved and refined, but they do still reflect a centralized, pre-web mentality. Bitcoin and Blockchain make it clear that the future of Systems of Record won’t always be centralized. In fact, microtransaction flows in the Internet of Things at scale will depend on the decentralized approaches,  algorithmic transaction validation, and immutable audit trail creation which blockchain inspires.

The Web is only an interim step in the distributed system evolution. P2P systems will eventually complemnt the web, but they’ll take a long time to kick in fully–well into the next decade. There’s always the S-curve of adoption that starts flat for years. P2P has ten years of an installed base of cloud tech, twenty years of web tech and fifty years plus of centralized computing to fight with. The bitcoin blockchain seems to have kicked P2P in gear finally, but progress will be slow through 2020.

The System of Engagement (requiring Web DBs) primarily relies on Web technnology (MySQL and NoSQL) in conjunction with traditional CRM and other customer-related structured databases.

The System of Intelligence (requiring Web file systems and less structured DBs) primarily relies on NoSQL, Hadoop, the Hadoop ecosystem and its successors, but is built around a core DW/DM RDBMS analytics environment with ETLed structured data from the System of Record and System of Engagement. The System of Intelligence will have to scale and evolve to accommodate input from the System of Collection.

The System of Collection (requiring distributed file systems and DBs) will rely on distributed file system successors to Hadoop and HTTP such as IPFS and the more distributed successors to MySQL+ NoSQL. Over the very long term, a peer-to-peer architecture will emerge that will become necessary to extend the footprint of the internet of things and allow it to scale.

Q8. Do you already have the piece parts to begin to build out a 2020+ intersystem vision now?

Alan Morrison: Contextual, ubiquitous computing is the vision of the 2020s, but to get to that, we need an intersystem approach. Without interoperation of the four systems I’ve alluded to, enterprises won’t be able to deliver the context required for competitive advantage. Without sufficient entity and relationship disambiguation via machine learning in machine/human feedback loops, enterprises won’t be able to deliver the relevance for competitive advantage.

We do have the piece parts to begin to build out an intersystem vision now. For example, interoperation is a primary stumbling block that can be overcome now. Middleware has been overly complex and inadequate to the current-day task, but middleware platforms such as EnterpriseWeb are emerging that can reach out as an integration fabric for all systems, up and down the stack. Here’s how the integration fabric becomes an essential enabler for the intersystem approach:

Fig4
PwC, 2015

A lot of what EnterpriseWeb (full disclosure: a JBR partner of PwC) does hinges on the creation and use of agents and semantic metadata that enable the data/logic virtualization. That’s what makes the desiloing possible. One of the things about the EnterpriseWeb platform is that it’s a full stack virtual integration and application platform, using methods that have data layer granularity, but process layer impact. Enterprise architects can tune their models and update operational processes at the same time. The result: every change is model-driven and near real-time. Stacks can all be simplified down to uniform, virtualized composable entities using enabling technologies that work at the data layer. Here’s how they work:

Fig5
PwC, 2015

So basically you can do process refinement across these systems, and intersystem analytics views thus also become possible.

Qx anything else you wish to add? 

Alan Morrison: We always quote science fiction writer William Gibson, who said,

“The future is already here — it’s just not very evenly distributed.”

Enterprises would do best to remind themselves what’s possible now and start working with it. You’ve got to grab onto that technology edge and let it pull you forward. If you don’t understand what’s possible, most relevant to your future business success and how to use it, you’ll never make progress and you’ll always be reacting to crises. Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.

We do a ton of research to get to the big picture and find the real edge, where tech could actually have a major business impact. And we try to think about what the business impact will be, rather than just thinking about the tech. Most folks who are down in the trenches are dismissive of the big picture, but the fact is they aren’t seeing enough of the horizon to make an informed judgement. They are trying to use tools they’re familiar with to address problems the tools weren’t designed for. Alongside them should be some informed contrarians and innovators to provide balance and get to a happy medium.

That’s how you counter groupthink in an enterprise. Executives need to clear a path for innovation and foster a healthy, forward-looking, positive and tolerant mentality. If the workforce is cynical, that’s an indication that they lack a sense of purpose or are facing systemic or organizational problems they can’t overcome on their own.

—————–
Alan Morrison (@AlanMorrison) is a senior research fellow at PwC, a longtime technology trends analyst and an issue editor of the firm’s Technology Forecast

Resources

Data-driven payments. How financial institutions can win in a networked economy, BY, Mark Flamme, Partner; Kevin Grieve, Partner;  Mike Horvath, Principal Strategy&. FEBRUARY 4, 2016, ODBMS.org

The rise of immutable data stores, By Alan Morrison, Senior Manager, PwC Center for technology and innovation (CTI), OCTOBER 9, 2015, ODBMS.org

The enterprise data lake: Better integration and deeper analytics, By Brian Stein and Alan Morrison, PwC, AUGUST 20, 2014 ODBMS.org

Related Posts

On the Industrial Internet of Things. Interview with Leon Guzenda , ODBMS Industry Watch, January 28, 2016

On Big Data and Society. Interview with Viktor Mayer-Schönberger , ODBMS Industry Watch, January 8, 2016

On Big Data Analytics. Interview with Shilpa Lawande , ODBMS Industry Watch, December 10, 2015

On Dark Data. Interview with Gideon Goldin , ODBMS Industry Watch, November 16, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/feed/ 0
Challenges and Opportunities for Big Data. Interview with Mike Hoskins http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/ http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/#comments Tue, 03 Dec 2013 07:52:01 +0000 http://www.odbms.org/blog/?p=2821

“We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data” –Mike Hoskins.

On the topic, Challenges and Opportunities for Big Data, I have interviewed Mike Hoskins, Actian Chief Technology Officer.

RVZ

Q1. What are in your opinion the most interesting opportunities in Big Data?

Mike Hoskins: Until recently, most data projects were solely focused on preparation. Seminal developments in the big data landscape, including Hortonworks Data Platform (HDP) 2.0 and the arrival of YARN (Yet Another Resource Negotiator) – which takes Hadoop’s capabilities in data processing beyond the limitations of the highly regimented and restrictive MapReduce programming model – provides an opportunity to move beyond the initial hype of big data and instead towards the more high-value work of predictive analytics.
As more big data applications are built on the Hadoop platform customized by industry and business needs, we’ll really begin to see organizations leveraging predictive analytics across the enterprise – not just in a sandbox or in the domain of the data scientists, but in the hands of the business users. At that point, more immediate action can be taken on insights.

Q2. What are the most interesting challenges in Big Data?

Mike Hoskins: We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data. Companies need to rethink their static and rigid business intelligence and analytic software architectures in order to continue working at the speed of business. It’s clear that time has become the new gold standard – you can’t produce more of it; you can only increase the speed at which things happen.
Software vendors with the capacity to survive and thrive in this environment will keep pace with the competition by offering a unified platform, underpinned by engineering innovation, completeness of solution and the service integrity and customer support that is essential to market staying power.

Q3. Steve Shine, CEO and President, Actian Corporation, said in a recent interview (*) that “the synergies in data management come not from how the systems connect but how the data is used to derive business value”. Actian has completed a number of acquisitions this year. So, what is your strategy for Big Data at Actian?

Mike Hoskins: Actian has placed its bets on a completely modern unified platform that is designed to deliver on the opportunities presented by the Age of Data. Our technology assets bring a level of maturity and innovation to the space that no other technology vendor can provide – with 30+ years of expertise in ‘all things data’ and over $1M investment in innovation. Our mission is to arm organizations with solutions that irreversibly shift the price/performance curve beyond the reach of traditional legacy stack players, allowing them to get a leg up on the competition, retain customers, detect fraud, predict business trends and effectively use data as their most important asset.

Q4. What are the products synergies related to such a strategy?

Mike Hoskins: Through the acquisition of Pervasive Software (a provider of big data analytics and cloud-based and on-premises data management and integration), Versant (an industry leader in specialized data management), and ParAccel (a leader in high-performance analytics), Actian has compiled a unified end-to-end platform with capabilities to connect, prep, optimize and analyze data natively on Hadoop, and then offer it to the necessary reporting and analytics environments to meet virtually any business need. All the while, operating on commodity hardware at a much lower cost than legacy software can ever evolve to.

Q5. What else still need to be done at Actian to fully deploy this strategy?

Mike Hoskins: There are definitely opportunities to continue integrating the platform experience and improve the user experience overall. Our world-class database technology can be brought closer to Hadoop, and we will continue innovating on analytic techniques to grow our stack upward.
Our development team is working diligently to create a common user interface across all of our platforms, as we bring out technology together. We have the opportunity to create a true first-class SQL engine running natively Hadoop, and to more fully exploit market leading cooperative computing with our On-Demand Integration (ODI) capabilities. I would also like to raise the awareness of the power and speed of our offerings as a general paradigm for analytic applications.

We don’t know what new challenges the Age of Data will bring, but we will continue to look to the future and build out a technology infrastructure to help organizations deal with the only constant – change.

Q6. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

Mike Hoskins: Elastic cloud computing is a convulsive game changer in the marketplace. It’s positive; if not where you do full production, at the very least it allows people to test, adopt and experiment with their data in a way that they couldn’t before. For cases where data is born in the cloud, using a 100% cloud model makes sense. However, much data is highly distributed in cloud and on-premises systems and applications, so it’s vital to have technology that can run and connect to either environments via a hybrid model.

We will soon see more organizations utilizing cloud platforms to run analytic processes, if that is where their data is born and lives.

Q7. How is your Cloud technology helping Amazon`s Redshift?

Mike Hoskins: Amazon Redshift leverages our high-performance analytics database technology to help users get the most out of their cloud investment. Amazon selected our technology over all other database and data warehouse technologies available in the marketplace because of the incredible performance, extreme scalability, and flexibility.

Q8. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey.
When you speak with your customers what are the typical use cases and requirements they have?

Mike Hoskins: A recent survey of data architects and CIOs by Sand Hill Group revealed that the top challenge of Hadoop adoption was knowledge and experience with the Hadoop platform, followed by the availability of Hadoop and big data skills, and finally the amount of technology development required to implement a Hadoop-based solution. This just goes to show how little we have actually begun to fully leverage the capabilities of Hadoop. Businesses are really only just starting to dip their toe in the analytic water. Although it’s still very early, the majority of use cases that we have seen are centered around data prep and ETL.

Q9. What do you think is still needed for big data analytics to be really useful for the enterprise?

Mike Hoskins: If we look at the complete end-to-end data pipeline, there are several things that are still needed for enterprises to take advantage of the opportunities. This includes high productivity, performant integration layers, and analytics that move beyond the sphere of data science and into mainstream business usage, with discovery analytics through a simple UI studio or an analytics-as-a-service offering. Analytics need to be made more available in the critical discovery phase, to bring out the outcomes, patterns, models, discoveries, etc. and begin applying them to business processes.

Qx. Anything else you wish to add?

Mike Hoskins: These kinds of highly disruptive periods are, frankly, unnerving for the marketplace and businesses. Organizations cannot rely on traditional big stack vendors, who are unprepared for the tectonic shift caused by big data, and therefore are not agile enough to rapidly adjust their platforms to deliver on the opportunities. Organizations are forced to embark on new paths and become their own System Integrators (SIs).

On the other hand, organizations cannot tie their future to the vast number of startups, throwing darts to find the one vendor that will prevail. Instead, they need a technology partner somewhere in the middle that understands data in-and-out, and has invested completely and wholly as a dedicated stack to help solve the challenge.

Although it’s uncomfortable, it is urgent that organizations look at modern architectures, next-generation vendors and innovative technology that will allow them to succeed and stay competitive in the Age of Data.

—————————–
Mike Hoskins, Actian Chief Technology Officer
Actian CTO Michael Hoskins directs Actian’s technology innovation strategies and evangelizes accelerating trends in big data, and cloud-based and on-premises data management and integration. Mike, a Distinguished and Centennial Alumnus of Ohio’s Bowling Green State University, is a respected technology thought leader who has been featured in TechCrunch, Forbes.com, Datanami, The Register and Scobleizer. Mike has been a featured speaker at events worldwide, including Strata NY + Hadoop World 2013, the keynoter at DeployCon 2012, the “Open Standards and Cloud Computing” panel at the Annual Conference on Knowledge Discovery and Data Mining, the “Scaling the Database in the Cloud” panel at Structure 2010, and the “Many Faces of Map Reduce – Hadoop and Beyond” panel at Structure Big Data 2011. Mike received the AITP Austin chapter’s 2007 Information Technologist of the Year Award for his leadership in developing Actian DataRush, a highly parallelized framework to leverage multicore. Follow Mike on Twitter: @MikeHSays.

Related Posts

Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner. November 15, 2013

On Big Data. Interview with Adam Kocoloski. November 5, 2013

Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

(*) Acquiring Versant –Interview with Steve Shine. March 6, 2013

Resources

“Do You Hadoop? A Survey of Big Data Practitioners”, Bradley Graham M. R. Rangaswami, SandHill Group, October 29, 2013 (.PDF)

ActianVectorwise 3.0: Fast Analytics and Answers from Hadoop. Actian Corporation
Paper | Technical | English | DOWNLOAD(PDF)| May 2013|

]]>
http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/feed/ 1
Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner http://www.odbms.org/blog/2013/11/big-data-analytics-at-thomson-reuters-interview-with-jochen-l-leidner/ http://www.odbms.org/blog/2013/11/big-data-analytics-at-thomson-reuters-interview-with-jochen-l-leidner/#comments Fri, 15 Nov 2013 09:21:06 +0000 http://www.odbms.org/blog/?p=2779

“My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.” —Dr. Jochen L. Leidner.

I wanted to know how Thomson Reuters uses Big Data. I have interviewed Dr. Jochen L. Leidner, Lead Scientist, of the London R&D at Thomson Reuters.

RVZ

Q1. What is your current activity at Thomson Reuters?
Jochen L. Leidner: For the most part, I carry out applied research in information access, and that’s what I have been doing for quite a while. After five years with the company – I joined from the University of Edinburgh, where I had been a postdoctoral Royal Society of Edinburgh Enterprise Fellow half a decade ago – I am currently a Lead Scientist with Thomson Reuters, where I am building up a newly-established London site part of our Corporate Research & Development group (by the way: we are hiring!). Before that, I held research and innovation-related roles in the USA and in Switzerland.
Let me say a few words about Thomson Reuters before I go more into my own activities, just for background. Thomson Reuters has around 50,000 employees in over 100 countries and sells information to professionals in many verticals, including finance & risk, legal, intellectual property & scientific, tax & accounting.
Our headquarters are located at 3 Time Square in the city of New York, NY, USA.
Most people know our REUTERS brand from reading their newspapers (thanks to our highly regarded 3,000+ journalists at news desks in about 100 countries, often putting their lives at risk to give us reliable reports of the world’s events) or receiving share price information on the radio or TV, but as a company, we are also involved in as diverse areas as weather prediction (as the weather influences commodity prices) and determining citation impact of academic journals (which helps publishers sell their academic journals to librarians), or predicting Nobel prize winners.
My research colleagues and I study information access and especially means to improve it, using including natural language processing, information extraction, machine learning, search engine ranking, recommendation system and similar areas of investigations.
We carry out a lot of contract research for internal business units (especially if external vendors do not offer what we need, or if we believe we can build something internally that is lower cost and/or better suited to our needs), feasibility studies to de-risk potential future products that are considered, and also more strategic, blue-sky research that anticipates future needs. As you would expect, we protect our findings and publish them in the usual scientific venues.

Q2. Do you use Data Analytics at Thomson Reuters and for what?
Jochen L. Leidner: Note that terms like “analytics” are rather too broad to be useful in many instances; but the basic answer is “yes”, we develop, apply internally, and sell as products to our customers what can reasonably be called solutions that incorporate “data analytics” functions.
One example capability that we developed is the recommendation engine CaRE (Al-Kofahi et al., 2007), which was developed by our group, Corporate Research & Development, which is led by our VP of Research & Development, Khalid al-Kofahi. This is a bit similar in spirit to Amazon’s well-known book recommendations. We used it to recommend legal documents to attorneys, as a service to supplement the user’s active search on our legal search engine with “see also..”-type information. This is an example of a capability developed in-house that also something that made it into a product, and is very popular.
Thomson Reuters is selling information services, often under a subscription model, and for that it is important to have metrics available that indicate usage, in order to inform our strategy. So another example for data analytics is that we study how document usage can inform personalization and ranking, of from where documents are accessed, and we use this to plan network bandwidth and to determine caching server locations.
A completely different example is citation information: Since 1969, when our esteemed colleague Eugene Garfield (he is now officially retired, but is still active) came up with the idea of citation impact, our Scientific business division is selling the journal citation impact factor – an analytic that can be used as a proxy for the importance of a journal (and, by implication, as an argument to a librarian to purchase a subscription of that journal for his or her university library).
Or, to give another example from the financial markets area, we are selling predictive models (Starmine) that estimate how likely it is whether a given company goes bankrupt within the next six months.

Q3. Do you have Big Data at Thomson Reuters? Could you please give us some examples of Big Data Use Cases at your company?
Jochen L. Leidner: For most definitions of “big”, yes we do. Consider that we operate a news organization, which daily generates in the tens of thousands of news reports (if we count all languages together). Then we have photo journalists who create large numbers of high-quality, professional photographs to document current events visually, and videos comprising audio-visual storytelling and interviews. We further collect all major laws, statutes, regulations and legal cases around in major jurisdictions around the world, enrich the data with our own meta-data using both manual expertise and automatic classification and tagging tools to enhance findability. We hold collections of scientific articles and patents in full text and abstracts.
We gather, consolidate and distribute price information for financial instruments from hundreds of exchanges around the world. We sell real-time live feeds as well as access to decades of these time series for the purpose of back-testing trading strategies.

Q4. What “value” can be derived by analyzing Big Data at Thomson Reuters?
Jochen L. Leidner: This is the killer question: we take the “value” very much without the double quotes – big data analytics lead to cost savings as well as generate new revenues in a very real, monetary sense of the word “value”. Because our solutions provide what we call “knowledge to act” to our customers, i.e., information that lets them make better decisions, we provide them with value as well: we literally help our customers save the cost of making a wrong decision.

Q5. What are the main challenges for big data analytics at Thomson Reuters ?
Jochen L. Leidner: I’d say these are absolute volume, growth, data management/integration, rights management, and privacy are some of the main challenges.
One obvious challenge is the size of the data. It’s not enough to have enough persistent storage space to keep it, we also need backup space, space to process it, caches and so on – it all adds up. Another is the growth and speed of that growth of the data volume. You can plan for any size, but it’s not easy to adjust your plans if unexpected growth rates come along.
Another challenge that we must look into is the integration between our internal data holdings, external public data (like the World Wide Web, or commercial third-party sources – like Twitter – which play an important role in the modern news ecosystem), and customer data (customers would like to see their own internal, proprietary data be inter-operable with our data). We need to respect the rights associated with each data set, as we deal with our own data, third party data and our customers’ data. We must be very careful regarding privacy when brainstorming about the next “big data analytics” idea – we take privacy very seriously and our CPO joined us from a government body that is in charge of regulating privacy.

Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
Jochen L. Leidner: Usually analytics projects happen as an afterthought to leverage existing data created in a legacy process, which means not a lot of change of process is needed at the beginning. This situation changes once there is resulting analytics output, and then the analytics-generating process needs to be integrated with the previous processes.
Even with new projects, product managers still don’t think of analytics as the first thing to build into a product for a first bare-bones version, and we need to change that; instrumentation is key for data gathering so that analytics functionality can build on it later on.
In general, analytics projects follow a process of (1) capturing data, (2) aligning data from different sources (e.g., resolving when two objects are the same), (3) pre-processing or transforming the data into a form suitable for analysis, (4) building some model and (5) understanding the output (e.g. visualizing and sharing the results). This five-step process is followed by an integration phase into the production process to make the analytics repeatable.

Q7. What kind of data management technologies do you use? What is your experience in using them?
Jochen L. Leidner: For storage and querying, we use relational database management systems to store valuable data assets and also use NoSQL databases such as CouchDB and MongoDB in projects where applicable. We use homegrown indexing and retrieval engines as well as open source libraries and components like Apache Lucene, Solr and ElasticSearch.
We use parallel, distributed computing platforms such as Hadoop/Pig and Spark to process data, and virtualization to manage isolation of environments.
We sometimes push out computations to Amazon’s EC2 and storage to S3 (but this can only be done if the data is not sensitive). And of course in any large organization there is a lot of homegrown software around. My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.

Q8. Do you handle un-structured data? If yes, how?
Jochen L. Leidner: We curate lots of our own unstructured data from scratch, including the thousands of news stories that our over three thousand REUTERS journalists write every day, and we also enrich the unstructured content produced by others: for instance, in the U.S. we manually enrich legal cases with human-written summaries (written by highly-qualified attorneys) and classify them based on a proprietary taxonomy, which informs our market-leading WestlawNext product (see this CIKM 2011 talk), our search engine for the legal profession, who need to find exactly the right cases. Over the years, we have developed proprietary content repositories to manage content storage, meta-data storage, indexing and retrieval. One of our challenges is to unite our data holdings, which often come from various acquisitions that use their own technology.

Q9. Do you use Hadoop? If yes, what is your experience with Hadoop so far?
Jochen L. Leidner: We have two clusters featuring Hadoop, Spark and GraphLab, and we are using them intensively. Hadoop is mature as an implementation of the MapReduce computing paradigm, but has its shortcomings because it is not a true operating system (but probably it should be) – for instance regarding the stability of HDFS, its distributed file system. People have started to realize there are shortcomings, and have started to build other systems around Hadoop to fill gaps, but these are still early stage, so I expect them to become more mature first and then there might be a wave of consolidation and integration. We have definitely come a long way since the early days.
Typically, on our clusters we run batch information extraction tasks, data transformation tasks and large-scale machine learning processes to train classifiers and taggers for these. We are also inducing language models and training recommender systems on them. Since many training algorithms are iterative, Spark can win over Hadoop for these, as it keeps models in RAM.

Q10. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?
Jochen L. Leidner: The dichotomy is perhaps between batch processing and dialog processing, whereas real-time (“hard”, deterministic time guarantee for a system response) goes hand-in-hand with its opposite non-real time, but I think what you are after here is that dialog systems have to be responsive. There is no one-size-fits-all method for meeting (near) real-time requirements or for making dialog systems more responsive; a lot of analytics functions require the analysis of more than just a few recent data-points, so if that’s what is needed it may take its time. But it is important that critical functions, such as financial data feeds, are delivered as fast as possible – micro-seconds matter here. The more commoditized analytics functions become, the faster they need to be available to retain at least speed as a differentiator.

Q11 Cloud computing and open source: Do you they play a role at Thomson Reuters? If yes, how?
Jochen L. Leidner: Regarding cloud computing, we use cloud services internally and as part of some of our product offerings. However, there are also reservations – a lot of our applications contain information that is too sensitive to entrust a third party, especially as many cloud vendors cannot give you a commitment with respect to hosting (or not hosting) in particular jurisdictions. Therefore, we operate our own set of data centers, and some part of these operates as what has become known as “private clouds”, retaining the benefit of the management outsourcing abstraction, but within our large organization rather than pushing it out to a third party. Of course the notion of private clouds is leading the cloud idea ad absurdum quite a bit, because it sacrifices the economy of scale, but having more control is an advantage.
Open source plays a huge role at Thomson Reuters – we rely on many open source components, libraries and systems, especially under the MIT, BSD, LGPL and Apache licenses. For example, some of our tagging pipelines rely on Apache UIMA, which is a contribution originally developed at IBM, and which has seen contributions by researchers from all around the world (from Darmstadt to Boulder). To date, we have not been very good about opening up our own services in the form of source code, but we are trying to change that now, and we have just launched a new corporation-wide process for open-sourcing software. We also have an internal sharing repository, “Corporate Source”, but in my personal view the audience in any single company is too small – open source (like other recent waves such as clouds or crowdsourcing) needs Internet-scale to work, and Web search engines for the various projects to be discovered).

Q12 What are the main research challenges ahead? And what are the main business challenges ahead?
Jochen L. Leidner: Some of the main business challenges are the cost pressure that some of our customers face, and the increasing availability of low-cost or free-of-charge information sources, i.e. the commoditization of information. I would caution here that whereas the amount of information available for free is large, this in itself does not help you if you have a particular problem and cannot find the information that helps you solve it, either because the solution is not there despite the size, or because it is there but findability is low. Further challenges include information integration, making systems ever more adaptive, but only to the extent it is useful, or supporting better personalization. Having said this sometimes systems need to be run in a non-personalized mode (e.g. in the field of e-discovery, you need to have a certain consistency, namely that the same legal search systems retrieves the same things today and tomorrow, and to different parties.

Q13 Anything else you wish to add?
Jochen L. Leidner: I would encourage decision makers of global companies not to get misled by fast-changing “hype” language: words like “cloud”, “analytics” and “big data” are too general to inform a professional discussion. Putting my linguist’s hat on, I can only caution about the lack of precision inherent in marketing language’s use in technological discussions: for instance, cluster computing is not the same as grid computing. And what looks like “big data” today we will almost certainly carry around with us on a mobile computing device tomorrow. Also, buzz words like “Big Data” do not by themselves solve any problem – they are not magic bullets. To solve any problem, look at the input data, specify the desired output data, and think hard about whether and how you can compute the desired result – nothing but “good old” computer science.

Dr. Jochen L. Leidner is a Lead Scientist with Thomson Reuters, where he is building up the corporation’s London site of its Research & Development group.
He holds Master’s degrees in computational linguistics, English language and literature and computer science from Friedrich-Alexander University Erlangen-Nuremberg and in computer speech, text and internet technologies from the University of Cambridge, as well as a Ph.D. in information extraction from the University of Edinburgh (“Toponym resolution in Text”). He is recipient of the first ACM SIGIR Doctoral Consortium Award, a Royal Society of Edinburgh Enterprise Fellowship in Electronic Markets, and two DAAD scholarships.
Prior to his research career, he has worked as a software developer, including for SAP AG (basic technology and knowledge management) as well as for startups.
He led the development teams of multiple question Answering systems, including the systems QED at Edinburgh and Alyssa at Saarland University, the latter of which ranked third at the last DARPA/NIST TREC open-domain question answering factoid track evaluation.
His main research interests include information extraction, question answering and search, geo-spatial grounding, applied machine learning with a focus on methodology behind research & development in the area of information access.
In 2013, Dr. Leidner has also been teaching an invited lecture course “Language Technology and Big Data” at the University of Zurich, Switzerland.

—————————-
Related Posts

Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

On Linked Data. Interview with John Goodwin. September 1, 2013

Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown. February 18, 2013

Resources

ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|

Follow us on Twitter: @odbmsorg

]]>
http://www.odbms.org/blog/2013/11/big-data-analytics-at-thomson-reuters-interview-with-jochen-l-leidner/feed/ 0
On geo-distributed data management — Interview with Adam Abrevaya. http://www.odbms.org/blog/2013/10/on-geo-distributed-data-management-interview-with-adam-abrevaya/ http://www.odbms.org/blog/2013/10/on-geo-distributed-data-management-interview-with-adam-abrevaya/#comments Sat, 19 Oct 2013 15:00:17 +0000 http://www.odbms.org/blog/?p=2712

“Geo-distribution is the ability to distribute a single, logical SQL/ACID database that delivers transactional consistency across multiple datacenters, cloud provider regions, or a hybrid” — Adam Abrevaya.

I have interviewed Adam Abrevaya, Vice President of Engineering, NuoDB.

RVZ

Q1. You just launched NuoDB 2.0, what is special about it?

Adam Abrevaya: NuoDB Blackbirds Release 2.0 demonstrates a strong implementation of the NuoDB vision. It includes over 200 new features and improvements, making it even more stable and reliable than previous versions.
We have improved migration tools; included Java stored procedures; are introducing powerful automated administration; made enhancements to core geo-distribution functionality and more.

Q2. You offer a feature called geo-distribution. What is it and why is it useful?

Adam Abrevaya: Geo-distribution is the ability to distribute a single, logical SQL/ACID database that delivers transactional consistency across multiple datacenters, cloud provider regions, or a hybrid.

NuoDB’s geo-distributed data management lets customers build an active/active, highly-responsive database for high availability and low latency. By bringing the database closer to the end user, we can enable faster responses while simultaneously eliminating the time spent on complex tasks like replication, backup and recovery schemes.

One of the most exciting aspects of the Release 2.0 launch was the discussion about a major deployment of NuoDB Geo-Distribution by a customer. We were very excited to include Cameron Weeks, CEO and Co-Founder of Fathom Voice, talking about the challenges his company was facing—both managing his existing business and cost-effectively expanding globally. After a lengthy evaluation of alternative technologies, he found NuoDB’s distributed database is the only one that met his needs.

Q3. NuoDB falls broadly into the category of NewSQL databases, but you say that you are also a distributed database and that your architecture is fundamentally different than other databases out there. What’s different about it?

Adam Abrevaya: Yes, we are a NewSQL database and we offer the scale-out performance typically associated with NoSQL solutions, while still maintaining the safety and familiarity of SQL and ACID guarantees.

Our architecture, envisioned by renowned data scientist, Jim Starkey, is based on what we call “On-demand Replication”. We have an architecture whitepaper (registration required) which provides all the technical differentiators of our approach.

Q4. NuoDB is SQL compliant, and you claim that it scales elastically. But how do you handle complex join operations on data sets that are geographically distributed and at the same time scale (in) (out)?

Adam Abrevaya: NuoDB can have transactions that work against completely different on-demand caches.
For example, you can have OLTP transactions running in 9 Amazon AWS regions, each working on a subset of the overall database. Separately, there can be on-demand caches that can be dedicated to queries across the entire data set. NuoDB manages these on-demand ACID-compliant caches with very different use cases automatically without impact to the critical end user OLTP operations.

Q5. What is special about NuoDB with respect to availability? Several other NoSQL data stores are also resilient to infrastructure and partition failures.

Adam Abrevaya: First off, NuoDB offers a distributed SQL database system that provides all the ACID guarantees you expect from a relational database. We scale out like NoSQL databases, and offer support for handling independent failures at each level of our architecture. Redundant processes take over for failed processes (due to machine or other failures) and we make it easy for new machines and process to be brought online and added to the overall database dynamically. Applications that make use of the typical facilities when building an enterprise application will automatically reconnect to surviving processes in our system. We can detect network partition failures and allow the application to take appropriate measures.

Q6 How are some of your customers using NuoDB?

Adam Abrevaya: We are seeing a number of common uses of NuoDB among our customers. These range from startups building new web-facing solutions, to geo-distributed SaaS applications, to ISVs moving existing apps to the cloud, to all sorts of other apps that hit the performance wall with MySQL and other traditional DBMS. Ultimately, with lots of replication, sharding, new server hardware, etc., customers can use traditional databases to scale out or up but at a very high cost in terms of both time, money and usually by giving up transactional guarantees. One customer said he decided to look at alternatives to MySQL just because he was spending so much time in meetings talking about how to get it to do what they needed it to do. He added up the cost of the man-hours and he said “migrate.”

As I mentioned already, Fathom Voice, a SaaS provider offering VoIP, conference bridging, receptionist services and some innovative communications apps, had a global deployment challenge. How to get the database near their globe trotting customers; reduce latency and ensure redundancy. They are one of many customers and prospects tackling these issues.

———————-
Adam Abrevaya, Vice President of Engineering, NuoDB
Adam has been building and managing world-class engineering teams and products for almost two decades. His passion is around building and delivering high-performance core infrastructure products that companies depend on to build their businesses.

Adam started his career at MIT Lincoln Laboratory where he developed a distributed platform and image processing algorithms for detecting dangerous weather patterns in radar images. The system was deployed at several airports around the country.

From there, Adam joined Object Design and held various senior management positions where he was responsible for managing several major releases of ObjectStore (an Object database) along with spearheading the development team building XML products that included: Stylus Studio, an XML database, and a Business Process Manager.

Adam joined Pantero Corporation as VP of Development where he developed a revolutionary Semantic Data Integration product. Pantero was eventually sold to Progress Software.

From Pantero, Adam joined m-Qube to manage and build the team creating its Mobile Messaging Gateway platform. The m-Qube platform is a carrier grade product that has become the leading Mobile Messaging Gateway in North America and generated billions of dollars in revenue. Adam continued managing the mQube platform along with expanded roles after acquisitions of the technology from VeriSign and Mobile Messenger.

———

Related Posts

On Big Data and NoSQL. Interview with Renat Khasanshyn. October 7, 2013

On NoSQL. Interview with Rick Cattell. August 19, 2013

Resources

Download NuoDB Pro Edition (Registration required) (NuoDB Blackbirds Release 2.0)

ODBMS.org free resources on
Relational Databases, NewSQL, XML Databases, RDF Data Stores:
Blog Posts |Free Software | Articles and Presentations| Lecture Notes | Tutorials| Journals |

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/10/on-geo-distributed-data-management-interview-with-adam-abrevaya/feed/ 0
Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/ http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/#comments Mon, 23 Sep 2013 14:48:10 +0000 http://www.odbms.org/blog/?p=2639

“The most valuable thing I’ve learned in this role is that judicious use of a little bit of knowledge can go a long way. I’ve seen colleagues and other companies get caught up in the “Big Data” craze by spend hundreds of thousands of pounds sterling on a Hadoop cluster that sees a few megabytes a month. But the most successful initiatives I’ve seen treat it as another tool and keep an eye out for valuable problems that they can solve.” –Matthew Eric Bassett.

I have interviewed Matthew Eric Bassett, Director of Data Science for NBCUniversal International.
NBCUniversal is one of the world’s leading media and entertainment companies in the development, production, and marketing of entertainment, news, and information to a global audience.
RVZ

Q1. What is your current activity at Universal?

Bassett: I’m the Director of Data Science for NBCUniversal International. I lead a small but highly effective predictive analytics team. I’m also a “data evangelist”; I spend quite a bit of my time helping other business units realize they can find business value from sharing and analyzing their data sources.

Q2. Do you use Data Analytics at Universal and for what?

Bassett: We predict key metrics for the different businesses – everything from television ratings, to how an audience will respond to marketing campaigns, to the value of a particular opening weekend for the box office. To do this, we use machine learning regression and classification algorithms, semantic analysis, monte-carlo methods, and simulations.

Q3. Do you have Big Data at Universal? Could you pls give us some examples of Big Data Use Cases at Universal?

Bassett: We’re not working with terabyte-scale data sources. “Big data” for us often means messy or incomplete data.
For instance, our cinema distribution company operates in dozens of countries. For each day in each one, we need to know how much money was spent and by whom -and feed this information into our machine-learning simulations for future predictions.
Each country might have dozens more cinema operators, all sending data in different formats and at different qualities. One territory may neglect demographics, another might mis-report gross revenue. In order for us to use it, we have to find missing or incorrect data and set the appropriate flags in our models and reports for later.

Automating this process is the bulk of our Big Data operation.

Q4. What “value” can be derived by analyzing Big Data at Universal?

Bassett: “Big data” helps everything from marketing, to distribution, to planning.
“In marketing, we know we’re wasting half our money. The problem is that we don’t know which half.” Big data is helping us solve that age-old marketing problem.
We’re able to track how the market is responding to our advertising campaigns over time, and compare it to past campaigns and products, and use that information to more precisely reach our audience (a bit how the Obama campaign was able to use big data to optimize its strategy).

In cinema alone, the opening weekend of a film can affect gross revenue by seven figures (or more), so any insight we can provide into the most optimal time can directly generate thousands or millions of dollars in revenue.

Being able to distill “big data” from historical information, audiences responses in social media, data from commercial operators, et cetera, into a useable and interactive simulation completely changes how we plan our strategy for the next 6-15 months.

Q5. What are the main challenges for big data analytics at Universal ?

Bassett: Internationalization, adoption, and speed.
We’re an international operation, so we need to extend our results from one country to another.
Some territories have a high correlation between our data mining operation and the metrics we want to predict. But when we extend to other territories we have several issues.
For instance, 1) it’s not as easy for us to do data mining on unstructured linguistic data (like audience’s comments on a youtube preview) and 2) User-generated and web analytics data is harder to find (and in some cases nonexistent!) in some of our markets, even if we did have a multi-language data mining capability. Less reliable regions, send us incoming data or historicals that are erroneous, incomplete, or simply not there – see my comment about “messy data”.

Reliability with internationalization feeds into another issue – we’re in an industry that historically uses qualitative and not quantitative processes. It takes quite a bit of “evangelicalism” to convince people what is possible with a bit of statistics and programming, and even after we’ve created a tool for a business, it takes some time for all the key players to trust and use it consistently.

A big part of accomplishing that is ensuring that our simulations and predictions happen fast.
Naturally, our systems need to be able to respond to market changes (a competing film studio changes a release date, an event in the news changes television ratings, et cetera) and inform people what happens.
But we need to give researchers and industry analysts feedback instantly – even while the underlying market is static – to keep them engaged. We’re often asking ourselves questions like “how can we make this report faster” or “how can we speed up this script that pulls audience info from a pdf”.

Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
For example when:

  • -capturing data
  • -aligning data from different sources (e.g., resolving when two objects are the same)
  • -transforming the data into a form suitable for analysis
  • -modeling it, whether mathematically, or through some form of simulation
  • -understanding the output
  • -visualizing and sharing the results

Bassett: We start with the insight in mind: What blind-spots do our businesses have, what questions are they trying to answer and how should that answer be presented? Our process begins with the key business leaders and figuring out what problems they have – often when they don’t yet know there’s a problem.

Then we start our feature selection, and identify which sources of data will help achieve our end goal – sometimes a different business unit has it sitting in a silo and we need to convince them to share, sometimes we have to build a system to crawl the web to find and collect it.
Once we have some idea of what we want, we start brainstorming about the right methods and algorithms we should use to reveal useful information: Should we cluster across a multi-variate time series of market response per demographic and use that as an input for a regression model? Can we reliably get a quantitative measure of a demographics engagement from sentiment analysis on comments? This is an iterative process, and we spend quite a bit of time in the “capturing data/transforming the data” step.
But it’s where all the fun is, and it’s not as hard as it sounds: typically, the most basic scientific methods are sufficient to capture 90% of the business value, so long as you can figure out when and where to apply it and where the edge cases lie.

Finally, we have an another excited stage: find surprising insight from the results.
You might start by trying to get a metric for risk in cinema, and you might find a metric for how the risk changes for releases that target a specific audience in the process – and this new method might work for a different business.

Q7. What kind of data management technologies do you use? What is your experience in using them? Do you handle un-structured data? If yes, how?

Bassett: For our structured, relational data, we make heavy use of MySQL. Despite collecting and analyzing a great deal of un-structured data, we haven’t invested much in a NoSQL or related infrastructure. Rather, we store and organize such data as raw files on Amazon’s S3 – it might be dirty, but we can easily mount and inspect file systems, use our Bash kung-fu, and pass S3 buckets to Hadoop/Elastic MapReduce.

Q8. Do you use Hadoop? If yes, what is your experience with Hadoop so far?

Bassett: Yes, we sometimes use Hadoop for that “learning step” I described earlier, as well as batch jobs for data mining on collected information. However, our experience is limited to Amazon’s Elastic MapReduce, which makes the whole process quite simple – we literally write our map and reduce procedures (in whatever language we chose), tell Amazon where to find the code and the data, and grab some coffee while we wait for the results.

Q9. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?

Bassett: We don’t do any real-time analytics…yet. Thus far, we’ve created a lot of value from simulations that responds to changing marketing information.

Q10 Cloud computing and open source: Do you they play a role at Universal? If yes, how?

Bassett: Yes, cloud computing and open source play a major role in all our projects: our whole operation makes extensive use of Amazon’s EC2 and Elastic MapReduce for simulation and data mining, and S3 for data storage.

We’re big believers in functional programming – many projects start with “experimental programming” in Racket (a dialect of the Lisp programming
language) and often stay there into production.

Additionally, we take advantage of the thriving Python community for computational statistics: Ipython notebook, NumPy, SciPi, NLTK, et cetera.

Q11 What are the main research challenges ahead? And what are the main business challenges ahead?

Bassett: I alluded to some already previously: collecting and analyzing multi-lingual data, promoting the use of predictive analytics, and making things fast.

Recruiting top talent is frequently a discussion among my colleagues, but we’ve been quite fortunate in this regards. (And we devote a great deal of time in training for machine learning and big data methods.)

Qx Anything else you wish to add?

Bassett: The most valuable thing I’ve learned in this role is that judicious use of a little bit of knowledge can go a long way. I’ve seen colleagues and other companies get caught up in the “Big Data” craze by spend hundreds of thousands of pounds sterling on a Hadoop cluster that sees a few megabytes a month. But the most successful initiatives I’ve seen treat it as another tool and keep an eye out for valuable problems that they can solve.

Thanks!

—–

Matthew Eric Bassett -Director of Data Science, NBCUniversal International
Matthew Eric Bassett is a programmer and mathematician from Colorado and started his career there building web and database applications for public and non-profit clients. He moved to London in 2007 and worked as a consultant for startups and small businesses. In 2011, he joined Universal Pictures to work on a system to quantify risk in the international box office market, which led to his current position leading a predictive analytics “restructuring” of NBCUniversal International.
Matthew holds an MSci in Mathematics and Theoretical Physics from UCL and is currently pursuing a PhD in Noncommutative Geometry from Queen Mary, University of London, where he is discovering interesting, if useless, applications of his field to number theory and machine learning.

Resources

How Did Big Data Help Obama Campaign? (Video Bloomberg TV)

Google’s Eric Schmidt Invests in Obama’s Big Data Brains (Bloomberg Businessweek Technology)

Cloud Data Stores – Lecture Notes: “Data Management in the Cloud”. Michael Grossniklaus, David Maier, Portland State University.
Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

Related Posts

Big Data from Space: the “Herschel” telescope. August 2, 2013

Cloud based hotel management– Interview with Keith Gruen July 25, 2013

On Big Data and Hadoop. Interview with Paul C. Zikopoulos. June 10, 2013

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/feed/ 0