ODBMS Industry Watch » Hadoop http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 21:04:31 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 On Vertica and the new combined Micro Focus company. Interview with Colin Mahony http://www.odbms.org/blog/2017/10/on-vertica-and-the-new-combined-micro-focus-company-interview-with-colin-mahony/ http://www.odbms.org/blog/2017/10/on-vertica-and-the-new-combined-micro-focus-company-interview-with-colin-mahony/#comments Wed, 25 Oct 2017 09:25:58 +0000 http://www.odbms.org/blog/?p=4489

” There has been no uncertainty with respect to the Micro Focus leadership’s commitment to building on the great brand and product we have developed at Vertica.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, Vertica Product Group, Micro Focus.
In this interview we covered the recent spin-off of HPE software into a new combined Micro Focus company, and how this is affecting Vertica. We also covered the new release of  Vertica 9, and the importance of Big Data analytics.

RVZ

Q1.With the recent spin-off of HPE software into a new, combined Micro Focus company, do you see things changing for Vertica?

Colin Mahony: From a product development, sales and customer support perspective – it’s been business as usual at Vertica leading up to and since the spin-merge with Micro Focus. Our focus, as always, is to build the best possible product and deliver world-class support for our growing customer base. That won’t change any time soon.

The biggest changes I see post spin-merge is that Vertica is now part of a pure-play software company, rather than a business where a majority of revenue comes from hardware. Running a software company is a lot different than running a hardware business. Under HPE, the software assets sometimes struggled in establishing their own identify as part of a much larger hardware business.  Micro Focus on the other hand is designed from the ground up to build, sell and support software for our customers, that’s all we do. The new, combined Micro Focus is the 7th largest pure-play software company in the world, and we have the global scale to be an industry shaper.
But maybe even more exciting is the level of support and GTM independence that we are already seeing from Micro Focus in support of Vertica. You have likely seen Vertica’s logo and you’ll continue to see more of that, especially on the Vertica.com website that we launched in February and that already has almost 1 million page views! We have been structured uniquely in the new Micro Focus and this gives me complete confidence in our future. I’m genuinely excited about the opportunity to be in a business that is dedicated and focused purely on software – especially software with analytics built in, the new Micro Focus company mission – and the business value of that software for customers.

Q2. There are concerns that Micro Focus may end up managing mature software assets of HPE and extending their shelf life, rather than actively investing in feature developments. What is your take on this?

Colin Mahony: I fundamentally disagree with that. Micro Focus helps companies bridge their existing technologies with new infrastructure and applications. It helps them maximize their ROI while embracing innovation to address the opportunities of the new Hybrid IT and analytics-driven environment. It’s frankly wrong to expect customers to make investments in core technologies without working hard to maximize the investment in those technologies. Over the years, Micro Focus has taken core assets and made them modern, delivering significant value to the company and our customers.

It’s also important to note that the new, combined Micro Focus has an incredible depth and breadth of software assets in its portfolio – covering DevOps, IT Operations, Cloud, Security, Big Data and more – not all of which are mature products.
Take SUSE for instance, a Micro Focus product and the fastest growing open Linux platform. I’m very impressed with the approach that Micro Focus has on supporting growth businesses like this. I have the very same expectations for our Vertica business, especially because this is a massive new opportunity for Micro Focus, which prior to the spin-merge did not have a Big Data offering.
This means no confusion, no duplication of resources, and a lot of potential because we know that every company in virtually every industry is thinking about how to leverage analytics at the core of everything they do, and again, why “analytics built in” is at the core of the new company’s mission.

Q3. Will Micro Focus continue to develop Vertica?

Colin Mahony: There has been no uncertainty with respect to the Micro Focus leadership’s commitment to building on the great brand and product we have developed at Vertica. Since the spin-merge with Micro Focus was first announced in 2016, we have actually been reinvigorating the Vertica brand name, all based on the recognition that Micro Focus has a tremendous market opportunity in front of it with the advent of Big Data and the growing importance all companies are placing on the value of analytics. You can see this commitment with the build-out of our new website, www.vertica.com, our presence at industry trade shows and conferences, and more.

In a recent interview, Chris Hsu, CEO of the new, combined Micro Focus, expressed his commitment to big data analytics – and specifically Vertica – as the number one area he is most excited to focus on and grow within the portfolio. It’s an exciting time to be part of Vertica. We have an incredible opportunity in front of us.

Q4. Micro Focus now has a number of software assets covering Hybrid IT, DevOps, Security and more, where analytics is critical. Does or will Vertica play a role in those products?

Colin Mahony: Absolutely. Not only is there a strong commitment in continuing to develop Vertica as a product and brand, there’s wide recognition within Micro Focus that predictive analytics is critical for the success of data-centric enterprises, and therefore a critical component to the breadth of assets in our own portfolio.

Vertica is an ideal solution for embedded analytics. Businesses that embed Vertica stand out from the competition and deliver higher value to customers. Specifically designed for analytic workloads, Vertica’s speed and performance, advanced analytics, ease of deployment, and support for data scientists make it tailor-made for embedding. We now have an opportunity to embed these great analytical features in a range of Micro Focus software assets, something we’ve already begun to do in application delivery management, IT operations and security. As I’ve said, a core part of our company’s core mission moving forward is to provide customers with enterprise-grade scalable software with analytics built in. I see this as a large and growing opportunity for innovation here at Micro Focus.

Q5. You recently released Vertica Version 9, with major enhancements in cloud deployments and separation of compute and storage. Are these common themes for Vertica moving forward?

Colin Mahony: They are. Vertica has always been 100% committed to helping our customers deploy advanced analytics free from underlying infrastructure and hardware lock-in. We’ve seen that legacy data warehouse solutions have forced many enterprises into rigid and high-cost proprietary hardware and analytics solutions supporting only limited data formats and deployment options. As data formats and storage locations continuously evolve, organizations require a powerful and unified solution to analyze data in the right place at the right time, with the performance and economics that the business requires. Our continued commitment to this principle – and our support for any major cloud platform, whether AWS, Azure or GCP – is foundational to Vertica’s core.

Separation of compute and storage is a logical extension of this product development ethos. Vertica’s beta release of its new Eon Mode architecture, offering separation of compute and storage, provides rapid elastic scaling up and down of the Vertica cluster, with just-in-time workload-based provisioning.
An intelligent, new caching mechanism on the nodes enable organizations to benefit from Vertica’s industry-leading query performance. Companies in the AWS ecosystem will be able to leverage AWS S3 for storage and Vertica’s query-optimized analytics engine for processing speed to capitalize on cloud economics.

You can expect continued product development and investment in these areas.

Q6. With the explosion of data lakes and other external data storage (including Hadoop, AWS S3, etc.), does this complicate the analytical database market or change the dynamics of how and where you analyze data?

Colin Mahony: It certainly changes the big data landscape. Hadoop has been a boon to companies and organizations that want to store vast new volumes of unstructured data cheaply in the form of a data lake. AWS S3 has extended that cheap storage to the cloud. Although Hadoop stores massive volumes of unstructured data, performing analytics on Hadoop proved challenging. Despite this challenge, companies did not want to move large amounts of data in and out of their Hadoop data lakes. As a result, more and more companies were looking to build out enterprise-grade SQL analytics on top of their Hadoop investments. This created a tremendous opportunity for Vertica, and Vertica for SQL on Hadoop was born. Vertica SQL on Hadoop is the same binary, the same core engine, with the ability to deploy natively on Hadoop nodes. Since then, we’ve continued to innovate on how Vertica integrates with the various Hadoop distributions and file formats. We’ve leveraged our years of experience in the Big Data analytics marketplace to enable organizations to analyze their data not only in place, but in the right place – without data movement – while supporting any major cloud deployment for fast and reliable read and write for multiple data formats.

Starting with the release of Vertica 8, users could derive more value from their Hadoop data lakes with Vertica’s high-performance Parquet and ORC Readers that enable users to securely access and analyze data that resides in Hadoop data lakes without copying or moving the data. And now with our latest Vertica 9 release, we’ve introduced a new HDFS Parquet writer – built on Vertica’s fast and reliable ability to not only read, but now write data and results on HDFS – to derive and contribute immediate insights on growing data lakes. Organizations can use Vertica 9’s flexible and expanded deployment options across on-premise, private, and public clouds, and on Hadoop and AWS S3 data lakes, to adopt a best-fit analytical solution.

The days of having to move data in and out of various databases and data lakes is coming to an end. In the future, more and more companies will bring analytics to the data, analyzing it in place. We believe Vertica is working at the forefront of this market transformation.

Q7. Over the last few releases, Vertica has made significant advancements in the area of in-database machine learning. How do you see this set of capabilities contributing to Vertica’s strategy and the success of your customers?

Colin Mahony: There’s no doubt that machine learning and predictive analytics are, and will continue to be a core differentiator for organizations. In today’s data-driven world, creating a competitive advantage depends on your ability to transform massive volumes of data into meaningful insights. Vertica has always supported the world’s leading data-driven organizations with the fastest SQL and extended SQL analytics. And now, by building machine learning functions directly into Vertica’s core — with no need to download and install separate packages — we are transforming the way data scientists and analysts across industries interact with data; removing barriers and accelerating time to value on predictive analytics projects. And it’s not just about developing the right algorithms and models. Our goal at Vertica is to support the entire machine learning and predictive analytics process, from data preparation to model evaluation and deployment – all using Vertica’s industry-leading scalability and performance. I’m incredibly excited to see these features transform data science and predictive analytics projects within our customer base, and for this reason, in-database machine learning will play a major role in Vertica’s future, and the future of our customers.

Our commitment to this area can be seen in the latest Vertica 9 release, which provides a comprehensive set of new Machine Learning algorithms for categorization, overfitting and prediction to enhance processing speed by eliminating the need for down-sampling and data movement. There’s also support for new data-preparation functions for deriving greater meaning from the data, while improving the quality of analysis, and a streamlined end-to-end workflow that simplifies production deployment of Machine models – particularly for customers that embed Vertica and require the ability to replicate models across clusters.

————————

c
Colin Mahony, SVP & General Manager, Vertica Product Group, Micro Focus

Colin Mahony leads the Vertica Product Group for Micro Focus, helping the world’s most data driven organizations to leverage and monetize their business data. Vertica was founded in 2005 and is one of the industry’s fastest growing, advanced analytics platform with in database machine learning, the ability to analyze data in the right place, and freedom from underlying infrastructure. Micro Focus also leverages Vertica to deliver embedded analytics across a very broad portfolio of enterprise grade software.

In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University. He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis as well as a mentor and board member of Year Up Boston.

————–

Resources

– What’s New in Vertica 9.0?, ODBMS.org, 22 Oct, 2017

– What’s New in Vertica 9.0: Eon Mode Beta, ODBMS.org, 22 Oct, 2017

– Vertica Version 9.0, ODBMS.org, 22 Oct, 2017

– Micro Focus Introduces Vertica 9, ODBMS.org, Sept. 27, 2017

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/10/on-vertica-and-the-new-combined-micro-focus-company-interview-with-colin-mahony/feed/ 0
On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/ http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/#comments Fri, 30 Jun 2017 13:40:51 +0000 http://www.odbms.org/blog/?p=4369

“Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.”–Nikita Ivanov.

I have interviewed Nikita Ivanov,CTO of GridGain.
Main topics of the interview are Apache Ignite, Apache Spark and MySQL, and how well they perform on big data analytics.

RVZ

Q1. What are the main technical challenges of SaaS development projects?

Nikita Ivanov: SaaS requires that the applications be highly responsive, reliable and web-scale. SaaS development projects face many of the same challenges as software development projects including a need for stability, reliability, security, scalability, and speed. Speed is especially critical for modern businesses undergoing the digital transformation to deliver real-time services to their end users. These challenges are amplified for SaaS solutions which may have hundreds, thousands, or tens of thousands of concurrent users, far more than an on-premise deployment of enterprise software.
Fortunately, in-memory computing offers SaaS developers solutions to the challenges of speed, scale and reliability.

Q2. In your opinion, what are the limitations of MySQL® when it comes to big data analytics?

Nikita Ivanov: MySQL was originally designed as a single-node system and not with the modern data center concept in mind. MySQL installations cannot scale to accommodate big data using MySQL on a single node. Instead, MySQL must rely on sharding, or splitting a data set over multiple nodes or instances, to manage large data sets. However, most companies manually shard their database, making the creation and maintenance of their application much more complex. Manually creating an application that can then perform cross-node SQL queries on the sharded data multiplies the level of complexity and cost.

MySQL was also not designed to run complicated queries against massive data sets. MySQL optimizer is quite limited, executing a single query at a time using a single thread. A MySQL query can neither scale among multiple CPU cores in a single system nor execute distributed queries across multiple nodes.

Q3. What solutions exist to enhance MySQL’s capabilities for big data analytics?

Nikita Ivanov: For companies which require real-time analytics, they may attempt to manually shard their database. Tools such as Vitess, a framework YouTube released for MySQL sharding, or ProxySQL are often used to help implement sharding.
To speed up queries, caching solutions such as Memcached and Redis are often deployed.

Many companies turn to data warehousing technologies. These solutions require ETL processes and a separate technology stack which must be deployed and managed. There are many external solutions, such as Hadoop and Apache Spark, which are quite popular. Vertica and ClickHouse have also emerged as analytics solutions for MySQL.

Apache Ignite offers speed, scale and reliability because it was built from the ground up as a high performant and highly scalable distributed in-memory computing platform.
In contrast to the MySQL single-node design, Apache Ignite automatically distributes data across nodes in a cluster eliminating the need for manual sharding. The cluster can be deployed on-premise, in the cloud, or in a hybrid environment. Apache Ignite easily integrates with Hadoop and Spark, using in-memory technology to complement these technologies and achieve significantly better performance and scale. The Apache Ignite In-Memory SQL Grid is highly optimized and easily tuned to execute high performance ANSI-99 SQL queries. The In-Memory SQL Grid offer access via JDBC/ODBC and the Ignite SQL API for external SQL commands or integration with analytics visualization software such as Tableau.

Q4. What is exactly Apache® Ignite™?

Nikita Ivanov: Apache Ignite is a high-performance, distributed in-memory platform for computing and transacting on large-scale data sets in real-time. It is 1,000x faster than systems built using traditional database technologies that are based on disk or flash technologies. It can also scale out to manage petabytes of data in memory.

Apache Ignite includes the following functionality:

· Data grid – An in-memory key value data cache that can be queried

· SQL grid – Provides the ability to interact with data in-memory using ANSI SQL-99 via JDBC or ODBC APIs

· Compute grid – A stateless grid that provides high-performance computation in memory using clusters of computers and massive parallel processing

· Service grid – A service grid in which grid service instances are deployed across the distributed data and compute grids

· Streaming analytics – The ability to consume an endless stream of information and process it in real-time

· Advanced clustering – The ability to automatically discover nodes, eliminating the need to restart the entire cluster when adding new nodes

Q5. How Apache Ignite differs from other in-memory data platforms?

Nikita Ivanov: Most in-memory computing solutions fall into one of three types: in-memory data grids, in-memory databases, or a streaming analytics engine.
Apache Ignite is a full-featured in-memory computing platform which includes an in-memory data grid, in-memory database capabilities, and a streaming analytics engine. Furthermore, Apache Ignite supports distributed ACID compliant transactions and ANSI SQL-99 including support for DML and DDL via JDBC/ODBC.

Q6. Can you use Apache® Ignite™ for Real-Time Processing of IoT-Generated Streaming Data?

Nikita Ivanov: Yes, Apache Ignite can ingest and analyze streaming data using its streaming analytics engine which is built on a high-performance and scalable distributed architecture. Because Apache Ignite natively integrates with Apache Spark, it is also possible to deploy Spark for machine learning at in-memory computing speeds.
Apache Ignite supports both high volume OLTP and OLAP use cases, supporting Hybrid Transactional Analytical Processing (HTAP) use cases, while achieving performance gains of 1000x or greater over systems which are built on disk-based databases.

Q7. How do you stream data to an Apache Ignite cluster from embedded devices?

Nikita Ivanov: It is very easy to stream data to an Apache Ignite cluster from embedded devices.
The Apache Ignite streaming functionality allows for processing never-ending streams of data from embedded devices in a scalable and fault-tolerant manner. Apache Ignite can handle millions of events per second on a moderately sized cluster for embedded devices generating massive amounts of data.

Q8. Is this different then using Apache Kafka?

Nikita Ivanov: Apache Kafka is a distributed streaming platform that lets you publish and subscribe to data streams. Kafka is most commonly used to build a real-time streaming data pipeline that reliably transfers data between applications. This is very different from Apache Ignite, which is designed to ingest, process, analyze and store streaming data.

Q9. How do you conduct real-time data processing on this stream using Apache Ignite?

Nikita Ivanov: Apache Ignite includes a connector for Apache Kafka so it is easy to connect Apache Kafka and Apache Ignite. Developers can either push data from Kafka directly into Ignite’s in-memory data cache or present the streaming data to Ignite’s streaming module where it can be analyzed and processed before being stored in memory.
This versatility makes the combination of Apache Kafka and Apache Ignite very powerful for real-time processing of streaming data.

Q10. Is this different then using Spark Streaming?

Nikita Ivanov: Spark Streaming enables processing of live data streams. This is merely one of the capabilities that Apache Ignite supports. Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.

Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.

Qx. Is there anything else you wish to add?

Nikita Ivanov: The world is undergoing a digital transformation which is driving companies to get closer to their customers. This transformation requires that companies move from big data to fast data, the ability to gain real-time insights from massive amounts of incoming data. Whether that data is generated by the Internet of Things (IoT), web-scale applications, or other streaming data sources, companies must put architectures in place to make sense of this river of data. As companies make this transition, they will be moving to memory-first architectures which ingest and process data in-memory before offloading to disk-based datastores and increasingly will be applying machine learning and deep learning to make understand the data. Apache Ignite continues to evolve in directions that will support and extend the abilities of memory-first architectures and machine learning/deep learning systems.

——–
Nikita IvanovFounder & CTO, GridGain,
Nikita Ivanov is founder of Apache Ignite project and CTO of GridGain Systems, started in 2007. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today. Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. He is an active member of Java middleware community, contributor to the Java specification. He’s also a frequent international speaker with over two dozen of talks on various developer conferences globally.

Resources

Apache Ignite Community Resources

apache/ignite on GitHub

Yardstick Apache Ignite Benchmarks

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite

Misys Uses GridGain to Enable High Performance, Real-Time Data Processing

The Spark Python API (PySpark)

Related Posts

Supporting the Fast Data Paradigm with Apache Spark. BY Stephen Dillon, Data Architect, Schneider Electric

On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah. ODBMS Industry Watch,March 13, 2017

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/feed/ 0
Identity Graph Analysis at Scale. Interview with Niels Meersschaert http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/ http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/#comments Tue, 09 May 2017 07:10:19 +0000 http://www.odbms.org/blog/?p=4359

“I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that.”–Niels Meersschaert.

I have interviewed Niels Meersschaert, Chief Technology Officer at Qualia. The Qualia team relies on over one terabyte of graph data in Neo4j, combined with larger amounts of non-graph data to provide major companies with consumer insights for targeted marketing and advertising opportunities.

RVZ

Q1. Your background is in Television & Film Production. How does it relate to your current job?

Niels Meersschaert: Engineering is a lot like producing. You have to understand what you are trying to achieve, understand what parts and roles you’ll need to accomplish it, all while doing it within a budget. I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that. Engineering is both art and science and creativity is a critical skill for the best engineers. I also believe that a breath of languages is critical for engineers.

Q2. Your company collects data on more than 90% of American households. What kind of data do you collect and how do you use such data?

Niels Meersschaert: We focus on high quality data that is indicative of commercial intent. Some examples include wishlist interaction, content consumption, and location data. While we have the breath of a huge swath of the American population, a key feature is that we have no personally identifiable information. We use anonymous unique identifiers.
So, we know this ID did actions indicative of interest in a new SUV, but we don’t know their name, email address, phone number or any other personally identifiable information about a consumer. We feel this is a good balance of commercial need and individual privacy.

Q3. If you had to operate with data from Europe, what would be the impact of the new EU General Data Protection Regulation (GDPR) on your work?

Niels Meersschaert: Europe is a very different market than the U.S. and many of the regulations you mentioned do require a different approach to understanding consumer behaviors. Given that we avoid personal IDs, our approach is already better situated than many peers, that rely on PII.

Q4. Why did you choose a graph database to implement your system consumer behavior tracking system?

Niels Meersschaert: Our graph database is used for ID management. We don’t use it for understanding the intent data, but rather recognizing IDs. Conceptually, describing the various IDs involved is a natural fit for a graph.
As an example, a conceptual consumer could be thought of as the top of the graph. That consumer uses many devices and each device could have 1 or more anonymous IDs associated with it, such as cookie IDs. Each node can represent an associated device or ID and the relationships between each node allow us to see the path. A key element we have in our system is something we call the Borg filter. It’s a bit of a reference to Star Trek, but essentially when we find a consumer is too connected, i.e. has dozens or hundreds of devices, we remove all those IDs from the graph as clearly something has gone wrong. A graph database makes it much easier to determine how many connected nodes are at each level.

Q5. Why did you choose Neo4j?

Niels Meersschaert: Neo4J had a rich query language and very fast performance, especially if your hot set was in RAM.

Q6. You manage one terabyte of graph data in Neo4j. How do you combine them with larger amounts of non-graph data?

Niels Meersschaert: You can think of the graph as a compression system for us. While consumer actions occur on multiple devices and anonymous IDs, they represent the actions of a single consumer. This actually simplifies things for us, since the unique grouping IDs is much smaller than the unique source IDs. It also allows us to eliminate non-human IDs from the graph. This does mean we see the world in different ways they many peers. As an example, if you focus only on cookie IDs, you tend to have a much larger number of unique IDs than actual consumers those represent. Sadly, the same thing happens with website monthly uniques, many are highly inflated both on the number of unique people they represent, but also since many of the IDs are non-human. Ultimately, the entire goal of advertising is to influence consumers, so we feel that having the better representation of actual consumers allows us to be more effective.

Q7. What are the technical challenges you face when blending data with different structure?

Niels Meersschaert: A key challenge is some unifying element between different systems or structures that link data. What we did with Neo4J is create a unique property on the nodes that we use for interchange. The internal node IDs that are part of Neo4J aren’t something we use except internally within the graph DB.

Q8. If your data is sharded manually, how do you handle scalability?

Niels Meersschaert: We don’t shard the data manually, but scalability is one of the biggest challenges. We’ve spent a lot of time tuning queries and grouping operations to take advantage of some of the capabilities of Neo4J and to work around some limitations it has. The vast majority of graph customers wouldn’t have the volume nor the volatility of data that we do, so our challenges are unique.

Q9. What other technologies do you use and how they interact with Neo4j?

Niels Meersschaert: We use the classic big data tools like Hadoop and Spark. We also use MongoDB and Google’s Big Query. If you look at the graph as the truth set of device IDs, we interact with it on ingestion and export only. Everything in the middle can operate on the consumer ID, which is far more efficient.

Q10. How do you measure the ROI of your solution?

Niels Meersschaert: There are a few factors we consider. First is how much does the infrastructure cost us to process the data and output? How fast is it in terms of execution time? How much development effort does it take relative to other solutions? How flexible is it for us to extend it? This is an ever evolving situation and one we always look at how to improve, especially as a smaller business.

———————————-

Niels Meersschaert
I’ve been coding since I was 7 years old on an Apple II. I’d built radio control model cars and aircraft as a child and built several custom chassis using controlled flex as suspension to keep weight & parts count down. So, I’d had an early interest in both software and physical engineering.

My father was from the Netherlands and my maternal grandfather was a linguist fluent in 43 languages. As a kid, my father worked for the airlines, so we traveled often to Europe to see family, so I grew up multilingual. Computer languages are just different ways to describe something, the basic concepts are similar, just as they are in spoken languages albeit with different grammatical and syntax structure. Whether you’re speaking French, or writing a program in Python or C, the key is you are trying to get your communication across to the target of your message, whether it is another person or a computer.

I originally started university in aeronautical engineering, but in my sophomore year, Grumman let go about 3000 engineers, so I didn’t think the career opportunities would be as great. I’d always viewed problem solutions as a combination of art & science, so I switched majors to one in which I could combine the two.

After school I worked producing and editing commercials and industrials, often with special effects. I got into web video early on & spent a lot of time on compression and distribution systems. That led to working on search, and bringing the linguistics back front and center again. I then combined the two and came full circle back to advertising, but from the technical angle at Magnetic, where we built search retargeting. At Qualia, we kicked this into high gear, where we understand consumer intent by analyzing sentiment, content and actions across multiple devices and environments and the interaction and timing between them to understand the point in the intent path of a consumer.

Resources

EU General Data Protection Regulation (GDPR):

Reform of EU data protection rules

European Commission – Fact Sheet Questions and Answers – Data protection reform

General Data Protection Regulation (Wikipedia)

Neo4j Sandbox: The Neo4j Sandbox enables you to get started with Neo4j, with built-in guides and sample datasets for popular use cases.

Related Posts

LDBC Developer Community: Benchmarking Graph Data Management Systems. ODBMS.org, 6 APR, 2017

Graphalytics benchmark.ODBMS.org 6 APR, 2017
The Graphalytics benchmark is an industrial-grade benchmark for graph analysis platforms such as Giraph. It consists of six core algorithms, standard datasets, synthetic dataset generators, and reference outputs, enabling the objective comparison of graph analysis platforms.

Collaborative Filtering: Creating the Best Teams Ever. By Maurits van der Goes, Graduate Intern | February 16, 2017

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/feed/ 0
On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/ http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/#comments Mon, 13 Mar 2017 10:54:21 +0000 http://www.odbms.org/blog/?p=4326

“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah

I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.  
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop  3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.

RVZ

Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?

Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
Namely:
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.

Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?

Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.

Q3. What are the typical challenging business problems that world’s leading organisations have?

Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.

Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?

Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).

Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?

Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.

Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?

Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.

Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?

Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.

That said, the most notable features in Apache Spark 2.0 are:

1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.

2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.

3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.

Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.

Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?

Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).

Some of the key benefits of the Impala approach include:

* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL

Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.

That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.

Q8. What is Apache Kudu, and why is it relevant for Impala Users?

Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.

Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.

Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?

Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).

Other cool additions are ATS v2 and classpath isolation which you can read more about here

Q10. What is the roadmap ahead for Cloudera Enterprise?

Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
——————————
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.

Resources

Download Page for Apache Spark™

Apache Impala supported by Cloudera Enterprise

DATA-X: Videobook- 8 short videos introduce query analytics for Apache Hadoop

A package that allows R developers to use Hadoop HBase

Book: Big Data Analytics with Spark

Related Posts

Streaming Analytics for Chain Monitoring. By Natalino Busa, Head of Data Science at Teradata — Thursday, ODBMS.org January 12, 2017

Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org SEPTEMBER 23, 2016

Next-Generation Genomics Analysis with Apache Spark. by Jason Bailey. ODBMS.org Thursday, June 30th, 2016

Supporting the Fast Data Paradigm with Apache Spark BY Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org,23 APR, 2016

– The new series of Q&A with Leading Data Scientists– ODBMS.org:
Part II
Part I

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/feed/ 0
On Data Analysis. Interview with Rob Winters http://www.odbms.org/blog/2017/01/on-data-analysis-interview-with-rob-winters/ http://www.odbms.org/blog/2017/01/on-data-analysis-interview-with-rob-winters/#comments Mon, 09 Jan 2017 19:35:49 +0000 http://www.odbms.org/blog/?p=4220

“I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. “–Rob Winters

I have interviewed Rob Winters,Head of Business Intelligence at TravelBird. The interview covers Rob`s projects experience with data analytics and HPE Vertica.

RVZ

Q1. What is the business of TravelBird?

Rob Winters: TravelBird builds and provides a daily selection of inspirational holiday offerings in twelve markets across Europe. Our goal is to create packages which excite the imagination and bring simplicity and joy to the act of travelling. These packages are then shared with our travellers via email, our website, and our iOS and Android applications.

Q2. What are the current data projects at TravelBird?

Rob Winters: TravelBird’s journey with being data driven is relatively short, beginning our initial Business Intelligence buildout in mid-2015. Currently our BI team is engaged in a number of projects, both more traditional BI and advanced analytics, including:
– Building data sources and training an organization in self-service BI
– Replacing our generic daily selections with personalized content selection models
– Optimizing pricing of packages based on product price volatility and customer demand
– Adjusting email frequency and timing to improve customer engagement and lifetime value

Q3. What is your experience in using predictive analytics?

Rob Winters: I have been working in the predictive analytics field for six years now across a variety of problem areas – customer service, retail, gaming, and now travel. From a technology standpoint I originally worked heavily with commercial solutions (Teradata, SAS) but for the last four years have used almost exclusively open source software including Hadoop, Spark, R, and Python.

Q4. How do you evaluate if your discovering insights are “good”?

Rob Winters: During the initial development of our algorithms we will typically follow a basic version of CRISP-DM to build an initial working model for our problem. To test models, we always use an A/B test and typically follow a two phase process: first the model is split-test against the current operational process/human selection, then when the model consistently outperforms the status quo, we will test future model iterations against the control.

Q5. Can you tell us a bit about the work you did in designing and implementing a fully automated, machine learning based content selection platform?

Rob Winters: To provide context, every day our planning team creates six unique product offerings for their target market of 50-500k customers to be shared via web, iOS/Android app, and email. Our goal was to replace that model with one that selects six unique products for each recipient based on past browsing and travel behavior. To do so, we designed an ensemble model consisting of several components:
– A customer preference model (user-item recommendation model)
– A product similarity model (item-item similarity)
– A “hotness” model to promote destinations which are trending/outperforming/expected to do well
– A portfolio model to select the right diversity for each recipient based on recommendation confidence, lifecycle state, and yield optimization of cannibalization vs product fit for a recipient

The data to feed these models is based on observing dozens of events per recipient per day, positive and negative feedback events of the recipient, all observable product features, and human expert input. The models are also able to improve themselves by continuously tuning the input parameters of each model based on recommendation split testing.

Q6. What are the primary technologies you are using?

Rob Winters: Our technology stack consists of the following:
-BI: Tableau
-Data warehousing: HPE Vertica
-Operations DBs: MySQL (web services) + Postgres (internal services)
-Recommendations serving: Redis
-Modeling/Analysis: Python, Spark via PySpark

Q7. What is your experience in using HPE Vertica?

Rob Winters: I have been using Vertica for five years in a number of organizations and facilitated the first rollout in the Netherlands. During that time I have been primarily an end user/data analyst but have also been the DBA for my deployments for the last two years.

Q8: Can you give us some more technical details of what was this first rollout in the Netherlands? What challenges did you solve in using HPE Vertica? What business benefits did you obtain?

Rob Winters: The objective of our rollout was to implement a centralized company datawarehouse to unify several production databases plus external API data.
The existing platform was Postgres (row-based solution) and relatively limited in performance. Primary gains were significantly faster analytics, the ability to add in several terabytes of event data (which was not possible on the prior platform), and new insights into the email database regarding churn, conversion, and customer value.

Q9: What were the main criteria for you to choose HPE Vertica? Did you do any performance test for HPE Vertica?

Rob Winters: We considered a number of alternatives including Microsoft PDW, Greenplum, and Infobright.
The primary considerations were price/performance, scalability, and analytical functionality. We found Vertica to be the best options across those aspects. Regarding performance testing, we did compare Infobright and Vertica and found the latter to be both more performant and easier to work with.

Q10. What specific functionalities of HPE Vertica do you find particularly useful in your job?

Rob Winters: There are a number of aspects which I find extremely beneficial, including:
-Ease of administration
-Performance tunability is very good, much higher than (for example) Redshift
-Analytical function extensions enable extremely powerful analyses directly via SQL
-The ability to load JSON data allows very rapid data integration from new sources

Q11. Do you think is it possible to turn an employee into a data analyst?

Rob Winters: Absolutely, I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. The biggest drivers for success in the transitition have been:
– Attitude/eagerness to learn
– Close collaboration with a more experienced analyst, either their supervisor or a more senior peer
– Making their initial projects in areas where they are unable to fall back on domain knowledge

——
Rob Winters, Head of Business Intelligence at TravelBird.
Rob has been working with and leading analytics teams since 2006 across a number of industries including telco, gaming, retail, and travel. His primary focus since 2011 has been green-field implementations of technology and team creation for both traditional business intelligence and predictive analytics; full details are listed on my linkedin profile. He holds a bachelor’s in economics and an MBA with a IT concentration.

Resources

Data-X: Video lectures on very practical and applied Data Analytics. Data-X is a project to produce a collection of video lectures on very practical and applied data analytics.

HPE Vertica 8 “Frontloader” BY Jeff Healey. ODBMS.org SEPTEMBER 12, 2016

Benchmarking HPE Vertica and Amazon Redshift. (Webinar)

HPE Vertica Analytics Platform on Microsoft Azure. By Chris_Daly. ODBMS.org SEPTEMBER 12, 2016

Hewlett Packard Enterprise Introduces HPE Vertica 8. ODBMS.org SEPTEMBER 7, 2016

Related Posts

On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, May 24, 2016

On data analytics for finance. Interview with Jason S.Cornez. ODBMS Industry Watch, May 17, 2016

A/B Testing is not art, it is science. By Ramkumar Ravichandran, Director, Analytics, Visa Inc. ODBMS.org, MAY 22, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/01/on-data-analysis-interview-with-rob-winters/feed/ 0
Democratizing the use of massive data sets. Interview with Dave Thomas. http://www.odbms.org/blog/2016/09/democratizing-the-use-of-massive-data-sets-interview-with-dave-thomas/ http://www.odbms.org/blog/2016/09/democratizing-the-use-of-massive-data-sets-interview-with-dave-thomas/#comments Mon, 12 Sep 2016 19:04:14 +0000 http://www.odbms.org/blog/?p=4234

“Any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.”–Dave Thomas.

I have interviewed Dave Thomas,Chief Scientist at Kx Labs.

RVZ

Q1. For many years business users have had their data locked up in databases and data warehouses. What is wrong with that?

Dave Thomas: It isn’t so much an issue of where the data resides, whether it is in files, databases, data warehouses or a modern data lake. The challenge is that modern businesses need access to the raw data, as well as the ability to rapidly aggregate and analyze their data.

Q2. Typical business intelligence (BI) tool users have never seen their actual data. Why?

Dave Thomas: For large corporations hardware and software both used to be prohibitively expensive, hence much of their data was aggregated prior to making it available to users. Even today when machines are very inexpensive most corporate IT infrastructures are impoverished relative to what one can buy on the street or in the Cloud.
Compounding the problem, IT charge-back mechanisms are biased to reduce IT spending rather than to maximize the value of data delivered to the business.
Traditional technologies are not sufficiently performant to allow processing of large volumes of data.
Many companies have inexpensive data lakes and have realized after the fact that using a commodity storage systems, such as HDFS, has severely constrained their performance and limited their utility. Hence more corporations are moving data away from HDFS into high-performance storage or memory.

Q3. What are the limitations of the existing BI and extract, transform and load (ETL) data tools?

Dave Thomas: Traditional BI tools assume that it is possible for DBAs and BI experts to a priori define the best way to structure and query the data. This reduces the whole power of BI to mere reporting. In an attempt to deal with huge BI backlogs, generic query and reporting tools have become popular to shift reporting to self-serve. However, they are often designed for sophisticated BI users rather than for normal business users. They are often not performant because they depend on the implementation of the underlying data stores.
For the most part, existing ETL tools are constrained by having to move the data to the ETL process and then on to the end user. Many ETL tools only work against one kind of data source. ETL can’t be written by normal users and due to the cost of an incorrect ETL run, such tools are not available to the data analyst. One of the major topics of discussion in Big Data shops is the complexity and performance of their Big Data pipeline. ETL, data blending, shouldn’t be a separate process or product. It should be something one can do with queries in a single efficient data language.

Q4. What are the typical technical challenges in finance, IoT and other time-series applications?

Dave Thomas:
1. Speed, as data volumes and variety are always increasing.
2. Ability to deal with both real-time events and historical events efficiently. Ideally in a single technology.
3. To handle time-series one needs to be able to deal with simultaneous arrival of events. Time with nanosecond precision is our solution. Other solutions are constrained by using milliseconds and event counters that are much less efficient.
4. High-performance operations on time, over days, months and years are essential for time-series. This is why time is a native type in Kx.
5. The essence of time-series is processing sliding time windows of data for both joins and aggregations.
6. In IOT, data is always dirty. Kx’s native support for missing data and out of band data due to failing sensors, allows one to deal with the realities of sensor data.

Q5. Kx offers analysts a language called q. Why not extend standard SQL?

Dave Thomas: I think there is a misunderstanding about q. Q is a full functional data language that both includes and extends SQL. Selects are easier than SQL because they provide implicit joins and group-bys. This makes queries roughly 50% of the code of SQL. Unlike many flavors of SQL, q lets one put a functional expression in any position in an SQL statement. One can easily extend the aggregation operations available to the end-user.

Q6. Can you show the difference between a query written in q and in standard SQL?

Dave Thomas: Here’s an example of retrieving parts from an orders table with a foreign key join to a parts table, summing by quantity and then sorting by color:

q:
select sum qty by p.color from sp

SQL:
select p.color, sum(sp.qty) from sp, p
where sp.p=p.p group by p.color order by color

Q7. How do queries execute inside the database?

Dave Thomas: Q is native to the database engine. Hence queries and analytics execute in the columns of the Kx database. There is no data shipping between the client and database server.

Q8. Shawn Rogers of Dell said: “A ‘citizen data scientist’ is an everyday, non-technical user that lacks the statistical and analytical prowess of a traditional data scientist, but is equally eager to leverage data in order to uncover insights, and importantly, do so at the speed of business.” What is your take on this?

Dave Thomas: High-performance data technologies, such as Kx, using modern large-memory hardware, can support data analysts versus data scientist queries. In the product Analyst for Kx, for example, users can work interactively on a sample of data using visual tools to import, clean, query, transform, analyze and visualize data with minimal, if any programming or even SQL. Given correct operations on one or more samples they then can be run against trillions of rows of data. Data analysts today can truly live in their data.

Q9. What are the risks of bringing the power of analytics to users who are non-expert programmers?

Dave Thomas: Clearly any important analysis needs to be validated and cross-checked. Hence any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.
In our experience users do make initial mistakes, but as they live in their data they quickly learn.
Visualization really helps, as does the provision of metadata about the data sources. Reducing the cycle time provides increased understanding, and allows one to make mistakes.
Runaway query performance has been a concern of DBAs, but for many years frameworks have been in place such as our smart query router that will ensure that ad hoc queries against massive datasets are throttled so they don’t run away. Fortunately, recent cost reductions in non-volatile memory make it possible to have high-performance query-only replicas of data that can be made available to different parts of the organization based on its needs.

Q10. How can non-expert programmers understand if the information expressed in visual analytics such as heat maps or in operational dashboard charts, is of good quality or not?

Dave Thomas: In our experience users spot visual anomalies much faster than inconsistencies in a spreadsheet.

Q11. What are the opportunities arising in “democratizing” the use of massive data sets?

Dave Thomas: We are finally living in a world where for many companies it is possible to run a real-time business where everyone can have fast, efficient access to the data they need. Rather than being held hostage to aggregations, spreadsheets and all sorts of variants of the truth, the organization can expediently see new opportunities to improve results in sales, marketing, production and other business operations.

Q12. How important is data query and data semantics?

Dave Thomas: Unfortunately we are not educated on how to express data semantics and data query.
Even computer scientists often study less about writing queries than how to execute them efficiently.
We need to educate students and employees on how to live in their data. It may well be that the future of programming for most will be writing queries. Given powerful data languages even compiler optimizations can be expressed by queries.
We need to invest much more in data governance and the use of standard terminology in order to share data within and across companies.

——————-
Dave Thomas, Kx Labs.
As Chief Scientist Dave envisions the future roadmap for Kx tools. Dave has had a long and storied career in computer software development and is perhaps best known as the founder and past CEO of Object Technology International, formerly OTI, now IBM OTI Labs, a pioneer in Agile Product Development. He was the principal visionary and architect for IBM VisualAge Smalltalk and Java tools and virtual machines including the popular open-source, multi-language Eclipse.org IDE. As the cofounder of Bedarra Research Labs he led the creation of the Ivy visual analytics workbench. Dave is a renowned speaker, university lecturer and Chairman of the Australian developer YOW! conferences.

Resources

New Kx release includes encryption, enhanced compression and Tableau integration. ODBMS.org JULY 4, 2016.

Resources for learning more about kdb+ and q benchmarking results.

Kdb+ and the Internet of Things/Big Data. InDetail Paper by Bloor Research Author: Philip Howard. ODBMS.org- JANUARY 28, 2015

Related Posts

Democratizing fast access to Big Data. By Dave Thomas, chief scientist at Kx Labs. ODBMS.org-April 26, 2016

On Data Governance. Interview with David Saul. ODBMS Industry Watch, Published on 2016-07-23

On the Challenges and Opportunities of IoT. Interview with Steve Graves. ODBMS Industry Watch, Published on 2016-07-06

On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, Published on 2016-05-24

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/09/democratizing-the-use-of-massive-data-sets-interview-with-dave-thomas/feed/ 0
On the Challenges and Opportunities of IoT. Interview with Steve Graves http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/ http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/#comments Wed, 06 Jul 2016 09:00:29 +0000 http://www.odbms.org/blog/?p=4172

“Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud.”–Steve Graves.

I have interviewed, Steve Graves, co-founder and CEO of McObject. Main topic of the interview is the Internet of Things and how it relates to databases.

RVZ

Q1. What are in your opinion the main Challenges and Opportunities of the Internet of Things (IoT) seen from the perspective of a database vendor?

Steve Graves: Let’s start with the opportunities.

When we started McObject in 2001, we chose “eXtremeDB, the embedded database for intelligent, connected devices” as our tagline. eXtremeDB was designed from the get-go to live in the “things” comprising what the industry now calls the Internet of Things. The popularization of this term has created a lot of visibility and, more importantly, excitement and buzz for what was previously viewed as the relatively boring “embedded systems.” And that creates a lot of opportunities.

A lot of really smart, creative people are thinking of innovative ways to improve our health, our workplace, our environment, our infrastructure, and more. That means new opportunities for vendors of every component of the technology stack.
The challenges are manifold, and I can’t begin to address all of them. The media is largely fixated on security, which itself is multi-dimensional.
We can talk about protecting IoT-enabled devices (e.g. your car) from being hacked. We can talk about protecting the privacy of your data at rest. And we can talk about protecting the privacy of data in motion.
Every vendor needs recognize the importance of security. But, it isn’t enough for a vendor, like McObject, to provide the features to secure the target system; the developer that assembles the stack along with their own proprietary technology to create an IoT solution needs to use available security features, and use them correctly.

After security, scaling IoT systems is the next big challenge. It’s easy enough to prototype something.
But careful planning is needed to leap from prototype to full-blown deployment. Obvious decisions have to be made about connectivity and necessary bandwidth, how many things per gateway, one tier of gateways or more, and how much compute capacity is needed in the cloud. Beyond that, there are less obvious decisions to be made that will affect scalability, like making sure the DBMS used on devices and/or gateways is able to handle the workload (e.g. that the gateway DBMS can scale from 10 input streams to 100 input streams); determining how to divide the analytics workload between gateways and the cloud; and ensuring that the gateway, its DBMS and its communication stack can stream data to the cloud while simultaneously processing its own input streams and analytics.
Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud. In fact, ‘things’ are usually very much like good ol’ embedded systems, and system engineers that know their way around real-time/embedded operating systems, JTAG debuggers, and so on, have always been at a premium.

Q2. Data management for the IoT: What are the main differences between data management in field-deployed devices and at aggregation points?

Steve Graves: Quite simply: scale. A field-deployed device (or a gateway to field-deployed devices that do not, themselves, have any data management need or capability) has to manage a modest amount of data. But an aggregation point (the cloud being the most obvious example) has to manage many times more data – possibly orders of magnitude more.
At the same time, I have to say that they might not be all that different. Some IoT systems are going to be closed, meaning the nature of the things making up the system is known, and these won’t require much scaling. For example, a building automation system for a small- to mid-size building would have perhaps 100s of sensors and 10s of gateways, and may (or may not) push data up to a central aggregation point. If there are just 10s of gateways, we can create a UI that connects to the database on each gateway where each database is one shard of a single logical database, and execute analytics against that logical database without any need of a central aggregation point. We can extend this hypothetical case to a campus of buildings, or to a landlord with many buildings in a metropolitan area, and then a central aggregation point makes sense.

But the database system would not necessarily be different, only the organization of the physical and logical databases.
The gateways of each building would stream to a database server in the cloud. In the case of 10 buildings, we could have 10 database servers in the cloud that represent 10 shards of that logical database in the cloud. This architecture allows for great scalability. The landlord acquires another building? Great, stand up another database server and the UI connects to 11 shards instead of 10. In this scenario, database servers are software, not hardware. For the numbers we’re talking about (10 or 11 buildings), it could easily be handled by a single hardware server of modest ability.

At the other end of the scale (pun intended) are IoT systems that are wide open. By that, I mean the creators are not able to anticipate the universe of “things” that could be connected, or their quantity. In the first case, the database system should be able to ingest data that was heretofore unknown. This argues for a NoSQL database system, i.e. a database system that is schema-less. In this scenario, the database system on field-deployed devices is probably radically different from the database system in the cloud. Field-deployed devices are purpose-specific, so A) they don’t need and wouldn’t benefit from a NoSQL database system, and B) most NoSQL database systems are too resource-hungry to reside on embedded device nodes.

Q3. If we look at the characteristics of a database system for managing device-based data in the IoT, how do they differ from the characteristics of a database system (typically deployed on a server) for analyzing the “big data” generated by myriad devices?

Steve Graves: Again, let’s recognize that field-deployed devices in the IoT are classic embedded systems. In practical terms, that means relatively modest hardware like an ARM, MIPS, PowerPC or Atom processor running at 100s of megahertz, or perhaps 1 ghz if we’re lucky, and with only enough memory to perform its function. Further, it may require a real-time operating system, or at least an embedded operating system that is less resource hungry than a full-on Linux distro. So, for a database system to run in this environment, it will need to have been designed to run in this environment. It isn’t practical to try to shoehorn in a database system that was written on the assumption that CPU cycles and memory are abundant. It may also be the case that the device has little-to-no persistent storage, which mandates an in-memory database.

So a database system for a field-deployed device is going to
1. have a small code size
2. use little stack
3. preferably, allocate no heap memory
4. have no, or minimal, external dependencies (e.g. not link in an extra 1 MB of code from the C run-time library)
5. have built-in ability to replicate data (to a gateway or directly to the cloud)
a. Replication should be “open”, meaning be able to replicate to a different database system
6. Have built-in security features

7. Nice to have:
a. built-in analytics to aggregate data prior to replicating it
b. ability to define the schema
c. ability to operate entirely in memory

A database system for the cloud might benefit from being schema-less, as described previously. It should certainly have pretty elastic scalability. Servers in the cloud are going to have ample resources and robust operating systems. So a database system for the cloud doesn’t need to have a small code size, use a small amount of stack memory, or worry about external dependencies such as the C run-time library. On the contrary, a database system for the cloud is expected to do much more (handle data at scale, execute analytics, etc.) and will, therefore, need ample resources. In fact, this database system should be able to take maximum advantage of the resources available, including being able to scale horizontally (across cores, CPUs, and servers).
In summary, the edge (device-based) DBMS needs to operate in a constrained environment. A cloud DBMS needs to be able to effectively and efficiently utilize the ample resources available to it.

Q4. Why is the ability to define a database schema important (versus a schema-less DBMS, aka NoSQL) for field-deployed devices?

Steve Graves: Field-deployed devices will normally perform a few specific functions (sometimes, just one function). For example, a building automation system manages HVAC, lighting, etc. A livestock management system manages feed, output, and so on. In such systems, the data requirements are well known. The hallmark NoSQL advantage of being able to store data without predefining its structure is unwarranted. The other purported hallmark of NoSQL is horizontal scalability, but this is not a need for field-deployed devices.
Walking away from the relational database model (and its implicit use of a database schema) has serious implications.
A great deal of scientific knowledge has been amassed around the relational database model over the last few decades, and without it developers are completely on their own with respect to enforcing sound data management practices.

In the NoSQL sphere, there is nothing comparable to the relational model (e.g. E.F. Codd’s work) and the mathematical foundation (relational calculus) underpinning it.
There should be overwhelming justification for a decision to not use relational.
In my experience, that justification is absent for data management of field-deployed devices.
A database system that “knows” the data design (via a schema) can more intelligently manage the data. For example, it can manage constraints, domain dependencies, events and much more. And some of the purported inflexibility imposed by a schema can be eliminated if the DBMS supports dynamic DDL (see more details on this in the answer to question Q6, below).

Q5. In your opinion, do IoT aggregation points resemble data lakes?

Steve Graves: The term data lake was originally conceived in the context of Hadoop and map-reduce functionality. In more recent times, the meaning of the term has morphed to become synonymous with big data, and that is how I use the term. Insofar as a gateway can also be an aggregation point, I would not say ‘aggregation points resemble data lakes’ because gateway aggregation points, in all likelihood, will not manage Big Data.

Q6. What are the main technical challenges for database systems used to accommodate new and unforeseen data, for example when a new type of device begins streaming data?

Steve Graves: The obvious challenges are
1. The ability to ingest new data that has a previously unknown structure
2. The ability to execute analytics on #1
3. The ability to integrate analytics on #1 with analytics on previously known data

#1 is handled well by NoSQL DBMSs. But, it might also be handled well by an RDBMS via “dynamic DDL” (dynamic data definition language), e.g. the ability to execute CREATE TABLE, ALTER TABLE, and/or CREATE INDEX statements against an existing database.
To efficiently execute analytics against any data, the structure of the data must eventually be understood.
RDBMS handle this through the database dictionary (the binary equivalent of the data definition language).
But some NoSQL DBMSs handle this through different meta data. For example, the MarkLogic DBMS uses JSON metadata to understand the structure of documents in its document store.
NoSQL DBMSs with no meta data whatsoever put the entire burden on the developers. In other words, since the data is opaque to the DBMS, the application code must read and interpret the content.

Q7. Client/server DBMS architecture vs. in-process DBMSs: which one is more suitable for IoT?

Steve Graves: For edge DBMSs (on constrained devices), an in-process architecture will be more suitable. It requires fewer resources than client/server architecture, and imposes less latency through elimination of inter-process communication. For cloud DBMSs, a client/server architecture will be more suitable. In the cloud environment, resources are not scarce, and the the advantage of being able to scale horizontally will outweigh the added latency associated with client/server.

Qx Anything else you wish to add?

Steve Graves: We feel that eXtremeDB is uniquely positioned for the Internet of Things. Not only have devices and gateways been in eXtremeDB’s wheelhouse for 15 years with over 25 million real world deployments, but the scalability, time series data management, and analytics built into the eXtremeDB server (big data) offering make it an attractive cloud database solution as well. Being able to leverage a single DBMS across devices, gateways and the cloud has obvious synergistic advantages.

———————
Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

Resources

Big Data, Analytics, and the Internet of Things, by Mohak Shah, analytics leader and research scientist at Bosch Research, USA.ODBMS.org APRIL 6, 2015

 Privacy considerations & responsibilities in the era of Big Data & Internet of Things, by Ramkumar Ravichandran, Director, Analytics, Visa Inc. ODBMS.org January 8, 2015.

 Securing Your Largest USB-Connected Device: Your Car,BY Shomit Ghose, General Partner, ONSET Ventures, ODBMs.org MARCH 31, 2016.

 eXtremeDB Financial Edition DBMS Sweeps Records in Big Data Benchmark,ODBMS.org JULY 2, 2016

 eXtremeDB in-memory database

 User Experience Design for the Internet of Things

Related Posts

On the Internet of Things. Interview with Colin MahonyODBMS Industry Watch, Published on 2016-03-14

A Grand Tour of Big Data. Interview with Alan MorrisonODBMS Industry Watch, Published on 2016-02-25

On the Industrial Internet of Things. Interview with Leon Guzenda, ODBMS Industry Watch,  January 28, 2016

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/07/on-the-challenges-and-opportunities-of-iot-interview-with-steve-graves/feed/ 0
On data analytics for finance. Interview with Jason S.Cornez. http://www.odbms.org/blog/2016/05/on-data-analytics-for-finance-interview-with-jason-s-cornez/ http://www.odbms.org/blog/2016/05/on-data-analytics-for-finance-interview-with-jason-s-cornez/#comments Tue, 17 May 2016 06:14:44 +0000 http://www.odbms.org/blog/?p=4132

“Understanding human language remains a difficult problem. The challenges here are not only technical, but there is also a perception from popular culture that computers today perform at the level we see in science fiction. So there is a gap between what is expected and what is possible.”–Jason S.Cornez.

I have interviewed Jason S.Cornez, Chief Technology Officer, RavenPack. Main topic of the interview is unstructured data analytics for finance.

RVZ

Q1. What is the business of RavenPack?

Jason S.Cornez: We specialize in the systematic analysis of unstructured data for finance. RavenPack Analytics transforms unstructured big data sets,such as traditional news and social media, into structured granular data and indicators to help financial services firms improve their performance. RavenPack addresses the challenges posed by the characteristics of Big Data – volume, variety, veracity and velocity – by converting unstructured content into a format that can be more effectively analyzed, manipulated and deployed in financial applications.

Q2. How is Deutsche Bank using RavenPack News Analytics as an overlay to a pairs trading strategy?

Jason S.Cornez: The profits and risks from trading stock pairs are very much related to the type of information event which creates divergence. If divergence is caused by a piece of news related specifically to one constituent of the pair, there is a good chance that prices will diverge further. On the other hand, if divergence is caused by random price movements or a differential reaction to common information, convergence is more likely to follow after the initial divergence. To test the effects of news on a pairs trading strategy, Deutsche Bank used two aggregated indicators based on RavenPack’s Big Data analytics derived from news and social media data measuring sentiment and media attention.
Specifically, using the two indicators, Deutsche Bank created a filter that would ignore trades where divergence was supported by negative sentiment and abnormal news volume.
Overall, Deutsche Bank finds that applying a news analytics overlay can help differentiate between “good” price divergence (which is likely to converge) and “bad” divergence. More importantly, such ability provides significant improvements to the performance of a traditional pairs trading strategy, especially by reducing divergence risk.

Q3. Who needs sentiment analytics in finance and why?

Jason S.Cornez: Sentiment analytics can help improve performance of trading strategies,reduce risk, and monitor compliance. Quantitative investors often subscribe to RavenPack Analytics granular data. This provides them with the ability to detect relevant, novel and unexpected events – be they corporate, macroeconomic or geopolitical -so they can enter new positions, or protect existing ones. These events, and the sentiment associated with them, help drive alpha generation as a novel factor in automated trading models.

Traditional Asset Managers, such as those managing hedge funds, mutual funds, pension funds and family offices may subscribe to RavenPack Indicators to help run portfolio optimization. The Indicators provide snapshots of sentiment and information density for an entity or instrument that can be used alongside fundamental or technical indicators to build portfolios with better risk/return profiles.

Brokerage and Market Makers can leverage RavenPack sentiment data to manage risk and generate trade ideas. They rely on RavenPack’s detection of relevant, novel and unexpected events – be they corporate, macroeconomic or geopolitical – to create circuit breakers protecting them from event risk.

Risk and Compliance Managers use RavenPack data to monitor accumulation of adverse sentiment or detect headline risk. The data help risk managers locate accumulations of risk and volatility, or changes in liquidity – either by aggregating sentiment, identifying event-driven regime shifts, or by creating alerts for when sentiment indicators reach extremes. As well, RavenPack event data also aids surveillance analysts to receive fewer false positives from market abuse alerts.

Finally, Professional and Academic researchers use RavenPack data to better understand how news and social media affect markets. They want to inform their clients how to find new sources of value and, hence, research and write about how quantitative investment managers find value in the data. RavenPack’s granular data is a great source of unique data for academics to enhance their published research – be it presenting a new way to use the data or controlling for news and social media in their work.

Q4. What are the main challenges and opportunities for Big Data analytics for financial markets?

Jason S.Cornez: Much of the work so far in Big Data analytics has been confined to structured data. These are sets of labeled and elementized values, such as what you might find in a traditional database table. Tools like Hadoop and Spark have helped to make structured big data analyitcs approachable.

RavenPack has always focused on unstructured data, primarily English-language text. Doing analytics here isn’t just about data mining, it requires more sophisticated processing for each document. Understanding human language remains a difficult problem. The challenges here are not only technical, but there is also a perception from popular culture that computers today perform at the level we see in science fiction. So there is a gap between what is expected and what is possible. One of our goals here is certainly to help make computers a little smarter.

Things start to get really interesting when you produce analytics by marrying structured data with unstructured data. A simple example could be a news story where an analyst expects mortgage rates to hit 4% by summer. It is certainly great if a computer understands that this is a story about interest rate guidance, but so much better if the computer is able to combine this with historical mortgage rates to know that the rates are currently rising, but still far below historical norms. As an industry, I don’t think too much has been done here yet, but that we’ll be seeing more activity here in the coming years.

Financial markets rely on information in order to be efficient. Big Data analytics promises to provide more information, and more types of information, faster than was previously possible. A more efficient market could help to level the playing field, as it were. And even if markets never become truly efficient, the financial industry sees that Big Data analytics can certainly help them. Several of these opportunities were addressed in the answer to the previous question.

Q5. What are your practical experience in building an infrastructure for Big Data Analytics of mostly unstructured text content, in realtime?

Jason S.Cornez: RavenPack has been processing Big Data since before Cloud Computing was a practical reality. We noticed that most competitors in the news analytics space were offering software solutions, whereas RavenPack has always been a service provider. We sell data, not software. As such we invested in our own infrastructure maintained at trusted hosting facilities. This was perhaps not the easiest or cheapest route, but it leads to compelling products that are relatively easy for a customer to adopt.

From the beginning, we’ve built a distributed system where collection, storage, classification, analytics, publication, and monitoring all run on distinct machines connected by a high-speed network. We learned virtualization technologies so that we could leverage our hardware investments more efficiently. We’ve been rigorous about maintaining a separation of concerns and establishing well-defined interfaces between our components. This not only makes our system robust, but it also allows us to choose the best technologies for each task.

In recent years, we’ve migrated to Cloud Computing and our early investments in distributed systems are really paying off. Most of our components work directly in the cloud and also scale without additional engineering work.

Q6. How do you manage to have a very low latency?

Jason S.Cornez: Low latency has always been a requirement of the system. Starting with low-latency, realtime processing in mind led to many of the architectural decisions that I mentioned above – especially about being distributed and being able to leverage big hardware. It’s painful to think about re-engineering an existing system that wasn’t designed with low latency in mind.

A specific observation is that storage, especially magnetic based storage, is far slower than CPU and also far slower than networking. So we have a heavily multi-threaded system where all storage tasks are delegated to background threads and the flow of data in the realtime system never needs to wait on a database.

Speaking of multi-threading, RavenPack performs various types of classification on each document. Many of these are independent and can be performed in parallel. As well, within a single document and single type of classification, many aspects work only on local information, such as a paragraph. This work can also be done in parallel. As more powerful, multi-core machines continue to appear, our system can continue to improve.

Of course, low latency really begins with good algorithms and good tools. We measure the system as a whole on a daily basis and we profile our code for both speed and space on a regular basis. At times, there is a trade-off between a feature and doing it feasibly. We often sacrifice a new feature until we can solve how to implement it without negatively impacting the performance of our system.

Q7. What are the main technological challenges you are currently facing?

Jason S.Cornez: There are many challenges ahead. Some of the obvious ones are about branching out from English into other languages, or from plain text to other media formats.

On the purely technical side, we see that cloud computing and big data are still very young fields. Cloud resources are much more ephemeral than those in a controlled, hosted environment. We must adapt software to work well in the face of disappearing machines and inaccessible resources. One example is startup time of a system. Traditionally, startup is a rare event and our servers run for a long time. But now that changes, and system startup is much more frequent and hence must be made more efficient. We are evolving rapidly in these areas right now.

Perhaps the biggest challenge remains the perception gap that I mentioned earlier. I’m very proud of the system we’ve built, but it remains possible for a human to find an entity or an event in a document that our system misses. I don’t think this problem will ever go away, but I’m confident RavenPack is making great strides here.

Q8. Why and how do you use Allegro Common Lisp?

Jason S.Cornez: RavenPack has been using Franz Allegro Common Lisp since we began. It is the primary language we use for analysis and classification of unstructured text. Common Lisp is an excellent language for both exploratory programming and high performance computing.

Common Lisp is a multi-paradigm language, or even a paradigm-neutral language. So the engineer has the flexibility to map from concept to code in the most natural way possible. Some concepts map naturally to an object-oriented design, others to a functional design, and other to an imperative design. The language naturally supports all of these so you never need to map from your concept into the philosophy of the language. And further, lisp is a programmable programming language, so as new paradigms come along, they can be added to the language by any developer. This is so easy and natural in Common Lisp that you often do it even when there is only a single use case in mind.

Common Lisp also shines for deploying and maintaining production software. Of course, it supports native OS threads, native machine compilation, and high performance garbage collection. But as well, you can attach to, inspect, modify and patch live systems.

Q9. What are the main lessons you learned so far?

Jason S.Cornez: It’s been a long and interesting journey, and nearly everything we know now has been learned along the way. One way I like to think about the main lessons learned is to consider what I believe to be the barriers that might make it difficult for a competitor or potential client to replicate what we’ve done.

A significant selling-point of our product that provides lots of value to our clients is our extensive historical archive of analytics. This of course is derived from our archive of content. The curation of such an archive is much harder than most people imagine. There is the minor issue of implementing the spec that the provider supplies. But the fun begins as you realize that the archive is incomplete and in multiple incompatible formats, some of them not documented at all. There are multiple timestamps, many with no timezone. The realtime feed looks different from the historical archive. The list goes on.

None of this is meant as a complaint about our content partners – this is the nature of things. And even having learned this lesson, there isn’t much we could have done differently. Of course, we now have a checklist of questions we give to any new content provider – and they often improve their offering as a result of working with us. But if we hear that incorporating someone’s content will be easy, we now know to take this with a grain of salt.

Qx Anything else you wish to add?

Jason S.Cornez: Thanks for this opportunity. I hope it has been helpful.

———————————
Jason S.Cornez, Chief Technology Officer, RavenPack.
Jason joined RavenPack in 2003 and is responsible for the design and implementation of the RavenPack software platform. He is a hands-on technology leader, with a consistent record of delivering break-through products. 
A Silicon Valley start-up veteran with 20 years of professional experience, Jason combines technical know-how with an understanding of business needs to turn vision into reality. Jason holds a Master’s Degree in Computer Science, along with undergraduate degrees in Mathematics and EECS, from the Massachusetts Institute of Technology.

——————————

Resources

–  Common Lisp Educational Resources:  list of books, Lisp-oriented web sites and tutorials.

–  Basic Lisp Techniques: The PDF file provides an introduction to the Common Lisp language.

–  Mean Reversion II: Pairs Trading Strategies (LINK to .PDF) – Registration required-, Deutsche Bank, Feb. 16, 2016.  In this paper, Deutsche Bank shows how to use RavenPack News Analytics as an overlay to a pairs trading strategy.

Related Posts

– Enterprise Information Extraction. BY Yunyao Li, Research Manager, Scalable Natural Language Processing (SNaP) Group, IBM Research–Almaden.ODBMS.org, MAY 9, 2016.

– Big Data: Content and Technology. BY Gio Wiederhold, ODBMS.org, May 2016

– Above the Clouds: What Modern IT Portends. BY Filippo Balestrieri and Bernardo A. Huberman, Hewlett Packard Labs, ODBMS.org, MARCH 30, 2016.

– “Civility in the Age of Artificial Intelligence”. BY STEVE LOHR, technology reporter for The New York Times and the author of “Data-ism”.ODBMS.org, FEBRUARY 6, 2016.

Follow ODBMS.org on Twitter: @odbmsorg
##

]]>
http://www.odbms.org/blog/2016/05/on-data-analytics-for-finance-interview-with-jason-s-cornez/feed/ 0
On Big Data and Data Science. Interview with James Kobielus http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/ http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/#comments Tue, 19 Apr 2016 08:34:09 +0000 http://www.odbms.org/blog/?p=4119

“One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.”– James Kobielus

On the topics of Big Data, and Data Science, I have interviewed James Kobielus, IBM Big Data Evangelist.

RVZ

Q1. What kind of companies generate Big Data, besides the Internet giants?

James Kobielus: Big data isn’t something you “generate.” Rather, the term refers to the ability to achieve differentiated value from advanced analytics on trustworthy data at any scale. In other words, it’s a best practice, not a specific type of data or even a specific scale of data (measured in volume, velocity, and/or variety).

When considered in this light, you can identify big data analytic applications in every industry. Every C-level executive has strategic applications of big data. Here are just a smattering:

  • Chief Marketing Officers have been the prime movers on many big data initiatives that involve Hadoop, NoSQL, and other approaches. Their primary applications consist of marketing campaign optimization, customer churn and loyalty, upsell and cross-sell analysis, targeted offers, behavioral targeting, social media monitoring, sentiment analysis, brand monitoring, influencer analysis, customer experience optimization, content optimization, and placement optimization
  • Chief Information Officers use big data platforms for data discovery, data integration, business analytics, advanced analytics, exploratory data science.
  • Chief Operations Officers rely on big data for supply chain optimization, defect tracking, sensor monitoring, and smart grid, among other applications.
  • Chief Information Security Officer run security incident and event management, anti-fraud detection, and other sensitive applications on big data.
  • Chief Technology Officers do IT log analysis, event analytics, network analytics, and other systems monitoring, troubleshooting, and optimization applications on big data.
  • Chief Financial Officers run complex financial risk analysis and mitigation modeling exercises on big data platforms.

Q2. What are the most challenging problems you are facing when analysing Big Data?

James Kobielus: Searching for actionable intelligence in big data involves building and testing advanced-analytics models against large volumes of complex data that may be flowing in at high velocities.

At these scales, it’s easy to get overwhelmed in your analysis unless you automate the end-to-end processes of extracting intelligence at scale. Automation can also help control the cost of managing a growing volume of algorithmic models against ever expanding big-data collections. The key processes that need automating are data discovery, profiling, sampling, and preparation, as well as model building, scoring, and deployment.

Q3. How do you typically handle them?

James Kobielus: Automating the modeling process will boost data scientist productivity by an order of magnitude, freeing them from drudgery so that they can focus on the sorts of exploration, modeling, and visualization challenges that demand expert human judgment. Data scientists can accelerate their modeling automation initiatives by following these steps:

  • Virtualize access to data, metadata, rules, and predictive models, as well as to data integration, data warehousing, and advanced analytic applications through a BI semantic virtualization layer;
  • Unify access, governance, orchestration, automation, and administration across these resources within a service-oriented architecture;
  • Explore commercial tools that support maximum automation of model development, scoring, deployment, and execution;
  • Consolidate, accelerate, and deepen predictive analytics through integration into big-data platforms with scalable in-database execution; and
  • Migrate existing analytical data marts into multidomain big-data platforms with unified data, metadata, and model governance within service-oriented virtualization framework.

Q4. What are in your experience the typical mistakes made in large scale data projects?

James Kobielus: One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.

Even if you accept that a data scientist’s integrity is rock-solid, intentions pure, skills stellar, and discipline rigorous, there’s no denying that bias may creep inadvertently into their work with big data. The biases may be minor or major, episodic or systematic, tangential or material to their findings and recommendations. Whatever their nature, the biases must be understood and corrected as fully as possible.

Here are some of the key sources of bias that may crop up in a data scientist’s work with big data:

  • Cognitive bias: This is the tendency to make skewed decisions based on pre-existing cognitive and heuristic factors–such as a misunderstanding of probabilities–rather than on the data and other hard evidence. You might say that the educated intuition that drives data science is rife with cognitive bias, but that’s not always a bad thing.
  • Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient, and cost-effective for your purposes, as opposed to being necessarily the most valid and relevant for your study. Clearly, data scientists do not have unlimited budgets, may operate under tight deadlines, and don’t use data for which they lack authorization. These constraints may introduce an unconscious bias in the big-data collections they are able to assemble.
  • Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population most relevant to the initial scope of a data-science project, thereby making it unlikely that you will uncover any meaningful correlations that may apply to other segments. Another source of sampling bias is “data dredging,” in which the data scientist uses regression techniques that may find correlations in samples but that may not be statistically significant in the wider population. Consequently, you’re likely to spuriously confirm your initial model for the segments that happen to make the sampling cut.
  • Modeling bias: Beyond the biases just discussed, this is the tendency to skew data-science models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms, and the wrong metrics of fitness. In addition, overfitting of models to past data without regard for predictive lift is a common bias. Likewise, failure to score and iterate models in a timely fashion with fresh observational data also introduces model decay, hence bias.
  • Funding bias: This may be the most silent but pernicious bias in data-scientific studies of all sorts. It’s the unconscious tendency to skew all modeling assumptions, interpretations, data, and applications to favor the interests of the party–employer, customer, sponsor, etc.–that employs or otherwise financially supports the data-science initiative. Funding bias makes it highly unlikely that data scientists will uncover disruptive insights that will “break the rice bowl” in which they make their living.

Q5. How do you measure “success” when analysing data?

James Kobielus: You measure success in your ability to distill useful insights in a timely fashion from the data at your disposal.

Q6. What skills are required to be an effective Data Scientist?

James Kobielus: Data science’s learning curve is formidable. To a great degree, you will need a degree, or something substantially like it, to prove you’re committed to this career. You will need to submit yourself to a structured curriculum to certify you’ve spent the time, money and midnight oil necessary for mastering this demanding discipline.

Sure, there are run-of-the-mill degrees in data-science-related fields, and then there are uppercase, boldface, bragging-rights “DEGREES.” To some extent, it matters whether you get that old data-science sheepskin from a traditional university vs. an online school vs. a vendor-sponsored learning program. And it matters whether you only logged a year in the classroom vs. sacrificed a considerable portion of your life reaching for the golden ring of a Ph.D. And it certainly matters whether you simply skimmed the surface of old-school data science vs. pursued a deep specialization in a leading-edge advanced analytic discipline.

But what matters most to modern business isn’t that every data scientist has a big honking doctorate. What matters most is that a substantial body of personnel has a common grounding in core curriculum of skills, tools and approaches. Ideally, you want to build a team where diverse specialists with a shared foundation can collaborate productively.

Big data initiatives thrive if all data scientists have been trained and certified on a curriculum with the following foundation:

  • Paradigms and practices: Every data scientist should acquire a grounding in core concepts of data science, analytics and data management. They should gain a common understanding of the data science lifecycle, as well as the typical roles and responsibilities of data scientists in every phase. They should be instructed on the various role(s) of data scientists and how they work in teams and in conjunction with business domain experts and stakeholders. And they learn a standard approach for establishing, managing and operationalizing data science projects in the business.
  • Algorithms and modeling: Every data scientist should obtain a core understanding of linear algebra, basic statistics, linear and logistic regression, data mining, predictive modeling, cluster analysis, association rules, market basket analysis, decision trees, time-series analysis, forecasting, machine learning, Bayesian and Monte Carlo Statistics, matrix operations, sampling, text analytics, summarization, classification, primary components analysis, experimental design, unsupervised learning constrained optimization.
  • Tools and platforms: Every data scientist should master a core group of modeling, development and visualization tools used on your data science projects, as well as the platforms used for storage, execution, integration and governance of big data in your organization. Depending on your environment, and the extent to which data scientists work with both structured and unstructured data, this may involve some combination of data warehousing, Hadoop, stream computing, NoSQL and other platforms. It will probably also entail providing instruction in MapReduce, R and other new open-source development languages, in addition to SPSS, SAS and any other established tools.
  • Applications and outcomes: Every data scientist should learn the chief business applications of data science in your organization, as well as in how to work best with subject-domain experts. In many companies, data science focuses on marketing, customer service, next best offer, and other customer-centric applications. Often, these applications require that data scientists understand how to leverage customer data acquired from structured survey tools, sentiment analysis software, social media monitoring tools and other sources. It also essential that every data scientist gain an understanding of the key business outcomes–such as maximizing customer lifetime value–that should focus their modeling initiatives.

Classroom instruction is important, but a curriculum that is 100 percent devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. Make sure that your data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues.

A business-oriented data-science curriculum should produce expert developers of statistical and predictive models. It should not degenerate into a program that produces analytics geeks with heads stuffed with theory but whose diplomas are only fit for hanging on the wall.

Q7. Hadoop vs. Spark: what are the pros and cons?

James Kobielus: Big data analytics infrastructures are growing more hybridized than ever. Every new technology—such as Hadoop, in-memory databases, and graph databases—finds its specific niche in terms of use cases, deployment modes, and applications for which it is best suited.

Even as Apache Spark pushes more deeply into big-data environments, it won’t substantially change this trend. Yes, of course Spark is on the fast track to ubiquity in big-data analytics. This is especially true for the next generation of machine-learning applications that feed on growing in-memory pools and require low-latency distributed computations for streaming and graph analytics. But those use cases aren’t the sum total of big-data analytics and never will be.

As we all grow more infatuated with Spark, it’s important to continually remind ourselves of what it’s not suitable for. If, for example, one considers all the critical data management, integration, and preparation tasks that must be performed prior to modeling in Spark, it’s clear that these will not be executed in any of the Spark engines (Spark SQL, Spark Streaming, GraphX). Instead, they’ll be carried out in the data platforms and elastic clusters (HDFS, Cassandra, HBase, Mesos, cloud services, etc.) upon which those engines run. Likewise, you’d be hardpressed to find anyone who’s seriously considering Spark in isolation for data warehousing, data governance, master data management, or operational business intelligence.

Above all else, Spark is the new power tool for data scientists who are pushing boundaries in the emerging era of in-memory big data analytics in low-latency scenarios of all types. Spark is proving its value as a development tool for the new generation of data scientists building the in-memory statistical models upon which it all will depend.

Let’s not fall into the delusion that everything is converging toward Spark, as if it were the ravenous maw that will devour every other big-data analytics tool and platform. Spark is just another approach that’s being fitted to and optimized for specific purposes.

And let’s resist the hype that treats Spark as Hadoop’s “successor.” This implies that Hadoop and other big-data approaches are “legacy,” rather than what they are, which is foundational. For example, no one is seriously considering doing “data lakes,” “data reservoirs,” or “data refineries” on anything but Hadoop or NoSQL.

——————–

James Kobielus is an industry veteran and serves as IBM Big Data Evangelist; Senior Program Director for Product Marketing in Big Data Analytics; and Team Lead, Technical Marketing, IBM Big Data & Analytics Hub. He spearheads thought leadership activities across the IBM Analytics solution portfolio. He has spoken at such leading industry events as IBM Insight, Hadoop Summit, and Strata. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.

Resources

–  Master of Information and Data Science,  UC Berkeley School of Information.

– MS in Data Science, NYU Center for Data Science.

– Free data science curriculum, kdnuggets.com

Data Science | Coursera

– Master of Science in Data Science – Data Science Institute

Data Mining and Applications Graduate Certificate, Stanford

The European Data Science Academy (EDSA) designs curricula for data science training and data science education across the European Union (EU).

-The EDISON project will focus on activities to establish the new profession of ‘Data Scientist’, following the emergence of Data Science technologies (also referred to as Data Intensive or Big Data technologies) which changes the way research is done, how scientists think and how the research data are used and shared. This includes definition of the required skills, competences framework/profile, corresponding Body Of Knowledge and model curriculum. It will develop a sustainability/business model to ensure a sustainable increase of Data Scientists, graduated from universities and trained by other professional education and training institutions in Europe. 
EDISON will facilitate the establishment of a Data Science education and training infrastructure at major European universities by promoting experience of ‘champion’ universities involving them into coordinated development and implementation of the model curriculum and creation of cooperative educational and training infrastructure.

Related Posts

– RIP Big Data, By Carl Olofson, Research Vice President, Data Management Software Research, IDC. ODBMS.org, January  2016

Open Source Software and IBM’s Big Data platform. By Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. ODBMS.org, April 2016.

Looking back at Big Data in 2015, By Cynthia M. Saracco, IBM Senior Solution Architect, ODBMS.org. November 2015

–  Heuristics for a Data Scientist: A common sense approach. BY Silvia Dassiè, Data Scientist at Ryanair. ODBMS.org, December 2015

The rise of immutable data stores. By Alan Morrison, Senior Manager, PwC Center for technology and innovation. ODBMS.org. October 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/feed/ 0
On the Internet of Things. Interview with Colin Mahony http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/ http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/#comments Mon, 14 Mar 2016 08:45:56 +0000 http://www.odbms.org/blog/?p=4101

“Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, HPE Big Data Platform. Topics of the interview are: The challenges of the Internet of Things, the opportunities for Data Analytics, the positioning of HPE Vertica and HPE Cloud Strategy.

RVZ

Q1. Gartner says 6.4 billion connected “things” will be in use in 2016, up 30 percent from 2015.  How do you see the global Internet of Things (IoT) market developing in the next years?

Colin Mahony: As manufacturers connect more of their “things,” they have an increased need for analytics to derive insight from massive volumes of sensor or machine data. I see these manufacturers, particularly manufacturers of commodity equipment, with a need to provide more value-added services based on their ability to provide higher levels of service and overall customer satisfaction. Data analytics platforms are key to making that happen. Also, we could see entirely new analytical applications emerge, driven by what consumers want to know about their devices and combine that data with, say, their exercise regimens, health vitals, social activities, and even driving behavior, for full personal insight.
Ultimately, the Internet of Things will drive a need for the Analyzer of Things, and that is our mission.

Q2. What Challenges and Opportunities bring the Internet of Things (IoT)? 

Colin Mahony: Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data. The reason? Traditional data warehouses were designed well before the Internet of Things, or, at least before OT (operational technology) like medical devices, industrial equipment, cars, and more were connected to the Internet. So, having an analytical platform to provide the scale and performance required to handle these volumes is important, but customers are taking more of a two- or three-tier approach that involves some sort of analytical processing at the edge before data is sent to an analytical data store. Apache Kafka is also becoming an important tier in this architecture, serving as a message bus, to collect and push that data from the edge in streams to the appropriate database, CRM system, or analytical platform for, as an example, correlation of fault data over months or even years to predict and prevent part failure and optimize inventory levels.

Q3. Big Data: In your opinion, what are the current main demands/needs in the market?

Colin Mahony: All organizations want – and need – to become data-driven organizations. I mean, who wants to make such critical decisions based on half answers and anecdotal data? That said, traditional companies with data stores and systems going back 30-40 years don’t have the same level playing field as the next market disruptor that just received their series B funding and only knows that analytics is the life blood of their business and all their critical decisions.
The good news is that whether you are a 100-year old insurance company or the next Uber or Facebook, you can become a data-driven organization by taking an open platform approach that uses the best tool for the job and can incorporate emerging technologies like Kafka and Spark without having to bolt on or buy all of that technology from a single vendor and get locked in.  Understanding the difference between an open platform with a rich ecosystem and open source software as one very important part of that ecosystem has been a differentiator for our customers.

Beyond technology, we have customers that establish analytical centers of excellence that actually work with the data consumers – often business analysts – that run ad-hoc queries using their preferred data visualization tool to get the insight need for their business unit or department. If the data analysts struggle, then this center of excellence, which happens to report up through IT, collaborates with them to understand and help them get to the analytical insight – rather than simply halting the queries with no guidance on how to improve.

Q4. How do you embed analytics and why is it useful? 

Colin Mahony: OEM software vendors, particularly, see the value of embedding analytics in their commercial software products or software as a service (SaaS) offerings.  They profit by creating analytic data management features or entirely new applications that put customers on a faster path to better, data-driven decision making. Offering such analytics capabilities enables them to not only keep a larger share of their customer’s budget, but at the same time greatly improve customer satisfaction. To offer such capabilities, many embedded software providers are attempting unorthodox fixes with row-oriented OLTP databases, document stores, and Hadoop variations that were never designed for heavy analytic workloads at the volume, velocity, and variety of today’s enterprise. Alternatively, some companies are attempting to build their own big data management systems. But such custom database solutions can take thousands of hours of research and development, require specialized support and training, and may not be as adaptable to continuous enhancement as a pure-play analytics platform. Both approaches are costly and often outside the core competency of businesses that are looking to bring solutions to market quickly.

Because it’s specifically designed for analytic workloads, HPE Vertica is quite different from other commercial alternatives. Vertica differs from OLTP DBMS and proprietary appliances (which typically embed row-store DBMSs) by grouping data together on disk by column rather than by row (that is, so that the next piece of data read off disk is the next attribute in a column, not the next attribute in a row). This enables Vertica to read only the columns referenced by the query, instead of scanning the whole table as row-oriented databases must do. This speeds up query processing dramatically by reducing disk I/O.

You’ll find Vertica as the core analytical engine behind some popular products, including Lancope, Empirix, Good Data, and others as well as many HPE offerings like HPE Operations Analytics, HPE Application Defender, and HPE App Pulse Mobile, and more.

Q5. How do you make a decision when it is more appropriate to “consume and deploy” Big Data on premise, in the cloud, on demand and on Hadoop?

Colin Mahony: The best part is that you don’t need to choose with HPE. Unlike most emerging data warehouses as a service where your data is trapped in their databases when your priorities or IT policies change, HPE offers the most complete range of deployment and consumption models. If you want to spin up your analytical initiative on the cloud for a proof-of-concept or during the holiday shopping season for e-retailers, you can do that easily with HPE Vertica OnDemand.
If your organization finds that due to security or confidentiality or privacy concerns you need to bring your analytical initiative back in house, then you can use HPE Vertica Enterprise on-premises without losing any customizations or disruption to your business. Have petabyte volumes of largely unstructured data where the value is unknown? Use HPE Vertica for SQL on Hadoop, deployed natively on your Hadoop cluster, regardless of the distribution you have chosen. Each consumption model, available in the cloud, on-premise, on-demand, or using reference architectures for HPE servers, is available to you with that same trusted underlying core.

Q6. What are the new class of infrastructures called “composable”? Are they relevant for Big Data?

Colin Mahony: HPE believes that a new architecture is needed for Big Data – one that is designed to power innovation and value creation for the new breed of applications while running traditional workloads more efficiently.
We call this new architectural approach Composable Infrastructure. HPE has a well-established track record of infrastructure innovation and success. HPE Converged Infrastructure, software-defined management, and hyper-converged systems have consistently proven to reduce costs and increase operational efficiency by eliminating silos and freeing available compute, storage, and networking resources. Building on our converged infrastructure knowledge and experience, we have designed a new architecture that can meet the growing demands for a faster, more open, and continuous infrastructure.

Q7. What is HPE Cloud Strategy? 

Colin Mahony: Hybrid cloud adoption is continuing to grow at a rapid rate and a majority of our customers recognize that they simply can’t achieve the full measure of their business goals by consuming only one kind of cloud.
HPE Helion not only offers private cloud deployments and managed private cloud services, but we have created the HPE Helion Network, a global ecosystem of service providers, ISVs, and VARs dedicated to delivering open standards-based hybrid cloud services to enterprise customers. Through our ecosystem, our customers gain access to an expanded set of cloud services and improve their abilities to meet country-specific data regulations.

In addition to the private cloud offerings, we have a strategic and close alliance with Microsoft Azure, which enables many of our offerings, including Haven OnDemand, in the public cloud. We also work closely with Amazon because our strategy is not to limit our customers, but to ensure that they have the choices they need and the services and support they can depend upon.

Q8. What are the advantages of an offering like Vertica in this space?

Colin Mahony: More and more companies are exploring the possibility of moving their data analytics operations to the cloud. We offer HPE Vertica OnDemand, our data warehouse as a service, for organizations that need high-performance enterprise class data analytics for all of their data to make better business decisions now. Built by design to drastically improve query performance over traditional relational database systems, HPE Vertica OnDemand is engineered from the same technology that powers the HPE Vertica Analytics Platform. For organizations that want to select Amazon hardware and still maintain the control over the installation, configuration, and overall maintenance of Vertica for ultimate performance and control, we offer Vertica AMI (Amazon Machine Image). The Vertica AMI is a bring-your-own-license model that is ideal for organizations that want the same experience as on-premise installations, only without procuring and setting up hardware. Regardless of which deployment model to choose, we have you covered for “on demand” or “enterprise cloud” options.

Q9. What is HPE Vertica Community Edition?

Colin Mahony: We have had tens of thousands of downloads of the HPE Vertica Community Edition, a freemium edition of HPE Vertica with all of the core features and functionality that you experience with our core enterprise offering. It’s completely free for up to 1 TB of data storage across three nodes. Companies of all sizes prefer the Community Edition to download, install, set-up, and configure Vertica very quickly on x86 hardware or use our Amazon Machine Image (AMI) for a bring-your-own-license approach to the cloud.

Q10. Can you tell us how Kiva.org, a non-profit organization, uses on-demand cloud analytics to leverage the internet and a worldwide network of microfinance institutions to help fight poverty? 

Colin Mahony: HPE is a major supporter of Kiva.org, a non-profit organization with a mission to connect people through lending to alleviate poverty. Kiva.org uses the internet and a worldwide network of microfinance institutions to enable individuals lend as little as $25 to help create opportunity around the world. When the opportunity arose to help support Kiva.org with an analytical platform to further the cause, we jumped at the opportunity. Kiva.org relies on Vertica OnDemand to reduce capital costs, leverage the SaaS delivery model to adapt more quickly to changing business requirements, and work with over a million lenders, hundreds of field partners and volunteers, across the world. To see a recorded Webinar with HPE and Kiva.org, see here.

Qx Anything else you wish to add?

Colin Mahony: We appreciate the opportunity to share the features and benefits of HPE Vertica as well as the bright market outlook for data-driven organizations. However, I always recommend that any organization that is struggling with how to get started with their analytics initiative to speak and meet with peers to learn best practices and avoid potential pitfalls. The best way to do that, in my opinion, is to visit with the more than 1,000 Big Data experts in Boston from August 29 – September 1st at the HPE Big Data Conference. Click here to learn more and join us for 40+ technical deep-dive sessions.

————-

Colin Mahony, SVP & General Manager, HPE Big Data Platform

Colin Mahony leads the Hewlett Packard Enterprise Big Data Platform business group, which is responsible for the industry leading Vertica Advanced Analytics portfolio, the IDOL Enterprise software that provides context and analysis of unstructured data, and Haven OnDemand, a platform for developers to leverage APIs and on demand services for their applications.
In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University.  He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis.

Resources

What’s in store for Big Data analytics in 2016, Steve Sarsfield, Hewlett Packard Enterprise. ODBMS.org, 3 FEB, 2016

What’s New in Vertica 7.2?: Apache Kafka Integration!, HPE, last edited February 2, 2016

Gartner Says 6.4 Billion Connected “Things” Will Be in Use in 2016, Up 30 Percent From 2015, Press release, November 10, 2015

The Benefits of HP Vertica for SQL on Hadoop, HPE, July 13, 2015

Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business , Walter Maguire, Chief Field Technologist, HP Big Data Group, ODBMS.org, 2 Nov, 2015

HP Distributed R ,ODBMS.org,  19 FEB, 2015.

Understanding ROS and WOS: A Hybrid Data Storage Model, HPE, October 7, 2015

Related Posts

On Big Data Analytics. Interview with Shilpa LawandeSource: ODBMS Industry Watch, Published on December 10, 2015

On HP Distributed R. Interview with Walter Maguire and Indrajit RoySource: ODBMS Industry Watch, Published on April 9, 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/03/on-the-internet-of-things-interview-with-colin-mahony/feed/ 0