“The real problem is not collecting the data, even at insanely high speeds; the real problem is acting on it in time. This is where we have things like automatic stock trading systems. The database is integrated rather than separated from the application.” –Joe Celko.
I have interviewed Joe Celko, a well know database expert, on the challenges of Big Data and when it makes sense using Non-Relational Databases.
Q1. Three areas make today’s new data different from the data of the past: Velocity, Volume and Variety. Why?
Joe Celko: I did a keynote at a PostgreSQL conference in Prague with the title “Our Enemy, the Punch Card” on the theme that we had been mimicking the old data models with the new technology. This is no surprise; the first motion pictures were done with a single camera that never moved to mimic a seat at a theater.
Eventually, “moving picture shows” evolved into modern cinema. This is the same pattern in data. It is physically impossible to make a punch card and magnetic tape data move as fast as fiber optics, or hold as many bits. More importantly, the cost per bit dropped by orders of magnitude. Now it was practical computerize everything! And since we can do, and do it cheap, we will do it.
But what we found out that this new, computerizable (is that a word?) data is not always traditionally structured data.
Q2. What about data Veracity? Is this a problem as well?
Q3. When information is changing faster than you can collect and query it,it simply cannot be treated the same as static data. What are the solutions available to solve this problem?
Joe Celko: I have to do a disclaimer here: I have done videos for Streambase and Kx Systems.
There is an old joke about two morons trying to fix a car. Q: “Is my signal light working?” A: “Yes. No. Yes. No. Yes. No. ..” but it summaries the basic problem with streaming data. That is streaming data or “complex events” in the literature.
The model is that tables are replaced by streams of data, but the query language in Streambase is an extended SQL dialect.
The Victory of SELECT-FROM-WHERE!
The Kx products are more like C or other low level languages.
The real problem is not collecting the data, even at insanely high speeds; the real problem is acting on it in time. This is where we have things like automatic stock trading systems. The database is integrated rather than separated from the application.
Q4. Old storage and access models do not work for big data. Why?
Joe Celko: First of all, the old stuff does not hold enough data. How would you put even a day’s worth of Wal-Mart sales on punch cards? Sequential access will not work; we need parallelism. We do not have time to index the data; the traditional tree indexing requires extra time, usually O(lg2(n)). Our best bets are perfect hashing functions and special hardware.
Q5. What are different ways available to store and access data such as petabytes and exabytes?
Joe Celko: Today, we are still stuck with moving disk. Optical storage is still too expensive and slow to write.
Solid State Disk is still too expensive, but dropping fast. My dream is really cheap solid state drives that have lots of processors in the drive which monitor a small subset of the data. We send out a command “Hey, minions, find red widgets and send me your results!” and it happens all at once. The ultimate Map-Reduce model in the hardware!
Q6. Not all data can fit into a relational model, including genetic data, semantic data, and data generated by social networks. How do you handle data variety?
Joe Celko: We have graph databases for social networks. I was a math major, so I love them. Graph theory has a lot of good problems and algorithms we can steal, just like SQL uses set theory and logic. But genetic data and semantics do not have a mature theory behind them. The real way to handle the diversity is new tools, starting at the conceptual level. How many times have you seen someone write 1960′s COBOL file systems in SQL?
Q7 What are the alternative storage, query, and management frameworks needed by certain kinds of Big Data?
Joe Celko: As best you can, do not scare your existing staff with a totally new environment.
Q8. Columnar-data stores, graph-databases, streaming databases, analytic data bases. How do classify and evaluate all of these NewSQL/ NoSQL solutions available?
Joe Celko: First decide what the problem is, then pick the tool. One of my war stories was consulting at a large California company that wanted to put their labor relations law library on their new DB2 database. It was all text, and used by lawyers. Lawyers do not know SQL. Lawyers do not want to learn SQL. But they do know Lexis and WestLaw text query tools. They know labor law and the special lingo. Programmers do not know labor law. Programmers do not want to learn labor law. But the programmers can set up a textbase for the lawyers.
Q9. If you were a user, how would you select the “right” data management tools and technology for the job?
Joe Celko: There is no generic answer. Oh, there will be a better answer by the time you get into production. Welcome to IT!
Joe Celko served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards. Mr. Celko is author a series of books on SQL and RDBMS for Morgan-Kaufmann. He is an independent consultant based in Austin, Texas. He has written over 1300 columns in the computer trade and academic press, mostly dealing with data and databases.
“Joe Celko’s Complete Guide to NoSQL: What Every SQL Professional Needs to Know about Non-Relational Databases“- Paperback: 244 pages, Morgan Kaufmann; 1 edition (October 31, 2013), ISBN-10: 0124071929
“Big Data: Challenges and Opportunities” (.PDF), Roberto V. Zicari, Goethe University Frankfurt, ODBMS.org, October 5, 2012
“In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.”–Steven T. Graves.
The fourth interview in the “Big Data: three questions to “ series of interviews, is with Steven T. Graves, President and CEO McObject
Q1. What is your current product offering?
Steven T. Graves: McObject has two product lines. One is the eXtremeDB product family. eXtremeDB is a real-time embedded database system built on a core in-memory database system (IMDS) architecture, with the eXtremeDB IMDS edition representing the “standard” product. Other eXtremeDB editions offer special features and capabilities such as an optional SQL API, high availability, clustering, 64-bit support, optional and selective persistent storage, transaction logging and more.
In addition, our eXtremeDB Financial Edition database system targets real-time capital markets systems such as algorithmic trading and risk management (and has its own Web site). eXtremeDB Financial Edition comprises a super-set of the individual eXtremeDB editions (bundling together all specialized libraries such as clustering, 64-bit support, etc.) and offers features including columnar data handling and vector-based statistical processing for managing market data (or any other type of time series data).
Features shared across the eXtremeDB product family include: ACID-compliant transactions; multiple application programming interfaces (a native and type-safe C/C++ API; SQL/ODBC/JDBC; native Java, C# and Python interfaces); multi-user concurrency with an optional multi-version concurrency control (MVCC) transaction manager; event notifications; cache prioritization; and support for multiple database indexes (b-tree, r-tree, kd-tree, hash, Patricia trie, etc.). eXtremeDB’s footprint is small, with an approximately 150K code size. eXtremeDB is available for a wide range of server, real-time operating system (RTOS) and desktop operating systems, and McObject provides eXtremeDB source code for porting.
McObject’s second product offering is the Perst open source, object-oriented embedded database system, available in all-Java and all-C# (.NET) versions. Perst is small (code size typically less than 500K) and very fast, with features including ACID-compliant transactions; specialized collection classes (such as a classic b-tree implementation; r-tree indexes for spatial data; database containers optimized for memory-only access, etc.); garbage collection; full-text search; schema evolution; a “wrapper” that provides a SQL-like interface (SubSQL); XML import/export; database replication, and more.
Perst also operates in specialized environments. Perst for .NET includes support for .NET Compact Framework, Windows Phone 8 (WP8) and Silverlight (check out our browser-based Silverlight CRM demo, which showcases Perst’s support for storage on users’ local file systems). The Java edition supports the Android smartphone platform, and includes the Perst Lite embedded database for Java ME.
Q2. Who are your current customers and how do they typically use your products?
Steven T. Graves: eXtremeDB initially targeted real-time embedded systems, often residing in non-PC devices such as set-top boxes, telecom switches or industrial controllers.
There are literally millions of eXtremeDB -based devices deployed by our customers; a few examples are set-top boxes from DIRECTV (eXtremeDB is the basis of an electronic programming guide); F5 Networks’ BIG-IP network infrastructure (eXtremeDB is built into the devices’ proprietary embedded operating system); and BAE Systems (avionics in the Panavia Tornado GR4 combat jet). A recent new customer in telecom/networking is Compass-EOS, which has released the first photonics-based core IP router, using eXtremeDB High Availability to manage the device’s control plane database.
Addition of “enterprise-friendly” features (support for SQL, Java, 64-bit, MVCC, etc.) drove eXtremeDB’s adoption for non-embedded systems that demand fast performance. Examples include software-as-a-service provider hetras Gmbh (eXtremeDB handles the most performance-intensive queries in its Cloud-based hotel management system); Transaction Network Services (eXtremeDB is used in a highly scalable system for real-time phone number lookups/ routing); and MeetMe.com (formerly MyYearbook.com – eXtremeDB manages data in social networking applications).
In the financial industry, eXtremeDB is used by a variety of trading organizations and technology providers. Examples include the broker-dealer TradeStation (McObject’s database technology is part of its next-generation order execution system); Financial Technologies of India, Ltd. (FTIL), which has deployed eXtremeDB in the order-matching application used across its network of financial exchanges in Asia and the Middle East; and NSE.IT (eXtremeDB supports risk management in algorithmic trading).
Users of Perst are many and varied, too. You can find Perst in many commercial software applications such as enterprise application management solutions from the Wily Division of CA. Perst has also been adopted for community-based open source projects, including the Frost client for the Freenet global peer-to-peer network. Some of the most interesting Perst-based applications are mobile. For example, 7City Learning, which provides training for financial professionals, gives students an Android tablet with study materials that are accessed using Perst. Several other McObject customers use Perst in mobile medical apps.
Q3. What are the main new technical features you are currently working on and why?
Steven T. Graves: One feature we’re very excited about is the ability to pipeline vector-based statistical functions in eXtremeDB Financial Edition – we’ve even released a short video and a 10-page white paper describing this capability. In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.
This may not sound unusual, since almost any algorithm or program can be viewed as a chain of operations acting on data.
But this pipelining has a unique purpose and a powerful result: it keeps market data inside CPU cache as the data is being worked.
Without pipelining, the results of each function would typically be materialized outside cache, in temporary tables residing in main memory. Handing interim results back and forth “across the transom” between CPU cache and main memory imposes significant latency, which is eliminated by pipelining. We’ve been improving this capability by adding new statistical functions to the library. (For an explanation of pipelining that’s more in-depth than the video but shorter than the white paper, check out this article on the financial technology site Low-Latency.com.)
We are also adding to the capabilities of eXtremeDB Cluster edition to make clustering faster and more flexible, and further simplify cluster administration. Improvements include a local tables option, in which database tables can be made exempt from replication, but shareable through a scatter/gather mechanism. Dynamic clustering, added in our recent v. 5.0 upgrade, enables nodes to join and leave clusters without interrupting processing. This further simplifies administration for a clustering database technology that counts minimal run-time maintenance as a key benefit. On selected platforms, clustering now supports the Infiniband switched fabric interconnect and Message Passing Interface (MPI) standard. In our tests, these high performance networking options accelerated performance more than 7.5x compared to “plain vanilla” gigabit networking (TCP/IP and Ethernet).
“Some of our current priorities include: augmenting capabilities in the area of real-time analytics – especially around online operations, SQL functionality, integrations with messaging applications, statistics and monitoring procedures, and enhanced developer features.”– Ryan Betts.
The third interview in the “Big Data: three questions to “ series of interviews, is with Ryan Betts, CTO of VoltDB.
Q1. What are your current product offerings?
Ryan Betts: VoltDB is a high-velocity database platform that enables developers to build next generation real-time operational applications. VoltDB converges all of the following:
• A dynamically scalable in-memory relational database delivering high-velocity, ACID-compliant OLTP
• High-velocity data ingestion, with millions of writes per second
• Real-time analytics, to enable instant operational visibility at the individual event level
• Real-time decisioning, to enable applications to act on data when it is most valuable—the moment it arrives
Version 4.0 delivers enhanced in-memory analytics capabilities and expanded integrations. VoltDB 4.0 is the only high performance operational database that combines in-memory analytics with real-time transactional decision-making in a single system.
It gives organizations an unprecedented ability to extract actionable intelligence about customer and market behavior, website interactions, service performance and much more by performing real-time analytics on data moving at breakneck speed.
Specifically, VoltDB 4.0 features a tenfold throughput improvement of analytic queries and is capable of writes and reads on millions of data events per second. It provides large-scale concurrent, multiuser access to data, the ability to factor current incoming data into analytics, and enhanced SQL support. VoltDB 4.0 also delivers expanded integrations with an organization’s existing data infrastructure such as message queue systems, improved JDBC driver and monitoring utilities such as New Relic.
Q2. Who are your current customers and how do they typically use your products?
Ryan Betts: Customers use VoltDB for a wide variety of data-management functions, including data caching, stream processing and “on the fly” ETL.
Current VoltDB customers represent industries ranging from telecommunications to e-commerce, power & energy, financial services, online gaming, retail and more.
Following are common use cases:
• Optimized, real-time information delivery
• Personalized audience targeting
• Real-time analytics dashboards
• Caching server replacements
• Session / user management
• Network analysis & monitoring
• Ingestion and on-the-fly-ETL
Below are the customers that have been publicly announced thus far:
Q3. What are the main new technical features you are currently working on and why?
Ryan Betts: Our customers are reaping the benefits of VoltDB in the areas of transactional decision-making and generating real-time analytics on that data—right at the moment it’s coming in.
Therefore, some of our current priorities include: augmenting capabilities in the area of real-time analytics – especially around online operations, SQL functionality, integrations with messaging applications, statistics and monitoring procedures, and enhanced developer features.
Although VoltDB has proven to be the industry’s “easiest to use” database, we are also continuing to invest quite heavily in making the process of building and deploying real-time operational applications with VoltDB even easier. Among other things, we are extending the power and simplicity that we offer developers in building high throughput applications to building modest sized throughput applications.
“Begin with a clear definition of the project’s business objectives and timeline, and be sure that you have appropriate executive sponsorship. The key stakeholders need to agree on a minimal set of compelling results that will impact your business; furthermore, technical leaders need to buy into the overall feasibility of the project and bring design and implementation ideas to the table.”–Cynthia M. Saracco.
How easy is to set up a Big Data project? On this topic I have interviewed Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. Cynthia is an expert in Big Data, analytics, and emerging technologies. She has more than 25 years of software industry experience.
Q1. How best is to get started with a Big Data project?
Cynthia M. Saracco: Begin with a clear definition of the project’s business objectives and timeline, and be sure that you have appropriate executive sponsorship.
The key stakeholders need to agree on a minimal set of compelling results that will impact your business; furthermore, technical leaders need to buy into the overall feasibility of the project and bring design and implementation ideas to the table. At that point, you can evaluate your technical options for the best fit. Those options might include Hadoop, a relational DBMS, a stream processing engine, analytic tools, visualization tools, and other types of software. Often, a combination of several types of software is needed for a single Big Data project. Keep in mind that every technology has its strengths and weaknesses, so be sure you understand enough about the technologies you’re inclined to use before moving forward.
If you decide that Hadoop should be part of your project, give serious consideration to using a distribution that packages commonly needed components into a single bundle so you can minimize the time required to install and configure your environment. It’s also helpful to keep in mind the existing skills of your staff and seek out offerings that enable them to be productive quickly.
Tools, applications, and support for common scripting and query languages all contribute to improved productivity. If your business application needs to integrate with existing analytical tools, DBMSs, or other software, look for offerings that have some built-in support for that as well.
Finally, because Big Data projects can get pretty complex, I often find it helpful to segment the work into broad categories and then drill down into each to create a solid plan. Examples of common technical tasks include collecting data (perhaps from various sources), preparing the data for analysis (which can range from simple format conversions to more sophisticated data cleansing and enrichment operations), analyzing the data, and rendering or sharing the results of that analysis with business users or downstream applications. Consider scalability and performance needs in addition to your functional requirements.
Q2. What are the most common problems and challenges encountered in Big Data projects?
Cynthia M. Saracco: Lack of appropriately scoped objectives and lack of required skills are two common problems. Regarding objectives, you need to find an appropriate use case that will impact your business and tailor your project’s technical work to meet the business goals of that project efficiently. Big Data is an exciting, rapidly evolving technology area, and it’s easy to get side tracked experimenting with technical features that may not be essential to solving your business problem. While such experimentation can be fun and educational, it can also result in project delays as well as deliverables that are off target. In addition, without well-scoped business objectives, the technical staff may end up chasing a moving target.
Regarding skills, there’s high demand for data scientists, architects, and developers experienced with Big Data projects. So you may need to decide if you want to engage a service provider to supplement in-house skills or if you want to focus on growing (or acquiring) new in-house skills. Fortunately, there are a number of Big Data training options available today that didn’t exist several years ago. Online courses, conferences, workshops, MeetUps, and self-study tutorials can help motivated technical professionals expand their skill set. However, from a project management point of view, organizations need to be realistic about the time required for staff to learn new Big Data technologies. Giving someone a few days or weeks to master Hadoop and its complementary offerings isn’t very realistic. But really, I see the skills challenge as a point-in-time issue. Many people recognize the demand for Big Data skills and are actively expanding their skills, so supply will grow.
Q3. Do you have any metrics to define how good is the “value” that can be derived by analyzing Big Data?
Cynthia M. Saracco: Most organizations want to focus on their return on investment (ROI). Even if your Big Data solution uses open source software, there are still expenses involved for designing, developing, deploying, and maintaining your solution. So what did your business gain from that investment?
The answer to that question is going to be specific to your application and your business. For example, if a telecommunications firm is able to reduce customer churn by 10% as a result of a Big Data project, what’s that worth? If an organization can improve the effectiveness of an email marketing campaign by 20%, what’s that worth? If an organization can respond to business requests twice as quickly, what’s that worth? Many clients have these kinds of metrics in mind as they seek to quantify the value they have derived — or hope to derive — from their investment in a Big Data project.
Q4. Is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?
Cynthia M. Saracco: More often, I’ve seen Hadoop used to augment or extend traditional forms of analytical processing, such as OLAP, rather than completely replace them. For example, Hadoop is often deployed to bring large volumes of new types of information into the analytical mix — information that might have traditionally been ignored or discarded. Log data, sensor data, and social data are just a few examples of that. And yes, preparing that data for analysis is certainly one of the tasks for which Hadoop is used.
Q4. IBM is offering BigInsights and Big SQL? What is it?
Cynthia M. Saracco: InfoSphere BigInsights is IBM’s Hadoop-based platform for analyzing and managing Big Data. It includes Hadoop, a number of complementary open source projects (such as HBase, Hive, ZooKeeper, Flume, Pig, and others) and a number of IBM-specific technologies designed to add value.
Big SQL is part of BigInsights. It’s IBM’s SQL interface to data stored in BigInsights. Users can create tables, query data, load data from various sources, and perform other functions. For a quick introduction to Big SQL, read this article.
Q5. How does it compare to RDBMS technology? When’s it most useful?
Cynthia M. Saracco: Big SQL provides standard SQL-based query access to data managed by BigInsights. Query support includes joins, unions, sub-queries, windowed aggregates, and other popular capabilities. Because Big SQL is designed to exploit the Hadoop ecosystem, it introduces Hadoop-specific language extensions for certain SQL statements.
For example, Big SQL supports Hive and HBase for storage management, so a Big SQL CREATE TABLE statement might include clauses related to data formats, field delimiters, SerDes (serializers/deserializers), column mappings, column families, etc. The article I mentioned earlier has some examples of these, and the product InfoCenter has further details.
In many ways, Big SQL can serve as an easy on-ramp to Hadoop for technical professionals who have a relational DBMS background. Big SQL is good for organizations that want to exploit in-house SQL skills to work with data managed by BigInsights. Because Big SQL supports JDBC and ODBC, many traditional SQL-based tools can work readily with Big SQL tables, which can also make Big Data easier to use by a broader user community.
However, Big SQL doesn’t turn Hadoop — or BigInsights — into a relational DBMS. Commercial relational DBMSs come with built-in, ACID-based transaction management services and model data largely in tabular formats. They support granular levels of security via SQL GRANT and REVOKE statements. In addition, some RDBMSs support 3GL applications developed in “legacy” programming languages such as COBOL. These are some examples of capabilities aren’t part of Big SQL.
Q6. What are some of its current limitations?
Cynthia M. Saracco: The current level of Big SQL included in BigInsights V184.108.40.206 enables users to create tables but not views.
Date/time data is supported through a full TIMESTAMP data type, and some common SQL operations supported by relational DBMSs aren’t available or have specific restrictions.
Examples include INSERT, UPDATE, DELETE, GRANT, and REVOKE statements. For more details on what’s currently supported in Big SQL, skim through the InfoCenter.
Q7. How BigInsights differs from / adds value to open source Hadoop?
Cynthia M. Saracco: As I mentioned earlier, BigInsights includes a number of IBM-specific technologies designed to add value to the open source technologies included with the product. Very briefly, these include:
- A Web console with administrative facilities, a Web application catalog, customizable dashboards, and other features.
- A text analytic engine and library that extracts phone numbers, names, URLs, addresses, and other popular business artifacts from messages, documents, and other forms of textual data.
- Big SQL, which I mentioned earlier.
- BigSheets, a spreadsheet-style tool for business analysts.
- Web-accessible sample applications for importing and exporting data, collecting data from social media sites, executing ad hoc queries, and monitoring the cluster. In addition, application accelerators (tool kits with dozens of pre-built software articles) are available for those working with social data and machine data.
- Eclipse tooling to speed development and testing of BigInsights applications, new text extractors, BigSheets functions, SQL-based applications, Java applications, and more.
- An integrated installation tool that installs and configures all selected components across the cluster and performs a system-wide health check.
- Connectivity to popular enterprise software offerings, including IBM and non-IBM RDBMSs.
- Platform enhancements focusing on performance, security, and availability. These include options to use with an alternative, POSIX-compliant distributed file system (GPFS-FPO) and an alternative MapReduce layer (Adaptive MapReduce) that features Platform Symphony’s advanced job scheduler, workload manager, and other capabilities.
You might wonder what practical benefits these kinds of capabilities bring. While that varies according to each organization’s usage patterns, one industry analyst study concluded that BigInsights lowers total cost of ownership (TCO) by an average of 28% over a three-year period compared with an open source-only implementation.
Finally, a number of IBM and partner offerings support BigInsights, which is something that’s important to organizations that want to integrate a Hadoop-based environment into their broader IT infrastructure. Some examples of IBM products that support BigInsights include DataStage, Cognos Business Intelligence, Data Explorer, and InfoSphere Streams.
Q8. Could you give some examples of successful Big Data projects?
Cynthia M. Saracco: I’ll summarize a few that have been publicly discussed so you can follow links I provide for more details. An energy firm launched a Big Data project to analyze large volumes of data that could help it improve the placement of new wind turbines and significantly reduce response time to business user requests.
A financial services firm is using Big Data to process large volumes of text data in minutes and offer its clients more comprehensive information based on both in-house and Internet-based data.
An online marketing firm is using Big Data to improve the performance of its clients email campaigns.
And other firms are using Big Data to detect fraud, assess risk, cross-sell products and services, prevent or minimize network outages, and so on. You can find a collection of videos about Big Data projects undertaken by various organizations; many of these videos feature users speaking directly about their Big Data experiences and the results of their projects.
And a recent report on Analytics: The real-world use of big data contains further examples. based the results of a survey of more than 1100 businesses that the Said Business School at the University of Oxford conducted with IBM’s Institute for Business Value.
Qx Anything else to add?
Cynthia M. Saracco: Hadoop isn’t the only technology relevant to managing and analyzing Big Data, and IBM’s Big Data software portfolio certainly includes more than BigInsights (its Hadoop-based offering). But if you’re a technologist who wants to learn more about Hadoop, your best bet is to work with the software. You’ll find a number of free online courses in the public domain, such as those at Big Data University. And IBM offers a free copy of its Quick Start Edition of BigInsights as a VMWare image or an installable image to help you get started with minimal effort.
Cynthia M. Saracco is a senior solutions architect at IBM’s Silicon Valley Laboratory, specializing in Big Data, analytics, and emerging technologies. She has more than 25 years of software industry experience, has written three books and more than 70 technical papers, and holds six patents.
- What’s the big deal about Big SQL? by Cynthia M. Saracco , Senior Software Engineer, IBM, and Uttam Jain, Software Architect, IBM.
- ODBMS.org: Free resources on Big Data and Analytical Data Platforms:
| Blog Posts | Free Software| Articles| Lecture Notes | PhD and Master Thesis|
“We are investing heavily in bringing SQL as the standard interface for accessing in real time (GemFire XD) and interactive (HAWQ) response times enabling enterprises to leverage their existing workforce for Hadoop processing.”–Susheel Kaushik.
I start this new year with a new series of short interviews to leading vendors of Big Data technologies. I call them “Big Data: three questions to“. The second of such interviews is with Susheel Kaushik, Senior Director, Product Management at Pivotal.
Q1. What is your current products offering?
Susheel Kaushik: Pivotal suite of products converge Apps, Data and Analytics for the enterprise customers.
Industry leading application frameworks and runtimes focused on enterprise needs. Pivotal App frameworks provide a rich set of product components that enables rapid application development including support for messaging, database services and robust analytic and visualization instrumentation.
• Pivotal tc Server: Lean, Powerful Apache Tomcat compatible application server that maximizes performance, scales easily, and minimizes cost and overhead.
• Pivotal Web Server: High Performance, Scalable and Secure HTTP server.
• Pivotal Rabbit MQ: Fast and dependable message server that supports a wide range of use cases including reliable integration, content based routing and global data delivery, and high volume monitoring and data ingestion.
• Spring: Takes the complexity out of Enterprise Java.
• vFabric: Provides a proven runtime platform for your Spring applications.
Disruptive Big Data products – MPP & Column store database, in memory data processing and Hadoop
• Pivotal Greenplum Database: A massively parallel platform for large-scale data analytics warehouse to manage, store and analyze petabytes of data.
• Pivotal GemFire: A real time distributed data store that with linear scalability and continuous uptime capabilities.
• Pivotal HD with HAWQ and GemFire XD: Commercially supported Apache Hadoop. HAWQ brings Enterprise class SQL capabilities and GemFireXD brings real time data access to Hadoop.
• Pivotal CF: Next generation enterprise PaaS – Pivotal CF makes applications the new unit of deployment and control (not VMs or middleware), radically improving developer productivity and operator agility.
Accelerate and help enterprise extract insights from their data assets. Pivotal analytic products offer advanced query and visualization capabilities to business analysts.
Q2. Who are your current customers and how do they typically use your products?
Susheel Kaushik: We have customers in all business verticals – Finance, Telco, Manufacturing, Energy, Medical, Retail to name a few.
Some of the typical uses of the products are:
• Big Data Store: Today, we find enterprises are NOT saving all of the data – cost efficiency is one of the reasons. Hadoop brings the price of the storage tier to a point where storing large amounts of data is not cost prohibitive. Enterprises now have mandates to not throw away any data in the hope that they can later unlock the potential insights from the data.
• Extend life of Existing EDW systems: Today most of the EDW system are challenged on the storage and processing aspects to provide a cost effective solution internally. Most of the data stored in EDW is not analyzed and the Pivotal Big Data products provide a platform for the customers to offload some of the data storage and the analytics processing. This offloaded processing, typically ETL like workloads, is ideal for the Big Data platforms. As a result, the processing times are reduced and the ETL relieved EDW now has excess capacity to satisfy the needs for some more years – thereby extending the life.
• Data Driven Applications: Some of the advanced enterprises already have peta-byte of varying formats data and are looking to derive insights from the data in real time/interactive time. These customers are building scalable applications leveraging the insights to assist business in decisioning (automated/manual).
In addition, Customers prefer the deployment choice provided from the Pivotal products, some prefer the bare metal infrastructure whereas some prefer the cloud deployment (on premise or the public clounds).
Q3. What are the main new technical features you are currently working on and why?
Susheel Kaushik: Here are some of the key technical features we are working on.
1. Better integration with HDFS
a. HDFS is becoming a cost effective storage interface for the enterprise customers. Pivotal is investing making the integration with HDFS even better. Enterprise customers demand security and performance from HDFS and we are actively investing in these capabilities.
In addition, storing the data in a single platform reduces the data duplication costs along with the data management costs to manage the multiple copies.
2. Integration with other Open Source projects
a. We are investing in Spring and Cloud foundry to integrate better with Hadoop. Spring and Cloud Foundry have a healthy eco system already. Making Hadoop easier to use for these users increases the talent pool available to build next generation data applications for Hadoop data.
3. SQL as a standard interface
a. SQL is the most expressive language for data analysis and enterprise customers have already made massive investments in training their workforce on SQL. We are investing heavily in bringing SQL as the standard interface for accessing in real time (GemFire XD) and interactive (HAWQ) response times enabling enterprises to leverage their existing workforce for Hadoop processing.
4. Improved Manageability and Operability
a. Managing and operating clusters is not easy for Hadoop and some of our enterprises do not have in-house capabilities to build/manage these large scale clusters. We are innovating to provide a simplified interface to manage and operate these clusters.
5. Improved Quality of Service
a. Resource contention is a challenge in any multi-tenant environment. We are actively working to make resource sharing in a multi-tenant environment easier. We already have products in the portfolio (MoreVRP) that allow customers to do fine grain control at CPU and IO level. We are making active investments to bring this capability across multiple processing paradigms.
ODBMS.org: Free resources on Big Data Analytics, NewSQL, NoSQL, Object Database Vendors: Blog Posts| Commercial | Open Source|
“The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. “–Iran Hutchinson.
I start this new year with a new series of short interviews to leading vendors of Big Data technologies. I call them “Big Data: three questions to“. The first of such interviews is with Iran Hutchinson, Big Data Specialist at InterSystems.
Q1. What is your current “Big Data” products offering?
Iran Hutchinson: InterSystems has actually been in the Big Data business for some time, since 1978, long before anyone called it that. We currently offer an integrated database, integration and analytics platform based on InterSystems Caché®, our flagship product, to enable Big Data breakthroughs in a variety of industries.
Launched in 1997, Caché is an advanced object database that provides in-memory speed with persistence, and the ability to ingest huge volumes of transactional data at insanely high velocity. It is massively scalable, because of its very lean design. Its efficient multidimensional data structures require less disk space and provide faster SQL performance than relational databases. Caché also provides sophisticated analytics, enabling real-time queries against transactional data with minimal maintenance and hardware requirements.
InterSystems Ensemble® is our seamless platform for integrating and developing connected applications. Ensemble can be used as a central processing hub or even as backbone for nationwide networks. By integrating this connectivity with our high-performance Caché database, as well as with new technologies for analytics, high-availability, security, and mobile solutions, we can deliver a rock-solid and unified Big Data platform, not a patchwork of disparate solutions.
We also offer additional technologies built on our integrated platform, such as InterSystems HealthShare®, a health informatics platform that enables strategic interoperability and analytics for action. Our TrakCare unified health information system is likewise built upon this same integrated framework.
Q2. Who are your current customers and how do they typically use your products?
Iran Hutchinson: We continually update our technology to enable customers to better manage, ingest and analyze Big Data. Our clients are in healthcare, financial services, aerospace, utilities – industries that have extremely demanding requirements for performance and speed. For example, Caché is the world’s most widely used database in healthcare. Entire countries, such as Sweden and Scotland, run their national health systems on Caché, as well as top hospitals and health systems around the world. One client alone runs 15 percent of the world’s equity trades through InterSystems software, and all of the top 10 banks use our products.
It is also being used by the European Space Agency to map a billion stars – the largest data processing task in astronomy to date. (See The Gaia Mission One Year Later.)
Our configurable ACID (Atomicity Consistency Isolation Durability) capabilities and ECP-based approach enable us to handle these kinds of very large-scale, very high-performance, transactional Big Data applications.
Q3. What are the main new technical features you are currently working on and why?
Iran Hutchinson: There are several new paradigms we are focusing on, but let’s focus on analytics. Once you absorb all that Big Data, you want to run analytics. And that’s where the three V’s of Big Data – volume, velocity and variety – are critically important.
Let’s talk about the variety of data. Most popular Big Data analytics solutions start with the assumption of structured data – rows and columns – when the most interesting data is unstructured, or text-based data. A lot of our competitors still struggle with unstructured data, but we solved this problem with Caché in 1997, and we keep getting better at it. InterSystems Caché offers both vertical and horizontal scaling, enabling schema-less and schema-based (SQL) querying options for both structured and unstructured data.
As a result, our clients today are running analytics on all their data – and we mean real-time, operational data, not the data that is aggregated a week later or a month later for boardroom presentations.
A lot of development has been done in the area of schema-less data stores or so-called document stores, which are mainly key-value stores. The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. Some companies now offer SQL querying on schema-less data stores as an add-on or plugin. InterSystems Caché provides a high-performance key-value store with native SQL support.
The commonly available SQL-based solutions also require a predefinition of what the user is interested in. But if you don’t know the data, how do you know what’s interesting? Embedded within Caché is a unique and powerful text analysis technology, called iKnow, that analyzes unstructured data out of the box, without requiring any predefinition through ontologies or dictionaries. Whether it’s English, German, or French, iKnow can automatically identify concepts and understand their significance – and do that in real-time, at transaction speeds.
iKnow enables not only lightning-fast analysis of unstructured data, but also equally efficient Google-like keyword searching via SQL with a technology called iFind.
And because we married that iKnow technology with another real-time OLAP-type technology we call DeepSee, we make it possible to embed this analytic capability into your applications. You can extract complex concepts and build cubes on both structured AND unstructured data. We blend keyword search and concept discovery, so you can express a SQL query and pull out both concepts and keywords on unstructured data.
Much of our current development activity is focused on enhancing our iKnow technology for a more distributed environment.
This will allow people to upload a data set, structured and/or unstructured, and organize it in a flexible and dynamic way by just stepping through a brief series of graphical representation of the most relevant content in the data set. By selecting, in the graphs, the elements you want to use, you can immediately jump into the micro-context of these elements and their related structured and unstructured information objects. Alternately, you can further segment your data into subsets that fit the use you had in mind. In this second case, the set can be optimized by a number of classic NLP strategies such as similarity extension, typicality pattern parallelism, etc. The data can also be wrapped into existing cubes or into new ones, or fed into advanced predictive models.
So our goal is to offer our customers a stable solution that really uses both structured and unstructured data in a distributed and scalable way. We will demonstrate the results of our efforts in a live system at our next annual customer conference, Global Summit 2014.
We also have a software partner that has built a very exciting social media application, using our analytics technology. It’s called Social Knowledge, and it lets you monitor what people are saying on Twitter and Facebook – in real-time. Mind you, this is not keyword search, but concept analysis – a very big difference. So you can see if there’s a groundswell of consumer feedback on your new product, or your latest advertising campaign. Social Knowledge can give you that live feedback – so you can act on it right away.
In summary, today InterSystems provides SQL and DeepSee over our shared data architecture to do structured data analysis.
And for unstructured data, we offer iKnow semantic analysis technology and iFind, our iKnow-powered search mechanism, to enable information discovery in text. These features will be enabled for text analytics in future versions of our shared-nothing data architectures.
ODBMS.org: Big Data Analytics, NewSQL, NoSQL, Object Database Vendors –Free Resources.
Follow ODBMS.org on Twitter: @odbmsorg
“Going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time.”–Nick Heudecker.
Gartner recently published a new report on “Operational Database Management Systems”. I have interviewed one of the co-authors of the report, Nick Heudecker, Research Director – Information Management at Gartner, Inc.
Happy Holidays and Happy New Year to you and yours!
Q1. You co-authored Gartner’s new report, “Magic Quadrant for Operational Database Management Systems”. How do you define “Operational Database Management Systems” (ODBMS)?
Nick Heudecker: Prior to operational DBMS, the common label for databases was OLTP. However, OLTP no longer accurately describes the range of activities an operational DBMS is called on to support. Additionally, mobile and social, elements of Gartner’s Nexus of Forces, have created new activity types which we broadly classify as interactions and observations. Supporting these new activity types has resulted in new vendors entering the market to compete with established vendors. Also, OLTP is no longer valid as all transactions are on-line.
Q2. What were the main evaluation criteria you used for the “Magic Quadrant for Operational Database Management Systems” report?
Nick Heudecker: The primary evaluation criteria for any Magic Quadrant consists of customer reference surveys. Vendors are also evaluated on market understanding, strategy, offerings, business model, execution, and overall viability.
Q3. To be included in the Magic Quadrant, what were the criteria that vendors and products had to meet?
Nick Heudecker: To be included in the Magic Quadrant, vendors had to have at least ten customer references, meet a minimum revenue number and meet our definition
of the market.
Q4. What is new in the last year in the Operational Database Management Systems area, in your view? What is changing?
Nick Heudecker: Innovations in the operational DBMS area have developed around flash memory, DRAM improvements, new processor technology, networking and appliance form factors. Flash memory devices have become faster, larger, more reliable and cheaper. DRAM has become far less costly and grown in size to greater than 1TB available on a server.
This has not only enabled larger disk caching, but also led to the development and wider use of in-memory DBMSs. New processor technology not only enables better DBMS performance in physically smaller servers, but also allows virtualization to be used for multiple applications and the DBMS on the same server. With new methods of interconnect such as 10-gigabit Ethernet and Infiniband, the connection between the storage systems and the DBMS software on the server is far faster. This has also increased performance and allowed for larger storage in a smaller space and faster interconnect for distributed data in a scale-out architecture. Finally, DBMS appliances are beginning to gain acceptance.
Q5. You also co-authored Gartner’s “Who’s Who in NoSQL Databases” report back in August. What is the current status of the NoSQL market in your opinion?
Nick Heudecker: There is a substantial amount of interest in NoSQL offerings, but also a great deal of confusion related to use cases and how vendor offerings are differentiated.
One question we get frequently is if NoSQL DBMSs are viable candidates to replace RDBMSs. To date, NoSQL deployments have been overwhelmingly supplemental to traditional relational DBMS deployments, not destructive.
Q6. How does the NoSQL market relate to the Operational Database Management Systems market?
Nick Heudecker: First, it’s difficult to define a NoSQL market. There are four distinct categories of NoSQL DBMS (document, key-value, table-style and graph), each with different capabilities and addressable use cases. That said, the various types of NoSQL DBMSs are included in the operational DBMS market based on capabilities around interactions and observations.
Q8. What do you see happening with Operational Database Management Systems, going forward?
Nick Heudecker: Going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time.
Nick Heudecker is a Research Director for Gartner Inc, covering information management topics and specializing in Big Data and NoSQL.
Prior to Gartner, Mr. Heudecker worked with several Bay Area startups and developed an enterprise software development consulting practice. He resides in Silicon Valley.
Follow ODBMS.org on Twitter: @odbmsorg
“We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data” –Mike Hoskins.
On the topic, Challenges and Opportunities for Big Data, I have interviewed Mike Hoskins, Actian Chief Technology Officer.
Q1. What are in your opinion the most interesting opportunities in Big Data?
Mike Hoskins: Until recently, most data projects were solely focused on preparation. Seminal developments in the big data landscape, including Hortonworks Data Platform (HDP) 2.0 and the arrival of YARN (Yet Another Resource Negotiator) – which takes Hadoop’s capabilities in data processing beyond the limitations of the highly regimented and restrictive MapReduce programming model – provides an opportunity to move beyond the initial hype of big data and instead towards the more high-value work of predictive analytics.
As more big data applications are built on the Hadoop platform customized by industry and business needs, we’ll really begin to see organizations leveraging predictive analytics across the enterprise – not just in a sandbox or in the domain of the data scientists, but in the hands of the business users. At that point, more immediate action can be taken on insights.
Q2. What are the most interesting challenges in Big Data?
Mike Hoskins: We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data. Companies need to rethink their static and rigid business intelligence and analytic software architectures in order to continue working at the speed of business. It’s clear that time has become the new gold standard – you can’t produce more of it; you can only increase the speed at which things happen.
Software vendors with the capacity to survive and thrive in this environment will keep pace with the competition by offering a unified platform, underpinned by engineering innovation, completeness of solution and the service integrity and customer support that is essential to market staying power.
Q3. Steve Shine, CEO and President, Actian Corporation, said in a recent interview (*) that “the synergies in data management come not from how the systems connect but how the data is used to derive business value”. Actian has completed a number of acquisitions this year. So, what is your strategy for Big Data at Actian?
Mike Hoskins: Actian has placed its bets on a completely modern unified platform that is designed to deliver on the opportunities presented by the Age of Data. Our technology assets bring a level of maturity and innovation to the space that no other technology vendor can provide – with 30+ years of expertise in ‘all things data’ and over $1M investment in innovation. Our mission is to arm organizations with solutions that irreversibly shift the price/performance curve beyond the reach of traditional legacy stack players, allowing them to get a leg up on the competition, retain customers, detect fraud, predict business trends and effectively use data as their most important asset.
Q4. What are the products synergies related to such a strategy?
Mike Hoskins: Through the acquisition of Pervasive Software (a provider of big data analytics and cloud-based and on-premises data management and integration), Versant (an industry leader in specialized data management), and ParAccel (a leader in high-performance analytics), Actian has compiled a unified end-to-end platform with capabilities to connect, prep, optimize and analyze data natively on Hadoop, and then offer it to the necessary reporting and analytics environments to meet virtually any business need. All the while, operating on commodity hardware at a much lower cost than legacy software can ever evolve to.
Q5. What else still need to be done at Actian to fully deploy this strategy?
Mike Hoskins: There are definitely opportunities to continue integrating the platform experience and improve the user experience overall. Our world-class database technology can be brought closer to Hadoop, and we will continue innovating on analytic techniques to grow our stack upward.
Our development team is working diligently to create a common user interface across all of our platforms, as we bring out technology together. We have the opportunity to create a true first-class SQL engine running natively Hadoop, and to more fully exploit market leading cooperative computing with our On-Demand Integration (ODI) capabilities. I would also like to raise the awareness of the power and speed of our offerings as a general paradigm for analytic applications.
We don’t know what new challenges the Age of Data will bring, but we will continue to look to the future and build out a technology infrastructure to help organizations deal with the only constant – change.
Q6. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?
Mike Hoskins: Elastic cloud computing is a convulsive game changer in the marketplace. It’s positive; if not where you do full production, at the very least it allows people to test, adopt and experiment with their data in a way that they couldn’t before. For cases where data is born in the cloud, using a 100% cloud model makes sense. However, much data is highly distributed in cloud and on-premises systems and applications, so it’s vital to have technology that can run and connect to either environments via a hybrid model.
We will soon see more organizations utilizing cloud platforms to run analytic processes, if that is where their data is born and lives.
Q7. How is your Cloud technology helping Amazon`s Redshift?
Mike Hoskins: Amazon Redshift leverages our high-performance analytics database technology to help users get the most out of their cloud investment. Amazon selected our technology over all other database and data warehouse technologies available in the marketplace because of the incredible performance, extreme scalability, and flexibility.
Q8. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey.
When you speak with your customers what are the typical use cases and requirements they have?
Mike Hoskins: A recent survey of data architects and CIOs by Sand Hill Group revealed that the top challenge of Hadoop adoption was knowledge and experience with the Hadoop platform, followed by the availability of Hadoop and big data skills, and finally the amount of technology development required to implement a Hadoop-based solution. This just goes to show how little we have actually begun to fully leverage the capabilities of Hadoop. Businesses are really only just starting to dip their toe in the analytic water. Although it’s still very early, the majority of use cases that we have seen are centered around data prep and ETL.
Q9. What do you think is still needed for big data analytics to be really useful for the enterprise?
Mike Hoskins: If we look at the complete end-to-end data pipeline, there are several things that are still needed for enterprises to take advantage of the opportunities. This includes high productivity, performant integration layers, and analytics that move beyond the sphere of data science and into mainstream business usage, with discovery analytics through a simple UI studio or an analytics-as-a-service offering. Analytics need to be made more available in the critical discovery phase, to bring out the outcomes, patterns, models, discoveries, etc. and begin applying them to business processes.
Qx. Anything else you wish to add?
Mike Hoskins: These kinds of highly disruptive periods are, frankly, unnerving for the marketplace and businesses. Organizations cannot rely on traditional big stack vendors, who are unprepared for the tectonic shift caused by big data, and therefore are not agile enough to rapidly adjust their platforms to deliver on the opportunities. Organizations are forced to embark on new paths and become their own System Integrators (SIs).
On the other hand, organizations cannot tie their future to the vast number of startups, throwing darts to find the one vendor that will prevail. Instead, they need a technology partner somewhere in the middle that understands data in-and-out, and has invested completely and wholly as a dedicated stack to help solve the challenge.
Although it’s uncomfortable, it is urgent that organizations look at modern architectures, next-generation vendors and innovative technology that will allow them to succeed and stay competitive in the Age of Data.
Mike Hoskins, Actian Chief Technology Officer
Actian CTO Michael Hoskins directs Actian’s technology innovation strategies and evangelizes accelerating trends in big data, and cloud-based and on-premises data management and integration. Mike, a Distinguished and Centennial Alumnus of Ohio’s Bowling Green State University, is a respected technology thought leader who has been featured in TechCrunch, Forbes.com, Datanami, The Register and Scobleizer. Mike has been a featured speaker at events worldwide, including Strata NY + Hadoop World 2013, the keynoter at DeployCon 2012, the “Open Standards and Cloud Computing” panel at the Annual Conference on Knowledge Discovery and Data Mining, the “Scaling the Database in the Cloud” panel at Structure 2010, and the “Many Faces of Map Reduce – Hadoop and Beyond” panel at Structure Big Data 2011. Mike received the AITP Austin chapter’s 2007 Information Technologist of the Year Award for his leadership in developing Actian DataRush, a highly parallelized framework to leverage multicore. Follow Mike on Twitter: @MikeHSays.
-ActianVectorwise 3.0: Fast Analytics and Answers from Hadoop. Actian Corporation
Paper | Technical | English | DOWNLOAD(PDF)| May 2013|
“My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.” —Dr. Jochen L. Leidner.
Q1. What is your current activity at Thomson Reuters?
Jochen L. Leidner: For the most part, I carry out applied research in information access, and that’s what I have been doing for quite a while. After five years with the company – I joined from the University of Edinburgh, where I had been a postdoctoral Royal Society of Edinburgh Enterprise Fellow half a decade ago – I am currently a Lead Scientist with Thomson Reuters, where I am building up a newly-established London site part of our Corporate Research & Development group (by the way: we are hiring!). Before that, I held research and innovation-related roles in the USA and in Switzerland.
Let me say a few words about Thomson Reuters before I go more into my own activities, just for background. Thomson Reuters has around 50,000 employees in over 100 countries and sells information to professionals in many verticals, including finance & risk, legal, intellectual property & scientific, tax & accounting.
Our headquarters are located at 3 Time Square in the city of New York, NY, USA.
Most people know our REUTERS brand from reading their newspapers (thanks to our highly regarded 3,000+ journalists at news desks in about 100 countries, often putting their lives at risk to give us reliable reports of the world’s events) or receiving share price information on the radio or TV, but as a company, we are also involved in as diverse areas as weather prediction (as the weather influences commodity prices) and determining citation impact of academic journals (which helps publishers sell their academic journals to librarians), or predicting Nobel prize winners.
My research colleagues and I study information access and especially means to improve it, using including natural language processing, information extraction, machine learning, search engine ranking, recommendation system and similar areas of investigations.
We carry out a lot of contract research for internal business units (especially if external vendors do not offer what we need, or if we believe we can build something internally that is lower cost and/or better suited to our needs), feasibility studies to de-risk potential future products that are considered, and also more strategic, blue-sky research that anticipates future needs. As you would expect, we protect our findings and publish them in the usual scientific venues.
Q2. Do you use Data Analytics at Thomson Reuters and for what?
Jochen L. Leidner: Note that terms like “analytics” are rather too broad to be useful in many instances; but the basic answer is “yes”, we develop, apply internally, and sell as products to our customers what can reasonably be called solutions that incorporate “data analytics” functions.
One example capability that we developed is the recommendation engine CaRE (Al-Kofahi et al., 2007), which was developed by our group, Corporate Research & Development, which is led by our VP of Research & Development, Khalid al-Kofahi. This is a bit similar in spirit to Amazon’s well-known book recommendations. We used it to recommend legal documents to attorneys, as a service to supplement the user’s active search on our legal search engine with “see also..”-type information. This is an example of a capability developed in-house that also something that made it into a product, and is very popular.
Thomson Reuters is selling information services, often under a subscription model, and for that it is important to have metrics available that indicate usage, in order to inform our strategy. So another example for data analytics is that we study how document usage can inform personalization and ranking, of from where documents are accessed, and we use this to plan network bandwidth and to determine caching server locations.
A completely different example is citation information: Since 1969, when our esteemed colleague Eugene Garfield (he is now officially retired, but is still active) came up with the idea of citation impact, our Scientific business division is selling the journal citation impact factor – an analytic that can be used as a proxy for the importance of a journal (and, by implication, as an argument to a librarian to purchase a subscription of that journal for his or her university library).
Or, to give another example from the financial markets area, we are selling predictive models (Starmine) that estimate how likely it is whether a given company goes bankrupt within the next six months.
Q3. Do you have Big Data at Thomson Reuters? Could you please give us some examples of Big Data Use Cases at your company?
Jochen L. Leidner: For most definitions of “big”, yes we do. Consider that we operate a news organization, which daily generates in the tens of thousands of news reports (if we count all languages together). Then we have photo journalists who create large numbers of high-quality, professional photographs to document current events visually, and videos comprising audio-visual storytelling and interviews. We further collect all major laws, statutes, regulations and legal cases around in major jurisdictions around the world, enrich the data with our own meta-data using both manual expertise and automatic classification and tagging tools to enhance findability. We hold collections of scientific articles and patents in full text and abstracts.
We gather, consolidate and distribute price information for financial instruments from hundreds of exchanges around the world. We sell real-time live feeds as well as access to decades of these time series for the purpose of back-testing trading strategies.
Q4. What “value” can be derived by analyzing Big Data at Thomson Reuters?
Jochen L. Leidner: This is the killer question: we take the “value” very much without the double quotes – big data analytics lead to cost savings as well as generate new revenues in a very real, monetary sense of the word “value”. Because our solutions provide what we call “knowledge to act” to our customers, i.e., information that lets them make better decisions, we provide them with value as well: we literally help our customers save the cost of making a wrong decision.
Q5. What are the main challenges for big data analytics at Thomson Reuters ?
Jochen L. Leidner: I’d say these are absolute volume, growth, data management/integration, rights management, and privacy are some of the main challenges.
One obvious challenge is the size of the data. It’s not enough to have enough persistent storage space to keep it, we also need backup space, space to process it, caches and so on – it all adds up. Another is the growth and speed of that growth of the data volume. You can plan for any size, but it’s not easy to adjust your plans if unexpected growth rates come along.
Another challenge that we must look into is the integration between our internal data holdings, external public data (like the World Wide Web, or commercial third-party sources – like Twitter – which play an important role in the modern news ecosystem), and customer data (customers would like to see their own internal, proprietary data be inter-operable with our data). We need to respect the rights associated with each data set, as we deal with our own data, third party data and our customers’ data. We must be very careful regarding privacy when brainstorming about the next “big data analytics” idea – we take privacy very seriously and our CPO joined us from a government body that is in charge of regulating privacy.
Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
Jochen L. Leidner: Usually analytics projects happen as an afterthought to leverage existing data created in a legacy process, which means not a lot of change of process is needed at the beginning. This situation changes once there is resulting analytics output, and then the analytics-generating process needs to be integrated with the previous processes.
Even with new projects, product managers still don’t think of analytics as the first thing to build into a product for a first bare-bones version, and we need to change that; instrumentation is key for data gathering so that analytics functionality can build on it later on.
In general, analytics projects follow a process of (1) capturing data, (2) aligning data from different sources (e.g., resolving when two objects are the same), (3) pre-processing or transforming the data into a form suitable for analysis, (4) building some model and (5) understanding the output (e.g. visualizing and sharing the results). This five-step process is followed by an integration phase into the production process to make the analytics repeatable.
Q7. What kind of data management technologies do you use? What is your experience in using them?
Jochen L. Leidner: For storage and querying, we use relational database management systems to store valuable data assets and also use NoSQL databases such as CouchDB and MongoDB in projects where applicable. We use homegrown indexing and retrieval engines as well as open source libraries and components like Apache Lucene, Solr and ElasticSearch.
We use parallel, distributed computing platforms such as Hadoop/Pig and Spark to process data, and virtualization to manage isolation of environments.
We sometimes push out computations to Amazon’s EC2 and storage to S3 (but this can only be done if the data is not sensitive). And of course in any large organization there is a lot of homegrown software around. My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.
Q8. Do you handle un-structured data? If yes, how?
Jochen L. Leidner: We curate lots of our own unstructured data from scratch, including the thousands of news stories that our over three thousand REUTERS journalists write every day, and we also enrich the unstructured content produced by others: for instance, in the U.S. we manually enrich legal cases with human-written summaries (written by highly-qualified attorneys) and classify them based on a proprietary taxonomy, which informs our market-leading WestlawNext product (see this CIKM 2011 talk), our search engine for the legal profession, who need to find exactly the right cases. Over the years, we have developed proprietary content repositories to manage content storage, meta-data storage, indexing and retrieval. One of our challenges is to unite our data holdings, which often come from various acquisitions that use their own technology.
Q9. Do you use Hadoop? If yes, what is your experience with Hadoop so far?
Jochen L. Leidner: We have two clusters featuring Hadoop, Spark and GraphLab, and we are using them intensively. Hadoop is mature as an implementation of the MapReduce computing paradigm, but has its shortcomings because it is not a true operating system (but probably it should be) – for instance regarding the stability of HDFS, its distributed file system. People have started to realize there are shortcomings, and have started to build other systems around Hadoop to fill gaps, but these are still early stage, so I expect them to become more mature first and then there might be a wave of consolidation and integration. We have definitely come a long way since the early days.
Typically, on our clusters we run batch information extraction tasks, data transformation tasks and large-scale machine learning processes to train classifiers and taggers for these. We are also inducing language models and training recommender systems on them. Since many training algorithms are iterative, Spark can win over Hadoop for these, as it keeps models in RAM.
Q10. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?
Jochen L. Leidner: The dichotomy is perhaps between batch processing and dialog processing, whereas real-time (“hard”, deterministic time guarantee for a system response) goes hand-in-hand with its opposite non-real time, but I think what you are after here is that dialog systems have to be responsive. There is no one-size-fits-all method for meeting (near) real-time requirements or for making dialog systems more responsive; a lot of analytics functions require the analysis of more than just a few recent data-points, so if that’s what is needed it may take its time. But it is important that critical functions, such as financial data feeds, are delivered as fast as possible – micro-seconds matter here. The more commoditized analytics functions become, the faster they need to be available to retain at least speed as a differentiator.
Q11 Cloud computing and open source: Do you they play a role at Thomson Reuters? If yes, how?
Jochen L. Leidner: Regarding cloud computing, we use cloud services internally and as part of some of our product offerings. However, there are also reservations – a lot of our applications contain information that is too sensitive to entrust a third party, especially as many cloud vendors cannot give you a commitment with respect to hosting (or not hosting) in particular jurisdictions. Therefore, we operate our own set of data centers, and some part of these operates as what has become known as “private clouds”, retaining the benefit of the management outsourcing abstraction, but within our large organization rather than pushing it out to a third party. Of course the notion of private clouds is leading the cloud idea ad absurdum quite a bit, because it sacrifices the economy of scale, but having more control is an advantage.
Open source plays a huge role at Thomson Reuters – we rely on many open source components, libraries and systems, especially under the MIT, BSD, LGPL and Apache licenses. For example, some of our tagging pipelines rely on Apache UIMA, which is a contribution originally developed at IBM, and which has seen contributions by researchers from all around the world (from Darmstadt to Boulder). To date, we have not been very good about opening up our own services in the form of source code, but we are trying to change that now, and we have just launched a new corporation-wide process for open-sourcing software. We also have an internal sharing repository, “Corporate Source”, but in my personal view the audience in any single company is too small – open source (like other recent waves such as clouds or crowdsourcing) needs Internet-scale to work, and Web search engines for the various projects to be discovered).
Q12 What are the main research challenges ahead? And what are the main business challenges ahead?
Jochen L. Leidner: Some of the main business challenges are the cost pressure that some of our customers face, and the increasing availability of low-cost or free-of-charge information sources, i.e. the commoditization of information. I would caution here that whereas the amount of information available for free is large, this in itself does not help you if you have a particular problem and cannot find the information that helps you solve it, either because the solution is not there despite the size, or because it is there but findability is low. Further challenges include information integration, making systems ever more adaptive, but only to the extent it is useful, or supporting better personalization. Having said this sometimes systems need to be run in a non-personalized mode (e.g. in the field of e-discovery, you need to have a certain consistency, namely that the same legal search systems retrieves the same things today and tomorrow, and to different parties.
Q13 Anything else you wish to add?
Jochen L. Leidner: I would encourage decision makers of global companies not to get misled by fast-changing “hype” language: words like “cloud”, “analytics” and “big data” are too general to inform a professional discussion. Putting my linguist’s hat on, I can only caution about the lack of precision inherent in marketing language’s use in technological discussions: for instance, cluster computing is not the same as grid computing. And what looks like “big data” today we will almost certainly carry around with us on a mobile computing device tomorrow. Also, buzz words like “Big Data” do not by themselves solve any problem – they are not magic bullets. To solve any problem, look at the input data, specify the desired output data, and think hard about whether and how you can compute the desired result – nothing but “good old” computer science.
Dr. Jochen L. Leidner is a Lead Scientist with Thomson Reuters, where he is building up the corporation’s London site of its Research & Development group.
He holds Master’s degrees in computational linguistics, English language and literature and computer science from Friedrich-Alexander University Erlangen-Nuremberg and in computer speech, text and internet technologies from the University of Cambridge, as well as a Ph.D. in information extraction from the University of Edinburgh (“Toponym resolution in Text”). He is recipient of the first ACM SIGIR Doctoral Consortium Award, a Royal Society of Edinburgh Enterprise Fellowship in Electronic Markets, and two DAAD scholarships.
Prior to his research career, he has worked as a software developer, including for SAP AG (basic technology and knowledge management) as well as for startups.
He led the development teams of multiple question Answering systems, including the systems QED at Edinburgh and Alyssa at Saarland University, the latter of which ranked third at the last DARPA/NIST TREC open-domain question answering factoid track evaluation.
His main research interests include information extraction, question answering and search, geo-spatial grounding, applied machine learning with a focus on methodology behind research & development in the area of information access.
In 2013, Dr. Leidner has also been teaching an invited lecture course “Language Technology and Big Data” at the University of Zurich, Switzerland.
- ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|
Follow us on Twitter: @odbmsorg
” The pace that we can generate data will outstrip our ability to store it.
I think you’ll soon see data scientists emphasizing the ability to make decisions on data before storing it ” –Adam Kocoloski.
I have interviewed Adam Kocoloski, Founder & CTO of Cloudant.
Q1. What can we learn from physics when managing and analyzing big data for the enterprise?
Adam Kocoloski: The growing body of data collected in today’s Web applications and sensor networks is a potential goldmine for businesses. But modeling transactions between people and causality between events becomes challenging at large scale, and traditional enterprise systems like data warehousing and business intelligence are too cumbersome to extract value fast enough.
Physicists are natural problem solvers, equipped to think through what tools will work for particular data challenges. In the era of big data, these challenges are growing increasingly relevant, especially to the enterprise.
In a way, physicists have it easier. Analyzing isolated particle collisions translated well to distributed university research systems and parallel models of computing. In other ways, we have shared the challenge of filtering big data to find useful information. In my physics work, we addressed this problem with blind analysis and machine learning. I think you’ll soon see those practices emerge in the field of enterprise data analysis.
Q2. How do you see data science evolving in the near future?
Adam Kocoloski: The pace that we can generate data will outstrip our ability to store it. I think you’ll soon see data scientists emphasizing the ability to make decisions on data before storing it.
The sheer volume of data we’re storing is a factor, but what’s more interesting is the shift toward the distributed generation of data — data from mobile devices, sensor networks, and the coming “Internet of Things.” It’s easy for an enterprise to stand up Hadoop in its own data center and start dumping data into it, especially if it plans to sort out the valuable parts later. It’s not so easy when it’s large volumes of operational data generated in a distributed system. Machine learning algorithms that can recognize and store only the useful patterns can help us better deal with the deluge.
As physicists, we learned that the way big data is headed, there’s no way we’ll be able to keep writing it all down. That’s the tradeoff today’s data scientists must learn: right when you collect the data, you need to make decisions on throwing it away.
Q3. In your opinion, given the current available Big Data technologies, what is the most difficult challenge in filtering big data to find useful information?
Adam Kocoloski: Cloudant is an operational data store and not a big data or offline analytics platform like Hadoop. That means we deal with mutable data that applications are accessing and changing as they run.
From my physics experience, the most difficult big data challenge I’ve seen is the lack of accurate simulations for machine learning. For me, that meant simulations of the STAR particle detector at Brookhaven National Lab’s Relativistic Heavy Ion Collider (RHIC).
People use machine learning algorithms in many fields, and they don’t always understand the caveats of building in an appropriate training data set. It’s easy to apply training data without fully understanding how the process works. If they do that, they won’t realize when they’ve trained their machine learning algorithms inappropriately.
Slicing data from big data sets is great, but at a certain point it becomes a black box that makes it hard to understand what is and what isn’t working well in your analysis. The bigger the data, the more it’s possible for one variable to be related to others in nonlinear ways. This problem makes it harder to reason about data, placing more demands on data scientists to build training data sets using a balanced combination of linear and nonlinear techniques.
Q4. Could you please explain why blind analyses is important for Big Data?
Adam Kocoloski: Humans are naturally predisposed to find signals. It’s an evolutionary trait of ours. It’s better if we recognize the tiger in the jungle, even if there really isn’t one there. If we see a bump in a distribution of data, we do what we can to tease it out. We bias ourselves that way.
So when you do a blind analysis, you hopefully immunize yourself against that bias.
Data scientists are people too, and with big data, they can’t become overly reliant on data visualization. It’s too easy for us to see things that aren’t really there. Instead of seeking out the signals within all that data, we need to work on recognizing the noise — the data we don’t want — so we can inversely select the data we want to keep.
Q5. Is machine learning the right way to analyze Big Data?
Adam Kocoloski: Machine learning offers the possibility to improve the signal-to-noise ratio beyond what any manually constructed analysis can do.
The potential is there, but you have to balance it with the need to understand the training data set. It’s not a panacea. Algorithms have weak points. They have places where they fail. When you’re applying various machine-learning analyses, it’s important that you understand where those weak points are.
Q6. The past year has seen a renaissance in NewSQL. Will transactions ultimately spell the end of NoSQL databases?
Adam Kocoloski: No — 1) because there’s a wide, growing class of problems that don’t require transactional semantics and 2) mobile computing makes transactions at large scale technically infeasible.
Applications like address books, blogs, or content management systems can store a wide variety of data, and, largely, do not require a high degree of transactional integrity. Using systems that inherently enforce schemas and row-level locking — like an relational database management system (RDBMS) — unnecessarily over-complicate these applications.
It’s widely thought that the popularity of NoSQL databases was due to the inability of relational databases to scale horizontally. If NewSQL databases can provide transactional integrity for large, distributed databases and cloud services, does this undercut the momentum of the NoSQL movement? I argue that no, it doesn’t, because mobile computing introduces new challenges (e.g. offline application data and database sync) that fundamentally cannot be addressed in transactional systems.
It’s unrealistic to lock a row in an RDBMS when a mobile device that’s only occasionally connected could introduce painful amounts of latency over unreliable networks. Add that to the fact that many NoSQL systems are introducing new behaviors (strong-consistency, multi-document transactions) and strategies for approximating ACID transactions (event sourcing) — mobile is showing us that we need to rethink the information theory behind it.
Q7. What is the technical role that CouchDB clustering plays for Cloudant’s distributed data hosting platform?
Adam Kocoloski: At Cloudant, clustering allows us to take one logical database and partition that database for large scale and high availability.
We also store redundant copies of the partitions that make up that cluster, and to our customers, it all looks and operates like one logical database. CouchDB’s interface naturally lends itself to this underlying clustering implementation, and it is one of the many technologies we have used to build Cloudant’s managed database service.
Cloudant is built to be more than just hosted CouchDB. Along with CouchDB, open source software projects like HAProxy, Lucene, Chef, and Graphite play a crucial role in running our service and managing the experience for customers. Cloudant is also working with organizations like the Open Geospatial Consortium (OGC) to develop new standards for working with geospatial data sets.
That said, the semantics of CouchDB replication — if not the actual implementation itself — are critical to Cloudant’s ability to synchronize individual JSON documents or entire database partitions between shard copies within a single cluster, between clusters in the same data center, and between data centers across the globe. We’ve been able to horizontally scale CouchDB and apply its unique replication abilities on a much larger scale.
Q8. Cloudant recently announced the merging of its distributed database platform into the Apache CouchDB project. Why? What are the benefits of such integration?
Adam Kocoloski: We merged the horizontal scaling and fault-tolerance framework we built in BigCouch into Apache CouchDB™. The same way Cloudant has applied CouchDB replication in new ways to adapt the database for large distributed systems, Apache CouchDB will now share those capabilities.
Previously, the biggest knock on CouchDB was that it couldn’t scale horizontally to distribute portions of a database across multiple machines. People saw it as a monolithic piece of software, only fit to run on a single server. That is no longer the case.
Obviously new scalability features are good for the Apache project, and a healthy Apache CouchDB is good for Cloudant. The open source community is an excellent resource for engineering talent and sales leads. Our contribution will also improve the quality of our code. Having more of it out there in live deployment will only increase the velocity of our development teams. Many of our engineers wear multiple hats — as Cloudant employees and Apache CouchDB project committers. With the code merger complete, they’ll no longer have to maintain multiple forks of the codebase.
Q9. Will there be two offerings of the same Apache CouchDB: one from Couchbase and one from Cloudant?
Adam Kocoloski: No. Couchbase has distanced itself from the Apache project. Their product, Couchbase Server, is no longer interface-compatible with Apache CouchDB and has no plans to become so.
Adam Kocoloski, Founder & CTO of Cloudant.
Adam is an Apache CouchDB developer and one of the founders of Cloudant. He is the lead architect of BigCouch, a Dynamo-flavored clustering solution for CouchDB that serves as the core of Cloudant’s distributed data hosting platform. Adam received his Ph.D. in Physics from MIT in 2010, where he studied the gluon’s contribution to the spin structure of the proton using a motley mix of server farms running Platform LSF, SGE, and Condor. He and his wife Hillary are the proud parents of two beautiful girls.
- “NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, MongoDB” (.pdf), by Denis Nelubin, Ben Engber, Thumbtack Technology, 2013
- “Ultra-High Performance NoSQL Benchmarking- Analyzing Durability and Performance Tradeoffs” (.pdf) by Denis Nelubin,, Ben Engber, Thumbtack Technology, 2013
Follow us on Twitter: @odbmsorg