“What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging” — Charu Aggarwal.
On Data Mining, Data Science and Big Data, I have interviewed Charu Aggarwal, Research Scientist at the IBM T. J. Watson Research Center, an expert in this area.
Q1. You recently edited two books: Data Classification: Algorithms and Applications and Data Clustering: Algorithms and Applications.
What are the main lessons learned in data classification and data clustering that you can share with us?
Charu Aggarwal: The most important lesson, which is perhaps true for all of data mining applications, is that feature extraction, selection and representation are extremely important. It is all too often that we ignore these important aspects of the data mining process.
Q2. How Data Classification and Data Clustering relate to each other?
Charu Aggarwal: Data classification is the supervised version of data clustering. Data clustering is about dividing the data into groups of similar points. In data classification, examples of groups of points are made available to you. Then, for a given test instance, you are supposed to predict which group this point might belong to.
In the latter case, the groups often have a semantic interpretation. For example, the groups might correspond to fraud/not fraud labels in a credit-card application. In many cases, it is natural for the groups in classification to be clustered as well. However, this is not always the case.
Some methods such as semi-supervised clustering/classification leverage the natural connections between these problems to provide better quality results.
Q3. Can data classification and data clustering be useful also for large data sets and data streams? If yes, how?
Charu Aggarwal: Data clustering is definately useful for large data sets, because clusters can be viewed as summaries of the data. In fact, a particular form of fine-grained clustering, referred to as micro-clustering, is commonly used for summarizing high-volume streaming data in real time. These summaries are then used for many different applications, such as first-story detection, novelty detection, prediction, and so on.
In this sense, clustering plays an intermediate role in enabling other applications for large data sets.
Classification can also be used to generate different types of summary information, although it is a little less common. The reason is that classification is often used as the end-user application, rather than as an intermediate application
like clustering. Therefore, big-data serves as a challenge and as an opportunity for classification.
It serves as a challenge because of obvious computational reasons. It serves as an opportunity because you can build more complex and accurate models with larger data sets without creating a situation, where the model inadvertently overfits to the random noise in the data.
Q4. How do you typically extract “information” from Big Data?
Charu Aggarwal: This is a highly application-specific question, and it really depends on what you are looking for. For example, for the same stream of health-care data, you might be looking for different types of information, depending on whether you are trying to detect fraud, or whether you are trying to discover clinical anomalies. At the end of the day, the role of the domain expert can never be discounted.
However, the common theme in all these cases is to create a more compressed, concise, and clean representation into one of the data types we all recognize and know how to process. Of course, this step is required in all data mining applications, and not just big data applications. What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging.
For example, if you look at Google’s original MapReduce framework, it was motivated by a need to efficiently perform operations that are almost trivial for smaller data sets, but suddenly become very expensive in the big-data setting.
Q5. What are the typical problems and scenarios when you cluster multimedia, text, biological, categorical, network, streams, and uncertain data?
Charu Aggarwal: The heterogeneity of the data types causes significant challenges.
One problem is that the different data types may often be mixed, as a result of which the existing methods can sometimes not be used directly. Some common scenarios in which such data types arise are photo/music/video-sharing (multimedia), healthcare (time-series streams and biological), and social networks. Among these different data types, the probabilistic (uncertain) data types does not seem to have graduated from academia into industry very well. Of course, it is a new area and there is a lot of active research going on. The picture will become clearer in a few years.
Q6. How effective are today ́s clustering algorithms?
Charu Aggarwal: Clustering problems have become increasingly effective in recent years because of advances in high-dimensional methods. In the past, when the data was very high-dimensional most existing methods work poorly because of locally irrelevant attributes and concentration effects. These are collectively referred to as the curse of dimensionality. Techniques such as subspace and projected clustering have been introduced to discover clusters in lower dimensional views of the data. One nice aspect of this approach is that some variations of it are highly interpretable.
Q7. What is in common between pattern recognition, database analytics, data mining, and machine learning?
Charu Aggarwal: They really do the same thing, which is that of analyzing and gleaning insights from data. It is just that the styles and emphases are different in various communities. Database folks are more concerned
about scalability. Pattern recognition and machine learning folks are somewhat more theoretical. The statistical folks tend to use their statistical models. The data mining community is the most recent one, and it was formed to create a common meeting ground for these diverse communities.
The first KDD conference was held in 1995, and we have come a long way since then towards integration. I believe that the KDD conference has played a very major role in the amalgamation of these communities. Today, it is actually possible for the folks from database and machine learning communities to be aware of each other’s work. This was not quite true 20 years ago.
Q8. What are the most “precise” methods in data classification?
Charu Aggarwal: I am sure that you will find experts who are willing to swear by a particular model. However, each model comes with a different set of advantages over different data sets. Furthermore, some models, such as univariate decision trees and rule-based methods, have the advantage of being interpretable even when they are outperformed by other methods. After all, analysts love to know about the “why” aside from the “what.”
While I cannot say which models are the most accurate (highly data specific), I can certainly point to the most “popular” ones today from a research point of view. I would say that SVMs, and neural networks (deep learning) are the most popular classification methods. However, my personal experience has been mixed.
While I have found SVMs to work quite well across a wide variety of settings, neural networks are generally less robust. They can easily over fit to noise or show unstable performance over small ranges of parameters. I am watching the debate over deep learning with some interest to see how it plays out.
Q9. When to use Mahout for classification? and What is the advantage of using Mahout for classification?
Charu Aggarwal: Apache Mahout is a scalable machine learning environment for data mining applications. One distinguishing feature of Apache Mahout is that it builds on top of distributed infrastructures like MapReduce, and enables easy building of machine learning applications. It includes libraries of various operations and applications.
Therefore, it reduces the effort of the end user beyond the basic MapReduce framework. It should be used in cases, where the data is large enough to require the use of such distributed infrastructures.
Q10. What are your favourite success stories in Data Classifications and/or Data Clustering?
Charu Aggarwal: One of my favorite success stores is in the field of high dimensional data, where I explored the effect of locally irrelevant dimensions and concentration effects on various data mining algorithms.
I designed a suite of algorithms for such high-dimensional tasks as clustering, similarity search, and outlier detection.
The algorithms continue to be relevant even today, and we have even generalized some of these results to big-data (streaming) scenarios and other application domains, such as the graph and text domains.
Qx Anything else you wish to add?
Charu Aggarwal: Data mining and data sciences are at exciting cross-roads today. I have been working in this field since 1995, and I have never seen as much excitement about data science in my first 15 years, as I have seen
in the last 5. This is truly quite amazing!
Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Yorktown Heights, New York.
He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996.
His research interest during his Ph.D. years was in combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James B. Orlin.
He has since worked in the field of performance analysis, databases, and data mining. He has published over 200 papers in refereed conferences and journals, and has applied for or been granted over 80 patents. He is author or editor of nine books.
Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research.
He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal.
He is a fellow of the IEEE for “contributions to knowledge discovery and data mining techniques”, and a life-member of the ACM.
– MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Appeared in:OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Download: PDF Version
Follow ODBMS.org on Twitter: @odbmsorg
“The IoT will have many, many trillions of connections, particularly considering it’s not just the devices that are connected, but people, organizations, applications, and the underlying network” –Emil Eifrem.
I have interviewed Emil Eifrem, CEO of Neo Technology. Among the topics we discussed: Graph Databases, the new release of Neo4j, and how graphs relate to the Internet of Things.
Q1. Michael Blaha said in an interview: “The key is the distinction between being occurrence-oriented and schema-oriented. For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.”
What is your take on this?
Emil Eifrem: While graphs do excel where requirements and/or data have an element of uncertainty or unpredictability, many of the gains that companies experience from using graph databases don’t require the schema to be dynamic. What make graph databases suitable is problems where relationships between the data, and not just the data, matter.
That said, I agree that a graph-oriented approach is incredibly natural and easier to develop against. We see this again and again, with development cycle times reduced by 90% in some cases. Performance is also a common driver.
Q2. You recently released Neo4j 2.2. What are the main enhancements to the internal architecture you have done in Neo4j 2.2 and why?
Emil Eifrem: Neo4j 2.2 makes huge strides in performance and scalability. Performance of Cypher queries is up to 100 times faster than before, thanks to a Cost-Based Optimizer, that includes Visual Query Plans as a tuning aid.
Read scaling for highly concurrent workloads can be as much as 10 times higher with the new In-Memory Page Cache, which helps users take better advantage of modern hardware. Write scaling is also significantly higher for highly concurrent transactional workloads. Our engineering team found some clever ways of increasing throughput, by buffering writes to a single transaction log, rather than blocking transactions one at a time where each transaction committed to two transaction logs (graph and index). The last internal architecture change was to integrate a bulk loader into the product. It’s blindingly fast. We use Neo4j internally and a load that took many hours transactionally runs in four minutes with the bulk loader. It operates at throughputs of a million records per second, even for extremely large graphs.
Besides all of the internal improvements, this release also includes a lot of the top-requested developer features from the community in the developer tooling, such as built-in learning and visualization improvements.
Q3. With Neo4j 2.2 you introduce a new page cache. Could you please explain what is new with this page cache?
Emil Eifrem: Neo4j has two levels of caching. In earlier versions, Neo4j delegated the lower level cache to the operating system, by memory mapping files. OS memory mapping is optimized for a wide range of workloads.
As users have continued to push into bigger & bigger workloads, with more and more data, we decided it was time to build a specialized cache, built specially for Neo4j workloads. The page cache uses an LRU-K algorithm, and is auto-configured and statistically optimized, to deliver vastly improved scalability in highly concurrent workloads. The result is much better read scaling in multi-core environments that maintains the ultra-fast performance that’s been the hallmark of Neo4j.
Q4. How is this new page cache helping overcoming some of the limitations imposed by current IO systems? Do you have any performance measurements to share with us?
Emil Eifrem: The benefits kick in progressively as you add cores and threads. In the labs we’ve seen up to 10 times higher read throughput compared to previous versions of Neo4j, in large simulations. We also have some very positive reports from the field indicating similar gains.
Q5. What enhancements did you introduce in Neo4j 2.2 to improve both transactional and batch write performance?
Emil Eifrem: Write throughput has gone up because of two improvements. One is the fast-write buffering architecture. This lets multiple transactions flush to disk at the same time, in a way that improves throughput without sacrificing latency. Secondly, there is a change to the structure of the transaction logs. Prior to 2.2, writes used to be committed one at a time with two-phase commit for both the graph and its index. With the unified transaction log, multiple writes can be committed together, using a more efficient approach than before, for ensuring ACIDity between the graph and indexes.
For bulk initial loading, there’s something entirely different, a utility called “neo4j-import” that’s designed to load data at extremely high rates. We’ve seen complex graphs with tens of billions of nodes and relationships loading at rates of 1M records per second.
Q6. You introduced a cost-based query planner, Cypher, which uses statistics about data sets. How does it work? What statistics do you use about data sets?
Emil Eifrem: In 2.2 we introduced both a cost-based optimizer and a visual query planner for Cypher.
The cost-based optimizer gathers statistics such as the total number of nodes by label and calculates the most efficient query path based not just on information about the question being asked, but information about the data patterns in the graph. While some Cypher read queries perform just as fast as they did before, others can be 100 times faster.
The visual query planner provides insight into how the Neo4j optimizer will execute a query, helping users write better and faster queries because Cypher is more transparent.
Q7. Gartner recently said that “Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture.”
Graph analysis does not necessarily imply the need a dedicated graph database. Do you have any comment?
Emil Eifrem: Gartner is making a business statement, not a technology statement. Graph analysis refers to a business activity. The best tool we know for carrying out relevant and valuable graph analysis problems is graph databases.
The real value of using a dedicated graph database is in the power to use data relationships easily. Because data and its relationships are stored and processed as they naturally occur in a graph database, elements such as index-free adjacency lead to ultra-accurate and speedy responses to even the most complex queries.
Businesses that build operational applications on the right graph database experience measurable benefits: better performance overall, more competitive applications that incorporate previously impossible-to-include real-time features, easier development cycles that lead to faster time-to-market, and higher revenues thanks to speedier innovation and sharper fraud detection.
Q8. How do you position Neo4j with respect to RDBMS which handles XML and RDF data, and to NoSQL databases which handle graph-based data?
Emil Eifrem: While it is possible to use Neo4j to model an RDF-style graph, our observation is that most people who have tried doing this have found RDF much more difficult to learn and use than the property graph model and associated query methods. This is unsurprising given that RDF is a web standard created by an organization chartered with world wide web standards (the W3C), which has a very different set of requirements than organizations do for their enterprise databases. We saw the need to invent a model suited for persistent data inside of an enterprise, for use as a database data model.
As for XML, that again is a great data transport and exchange mechanism, and is conceptually similar to what’s done in document databases. But it’s not really a suitable model for database storage: if what you care about is relating things across your network. XML databases experienced some hype early on but never caught on.
While we’re talking about document databases … there’s another point here worth drilling into, which is not the data model, but the consistency model. If you’re dealing with isolated documents, then it’s okay for the scope of the transaction to be limited to one object. This means eventual consistency is okay. With graphs, because things relate to one another, if you don’t ensure that related things get written to the database on an “all or nothing” basis, then you can very quickly corrupt your graph. This is why BASE is sufficient for other forms of NoSQL, but not for graphs.
Q9. Could you please give use some examples on how graph databases could help supporting the Internet of Things (IoT)?
Emil Eifrem: We love Neo4j for the Internet of Things, but we’d like to see it renamed to Internet of Connected Things! After all, the value is in the connections between all of the things, that is, the connections and interactions between the devices.
Two points are worth remembering:
- Devices in isolation bring little value to the IoT; rather, it’s the connections between devices that truly bring forth the latent possibilities.
- We’re not just speaking about tracking billions of connections; the IoT will have many, many trillions of connections, particularly considering it’s not just the devices that are connected, but people, organizations, applications, and the underlying network.
Understanding and managing these connections will be at least as important for businesses as understanding and managing the devices themselves. Imagination is key to unlocking the value of connected things. For example, in a telecommunications or aviation network, the questions, “What cell tower is experiencing problems?” and “Which plane will arrive late?” can be answered much more accurately by understanding how the individual components are connected and impact one another. Understanding connections is also key to understanding dependencies and uncovering cascading impacts.
Q10. What are your top 3 favourite case studies for Neo4j?
Emil Eifrem: My top three use cases are:
- Real-time Recommendations – Personalize product, content and service offers by leveraging data relationships. (dynamic pricing, financial services products, online retail, routing & networks)
- Fraud Detection – Improve existing fraud detection resulting methods by uncovering hidden relationships to discover fraud rings and indirections. (financial services, health care, government, gaming)
- Master Data Management – Improve business outcome through storage and retrieval of complex and ‘hierarchical’ master data. Top MDM data sets across our customer base include: customer (360 degree view), organizational hierarchy, employee (HR), product / product line management, metadata management / data governance, and CMDB, as well as digital assets. (financial services, telecommunications, insurance, agribusiness)
Q11. Anything else you wish to add?
Emil Eifrem: Yes, one thing. As much as we’re a product company, we are very passionate educators and evangelists. Our mission is to help the world make sense of data. We discovered that graphs are an amazing way of doing that, and we’re working hard to share that with the world.
For anyone interested in learning more, our web site offers a lot of great learning resources: talks, examples, free training… we’ve even worked with O’Reilly to offer up their Graph Databases e-book. By the time this article is published, the second edition should be up on http://graphdatabases.com. Any of your readers who are interested, are welcome to come and learn, and become part of the amazing & rapidly growing worldwide graph database community.
It’s been great to speak with you!
Emil is the founder of the Neo4j open source graph database project, the most widely deployed graph database in the world. As the CEO of Neo4j’s commercial sponsor Neo Technology Emil spreads the word about the powers of graphs everywhere. Emil is a co-author of the O’Reilly Media book “Graph Databases” and presents regularly at conferences around the world such as JAOO, JavaOne, QCon, and OSCON.
Follow ODBMS.org on Twitter: @odbmsorg
“Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.”–Krishna Gade.
I have interviewed Krishna Gade, Engineering Manager on the Data team at Pinterest.
Q1. What are the main challenges you are currently facing when dealing with data at Pinterest?
Krishna Gade: Pinterest is a data product and a data-driven company. Most of our Pinner-facing features like recommendations, search and Related Pins are created by processing large amounts of data every day. Added to this, we use data to derive insights and make decisions on products and features to build and ship. As Pinterest usage grows, the number of Pinners, Pins and the related metadata are growing rapidly. Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.
On the product side, we’re curating a unique dataset we call the ‘interest graph’ which captures the relationships between Pinners, Pins, boards (collections of Pins) and topic categories. As Pins are visual bookmarks of web pages saved by our Pinners, we can have the same web page Pinned many different times. One of the problems we try to solve is to collate all the Pins that belong to the same web page and aggregate all the metadata associated with them.
Visual discovery is an important feature in our product. When you click on a Pin we need to show you visually related Pins. In order to do this we extract features from the Pin image and apply sophisticated deep learning techniques to suggest Pins related to the original. There is a need to build scalable infrastructure and algorithms to mine and extract value from this data and apply to our features like search, recommendations etc.
Q2. You wrote in one of your blog posts that “data-driven decision making is in your company DNA”. Could please elaborate and explain what do you mean with that?
Krishna Gade: It starts from the top. Our senior leadership is constantly looking for insights from data to make critical decisions. Every day, we look at the various product metrics computed by our daily pipelines to measure how the numerous product features are doing. Every change to our product is first tested with a small fraction of Pinners as an A/B experiment, and at any given time we’re running hundreds of these A/B experiments. Over time data-driven decision making has become an integral part of our culture.
Q3. Specifically, what do you use Real-time analytics for at Pinterest?
Krishna Gade: We build batch pipelines extensively throughout the company to process billions of Pins and the activity on them. These pipelines allow us to process vast amounts of historic data very efficiently and tune and personalize features like search, recommendations, home feed etc. However these pipelines don’t capture the activity happening currently – new users signing up, millions of repins, clicks and searches. If we only rely on batch pipelines, we won’t know much about a new user, Pin or trend for a day or two. We use real-time analytics to bridge this gap.
Our real-time data pipelines process user activity stream that includes various actions taken by the Pinner (repins, searches, clicks, etc.) as they happen on the site, compute signals for Pinners and Pins in near real-time and make these available back to our applications to customize and personalize our products.
Q4 Could you pls give us an overview of the data platforms you use at Pinterest?
Krishna Gade: We’ve used existing open-source technologies and also built custom data infrastructure to collect, process and store our data. We built a logging agent Singer, deployed on all of our web servers that’s constantly pumping log data into Kafka, which we use as a log transport system. After the logs reach Kafka, they’re copied into Amazon S3 by our custom log persistence service called Secor. We built Secor to ensure 0-data loss and overcome the weak eventual consistency model of S3.
After this point, our self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing. All our large scale batch pipelines run on Hadoop, which is the core data infrastructure we depend on for improving and observing our product. Our engineers use either Hive or Cascading to build the data pipelines, which are managed by Pinball – a flexible workflow management system we built. More recently, we’ve started using Spark to support our machine learning use-cases.
Q5. You have built a real-time data pipeline to ingest data into MemSQL using Spark Streaming. Why?
Krishna Gade: As of today, most of our analytics happens in the batch processing world. All the business metrics we compute are powered by the nightly workflows running on Hadoop. In the future our goal is to be able to consume real-time insights to move quickly and make product and business decisions faster. A key piece of infrastructure missing for us to achieve this goal was a real-time analytics database that can support SQL.
We wanted to experiment with a real-time analytics database like MemSQL to see how it works for our needs. As part of this experiment, we built a demo pipeline to ingest all our repin activity stream into MemSQL and built a visualization to show the repins coming from the various cities in the U.S.
Q6. Could you pls give us some detail how is it implemented?
Krishna Gade: As Pinners interact with the product, Singer agents hosted on our web servers are constantly writing the activity data to Kafka. The data in Kafka is consumed by a Spark streaming job. In this job, each Pin is filtered and then enriched by adding geolocation and Pin category information. The enriched data is then persisted to MemSQL using MemSQL’s spark connector and is made available for query serving. The goal of this prototype was to test if MemSQL could enable our analysts to use familiar SQL to explore the real-time data and derive interesting insights.
Q7. Why did you choose MemSQL and Spark for this? What were the alternatives?
Krishna Gade: I led the Storm engineering team at Twitter, and we were able to scale the technology for hundreds of applications there. During that time I was able to experience both good and bad aspects of Storm.
When I came to Pinterest, I saw that we were beginning to use Storm but mostly for use-cases like computing the success rate and latency stats for the site. More recently we built an event counting service using Storm and HBase for all of our Pin and user activity. In the long run, we think it would be great to consolidate our data infrastructure to a fewer set of technologies. Since we’re already using Spark for machine learning, we thought of exploring its streaming capabilities. This was the main motivation behind using Spark for this project.
As for MemSQL, we were looking for a relational database that can run SQL queries on streaming data that would not only simplify our pipeline code but would give our analysts a familiar interface (SQL) to ask questions on this new data source. Another attractive feature about MemSQL is that it can also be used for the OLTP use case, so we can potentially have the same pipeline enabling both product insights and user-facing features. Apart from MemSQL, we’re also looking at alternatives like VoltDB and Apache Phoenix. Since we already use HBase as a distributed key-value store for a number of use-cases, Apache Phoenix which is nothing but a SQL layer on top of HBase is interesting to us.
Q8. What are the lessons learned so far in using such real-time data pipeline?
Krishna Gade: It’s early days for the Spark + MemSQL real-time data pipeline, so we’re still learning about the pipeline and ingesting more and more data. Our hope is that in the next few weeks we can scale this pipeline to handle hundreds of thousands of events per second and have our analysts query them in real-time using SQL.
Q9. What are your plans and goals for this year?
Krishna Gade: On the platform side, our plan to is to scale real-time analytics in a big way in Pinterest. We want to be able to refresh our internal company metrics, signals into product features at the granularity of seconds instead of hours. We’re also working on scaling our Hadoop infrastructure especially looking into preventing S3 eventual consistency from disrupting the stability of our pipelines. This year should also see more open-sourcing from us. We started the year by open-sourcing Pinball, our workflow manager for Hadoop jobs. We plan to open-source Singer our logging agent sometime soon.
One the product side, one of our big goals is to scale our self-serve ads product and grow our international user-base. We’re focusing especially on markets like Japan and Europe to grow our user-base and get more local content into our index.
Qx. Anything else you wish to add?
Krishna Gade: For those who are interested in more information, we share latest from the engineering team on our Engineering blog. You can follow along with the blog, as well as updates on our Facebook Page. Thanks a lot for the opportunity to talk about Pinterest engineering and some of the data infrastructure challenges.
Krishna Gade is the engineering manager for the data team at Pinterest. His team builds core data infrastructure to enable data driven products and insights for Pinterest. They work on some of the cutting edge big data technologies like Kafka, Hadoop, Spark, Redshift etc. Before Pinterest, Krishna was at Twitter and Microsoft building large scale search and data platforms.
–Singer, Pinterest’s Logging Infrastructure (LINK to SlideShares)
–Introducing Pinterest Secor (LINK to Pinterest engineering blog)
–MemSQL’s spark connector (memsql/memsql-spark-connector GitHub)
Follow ODBMS.org on Twitter: @odbmsorg
“Predictive analytics is a market which has been lagging the growth of big data – full of tools developed twenty or more years ago which simply weren’t built with today’s challenges in mind.”–Walter Maguire and Indrajit Roy
HP announced HP Distributed R. I wanted to learn more about it, and I have interviewed Walter Maguire, Chief Field Technologist with the HP Big Data Group,and Indrajit Roy, principal researcher at HP, who provided the answers with the assistance of Malu G. Castellanos, manager and technical contributor in the Vertica group of Hewlett Packard.
Q1. HP announced HP Distributed R. What is the difference with the standard R?
Maguire, Roy: R is a very popular statistical analysis tool. But it was conceived before the era of Big Data. It is single threaded and cannot analyze massive datasets. HP Distributed R brings scalability and high performance to R users. Distributed R is not a competing version of R. Rather, it is an open source package that can be installed on vanilla R. Once installed, R users can leverage the pre-built distributed algorithms and the Distributed R API to benefit from cluster computing and dramatically expand the scale of the data they are able to analyze.
Q2. How does HP Distributed R work?
Maguire, Roy: HP Distributed R has three components:
(1) an open source distributed runtime that executes R functions,
(2) a fast, parallel data loader to ingest data from different sources such as the Vertica database, and
(3) a mechanism to deploy the model in the Vertica database.
The distributed runtime is the core of HP Distributed R.
It starts multiple R workers on the cluster, breaks the user’s program into multiple independent tasks, and executes them in parallel on cluster. The runtime hides much of the internal data communication. For example, the user does not need to know how many machines make up the cluster and where data resides in the cluster. In essence, it allows any R algorithm which has been ported to use distributed R to act like a massively parallel system.
Q3. Could you tell us some details on how users write Distributed R programs to benefit from scalability and high-performance?
Maguire, Roy: A programmer can use HP Distributed R’s API to write distributed applications. The API consists of two types of language constructs. First, the API provides distributed data-structures. These are really distributed versions of R’s common data structures such as array, data.frame, and list. As an example, distributed arrays can store 100s of gigabytes of data in-memory and across a cluster. Second, the API also provides a way for users to express parallel tasks on distributed data structures. While R users can write their own custom distributed applications using this API, we expect most R users to be interested in built-in algorithms. Just like R has built-in packages such as kmeans for clustering and glm for regression, HP Distributed R provides distributed versions of common clustering, classification, and graph algorithms.
Q4. R has already many packages that provide parallelism constructs. How do they fit into Distributed R?
Maguire, Roy: Yes, R has a number of open source parallel packages. Unfortunately, none of the packages can handle hundreds or thousands of gigabytes of data or has built-in distributed data structures and computational algorithms. HP Distributed R fills that functionality gap, along with enterprise support – which is critical for customers before they deploy R in production systems.
Also, it’s worth noting that using distributed R doesn’t prevent an R programmer from using their current libraries in their current environment. Those libraries just won’t gain the scale and performance benefits of distributed R.
Q5. Why is there a need to streamline language constructs in order to move R forward in the era of Big Data?
Maguire, Roy: The open source community has done a tremendous job of advancing R—different algorithms, thousands of packages, and a great user community.
However, in the case of parallelism and Big Data there is a confusing mix of R extensions. These packages have overlapping functionality, in many cases completely different syntax, and none of them solve all the issues users face with Big Data. We need to ensure that future R contributors can use a standard set of interfaces and write applications that are portable across different backend packages. This is not just our concern, but something that members of R-core and other companies are interested in as well. Our goal is to help the open source community streamline some of the language constructs so they can spend more time answering analytic questions and less time trying to make sense of the different R extensions.
Q6. What are in your opinion the strengths and weaknesses of the current R parallelism constructs?
Maguire, Roy: Some packages such as “parallel” are very useful. In the case of “parallel”, it is accessible to most R users, already ships with R, and it is easy to express embarrassingly parallel applications (those in which individual tasks don’t need to coordinate with each other). Still, parallel and other packages lack concepts such as distributed data-structures which can provide the much needed performance on massive data. Additionally, it is not clear if the infrastructure implementing existing parallel constructs have been tested on large, multi-gigabyte data.
Q7. When MPI and R wrappers around MPI are a good option?
Maguire, Roy: MPI is a powerful tool. It is widely used in the scientific and high performance computing domain.
If you have an existing MPI application and want to expose it to R users, the right thing is to make it available thought R wrappers. It does not make sense to rewrite these optimized scientific applications in R or any other language.
Q8. Why for in-memory processing, adding some form of distributed objects in R can potentially improve performance?
Maguire, Roy: In-memory processing represents a big change moving forward. The key idea is to remove bottlenecks such as the disk which slows down applications. In HP Distributed R, distributed objects provide a way to store and manipulate data in-memory. Without these distributed objects, data on worker nodes will be ephemeral and users will not be able to reference remote data. Worse, there will be performance issues. For example, many machine learning applications are iterative and need to execute tasks for multiple rounds. Without the concept of distributed objects, applications would end up re-broadcasting data to remote servers in each round. This results in a lot of data movement and very poor performance. Incidentally, this is a good example of why we undertook Distributed R in the first place. Implementing the bare bones of a parallel application is relatively straightforward, but there are thousands or tens of thousands of edge cases which arise once said application is in use due to the nature of distributed processing.
This is when the value of a cohesive parallel framework like Distributed R becomes very apparent.
Q9. Do you think that by using simple parallelism constructs, such as lapply, that operate on distributed data structures, may make it easier to program in R?
Maguire, Roy: Yes, we need to ensure that R users have a simple API to express parallelism. Implementing machine learning algorithms requires deep knowledge. Couple it with parallelism, and you are left with a very small set of people who can really write such applications. To ensure that R users continue to contribute, we need an API which is familiar to current R users. Constructs from the apply() family are a good choice. In fact we are exploring these kind of APIs with members of R-core.
Q10. R is an open-source software project. What about HP Distributed R?
Maguire, Roy: Just like R, HP Distributed R is a GPL licensed open source project. Our code is available on GitHub and we try to release a new version every few months. We provide enterprise support for customers who need it. If you have HP Vertica enterprise edition you will see additional benefits of integrating Vertica with Distributed R.
For example, you can build a machine learning model in Distributed R, and then deploy it in Vertica to score data real time in an analytic application – something many of our customers need.
Qx Anything you with to add?
Maguire, Roy: Predictive analytics is a market which has been lagging the growth of big data – full of tools developed twenty or more years ago which simply weren’t built with today’s challenges in mind.
With HP Distributed R we are not only providing users with scalable and high performance solutions, but also making a difference in the open source community. We look forward to nurturing contributors who can straddle the world of data science and distributed systems.
A core tenet of our big data strategy is to create a positive developer experience, and we are very focused on technology development and fulfillment choices which support that goal.
Walter Maguire has twenty-eight years of experience in analytics and data technologies.
He practiced data science before it had a name, worked with big data when “big” meant a megabyte, and has been part of the movement which has brought data management and analytic technologies from back-office, skunk works operations to core competencies for the largest companies in the world. He has worked as a practitioner as well as a vendor, working with analytics technologies ranging from SAS and R to data technologies such as Hadoop, RDBMS and MPP databases. Today, as Chief Field Technologist with the HP Big Data Group, Walt has the unique pleasure of addressing strategic customer needs with Haven, the HP big data platform.
Indrajit Roy is a principal researcher at HP. His research focusses on next generation distributed systems that solve the challenges of Big Data. Indrajit’s pet project is HP Distributed R, a new open source product that helps data scientists. Indrajit has multiple patents, publications and a best paper award. In the past he worked on computer security and parallel programming. Indrajit received his PhD in computer science from the University of Texas at Austin.
- Download HP Distributed R V1.0
- HP Distributed R Data Sheet
- Online Documentation
- HP Distributed R Source on GitHub
- Big Data Predictive Analytics Survey Infographic
- Distributed Machine Learning and Graph Processing with Sparse Matrices
- Using R for Iterative and Incremental Processing
Follow ODBMS.org on Twitter: @odbmsorg
“Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.”–Dr. Uwe Lammers.
“The Gaia mission is considered to be the largest data processing challenge in astronomy.”–Vik Nagjee
In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past.
The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.
I recall here the Objectives of the Gaia Project (source ESA Web site):
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
I have been following the GAIA mission since 2011, and I have reported it in two interviews until now. This is the third interview of the series, the first one after the launch.
The interview is with Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency, and Vik Nagjee, Product Manager for Data Platforms at InterSystems.
Q1. Could you please elaborate in some detail what is the goal and what are the expected results of the Gaia mission?
Uwe Lammers: We are trying to construct the most consistent, most complete and most accurate astronomical catalog ever done. Completeness means to observe all objects in the sky that are brighter than a so-called magnitude limit of 20. These are mostly stars in our Milky Way up to 1.5 billion in number. In addition, we expect to observe as many as 10 million other galaxies, hundreds of thousands of celestial bodies in our solar system (mostly asteroids), tens of thousands of new exo-planets, and more. Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.
Vik Nagjee: The data collected from Gaia will ultimately result in a three-dimensional map of the Milky Way, plotting over a billion celestial objects at a distance of up to 30,000 light years. This will reveal the composition, formation and evolution of the Galaxy, and will enable the testing of Albert Einstein’s Theory of Relativity, the space-time continuum, and gravitational waves, among other things. As such, the Gaia mission is considered to be the largest data processing challenge in astronomy.
Orbiting the Lagrange 2 (L2) point, a fixed spot 1.5 million kilometers from Earth, Gaia will measure the position, movement, and brightness of more than a billion celestial objects, looking at each one an average of 70 times over the course of five years. Gaia’s measurements will be much more complete, powerful, and accurate than anything that has been done before. ESA scientists estimate that Gaia will find hundreds of thousands of new celestial objects, including extra-solar planets, and the failed stars known as brown dwarfs. In addition, because Gaia can so accurately measure the position and movement of the stars, it will provide valuable information about the galaxy’s past – and future – evolution.
Read more about the Gaia mission here.
Q2. What is the size and structure of the information you analysed so far?
Uwe Lammers: From the start of the nominal mission on 25 July until today, we have received about 13 terabytes of compressed binary telemetry from the satellite. The daily pipeline running here at the Science Operations Centre (SOC) has processed all this and generated about 48 TB of higher-level data products for downstream systems.
At the end of the mission, the Main Database (MDB) is expected to hold more than 1 petabyte of data. The structure of the data is complex and this is one of the main challenges of the project. Our data model contains about 1,500 tables with thousands of fields in total, and many inter-dependencies. The final catalog to be released sometime around 2020 will have a simpler structure, and there will be ways to access and work with it in a convenient form, of course.
Q3. Since the launch of Gaia in December 2013, what intermediate results did you obtain by analysing the data received so far?
Uwe Lammers: Last year we found our first supernova (exploding star) with the prototype of the so-called Science Alert pipeline. When this system is fully operational, we expect to find several of these per day. The recent detection of a micro-lensing event was another nice demonstration of Gaia’s capabilities.
Q4. Did you find out any unexpected information and/or confirmation of theories by analysing the data generated by Gaia so far?
Uwe Lammers: It is still too early in the mission to prove or disprove established astronomical theories. For that we need to collect more data and do much more processing. The daily SOC pipeline is only one, the first part, of a large distributed system that involves five other Data Processing Centres (DPCs), each running complex scientific algorithms on the data. The whole system is designed to improve the results iteratively, step by step, until the final accuracy has been reached. However, there will certainly be intermediate results. One simple example of an unexpected early finding is that Gaia gets hit by micro-meteoroids much more often than pre-launch estimates predicted.
Q5. Could you please explain at some high level the Gaia’s data pipeline?
Uwe Lammers: Hmmm, that’s not easy to do in a few words. The daily pipeline at the SOC converts compact binary telemetry of the satellite into higher level products for the downstream systems at the SOC and the other processing centres. This sounds simple, but it is not – mainly because of the complex dependencies and the fact that data does not arrive from the satellite in strict time order. The output of the daily pipeline is only the start as mentioned above.
From the SOC, data gets sent out daily to the other DPCs, which perform more specialized processing. After a number of months we declare the current data segment as closed, receive the outputs from the other DPCs back at the SOC, and integrate all into a coherent next version of the MDB. The creation of it marks the end of the current iteration and the start of a new one. This cyclic processing will go on for as many iterations as needed to converge to a final result.
An important key process is the Astrometric Global Iterative Solution (AGIS), which will give us the astrometric part of the catalog. As the name suggests, it is in itself an iterative process and we run it likewise here at the SOC.
Vik Nagjee: To add on to what Dr. Lammers describes, Gaia data processing is handled by a pan-European collaboration, the Gaia Data Processing and Analysis Consortium (DPAC), and consists of about 450 scientists and engineers from across Europe. The DPAC is organized into nine Coordination Units (CUs); each CU is responsible for a specific portion of the Gaia data processing challenge.
One of the CUs – CU3: Core Processing – is responsible for unpacking, decompressing, and processing the science data retrieved from the satellite to provide rapid monitoring and feedback of the spacecraft and payload performances at the ultra-precise accuracy levels targeted by the mission. In other words, CU3 is responsible for ensuring the accuracy of the data collected by Gaia, as it is being collected, to ensure the accuracy of the eventual 3-D catalog of the Milky Way.
Over its lifetime, Gaia will generate somewhere between 500,000 to 1 million GB of data. On an average day, approximately 50 million objects will “transit” Gaia’s field of view, resulting in about 285 GB of data. When Gaia is surveying a densely populated portion of the galaxy, the daily amount could be 7 to 10 times as much, climbing to over 2,000 GB of data in a day.
There is an eight-hour window of time each day when raw data from Gaia is downloaded to one of three ground stations.
The telemetry is sent to the European Space Astronomy Centre (ESAC) in Spain – the home of CU3: Core Processing – where the data is ingested and staged.
The initial data treatment converts the data into the complex astrometric data models required for further computation. These astrometric objects are then sent to various other Computational Units, each of which is responsible for looking at different aspects of the data. Eventually the processed data will be combined into a comprehensive catalog that will be made available to astronomers around the world.
In addition to performing the initial data treatment, ESAC also processes the resulting astrometric data with some complex algorithms to take a “first-look” at the data, making sure that Gaia is operating correctly and sending back good information. This processing occurs on the Initial Data Treatment / First Look (IDT/FL) Database; the data platform for the IDT/FL database is InterSystems Caché.
Q6. Observations made and conclusions drawn are only as good as the data that supports them. How do you evaluate the “quality” of the data you receive? and how do you discard the “noise” from the valuable information?
Uwe Lammers: A very good question! If you refer to the final catalog, this is a non-trivial problem and a whole dedicated group of people is working on it. The main issue is, of course, that we do not know the “true” values as in simulations. We work with models, e.g., models of the stars’ positions and the satellite orientation. With those we can predict the observations, and the difference between the predicted and the observed values tells us how well our models represent reality. We can also do consistency checks. For instance, we do two runs of AGIS, one with only the observations from odd months and another one from even months, and both must give similar results. But we will also make use of external astronomical knowledge to validate results, e.g., known distances to particular stars. For distinguishing “noise” from “signal,” we have implemented robust outlier rejection schemes. The quality of the data coming directly from the satellite and from the daily pipeline is assessed with a special system called First Look running also at the SOC.
Vik Nagjee: The CU3: Core Processing Unit is responsible for ensuring the accuracy of the data being collected by Gaia, as it is being collected, so as to ensure the accuracy of the eventual 3-D catalog of the Milky Way.
InterSystems Caché is the data platform used by CU3 to quickly determine that Gaia is working properly and that the data being downloaded is trustworthy. Caché was chosen for this task because of its proven ability to rapidly ingest large amounts of data, populate extremely complex astrometric data models, and instantly make the data available for just-in-time analytics using SQL, NoSQL, and object paradigms.
One million GB of data easily qualifies as Big Data. What makes InterSystems Caché unique is not so much its ability to handle very large quantities of data, but its abilities to provide just-in-time analytics on just the right data.
We call this “Big Slice” — which is where analytics is performed just-in-time for a focused result.
A good analogy is how customer service benefits from occasional Big Data analytics. Breakthrough customer service comes from improving service at the point of service, one customer at a time, based on just-in-time processing of a Big Slice – the data relevant to the customer and her interactions. Back to the Gaia mission: at the conclusion of five years of data collection, a true Big Data exercise will plot the solar map. Yet, frequently ensuring data accuracy is an example of the increasing strategic need for our “Big Slice” concept.
Q7. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?
Uwe Lammers: At the SOC all systems use InterSystems’ Caché database. Despite some initial hiccups, Cache´ has proved to be a good choice for us. For analytics we use a few popular generic astronomical tools (e.g., topcat), but most are custom-made and specific to Gaia data. All DPCs had originally used relational databases, but some have migrated to Apache’s Hadoop.
Q8. Specifically for the Initial Data Treatment/First Look (IDT/FL) database, what are the main data management challenges you have?
Uwe Lammers: The biggest challenge is clearly the data volumes and the steady incoming stream that will not stop for the next five years. The satellite sends us 40-100 GB of compressed raw data every day, which the daily pipeline needs to process and store the output in near real time, as otherwise we quickly accumulate backlogs.
This means all components, the hardware, databases, and software, have to run and work robustly more or less around the clock. The IDTFL database grows daily by a few hundred gigabytes, but not all data has to be kept forever. There is an automatic cleanup process running that deletes data that falls out of chosen retention periods. Keeping all this machinery running around the clock is tough!
Vik Nagjee: Gaia’s data pipeline imposes some rather stringent requirements on the data platform used for the Initial Data Treatment/First Look (IDT/FL) database. The technology must be capable of ingesting a large amount of data and converting it into complex objects very quickly. In addition, the data needs to be immediately accessible for just-in-time analytics using SQL.
ESAC initially attempted to use traditional relational technology for the IDT/FL database, but soon discovered that a traditional RDBMS couldn’t ingest discrete objects quickly enough. To achieve the required insert rate, the data would have to be ingested as large BLOBs of approximately 50,000 objects, which would make further analysis extremely difficult. In particular, the first look process, which requires rapid, just-in-time analytics of the discrete astrometric data, would be untenable. Another drawback to using traditional relational technology, in addition to the typical performance and scalability challenges, was the high cost of the hardware that would be needed.
Since traditional RDBMS technology couldn’t meet the stringent demands imposed by CU3, ESAC decided to use InterSystems Caché.
Q9. How did you solve such challenges and what lessons did you learn until now?
Uwe Lammers: I have a good team of talented and very motivated people and this is certainly one aspect.
In case of problems we are also totally dependent on quick response times from the hardware vendors, the software developers and InterSystems. This has worked well in the past, and InterSystems’ excellent support in all cases where the database was involved is much appreciated. As far as the software is concerned, the clear lesson is that rigorous validation testing is essential – the more the better. There can never be too much. As a general lesson, one of my favorite quotes from Einstein captures it well: “Everything should be made as simple as possible, but no simpler.”
Q10. What is the usefulness of the CU3’s IDT/FL database for the Gaia’s mission so far?
Uwe Lammers: It is indispensable. It is the central working repository of all input/output data for the daily pipeline including the important health monitoring of the satellite.
Vik Nagjee: The usefulness of CU3’s IDT/FL database was proven early in Gaia’s mission. During the commissioning period for the satellite, an initial look at the data it was generating showed that extraneous light was being gathered. If the situation couldn’t be corrected, the extra light could significantly degrade Gaia’s ability to see and measure faint objects.
It was hypothesized that water vapor from the satellite outgassed in the vacuum of space, and refroze on Gaia’s mirrors, refracting light into its focal plane. Although this phenomenon was anticipated (and the mirrors equipped with heaters for that very reason), the amount of ice deposited was more than expected. Heating the mirrors melted the ice and solved the problem.
Scientists continue to rely on the IDT/FL database to provide just-in-time feedback about the efficacy and reliability of the data they receive from Gaia.
Qx Anything else you wish to add?
Uwe Lammers: Gaia is by far the most interesting and challenging project I have every worked on.
It is fascinating to see science, technology, and a large diverse group of people working together trying to create something truly great and lasting. Please all stay tuned for exciting results from Gaia to come!
Vik Nagjee: As Dr. Lammers said, Gaia is truly one of the most interesting and challenging computing projects of all time. I’m honored to have been a contributor to this project, and cannot wait to see the results from the Gaia catalog. Here’s to unraveling the chemical and dynamical history of our Galaxy!
Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency.
Uwe Lammers has a PhD in Physics and a degree in Computer Science and has been working for the European Space Agency on a number of space science mission for the past 20 years. After being involved in the X-ray missions
EXOSAT, BeppoSAX, and XMM-Newton, Gaia caught his attention in 2004.
As of late 2005, together with William O’Mullane, he built up the Gaia Science Operations Centre (SOC) at ESAC near Madrid. From early 2006 to mid-2014 he was in charge of the development of AGIS and is now leading the SOC as Gaia Science Operations Manager.
Vik Nagjee is a Product Manager for Data Platforms at InterSystems.
He’s responsible for Performance and Scalability of InterSystems Caché, and spends the rest of his time helping people (prospects, application partners, end users, etc.) find perfect solutions for their data, processing, and system architecture needs.
Follow ODBMS.org on Twitter: @odbmsorg
“In normal English usage the word resilience is taken to mean the power to resume original shape after compression; in the context of data base management the term data base resilience is defined as the ability to return to a previous state after the occurrence of some event or action which may have changed that state.
from “P. A Dearnley, School of Computing Studies, University of East Anglia, Norwich NR4 7TJ , 1975 “
On the topic database resilience, I have interviewed Seth Proctor, Chief Technology Officer at NuoDB.
Q1. When is a database truly resilient?
Seth Proctor: That is a great question, and the quotation above is a good place to start. In general, resiliency is about flexibility. It’s kind of the view that you should bend but not break. Individual failures (server crashes, disk wear, tripping over power cables) are inevitable but don’t have to result in systemic outages.
In some cases that means reacting to failure in a way that’s non-disruptive to the overall system.
The redundant networks in modern airplanes are a great example of this model. Other systems take a deeper view, watching global state to proactively re-route activity or replace components that may be failing. This is the model that keeps modern telecom networks running reliably. There are many views applied in the database world, but to me a resilient database is one that can react automatically to likely or active failures so that applications continue operating with full access to their data even as failures occur.
Q2. Is database resilience the same as disaster recovery?
Seth Proctor: I don’t believe it is. In traditional systems there is a primary site where the database is “active” and updates are replicated from there to other sites. In the case of failure to the primary site, one of the replicas can take over. Maintaining that replica (or replicas) is usually the key part of Disaster Recovery.
Sometimes that replica is missing the latest changes, and usually the act of “failing over” to a replica involves some window where the database is unavailable. This leads to operational terms like “hot stand-by” where failing over is faster but still delayed, complicated and failure-prone.
True resiliency, in my opinion, comes from systems that are designed to always be available even as some components fail. Reacting to failure efficiently is a key requirement, as is survival in case of complete site loss, so replicating data to multiple locations is critical to resiliency. At a minimum, however, a resilient data management solution cannot lose data (sort of “primum non nocere” for servers) and must be able to provide access to all data even as servers fail. Typical Disaster Recovery solutions on their own are not sufficient. A resilient solution should also be able to continue operations in the face of expected failures: hardware and software upgrades, network updates and service migration.
This is especially true as we push out to hybrid cloud deployments.
Q3. What are the database resilience requirements and challenges, especially in this era of Big Data?
Seth Proctor: There is no one set of requirements since each application has different goals with different resiliency needs. Big Data is often more about speeds and volume while in the operational space correctness, latency and availability are key. For instance, if you’re handling high-value bank transactions you have different needs than something doing weekly trend-analysis on Internet memes. The great thing about “the cloud” is the democratization of features and the new systems that have evolved around scale-out architectures. Things like transactional consistency were originally designed to make failures simpler and systems more resilient; as consistent data solutions scale out in cloud models it’s simpler to make any application resilient without sacrificing performance or increasing complexity.
That said, I look for a couple of key criteria when designing with resiliency in mind. The first is a distributed architecture, the foundation for any system to survive individual failure but remain globally available.
Ideally this provides a model where an application can continue operating even as arbitrary components fail. Second is the need for simple provisioning & monitoring. Without this, it’s hard to react to failures in an automatic or efficient fashion, and it’s almost impossible to orchestrate normal upgrade processes without down-time. Finally, a database needs to have a clear model for how the same data is kept in multiple locations and what the failure modes are that could result in any loss. These requirements also highlight a key challenge: what I’ve just described are what we expect from cloud infrastructure, but are pushing the limits of what most shared-nothing, sharded or vertically-scaled data architectures offer.
Q4. What is the real risk if the database goes offline?
Seth Proctor: Obviously one risk is the ripple effect it has to other services or applications.
When a database fails it can take with it core services, applications or even customers. That can mean lost revenue or opportunity and it almost certainly means disruption across an organization. Depending on how a database goes offline, the risk may also extend to data loss, corruption, or both. Most databases have to trade-off certain elements of latency against guaranteed durability, and it’s on failure that you pay for that choice. Sometimes you can’t even sort out what information was lost. Perhaps most dangerous, modern deployments typically create the illusion of a data management service by using multiple databases for DR, scale-out etc. When a single database goes offline you’re left with a global service in an unknown state with gaps in its capabilities. Orchestrating recovery is often expensive, time-consuming and disruptive to applications.
Q5. How are customers solving the continuous availability problem today?
Seth Proctor: Broadly, database availability is tackled in one of two fashions. The first is by running with many redundant, individual, replicated servers so that any given server can fail or be taken offline for maintenance as needed. Putting aside the complexity of operating so many independent services and the high infrastructure costs, there is no single view of the system. Data is constantly moving between services that weren’t designed with this kind of coordination in mind so you have to pay close attention to latencies, backup strategies and visibility rules for your applications. The other approach is to use a database that has forgone consistency, making a lot of these pieces appear simpler but placing the burden that might be handled by the database on the application instead. In this model each application needs to be written to understand the specifics of the availability model and in exchange has a service designed with redundancy.
Q6. Now that we are in the Cloud era, is there a better way?
Seth Proctor: For many pieces of the stack cloud architectures result in much easier availability models. For the database specifically, however, there are still some challenges. That said, I think there are a few great things we get from the cloud design mentality that are rapidly improving database availability models. The first is an assumption about on-demand resources and simplicity of spinning up servers or storage as needed. That makes reacting to failure so much easier, and much more cost-effective, as long as the database can take advantage of it. Next is the move towards commodity infrastructure. The economics certainly make it easier to run redundantly, but commodity components are likely to fail more frequently. This is pushing systems design, making failure tests critical and generally putting more people into the defensive mind-set that’s needed to build for availability. Finally, of course, cloud architectures have forced all of us to step back and re-think how we build core services, and that’s leading to new tools designed from the start with this point of view. Obviously that’s one of the most basic elements that drives us at NuoDB towards building a new kind of database architecture.
Q7. Can you share methodologies for avoiding single points of failure?
Seth Proctor: For sure! The first thing I’d say is to focus on layering & abstraction.
Failures will happen all the time, at every level, and in ways you never expect. Assume that you won’t test all of them ahead of time and focus on making each little piece of your system clear about how it can fail and what it needs from its peers to be successful. Maybe it’s obvious, but to avoid single points of failure you need components that are independent and able to stand-in for each other. Often that means replicating data at lower-levels and using DNS or load-balancers at a higher-level to have one name or endpoint map to those independent components. Oh, also, decouple your application logic as much as possible from your operational model. I know that goes against some trends, but really, if your application has deep knowledge of how some service is deployed and running it makes it really hard to roll with failures or changes to that service.
Q8. What’s new at NuoDB?
Seth Proctor: There are too many new things to capture it all here!
For anyone who hasn’t looked at us, NuoDB is a relational database build against a fundamentally new, distributed architecture. The result is ACID semantics, support for standard SQL (joins, indexes, etc.) and a logical view of a single database (no sharding or active/passive models) designed for resiliency from the start.
Rather than talk about the details here I’d point people at a white paper (Note of the Editor: registration required) we’ve just published on the topic.
Right now we’re heavily focused on a few key challenges that our enterprise customers need to solve: migrating from vertical scale to cloud architectures, retaining consistency and availability and designing for on-demand scale and hybrid deployments. Really important is the need for global scale, where a database scales to multiple data centers and multiple geographies. That brings with it all kinds of important requirements around latencies, failure, throughput, security and residency. It’s really neat stuff.
Q9- How does it differentiate with respect to other NoSQL and NewSQL databases?
Seth Proctor: The obvious difference to most NoSQL solutions is that NuoDB supports standard SQL, transactional consistency and all the other things you’d associate with an RDBMS.
Also, given our focus on enterprise use-cases, another key difference is the strong baseline with security, backup, analysis etc. In the NewSQL space there are several databases that run in-memory, scale-out and provide some kind of SQL support. Running in-memory often means placing all data in-memory, however, which is expensive and can lead to single points of failure and delays on recovery. Also, there are few that really support the arbitrary SQL that enterprises need. For instance, we have customers running 12-way joins or transactions that last hours and run thousands of statements.
These kinds of general-purpose capabilities are very hard to scale on-demand but they are the requirement for getting client-server architectures into the cloud, which is why we’ve spent so long focused on a new architectural view.
One other key difference is our focus on global operations. There are very few people trying to take a single, logical database and distribute it to multiple geographies without impacting consistency, latency or security.
Qx Anything else you wish to add?
Seth Proctor: Only that this was a great set of questions, and exactly the direction I encourage everyone to think about right now. We’re in a really exciting time between public clouds, new software and amazing capacity from commodity infrastructure. The hard part is stepping back and sorting out all the ways that systems can fail.
Architecting with resiliency as a goal is going to get more commonplace as the right foundational services mature.
Asking yourself what that means, what failures you can tolerate and whether you’re building systems that can grow alongside those core services is the right place to be today. What I love about working in this space today is that concepts like resilient design, until recently a rarefied approach, are accessible to everyone.
Anyone trying to build even the simplest application today should be asking these questions and designing from the start with concepts like resiliency front and center.
Seth Proctor, Chief Technology Officer, NuoDB
Seth has 15+ years of experience in the research, design and implementation of scalable systems. That experience includes work on distributed computing, networks, security, languages, operating systems and databases all of which are integral to NuoDB. His particular focus is on how to make technology scale and how to make users scale effectively with their systems.
Prior to NuoDB Seth worked at Nokia on their private cloud architecture. Before that he was at Sun Microsystems Laboratories and collaborated with several product groups and universities. His previous work includes contributions to the Java security framework, the Solaris operating system and several open source projects, in addition to the design of new distributed security and resource management systems. Seth developed new ways of looking at distributed transactions, caching, resource management and profiling in his contributions to Project Darkstar. Darkstar was a key effort at Sun which provided greater insights into how to distribute databases.
Seth holds eight patents for his cutting edge work across several technical disciplines. He has several additional patents awaiting approval related to achieving greater database efficiency and end-user agility.
- On Big Data and the Internet of Things. Interview with Bill Franks
Source: ODBMS.org | Published on 2015-03-09
- On MarkLogic 8. Interview with Stephen Buxton
Source: ODBMS.org | Published on 2015-02-13
- On Data Curation. Interview with Andy Palmer
Source: ODBMS.org | Published on 2015-01-14
- On Solr and Mahout. Interview with Grant Ingersoll
Source: ODBMS.org | Published on 2015-01-06
Follow ODBMS.org on Twitter: @odbmsorg
“Perhaps the biggest challenge is that the IoT has the potential to generate orders of magnitude more data than any other source in existence today. So, in the world of the IoT we will test the limits of ‘big.’”–Bill Franks
On topics of Data Warehouses, Hadoop, the Internet of Things, and Teradata`s perspective on the world of Big Data, I have interviewed Bill Franks, Chief Analytics Officer for Teradata.
Q1. What is Teradata`s perspective on the world of Big Data?
Bill Franks: Our perspective has not really changed with regard to ‘big data:’ the primary mission of Teradata for decades has been helping organizations utilize and analyze large volumes of data to produce insight for business value. Note that our Teradata database was originally designed exclusively for analytics, then called ‘decision support’ – unlike most other platforms, which were designed for general computing – then later adapted for analytic uses. As a result, the Teradata analytic engine is – and has always been – uniquely architected for large – ‘big data’ – volume and complexity aimed at producing actionable intelligence.
Of course, the amount of data that’s considered ‘big’ and thus a challenge – has changed, and we have a lot of novel data sources in recent times. However, we believe that companies which have always focused on analyzing and acting upon data intelligently can adapt to the new world of big data. After all, big data is just more data and the analysis of big data is still analysis. There are as many similarities as differences from the past.
Teradata has engineered further analytic enhancements over the years to create a diverse portfolio of products, partnerships, and services to allow our customers to continue to get the most from their data assets. The pace of change is very rapid today and we expect that to continue. We believe our strength is in our experience, expertise and our ability to help organizations navigate the changing landscape and continue to derive new, useful insights from their increasingly large and diverse data sources.
Q2. Most data warehousing projects consolidate data from different source systems. What is different in the world of Big Data?
Bill Franks: By definition, if you want to look at two different data sources together, you must either move one set of data to the other or move them both to a 3rd location. If data is truly disparate, you can’t use it effectively. That is what drove data warehousing to prominence. One huge difference between data warehousing practices years back and then today is that previously, all data that was captured in the business world met three criteria almost 100% of the time.
1) It was immensely important; given the cost to capture and store it,
2) The data was well structured, and
3) The data was generated by an organization’s internal business processes.
— Therefore, it was mostly placed in relational databases or on a mainframe since those technologies easily handle that type of data. Data warehousing solved the problem of many structured data platforms being spread out – by consolidating the sources for analytic purposes into a single structured platform.
What is different with big data is that today, the data often violates all of the rules.
1) Much of it is not important, or has not yet been proven to be important,
2) The data is not structured in the classic fashion at the outset (though most can and must be structured for analytical purposes), and
3) The data is often from sources external to an organization.
— As a result, we now have disparate data platforms that each serve different functions. Some focus on one type of data, while others focus on flexibility. However, the downside is that these platforms don’t integrate well and it isn’t as easy to tie everything together. That’s a problem Teradata is working diligently to solve with our Unified Data Architecture – our pioneering version of the visionary Gartner Logical Data Warehouse.
Q3. Will data warehouses become obsolete soon and be replaced by Hadoop?
Bill Franks: Absolutely not. A few years ago, that was a common claim. That claim is rarely heard today. In fact, all of the big Hadoop vendors partner with Teradata.
This is because our data warehousing platforms provide some important things Hadoop does not — just as Hadoop provides some things a data warehouse does not. Each platform has its strengths and weaknesses, but when positioned together, additional value is added. Part of the issue is that people mistake policy decisions for technology limitations.
There is no reason you can’t place untested, raw, unclean data of unknown value on a data warehousing platform; it’s the corporate policies that often forbid it. It is true that once data is critical and is leveraged by many applications and business users, you have to keep some control and consistency over it. This is what a data warehouse does for an organization.
But, that doesn’t mean you can’t experiment with new sources freely using the technology that supports formal data warehouses.
A colleague of mine mentioned a conversation he had with a Hadoop user. That user was boasting about how he could with a single command change the data type of information on Hadoop, for instance, if it would help him more easily solve his next problem. My friend then asked him what would happen to the prior dozen or two processes that were built expecting the data to be in the original data type format. Wouldn’t they all then break? The user had a blank stare for a moment and then realized his error. As you develop more processes, you must implement security, consistency, and controls on the underlying data. This is why data warehousing – as Gartner defines it, is going to be around for a long time.
Q4. With the increased need of tools for combining data together, are we going to see a “federated”- Big Data architecture?
Bill Franks: A form of that is exactly what we are pursuing with Teradata’s Unified Data Architecture. Again – we refer to Gartner’s vision of the “Logical Data Warehouse.” What we are doing is putting in place a layer of architecture that connects multiple disparate data stores. This architecture includes – and connects – relational databases like Teradata and Oracle, discovery platforms like our Teradata Aster offering, Hadoop, and other platforms such as MongoDB. The idea is that we make information available to users about data throughout the ecosystem, not just the data on the platform they are operating from. So, I see a data dictionary that includes a “table” called “Sensor Feed.”
I can see the data elements available and write analytic logic against those elements. However, I don’t need to be aware of whether the data is a database table, or a Hadoop file, or is in MongoDB. Users can simply build analytics instead of worrying about where data resides, how to log on to various systems, and how to move data. We’ll handle that for them.
We are also beginning to push processing across the various platforms to optimize performance. Just like with a ‘table’ versus a ‘view’ in a database, making a process enterprise-ready might require moving data around the architecture permanently. But now, users are free to discover where that is required. And, the technical team behind the curtain can worry about the details just as they do with traditional data warehousing. We are very bullish on our approach and think we are well positioned to maintain our leadership position in the analytics space.
Q5. Teradata made several acquisitions lately. How do the tools that Teradata acquired fit the current Teradata Data Architectural Framework?
Bill Franks: I believe this in general was addressed. However, in addition I would point out that we acquired Revelytix in 2014 to obtain Loom: an open platform for discovering, profiling, preparing, and tracking data lineage for data in Hadoop. Likewise, we acquired Hadapt, which created a big data analytic platform natively integrating SQL with Apache Hadoop. Plus our recent RainStor acquisition strengthens Teradata’s enterprise-grade Hadoop solutions and enables organizations to add archival data store capabilities for their entire enterprise, including data stored in OLTP, data warehouses, and applications.
Q6. What are the key differentiators of the Teradata Database core architecture?
Bill Franks: As I said, the Teradata DW was differentiated from the start – uniquely architected for analytics from day one. However, I would add that Teradata continues to broaden our differentiation: we’ve built the best data orchestration software in the industry (Teradata Unity and QueryGrid). The orchestration software is key – because it enables our customers to choose a file system that they use to store the data in – and the analytics that they apply to that data independently — and marry them together with software.
It helps reduce the complexity of connecting to, accessing, understanding interfaces and getting value from multiple analytical systems. Another differentiator is Teradata Intelligent Memory, introduced two years ago. TIM is the world’s first extended memory technology beyond cache to increase query performance. Users can configure the exact amount of in-memory capability needed for critical workloads – based on temperature – hot or cold data. The list goes on. I would say that our data technology really does focus on how data is best used – and what proficient users need most.
Q7. Is SQL really the right language to handle Big Data Analytics?
Bill Franks: In some cases yes and in some cases no. We want users to be able to utilize whatever language or platform is best for any given task. There are many big data requirements that perfectly fit SQL and many that don’t. The key is enabling scalable access to the data and flexibility in approach. Most people are aware that there is a big effort to add a SQL interface to Hadoop. What most haven’t realized is how far we’ve also come the other direction. For some time, Teradata has allowed C and Java processing directly against our database platforms via User Defined Functions and other similar extensions. We are now also enabling other languages such as R and Python to be executed within a Teradata context. What is possible today is so far beyond what was possible even 5 or 10 years ago.
Q8. How do you see the adoption of Cloud for Analytics?
Bill Franks: We are aggressively rolling out our own cloud offerings across our product suites. Many of our enterprise customers also configure our products as a private cloud behind their firewall. Adoption will be mixed based on the type of data and nature of work being done. Anything involving sensitive data is still typically not allowed outside a firewall. If you think back to the issue raised in a prior question of having to be able to combine data for analytics, you can’t really have some data locked behind a firewall and some data locked outside it. The real driver behind the cloud is that people want flexible, pay on demand access to analysis platforms. We have multiple ways to provide that to our clients, of which our cloud offerings are only one option. We have some other novel pricing and licensing options the help customers get access to the resources they require for analytics.
Q9. What are the most important data challenges posed by the Internet of Things (IoT)?
Bill Franks: Perhaps the biggest challenge is that the IOT has the potential to generate orders of magnitude more data than any other source in existence today.
So, in the world of the IOT we will test the limits of ‘big.’ At the same time, much of the data generated by the IOT will have low value in the short term and no value in the long term. One of the biggest challenges will be determining which pieces of the information generated by a given sensor actually matters to your business and for how long. In the long run, it is likely that only a small fraction of the raw data produced by the IOT will be stored beyond a few moments of immediate usage. For example, why keep the sensor readings that help navigate my car into a tight parking spot? Once I’m safely in the spot, I really don’t ever need to revisit that data again. If I hit a car in front of me, I might make an exception and keep the data so that the cause can be identified.
Q10. Could you mention some successful Big Data projects you have recently completed with customers?
Bill Franks: We are seeing a lot of very interesting analytics come about. We’ve helped health organizations discover genetic patterns associated with disease, we’ve helped manufacturers reduce cost and increase customer satisfaction by building predictive maintenance algorithms, we’ve helped cable providers identify valuable consumer viewing habits.
I could go on and on. A great place to see some of the examples, and even hear from some of the companies and people behind it, is at our website.
Bill Franks is the Chief Analytics Officer for Teradata, where he provides insight on trends in the analytics and big data space and helps clients understand how Teradata and its analytic partners can support their efforts. His focus is to translate complex analytics into terms that business users can understand and work with organizations to implement their analytics effectively. His work has spanned many industries for companies ranging from Fortune 100 companies to small non-profits. Franks also helps determine Teradata’s strategies in the areas of analytics and big data.
Franks is the author of the book Taming The Big Data Tidal Wave (John Wiley & Sons, Inc., April, 2012). In the book, he applies his two decades of experience working with clients on large-scale analytics initiatives to outline what it takes to succeed in today’s world of big data and analytics. The book made Tom Peter’s list of 2014 “Must Read” books and also the Top 10 Most Influential Translated Technology Books list from CSDN in China.
Franks’ second book The Analytics Revolution (John Wiley & Sons, Inc., September, 2014) lays out how to move beyond using analytics to find important insights in data (both big and small) and into operationalizing those insights at scale to truly impact a business.
He is a faculty member of the International Institute for Analytics, founded by leading analytics expert Tom Davenport, and an active speaker who has presented at dozens of events in recent years. His blog, Analytics Matters, addresses the transformation required to make analytics a core component of business decisions.
Franks earned a Bachelor’s degree in Applied Statistics from Virginia Tech and a Master’s degree in Applied Statistics from North Carolina State University. More information is available here: http://www.bill-franks.com.
Follow ODBMS.org on Twitter: @odbmsorg
“Apache Ignite is an incubating Apache project, which provides a high-performance, distributed in-memory data management software layer between various data sources and applications.”–Nikita Ivanov
I have interviewed Nikita Ivanov, founder and CTO of GridGain Systems. Main topic of the interview is the new release of Apache Ignite.
Q1. In your opinion, what are the main differences between an In-Memory Database, an In-Memory Data Grid and an In-Memory Data Fabric?
Nikita Ivanov: The main difference between in-memory databases (IMDB) and in-memory data grids is that IMDBs support only SQL (or some proprietary NoSQL dialect) while most Data Grids (IMDG) support multiple ways to access and process data. In IMDB the only way to access and process data is SQL and SQL store-procedures, while IMDGs typically support at least the following paradigms: SQL, Key/Value, MapReduce, MPP, and MPI-based processing.
Compared with IMDGs, an In-Memory Data Fabric represents the latest generation of in-memory technologies, integrated into a single platform, which eliminates the need for point solutions such as IMDB’s or IMDGs. It is a software layer that sits between applications and data stores, and it allows for high-performance data access and processing across different types of data, such as SQL, NoSQL and Hadoop. All without any rip and replace of existing applications or databases.
Q2. How is it possible to accelerate Hadoop-based deployments with in-memory technology?
Nikita Ivanov: To accelerate Hadoop in a meaningful way one needs to find a way to accelerate two core technologies that define Hadoop: HDFS, a distributed file system where data is stored, and MapReduce, a framework that allows parallel processing of the data stored in HDFS.
At GridGain, we’ve developed a highly optimized in-memory file system that is 100% compatible with HDFS that allows to store data directly in DRAM of computers in a Hadoop cluster. We’ve also developed a specifically optimized YARN-based MapReduce implementation that takes full advantage of the data stored directly in DRAM instead of disks.
The combination of these two innovations allows GridGain to speed up any Hadoop payloads – including Pig, Hive, or hand-written MapReduce jobs in any language – up to 10x without any code change. GridGain provides the first Hadoop accelerator that provide a true plug-and-play acceleration to the existing Hadoop jobs.
Q3. Why did you decide to open source your product?
Nikita Ivanov: Even before October of last year, GridGain already had an open core model: We offered an in-memory data fabric under the Apache 2.0 license, and we also offered a commercial edition with a number of enterprise-grade features, such as enhanced security, data center replication, rolling updates, cross-language portability, and others.
The drivers for our decision to contribute our core open source code base to the Apache Software Foundation (ASF) were of course to ensure continued, broad adoption of in-memory technologies and the long-term viability of the code base. But equally importantly, we also want to build a thriving community that adopts and adapts this code base, and hence will be key in finding new use cases for in-memory computing.
Q4. What is Apache Ignite?
Nikita Ivanov: Apache Ignite (incubating) is an open source, distributed framework for a unified In-Memory Data Fabric, originally developed by GridGain Systems. Apache Ignite is an incubating Apache project, which provides a high-performance, distributed in-memory data management software layer between various data sources and applications. Its code is written mostly in Java and Scala with small amount of C++ code, and it will initially combine an in-memory data grid, in-memory compute grid and in-memory streaming processing in one framework.
Apache Ignite’s large scale, in-memory framework offers transactional and real-time analytics applications performance gains of 100-1,000 times faster throughput and/or lower latencies. It is also a key open source foundation to enable the emerging class of so-called hybrid transactional-analytical workloads.
Q5. What is special about v1.0?
Nikita Ivanov: In October of 2014, the GridGain In-Memory Data Fabric core code base was accepted by the Apache Software Foundation (ASF) into the Incubator program under the name “Apache Ignite”.
Since then, GridGain engineers as well as other contributors have been busy working on migrating the existing code base, documentation, and refactoring of the existing internal build, test & release processes to the “Apache Way”.
Version 1.0 represents the first release that meets these goals, and will include additional enhancements above and beyond the most recent open source In-Memory Data Fabric from GridGain. In fact, Apache Ignite has a large set of features, and one of its coolest new features is its ability to automatically integrate with different RDBMS systems, such as Oracle, MySql, Postgres, DB2, Microsoft SQL, etc. This feature automatically generates the application domain model based on the schema definition of the underlying database, and then loads the data.
Despite the breadth of its feature set, however, Ignite is actually very easy to use: For example, there are no custom installers. The product comes as one ZIP file, which is ready to go once you unzip it. And it has only 1 mandatory dependency – ignite-core.jar. All other dependencies, like integration with Spring for configuration, or with the H2 database for SQL, can be added to the process a la carte. Also, the project is fully mavenized, and is composed of over a dozen of maven artifacts that can be imported and used in any combination. Apache Ignite is based on standard Java APIs, and for distributed caches and data grid functionality Ignite implements the JCache (JSR 107) standard.
The new Apache Ignite v1.0 bits are available for download now from the Apache Ignite web site.
Q6. Who will be using the Apache Ignite In-Memory Data Fabric, and for what?
Nikita Ivanov: We expect developers and software architects of high-performance, hyper-scale on-premise and SaaS applications to take advantage of the following capabilities when building or performance-tuning their new or existing applications: compute grid, data grid, service grid, streaming, clustering, distributed data structures, distributed messaging, distributed events and in-memory file system.
Use cases can be found in software designed for financial services, telecommunications, retail, transportation, social media, online advertising, utilities, biosciences and many other industries.
Q7. What is positioning of the Apache Ignite project?
Nikita Ivanov: As we explained in our blog from last November, we believe Apache Ignite has all the right ingredients to become for the world of Fast Data what Hadoop is for Big Data today. This means that unlike Hadoop, which is a batch process focused on enabling the storage of large amounts of data economically, Ignite will enable extremely fast and ultra-low latency processing of data, allowing its users to derive actionable insights from their data much faster. Unlike Spark, a popular sister project of Ignite in the ASF, which is mainly focused on enhancing analytics and machine-learning for the Hadoop world, Ignite is a data source agnostic processing layer, which can be used for both Hadoop-like computation and many other computing paradigms like MPP, MPI, streaming processing.
In addition to real-time analytics, Ignite’s in-memory framework also offers support for full ACID transactions.
Q8. You have previously posted that Oracle and SAP are missing the point of In-Memory Computing. Could you please elaborate on this?
Nikita Ivanov: We continue to believe that Oracle and SAP are missing the point of in-memory computing for the following reasons: By offering a well-integrated platform of a compute grid, data grid, streaming/CEP and Hadoop acceleration, Apache Ignite (incubating) and the GridGain In-Memory Data Fabric offer a strategic approach to in-memory computing, across both transactional and analytical workloads, that delivers performance, scale and comprehensive capabilities far above and beyond what traditional in-memory databases, data grids or other in-memory-based point solutions can offer by themselves.
Both Apache Ignite and GridGain’s enterprise offering built on Apache Ignite will greatly benefit from a thriving community adapting the code base to new and emerging use cases; therefore, we believe this code base is extremely well positioned to drive superior innovation to the world of Fast Data, just as the Hadoop community has been doing for Big Data.
In addition, unlike Oracle or SAP Hana, Apache Ignite is more affordable, easier-to-access and more transparent open source software running on commodity hardware, which typically increases developers’ and architects’ motivation to explore the potential of in-memory computing. That said, if all the customer is looking for from in-memory technology is faster processing of their (SQL) data, then they may still choose to deploy proprietary software from Oracle or SAP.
Qx Anything else you wish to add?
Nikita Ivanov: I guess I should mention that even though Apache Ignite has been in incubation for less than 4 months only, we are excited to see that the project already has a very vibrant and growing community.
But we always welcome community contributions, so if there are readers that would like to contribute, please send an email to the Apache Ignite dev list, and we will get you started. And even if you are not ready to contribute immediately, we would like to invite everyone to join our dev list. Most of the discussions happen there, and you can find out a lot about where the project is going and also provide your own ideas. Another great way, of course, for people to familiarize themselves with Apache Ignite, is to take a look at the code and see what it can do for thier project. The Ignite bits can be downloaded on the Apache Ignite homepage.
Nikita Ivanov is founder and CTO of GridGain Systems, started in 2007 and funded by RTP Ventures and Almaz Capital. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today.
Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. Nikita was one of the pioneers in using Java technology for server side middleware development while working for one of Europe’s largest system integrators in 1996.
He is an active member of Java middleware community, contributor to the Java specification, and holds a Master’s degree in Electro Mechanics from Baltic State Technical University, Saint Petersburg, Russia.
– On Solr and Mahout. Interview with Grant Ingersoll. ODBMS Industry Watch, 2015-01-06
– Big Data: Three questions to McObject. ODBMS Industry Watch, February 14, 2014
Follow ODBMS.org on Twitter: @odbmsorg
“When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits. “– Stephen Buxton.
MarkLogic recently released MarkLogic 8. I wanted to know more about this release. For that, I have interviewed Stephen Buxton, Senior Director, Product Management at MarkLogic.
Q1. You have recently launched MarkLogic® 8 software release. How is it positioned in the Big Data market? How does it differentiate from other products from NoSQL vendors?
Stephen Buxton: MarkLogic 8 is our biggest release ever, further solidifying MarkLogic’s position in the market as the only Enterprise NoSQL database.
With MarkLogic 8, you can now store, manage and search JSON, XML, and RDF all in one unified platform—without sacrificing enterprise features such as transactional consistency, security, or backup and recovery.
While other database companies are still figuring out how to strengthen their platform and add features like transactional consistency, we’ve moved far ahead of them by working on new innovative features such as Bitemporal and Semantics. It’s for these reasons that over 500 enterprise organizations have chosen MarkLogic to run their mission-critical applications.
MarkLogic 8 is more powerful, agile, and trusted than ever before, and is an ideal platform for doing two things: making heterogeneous data integration simpler and faster; and for doing dynamic content delivery at massive scale.
Relational databases do not offer enough flexibility—integration projects can take multiple years, cost millions of dollars, and struggle at scale. But, the newer NoSQL databases that do have agility still lack the enterprise features required to run in the data centers at large organizations. MarkLogic is the only NoSQL database that is able to solve today’s challenge, having the flexibility to serve as an operational and analytical database for all of an organization’s data.
Within MarkLogic, the JSON structure is mapped directly to the internal structure already used by the XML document format, so it has the same speed and scalability as with XML. This also means that all of the production-proven indexing, data management, and security capabilities that MarkLogic is known for are fully maintained.
Q3. In MarkLogic 8 you have been adding full SPARQL 1.1 support and Inferencing capability. Could you please explain what kind of Inferencing capability did you add and what are they useful for?
Stephen Buxton: We made a big leap forward on the semantics foundation that was laid in our previous release, adding full SPARQL 1.1 support, which includes support for property paths, aggregates, and SPARQL Update. Support for automatic inferencing was also added, which is a powerful capability that allows the database to combine existing data and apply pre-defined rules to infer new data. SPARQL 1.1 is a standard defined by the W3C that is supported by many RDF triple stores. But, MarkLogic differentiates itself among triple stores as you can store your documents and data right alongside your triples, and you can query across all three data models easily and efficiently.
Automatic inferencing is a really powerful feature that is part of an overall strategy to provide a more intelligent data layer so that you can build smarter apps.
With inferencing, for example, if you had two pieces of data stored as RDF triples, such as “John lives in Virginia” and “Virginia is in the United States”, then MarkLogic 8 could infer the new fact, “John lives in the United States.”
This can make search results richer and also show you new relationships in your data.
In MarkLogic 8, rules for inferencing are applied at query time. This approach is referred to as backward-chaining inference, a very flexible approach in which only the required rules are applied for each query, so the server does the minimum work necessary to get the correct results; and when your data or ontology or rules sets change, that change is available immediately – it takes effect with the very next query. And, of course, inference queries are transactional, distributed, and obey MarkLogic’s rule-based security, just like any other query. MarkLogic 8 has supplied rule sets for RDFS, RDFS-Plus, OWL-Horst, and their subsets; and you can create your own. With MarkLogic 8 you can further restrict any SPARQL query (with or without inference) by any document attribute, including timestamp, provenance, or even a bitemporal constraint.
More details and examples can be found at developer.marklogic.com.
Q4. The additions to SPARQL include Property Paths, Aggregates, and SPARQL Update. Could you please explain briefly each of them?
Stephen Buxton: SPARQL 1.1 brings support for property paths, aggregates, and SPARQL Update. These capabilities make working with RDF data simpler and more powerful, which means increased context for your data—all using the SPARQL 1.1 industry standard query language.
SPARQL 1.1’s property paths let you traverse an RDF graph – bouncing from point-to-point across a graph. This graph traversal allows you to do powerful, complex queries such as, “Show me all the people who are connected to John” by finding people that know John, and people that know people that know John, and so on.
With aggregate SPARQL functions you can do analytic queries over hundreds of billions of triples. MarkLogic 8 supports all the SPARQL 1.1 Aggregate functions – COUNT, SUM, MIN, MAX, and AVG – as well as the grouping operations GROUP BY, GROUP BY .. HAVING, GROUP_CONCAT and SAMPLE.
SPARQL 1.1 also includes SPARQL Update. With these capabilities, you can delete, insert, and update (delete/insert) individual triples, and manipulate RDF graphs, all using SPARQL 1.1.
Q5. The addition of SPARQL Update capabilities could have the potential to influence the capability you offer of a RDF triple store that scales horizontally and manages billions of triples. Any comment on that?
Stephen Buxton: The enhancements in MarkLogic 8 make it able to function as a full-featured, stand-alone triple store– this means you can now get a triple store that is horizontally scalable as part of a shared-nothing cluster, and still get all of the enterprise features MarkLogic is known for such as such as High Availability, Disaster Recovery, and certified security. Beyond that, anyone looking for “just a triple store” will find they can also store, manage, and query documents and data in the same database, a unique capability that only MarkLogic has.
Q6. You have been adding a so called Bitemporal Data Management. What is it and why is it useful?
Stephen Buxton: Bitemporal is a new feature that allows you to ask, “What did you know and when did you know it?” The MarkLogic Bitemporal feature answers this critical question by tracking what happened, when it happened, and when we found out. A bitemporal database is much more powerful than a temporal database that can only track when something happened. The difference between when something happened and when you found out about it can be incredibly significant, particularly when it comes to audits and regulation.
A bitemporal database tracks time across two different axes, the system and valid time axes. This allows you to go back in time and explore data, manage historical data across systems, ensure data integrity, and do complex bitemporal analysis. You can answer complex questions such as:
• Where did John Thomas live on August 20th as we knew it on September 1st?
• Where was the Blue Van on October 12th as we knew it on October 23rd?
Bitemporal is important for a wide variety of use cases across industries. Getting a more accurate picture of a business at different points-in-time used to be impossible, or very challenging at best. Bitemporal helps ensure that you always have a full and accurate picture of your data at every point-in-time, which is particularly useful in regulated industries.
• Regulatory requirements – Avoid the increasingly harsh downside consequences from not adhering to government and industry regulations, particularly in financial services and insurance
• Audits – Preserve the history of all your data, including the changes made to it, so that clear audits can be conducted without having to worry about lost data, data integrity, or cumbersome ETL processes with archived data
• Investigations and Intelligence – No more lost emails and no more missing information. Bitemporal databases never erase data, so it is possible to see exactly how data was updated based on what was known at the time
• Business Analytics – Run complex queries that were not previously possible in order to better understand your business and answer new questions about how different decisions and changes in the past could have led to different results
• Cost reduction – Manage data with a smaller footprint as the shape of the data changes, avoiding the need to set up additional databases for historical data.
Bitemporal is enhanced by MarkLogic’s Tiered Storage, which allows you to more easily archive your data to cheaper storage tiers with little administrative overhead. This keeps Bitemporal simple, and obviates the high cost imposed by the few relational databases that do have Bitemporal. MarkLogic also eliminates the schema roadblocks that relational databases that have Bitemporal struggle with. MarkLogic is schema-agnostic and can adjust to the shape of data as that data changes over time.
Q7. How is bitemporal different from versioning?
Stephen Buxton: Bitemporal works by ingesting bitemporal documents that are managed as a series of documents with range indexes for valid and system time axes. Documents are stored in a temporal collection protected by security permissions. The initial document inserted into the database is kept and never changes, allowing you to track the provenance of information with full governance and immutability.
Q8. Could you give us some examples of how Bitemporal Data Management could be useful applications for the financial services industry?
Stephen Buxton: One example of Bitemporal is trade reconciliation in financial services. When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits.
Imagine the Head of IT Architecture at a major bank working on mining information and looking for changes in risk profiles. The risk profiles cannot be accurately calculated without having an accurate picture of the reference and trade data, and how it changed over time. This task becomes simple and fast using Bitemporal.
Qx Anything else you wish to add?
Stephen Buxton: In addition to innovative features such as Bitemporal and Semantics, and features that make MarkLogic more widely accessible in the developer community, there are other updates in Marklogic 8 that make it easier to administer and manage. For example, Incremental Backup, another feature added in MarkLogic 8, allows DBAs to perform backups faster while using less storage.
With MarkLogic 8, you can have multiple daily incremental backups with only a minimal impact on database performance. This feature is one worth highlighting because it will help make DBAs live much easier, and will save an organization time and money.
It’s just another example of MarkLogic’s continuing dedication to being an enterprise NoSQL database that is more powerful, agile, and trusted than anything else.
Stephen Buxton is Senior Director of Product Management for Search and Semantics at MarkLogic, where he has been a member of the Products team for 8 years. Stephen focuses on bringing a rich semantic search experience to users of the MarkLogic NoSQL database, document store, and triple store. Before joining MarkLogic, Stephen was Director of Product Management for Text and XML at Oracle Corporation.
–On making information accessible. Interview with David Leeming. ODBMS Industry Watch, July 30, 2014
Follow ODBMS.org on Twitter: @odbmsorg
“We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.”–John Allison
I have interviewed John Allison, CTO and founder of Customer.io, a start up company in Portland, Oregon.
Q1. What is the business of Customer.io ?
John Allison: We help our customers send timely, targeted messages based on user activity on their website or mobile app. We achieve this by collecting analytical data, providing real-time segmentation, and allowing our customers to define rules to trigger messages at different points in their interactions with a user.
Q2. How large are the data sets you analyze?
John Allison: We’ve collected 6 terabytes of analytical event data for over 55 million unique users across our platform. Due to it’s nature, this data continues to grow and grows faster as we collect data for more and more users.
Q3. What are the main business and technical challenges you are currently facing?
John Allison: As we continue to grow our business, we need to ensure the technical side of our service can easily scale out to support new customers who want to use our product.
Q4. Why did you replace your existing underlying database architecture supporting your “MVP” product ? What were the main technical problems you encountered?
John Allison: As our data set grew in size to the point where we couldn’t realistically manage it all on a small number of servers, we began looking for alternatives which would allow us to continue providing our service in a larger, more distributed way.
Q5. How did you evaluate the alternatives?
John Allison: We evaluated many options and found that most didn’t live up to the availability or consistency guarantees they promised when run over a cluster of servers. We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.
Q6. How is the new solution looking like?
John Allison: We’ve taken more of a polyglot approach to storing our data. We are consolidating on three main clustered databases:
1) FoundationDB – Data where distributed transactions and consistency guarantees are most important.
2) Riak – Large amounts of immutable data where availability is more important.
3) ElasticSearch – Indexing data for ad-hoc querying.
All three have built in tools for expanding and administrating a cluster, provide fault-tolerance and increased reliability in the face of server faults, and each provides us with unique ways to access our data.
Q7. What experience do you have with this new database architecture until now? Do you have any measurable results you can share with us?
John Allison: Embracing a distributed architecture and storing data in the right database for a given use-case has led to less time worrying about operations, increased reliability of our service as a whole, and the ability to scale out all parts of our infrastructure to increase our platform’s capacity.
Q8. Moving forward, what are your plans for the next implementation of your product?
John Allison: Continuing to improve our product in order to provide the most value we can for our customers.
John Allison is the CTO and founder of Customer.io, a startup focused on making it easy to build, manage, and measure automatic customer retention emails. Prior to that he was the head of engineering at Challengepost.com. He is a world traveler, Golfer, and an Arkansas Razorback fan.
We have published several new experts articles on Big Data and Analytics in ODBMS.org.
– On Mobile Data Management. Interview with Bob Wiederhold. ODBMS Industry Watch, 2014-11-18.
–Big Data Management at American Express. Interview with Sastry Durvasula and Kevin Murray. ODBMS Industry Watch, 2014-10-12
Follow ODBMS.org on Twitter: @odbmsorg