“Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.”–Krishna Gade.
I have interviewed Krishna Gade, Engineering Manager on the Data team at Pinterest.
Q1. What are the main challenges you are currently facing when dealing with data at Pinterest?
Krishna Gade: Pinterest is a data product and a data-driven company. Most of our Pinner-facing features like recommendations, search and Related Pins are created by processing large amounts of data every day. Added to this, we use data to derive insights and make decisions on products and features to build and ship. As Pinterest usage grows, the number of Pinners, Pins and the related metadata are growing rapidly. Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.
On the product side, we’re curating a unique dataset we call the ‘interest graph’ which captures the relationships between Pinners, Pins, boards (collections of Pins) and topic categories. As Pins are visual bookmarks of web pages saved by our Pinners, we can have the same web page Pinned many different times. One of the problems we try to solve is to collate all the Pins that belong to the same web page and aggregate all the metadata associated with them.
Visual discovery is an important feature in our product. When you click on a Pin we need to show you visually related Pins. In order to do this we extract features from the Pin image and apply sophisticated deep learning techniques to suggest Pins related to the original. There is a need to build scalable infrastructure and algorithms to mine and extract value from this data and apply to our features like search, recommendations etc.
Q2. You wrote in one of your blog posts that “data-driven decision making is in your company DNA”. Could please elaborate and explain what do you mean with that?
Krishna Gade: It starts from the top. Our senior leadership is constantly looking for insights from data to make critical decisions. Every day, we look at the various product metrics computed by our daily pipelines to measure how the numerous product features are doing. Every change to our product is first tested with a small fraction of Pinners as an A/B experiment, and at any given time we’re running hundreds of these A/B experiments. Over time data-driven decision making has become an integral part of our culture.
Q3. Specifically, what do you use Real-time analytics for at Pinterest?
Krishna Gade: We build batch pipelines extensively throughout the company to process billions of Pins and the activity on them. These pipelines allow us to process vast amounts of historic data very efficiently and tune and personalize features like search, recommendations, home feed etc. However these pipelines don’t capture the activity happening currently – new users signing up, millions of repins, clicks and searches. If we only rely on batch pipelines, we won’t know much about a new user, Pin or trend for a day or two. We use real-time analytics to bridge this gap.
Our real-time data pipelines process user activity stream that includes various actions taken by the Pinner (repins, searches, clicks, etc.) as they happen on the site, compute signals for Pinners and Pins in near real-time and make these available back to our applications to customize and personalize our products.
Q4 Could you pls give us an overview of the data platforms you use at Pinterest?
Krishna Gade: We’ve used existing open-source technologies and also built custom data infrastructure to collect, process and store our data. We built a logging agent Singer, deployed on all of our web servers that’s constantly pumping log data into Kafka, which we use as a log transport system. After the logs reach Kafka, they’re copied into Amazon S3 by our custom log persistence service called Secor. We built Secor to ensure 0-data loss and overcome the weak eventual consistency model of S3.
After this point, our self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing. All our large scale batch pipelines run on Hadoop, which is the core data infrastructure we depend on for improving and observing our product. Our engineers use either Hive or Cascading to build the data pipelines, which are managed by Pinball – a flexible workflow management system we built. More recently, we’ve started using Spark to support our machine learning use-cases.
Q5. You have built a real-time data pipeline to ingest data into MemSQL using Spark Streaming. Why?
Krishna Gade: As of today, most of our analytics happens in the batch processing world. All the business metrics we compute are powered by the nightly workflows running on Hadoop. In the future our goal is to be able to consume real-time insights to move quickly and make product and business decisions faster. A key piece of infrastructure missing for us to achieve this goal was a real-time analytics database that can support SQL.
We wanted to experiment with a real-time analytics database like MemSQL to see how it works for our needs. As part of this experiment, we built a demo pipeline to ingest all our repin activity stream into MemSQL and built a visualization to show the repins coming from the various cities in the U.S.
Q6. Could you pls give us some detail how is it implemented?
Krishna Gade: As Pinners interact with the product, Singer agents hosted on our web servers are constantly writing the activity data to Kafka. The data in Kafka is consumed by a Spark streaming job. In this job, each Pin is filtered and then enriched by adding geolocation and Pin category information. The enriched data is then persisted to MemSQL using MemSQL’s spark connector and is made available for query serving. The goal of this prototype was to test if MemSQL could enable our analysts to use familiar SQL to explore the real-time data and derive interesting insights.
Q7. Why did you choose MemSQL and Spark for this? What were the alternatives?
Krishna Gade: I led the Storm engineering team at Twitter, and we were able to scale the technology for hundreds of applications there. During that time I was able to experience both good and bad aspects of Storm.
When I came to Pinterest, I saw that we were beginning to use Storm but mostly for use-cases like computing the success rate and latency stats for the site. More recently we built an event counting service using Storm and HBase for all of our Pin and user activity. In the long run, we think it would be great to consolidate our data infrastructure to a fewer set of technologies. Since we’re already using Spark for machine learning, we thought of exploring its streaming capabilities. This was the main motivation behind using Spark for this project.
As for MemSQL, we were looking for a relational database that can run SQL queries on streaming data that would not only simplify our pipeline code but would give our analysts a familiar interface (SQL) to ask questions on this new data source. Another attractive feature about MemSQL is that it can also be used for the OLTP use case, so we can potentially have the same pipeline enabling both product insights and user-facing features. Apart from MemSQL, we’re also looking at alternatives like VoltDB and Apache Phoenix. Since we already use HBase as a distributed key-value store for a number of use-cases, Apache Phoenix which is nothing but a SQL layer on top of HBase is interesting to us.
Q8. What are the lessons learned so far in using such real-time data pipeline?
Krishna Gade: It’s early days for the Spark + MemSQL real-time data pipeline, so we’re still learning about the pipeline and ingesting more and more data. Our hope is that in the next few weeks we can scale this pipeline to handle hundreds of thousands of events per second and have our analysts query them in real-time using SQL.
Q9. What are your plans and goals for this year?
Krishna Gade: On the platform side, our plan to is to scale real-time analytics in a big way in Pinterest. We want to be able to refresh our internal company metrics, signals into product features at the granularity of seconds instead of hours. We’re also working on scaling our Hadoop infrastructure especially looking into preventing S3 eventual consistency from disrupting the stability of our pipelines. This year should also see more open-sourcing from us. We started the year by open-sourcing Pinball, our workflow manager for Hadoop jobs. We plan to open-source Singer our logging agent sometime soon.
One the product side, one of our big goals is to scale our self-serve ads product and grow our international user-base. We’re focusing especially on markets like Japan and Europe to grow our user-base and get more local content into our index.
Qx. Anything else you wish to add?
Krishna Gade: For those who are interested in more information, we share latest from the engineering team on our Engineering blog. You can follow along with the blog, as well as updates on our Facebook Page. Thanks a lot for the opportunity to talk about Pinterest engineering and some of the data infrastructure challenges.
Krishna Gade is the engineering manager for the data team at Pinterest. His team builds core data infrastructure to enable data driven products and insights for Pinterest. They work on some of the cutting edge big data technologies like Kafka, Hadoop, Spark, Redshift etc. Before Pinterest, Krishna was at Twitter and Microsoft building large scale search and data platforms.
-Singer, Pinterest’s Logging Infrastructure (LINK to SlideShares)
-Introducing Pinterest Secor (LINK to Pinterest engineering blog)
-MemSQL’s spark connector (memsql/memsql-spark-connector GitHub)
Follow ODBMS.org on Twitter: @odbmsorg
“Predictive analytics is a market which has been lagging the growth of big data – full of tools developed twenty or more years ago which simply weren’t built with today’s challenges in mind.”–Walter Maguire and Indrajit Roy
HP announced HP Distributed R. I wanted to learn more about it, and I have interviewed Walter Maguire, Chief Field Technologist with the HP Big Data Group,and Indrajit Roy, principal researcher at HP, who provided the answers with the assistance of Malu G. Castellanos, manager and technical contributor in the Vertica group of Hewlett Packard.
Q1. HP announced HP Distributed R. What is the difference with the standard R?
Maguire, Roy: R is a very popular statistical analysis tool. But it was conceived before the era of Big Data. It is single threaded and cannot analyze massive datasets. HP Distributed R brings scalability and high performance to R users. Distributed R is not a competing version of R. Rather, it is an open source package that can be installed on vanilla R. Once installed, R users can leverage the pre-built distributed algorithms and the Distributed R API to benefit from cluster computing and dramatically expand the scale of the data they are able to analyze.
Q2. How does HP Distributed R work?
Maguire, Roy: HP Distributed R has three components:
(1) an open source distributed runtime that executes R functions,
(2) a fast, parallel data loader to ingest data from different sources such as the Vertica database, and
(3) a mechanism to deploy the model in the Vertica database.
The distributed runtime is the core of HP Distributed R.
It starts multiple R workers on the cluster, breaks the user’s program into multiple independent tasks, and executes them in parallel on cluster. The runtime hides much of the internal data communication. For example, the user does not need to know how many machines make up the cluster and where data resides in the cluster. In essence, it allows any R algorithm which has been ported to use distributed R to act like a massively parallel system.
Q3. Could you tell us some details on how users write Distributed R programs to benefit from scalability and high-performance?
Maguire, Roy: A programmer can use HP Distributed R’s API to write distributed applications. The API consists of two types of language constructs. First, the API provides distributed data-structures. These are really distributed versions of R’s common data structures such as array, data.frame, and list. As an example, distributed arrays can store 100s of gigabytes of data in-memory and across a cluster. Second, the API also provides a way for users to express parallel tasks on distributed data structures. While R users can write their own custom distributed applications using this API, we expect most R users to be interested in built-in algorithms. Just like R has built-in packages such as kmeans for clustering and glm for regression, HP Distributed R provides distributed versions of common clustering, classification, and graph algorithms.
Q4. R has already many packages that provide parallelism constructs. How do they fit into Distributed R?
Maguire, Roy: Yes, R has a number of open source parallel packages. Unfortunately, none of the packages can handle hundreds or thousands of gigabytes of data or has built-in distributed data structures and computational algorithms. HP Distributed R fills that functionality gap, along with enterprise support – which is critical for customers before they deploy R in production systems.
Also, it’s worth noting that using distributed R doesn’t prevent an R programmer from using their current libraries in their current environment. Those libraries just won’t gain the scale and performance benefits of distributed R.
Q5. Why is there a need to streamline language constructs in order to move R forward in the era of Big Data?
Maguire, Roy: The open source community has done a tremendous job of advancing R—different algorithms, thousands of packages, and a great user community.
However, in the case of parallelism and Big Data there is a confusing mix of R extensions. These packages have overlapping functionality, in many cases completely different syntax, and none of them solve all the issues users face with Big Data. We need to ensure that future R contributors can use a standard set of interfaces and write applications that are portable across different backend packages. This is not just our concern, but something that members of R-core and other companies are interested in as well. Our goal is to help the open source community streamline some of the language constructs so they can spend more time answering analytic questions and less time trying to make sense of the different R extensions.
Q6. What are in your opinion the strengths and weaknesses of the current R parallelism constructs?
Maguire, Roy: Some packages such as “parallel” are very useful. In the case of “parallel”, it is accessible to most R users, already ships with R, and it is easy to express embarrassingly parallel applications (those in which individual tasks don’t need to coordinate with each other). Still, parallel and other packages lack concepts such as distributed data-structures which can provide the much needed performance on massive data. Additionally, it is not clear if the infrastructure implementing existing parallel constructs have been tested on large, multi-gigabyte data.
Q7. When MPI and R wrappers around MPI are a good option?
Maguire, Roy: MPI is a powerful tool. It is widely used in the scientific and high performance computing domain.
If you have an existing MPI application and want to expose it to R users, the right thing is to make it available thought R wrappers. It does not make sense to rewrite these optimized scientific applications in R or any other language.
Q8. Why for in-memory processing, adding some form of distributed objects in R can potentially improve performance?
Maguire, Roy: In-memory processing represents a big change moving forward. The key idea is to remove bottlenecks such as the disk which slows down applications. In HP Distributed R, distributed objects provide a way to store and manipulate data in-memory. Without these distributed objects, data on worker nodes will be ephemeral and users will not be able to reference remote data. Worse, there will be performance issues. For example, many machine learning applications are iterative and need to execute tasks for multiple rounds. Without the concept of distributed objects, applications would end up re-broadcasting data to remote servers in each round. This results in a lot of data movement and very poor performance. Incidentally, this is a good example of why we undertook Distributed R in the first place. Implementing the bare bones of a parallel application is relatively straightforward, but there are thousands or tens of thousands of edge cases which arise once said application is in use due to the nature of distributed processing.
This is when the value of a cohesive parallel framework like Distributed R becomes very apparent.
Q9. Do you think that by using simple parallelism constructs, such as lapply, that operate on distributed data structures, may make it easier to program in R?
Maguire, Roy: Yes, we need to ensure that R users have a simple API to express parallelism. Implementing machine learning algorithms requires deep knowledge. Couple it with parallelism, and you are left with a very small set of people who can really write such applications. To ensure that R users continue to contribute, we need an API which is familiar to current R users. Constructs from the apply() family are a good choice. In fact we are exploring these kind of APIs with members of R-core.
Q10. R is an open-source software project. What about HP Distributed R?
Maguire, Roy: Just like R, HP Distributed R is a GPL licensed open source project. Our code is available on GitHub and we try to release a new version every few months. We provide enterprise support for customers who need it. If you have HP Vertica enterprise edition you will see additional benefits of integrating Vertica with Distributed R.
For example, you can build a machine learning model in Distributed R, and then deploy it in Vertica to score data real time in an analytic application – something many of our customers need.
Qx Anything you with to add?
Maguire, Roy: Predictive analytics is a market which has been lagging the growth of big data – full of tools developed twenty or more years ago which simply weren’t built with today’s challenges in mind.
With HP Distributed R we are not only providing users with scalable and high performance solutions, but also making a difference in the open source community. We look forward to nurturing contributors who can straddle the world of data science and distributed systems.
A core tenet of our big data strategy is to create a positive developer experience, and we are very focused on technology development and fulfillment choices which support that goal.
Walter Maguire has twenty-eight years of experience in analytics and data technologies.
He practiced data science before it had a name, worked with big data when “big” meant a megabyte, and has been part of the movement which has brought data management and analytic technologies from back-office, skunk works operations to core competencies for the largest companies in the world. He has worked as a practitioner as well as a vendor, working with analytics technologies ranging from SAS and R to data technologies such as Hadoop, RDBMS and MPP databases. Today, as Chief Field Technologist with the HP Big Data Group, Walt has the unique pleasure of addressing strategic customer needs with Haven, the HP big data platform.
Indrajit Roy is a principal researcher at HP. His research focusses on next generation distributed systems that solve the challenges of Big Data. Indrajit’s pet project is HP Distributed R, a new open source product that helps data scientists. Indrajit has multiple patents, publications and a best paper award. In the past he worked on computer security and parallel programming. Indrajit received his PhD in computer science from the University of Texas at Austin.
- Download HP Distributed R V1.0
- HP Distributed R Data Sheet
- Online Documentation
- HP Distributed R Source on GitHub
- Big Data Predictive Analytics Survey Infographic
- Distributed Machine Learning and Graph Processing with Sparse Matrices
- Using R for Iterative and Incremental Processing
Follow ODBMS.org on Twitter: @odbmsorg
“Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.”–Dr. Uwe Lammers.
“The Gaia mission is considered to be the largest data processing challenge in astronomy.”–Vik Nagjee
In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past.
The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.
I recall here the Objectives of the Gaia Project (source ESA Web site):
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
I have been following the GAIA mission since 2011, and I have reported it in two interviews until now. This is the third interview of the series, the first one after the launch.
The interview is with Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency, and Vik Nagjee, Product Manager for Data Platforms at InterSystems.
Q1. Could you please elaborate in some detail what is the goal and what are the expected results of the Gaia mission?
Uwe Lammers: We are trying to construct the most consistent, most complete and most accurate astronomical catalog ever done. Completeness means to observe all objects in the sky that are brighter than a so-called magnitude limit of 20. These are mostly stars in our Milky Way up to 1.5 billion in number. In addition, we expect to observe as many as 10 million other galaxies, hundreds of thousands of celestial bodies in our solar system (mostly asteroids), tens of thousands of new exo-planets, and more. Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.
Vik Nagjee: The data collected from Gaia will ultimately result in a three-dimensional map of the Milky Way, plotting over a billion celestial objects at a distance of up to 30,000 light years. This will reveal the composition, formation and evolution of the Galaxy, and will enable the testing of Albert Einstein’s Theory of Relativity, the space-time continuum, and gravitational waves, among other things. As such, the Gaia mission is considered to be the largest data processing challenge in astronomy.
Orbiting the Lagrange 2 (L2) point, a fixed spot 1.5 million kilometers from Earth, Gaia will measure the position, movement, and brightness of more than a billion celestial objects, looking at each one an average of 70 times over the course of five years. Gaia’s measurements will be much more complete, powerful, and accurate than anything that has been done before. ESA scientists estimate that Gaia will find hundreds of thousands of new celestial objects, including extra-solar planets, and the failed stars known as brown dwarfs. In addition, because Gaia can so accurately measure the position and movement of the stars, it will provide valuable information about the galaxy’s past – and future – evolution.
Read more about the Gaia mission here.
Q2. What is the size and structure of the information you analysed so far?
Uwe Lammers: From the start of the nominal mission on 25 July until today, we have received about 13 terabytes of compressed binary telemetry from the satellite. The daily pipeline running here at the Science Operations Centre (SOC) has processed all this and generated about 48 TB of higher-level data products for downstream systems.
At the end of the mission, the Main Database (MDB) is expected to hold more than 1 petabyte of data. The structure of the data is complex and this is one of the main challenges of the project. Our data model contains about 1,500 tables with thousands of fields in total, and many inter-dependencies. The final catalog to be released sometime around 2020 will have a simpler structure, and there will be ways to access and work with it in a convenient form, of course.
Q3. Since the launch of Gaia in December 2013, what intermediate results did you obtain by analysing the data received so far?
Uwe Lammers: Last year we found our first supernova (exploding star) with the prototype of the so-called Science Alert pipeline. When this system is fully operational, we expect to find several of these per day. The recent detection of a micro-lensing event was another nice demonstration of Gaia’s capabilities.
Q4. Did you find out any unexpected information and/or confirmation of theories by analysing the data generated by Gaia so far?
Uwe Lammers: It is still too early in the mission to prove or disprove established astronomical theories. For that we need to collect more data and do much more processing. The daily SOC pipeline is only one, the first part, of a large distributed system that involves five other Data Processing Centres (DPCs), each running complex scientific algorithms on the data. The whole system is designed to improve the results iteratively, step by step, until the final accuracy has been reached. However, there will certainly be intermediate results. One simple example of an unexpected early finding is that Gaia gets hit by micro-meteoroids much more often than pre-launch estimates predicted.
Q5. Could you please explain at some high level the Gaia’s data pipeline?
Uwe Lammers: Hmmm, that’s not easy to do in a few words. The daily pipeline at the SOC converts compact binary telemetry of the satellite into higher level products for the downstream systems at the SOC and the other processing centres. This sounds simple, but it is not – mainly because of the complex dependencies and the fact that data does not arrive from the satellite in strict time order. The output of the daily pipeline is only the start as mentioned above.
From the SOC, data gets sent out daily to the other DPCs, which perform more specialized processing. After a number of months we declare the current data segment as closed, receive the outputs from the other DPCs back at the SOC, and integrate all into a coherent next version of the MDB. The creation of it marks the end of the current iteration and the start of a new one. This cyclic processing will go on for as many iterations as needed to converge to a final result.
An important key process is the Astrometric Global Iterative Solution (AGIS), which will give us the astrometric part of the catalog. As the name suggests, it is in itself an iterative process and we run it likewise here at the SOC.
Vik Nagjee: To add on to what Dr. Lammers describes, Gaia data processing is handled by a pan-European collaboration, the Gaia Data Processing and Analysis Consortium (DPAC), and consists of about 450 scientists and engineers from across Europe. The DPAC is organized into nine Coordination Units (CUs); each CU is responsible for a specific portion of the Gaia data processing challenge.
One of the CUs – CU3: Core Processing – is responsible for unpacking, decompressing, and processing the science data retrieved from the satellite to provide rapid monitoring and feedback of the spacecraft and payload performances at the ultra-precise accuracy levels targeted by the mission. In other words, CU3 is responsible for ensuring the accuracy of the data collected by Gaia, as it is being collected, to ensure the accuracy of the eventual 3-D catalog of the Milky Way.
Over its lifetime, Gaia will generate somewhere between 500,000 to 1 million GB of data. On an average day, approximately 50 million objects will “transit” Gaia’s field of view, resulting in about 285 GB of data. When Gaia is surveying a densely populated portion of the galaxy, the daily amount could be 7 to 10 times as much, climbing to over 2,000 GB of data in a day.
There is an eight-hour window of time each day when raw data from Gaia is downloaded to one of three ground stations.
The telemetry is sent to the European Space Astronomy Centre (ESAC) in Spain – the home of CU3: Core Processing – where the data is ingested and staged.
The initial data treatment converts the data into the complex astrometric data models required for further computation. These astrometric objects are then sent to various other Computational Units, each of which is responsible for looking at different aspects of the data. Eventually the processed data will be combined into a comprehensive catalog that will be made available to astronomers around the world.
In addition to performing the initial data treatment, ESAC also processes the resulting astrometric data with some complex algorithms to take a “first-look” at the data, making sure that Gaia is operating correctly and sending back good information. This processing occurs on the Initial Data Treatment / First Look (IDT/FL) Database; the data platform for the IDT/FL database is InterSystems Caché.
Q6. Observations made and conclusions drawn are only as good as the data that supports them. How do you evaluate the “quality” of the data you receive? and how do you discard the “noise” from the valuable information?
Uwe Lammers: A very good question! If you refer to the final catalog, this is a non-trivial problem and a whole dedicated group of people is working on it. The main issue is, of course, that we do not know the “true” values as in simulations. We work with models, e.g., models of the stars’ positions and the satellite orientation. With those we can predict the observations, and the difference between the predicted and the observed values tells us how well our models represent reality. We can also do consistency checks. For instance, we do two runs of AGIS, one with only the observations from odd months and another one from even months, and both must give similar results. But we will also make use of external astronomical knowledge to validate results, e.g., known distances to particular stars. For distinguishing “noise” from “signal,” we have implemented robust outlier rejection schemes. The quality of the data coming directly from the satellite and from the daily pipeline is assessed with a special system called First Look running also at the SOC.
Vik Nagjee: The CU3: Core Processing Unit is responsible for ensuring the accuracy of the data being collected by Gaia, as it is being collected, so as to ensure the accuracy of the eventual 3-D catalog of the Milky Way.
InterSystems Caché is the data platform used by CU3 to quickly determine that Gaia is working properly and that the data being downloaded is trustworthy. Caché was chosen for this task because of its proven ability to rapidly ingest large amounts of data, populate extremely complex astrometric data models, and instantly make the data available for just-in-time analytics using SQL, NoSQL, and object paradigms.
One million GB of data easily qualifies as Big Data. What makes InterSystems Caché unique is not so much its ability to handle very large quantities of data, but its abilities to provide just-in-time analytics on just the right data.
We call this “Big Slice” — which is where analytics is performed just-in-time for a focused result.
A good analogy is how customer service benefits from occasional Big Data analytics. Breakthrough customer service comes from improving service at the point of service, one customer at a time, based on just-in-time processing of a Big Slice – the data relevant to the customer and her interactions. Back to the Gaia mission: at the conclusion of five years of data collection, a true Big Data exercise will plot the solar map. Yet, frequently ensuring data accuracy is an example of the increasing strategic need for our “Big Slice” concept.
Q7. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?
Uwe Lammers: At the SOC all systems use InterSystems’ Caché database. Despite some initial hiccups, Cache´ has proved to be a good choice for us. For analytics we use a few popular generic astronomical tools (e.g., topcat), but most are custom-made and specific to Gaia data. All DPCs had originally used relational databases, but some have migrated to Apache’s Hadoop.
Q8. Specifically for the Initial Data Treatment/First Look (IDT/FL) database, what are the main data management challenges you have?
Uwe Lammers: The biggest challenge is clearly the data volumes and the steady incoming stream that will not stop for the next five years. The satellite sends us 40-100 GB of compressed raw data every day, which the daily pipeline needs to process and store the output in near real time, as otherwise we quickly accumulate backlogs.
This means all components, the hardware, databases, and software, have to run and work robustly more or less around the clock. The IDTFL database grows daily by a few hundred gigabytes, but not all data has to be kept forever. There is an automatic cleanup process running that deletes data that falls out of chosen retention periods. Keeping all this machinery running around the clock is tough!
Vik Nagjee: Gaia’s data pipeline imposes some rather stringent requirements on the data platform used for the Initial Data Treatment/First Look (IDT/FL) database. The technology must be capable of ingesting a large amount of data and converting it into complex objects very quickly. In addition, the data needs to be immediately accessible for just-in-time analytics using SQL.
ESAC initially attempted to use traditional relational technology for the IDT/FL database, but soon discovered that a traditional RDBMS couldn’t ingest discrete objects quickly enough. To achieve the required insert rate, the data would have to be ingested as large BLOBs of approximately 50,000 objects, which would make further analysis extremely difficult. In particular, the first look process, which requires rapid, just-in-time analytics of the discrete astrometric data, would be untenable. Another drawback to using traditional relational technology, in addition to the typical performance and scalability challenges, was the high cost of the hardware that would be needed.
Since traditional RDBMS technology couldn’t meet the stringent demands imposed by CU3, ESAC decided to use InterSystems Caché.
Q9. How did you solve such challenges and what lessons did you learn until now?
Uwe Lammers: I have a good team of talented and very motivated people and this is certainly one aspect.
In case of problems we are also totally dependent on quick response times from the hardware vendors, the software developers and InterSystems. This has worked well in the past, and InterSystems’ excellent support in all cases where the database was involved is much appreciated. As far as the software is concerned, the clear lesson is that rigorous validation testing is essential – the more the better. There can never be too much. As a general lesson, one of my favorite quotes from Einstein captures it well: “Everything should be made as simple as possible, but no simpler.”
Q10. What is the usefulness of the CU3’s IDT/FL database for the Gaia’s mission so far?
Uwe Lammers: It is indispensable. It is the central working repository of all input/output data for the daily pipeline including the important health monitoring of the satellite.
Vik Nagjee: The usefulness of CU3’s IDT/FL database was proven early in Gaia’s mission. During the commissioning period for the satellite, an initial look at the data it was generating showed that extraneous light was being gathered. If the situation couldn’t be corrected, the extra light could significantly degrade Gaia’s ability to see and measure faint objects.
It was hypothesized that water vapor from the satellite outgassed in the vacuum of space, and refroze on Gaia’s mirrors, refracting light into its focal plane. Although this phenomenon was anticipated (and the mirrors equipped with heaters for that very reason), the amount of ice deposited was more than expected. Heating the mirrors melted the ice and solved the problem.
Scientists continue to rely on the IDT/FL database to provide just-in-time feedback about the efficacy and reliability of the data they receive from Gaia.
Qx Anything else you wish to add?
Uwe Lammers: Gaia is by far the most interesting and challenging project I have every worked on.
It is fascinating to see science, technology, and a large diverse group of people working together trying to create something truly great and lasting. Please all stay tuned for exciting results from Gaia to come!
Vik Nagjee: As Dr. Lammers said, Gaia is truly one of the most interesting and challenging computing projects of all time. I’m honored to have been a contributor to this project, and cannot wait to see the results from the Gaia catalog. Here’s to unraveling the chemical and dynamical history of our Galaxy!
Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency.
Uwe Lammers has a PhD in Physics and a degree in Computer Science and has been working for the European Space Agency on a number of space science mission for the past 20 years. After being involved in the X-ray missions
EXOSAT, BeppoSAX, and XMM-Newton, Gaia caught his attention in 2004.
As of late 2005, together with William O’Mullane, he built up the Gaia Science Operations Centre (SOC) at ESAC near Madrid. From early 2006 to mid-2014 he was in charge of the development of AGIS and is now leading the SOC as Gaia Science Operations Manager.
Vik Nagjee is a Product Manager for Data Platforms at InterSystems.
He’s responsible for Performance and Scalability of InterSystems Caché, and spends the rest of his time helping people (prospects, application partners, end users, etc.) find perfect solutions for their data, processing, and system architecture needs.
Follow ODBMS.org on Twitter: @odbmsorg
“In normal English usage the word resilience is taken to mean the power to resume original shape after compression; in the context of data base management the term data base resilience is defined as the ability to return to a previous state after the occurrence of some event or action which may have changed that state.
from “P. A Dearnley, School of Computing Studies, University of East Anglia, Norwich NR4 7TJ , 1975 “
On the topic database resilience, I have interviewed Seth Proctor, Chief Technology Officer at NuoDB.
Q1. When is a database truly resilient?
Seth Proctor: That is a great question, and the quotation above is a good place to start. In general, resiliency is about flexibility. It’s kind of the view that you should bend but not break. Individual failures (server crashes, disk wear, tripping over power cables) are inevitable but don’t have to result in systemic outages.
In some cases that means reacting to failure in a way that’s non-disruptive to the overall system.
The redundant networks in modern airplanes are a great example of this model. Other systems take a deeper view, watching global state to proactively re-route activity or replace components that may be failing. This is the model that keeps modern telecom networks running reliably. There are many views applied in the database world, but to me a resilient database is one that can react automatically to likely or active failures so that applications continue operating with full access to their data even as failures occur.
Q2. Is database resilience the same as disaster recovery?
Seth Proctor: I don’t believe it is. In traditional systems there is a primary site where the database is “active” and updates are replicated from there to other sites. In the case of failure to the primary site, one of the replicas can take over. Maintaining that replica (or replicas) is usually the key part of Disaster Recovery.
Sometimes that replica is missing the latest changes, and usually the act of “failing over” to a replica involves some window where the database is unavailable. This leads to operational terms like “hot stand-by” where failing over is faster but still delayed, complicated and failure-prone.
True resiliency, in my opinion, comes from systems that are designed to always be available even as some components fail. Reacting to failure efficiently is a key requirement, as is survival in case of complete site loss, so replicating data to multiple locations is critical to resiliency. At a minimum, however, a resilient data management solution cannot lose data (sort of “primum non nocere” for servers) and must be able to provide access to all data even as servers fail. Typical Disaster Recovery solutions on their own are not sufficient. A resilient solution should also be able to continue operations in the face of expected failures: hardware and software upgrades, network updates and service migration.
This is especially true as we push out to hybrid cloud deployments.
Q3. What are the database resilience requirements and challenges, especially in this era of Big Data?
Seth Proctor: There is no one set of requirements since each application has different goals with different resiliency needs. Big Data is often more about speeds and volume while in the operational space correctness, latency and availability are key. For instance, if you’re handling high-value bank transactions you have different needs than something doing weekly trend-analysis on Internet memes. The great thing about “the cloud” is the democratization of features and the new systems that have evolved around scale-out architectures. Things like transactional consistency were originally designed to make failures simpler and systems more resilient; as consistent data solutions scale out in cloud models it’s simpler to make any application resilient without sacrificing performance or increasing complexity.
That said, I look for a couple of key criteria when designing with resiliency in mind. The first is a distributed architecture, the foundation for any system to survive individual failure but remain globally available.
Ideally this provides a model where an application can continue operating even as arbitrary components fail. Second is the need for simple provisioning & monitoring. Without this, it’s hard to react to failures in an automatic or efficient fashion, and it’s almost impossible to orchestrate normal upgrade processes without down-time. Finally, a database needs to have a clear model for how the same data is kept in multiple locations and what the failure modes are that could result in any loss. These requirements also highlight a key challenge: what I’ve just described are what we expect from cloud infrastructure, but are pushing the limits of what most shared-nothing, sharded or vertically-scaled data architectures offer.
Q4. What is the real risk if the database goes offline?
Seth Proctor: Obviously one risk is the ripple effect it has to other services or applications.
When a database fails it can take with it core services, applications or even customers. That can mean lost revenue or opportunity and it almost certainly means disruption across an organization. Depending on how a database goes offline, the risk may also extend to data loss, corruption, or both. Most databases have to trade-off certain elements of latency against guaranteed durability, and it’s on failure that you pay for that choice. Sometimes you can’t even sort out what information was lost. Perhaps most dangerous, modern deployments typically create the illusion of a data management service by using multiple databases for DR, scale-out etc. When a single database goes offline you’re left with a global service in an unknown state with gaps in its capabilities. Orchestrating recovery is often expensive, time-consuming and disruptive to applications.
Q5. How are customers solving the continuous availability problem today?
Seth Proctor: Broadly, database availability is tackled in one of two fashions. The first is by running with many redundant, individual, replicated servers so that any given server can fail or be taken offline for maintenance as needed. Putting aside the complexity of operating so many independent services and the high infrastructure costs, there is no single view of the system. Data is constantly moving between services that weren’t designed with this kind of coordination in mind so you have to pay close attention to latencies, backup strategies and visibility rules for your applications. The other approach is to use a database that has forgone consistency, making a lot of these pieces appear simpler but placing the burden that might be handled by the database on the application instead. In this model each application needs to be written to understand the specifics of the availability model and in exchange has a service designed with redundancy.
Q6. Now that we are in the Cloud era, is there a better way?
Seth Proctor: For many pieces of the stack cloud architectures result in much easier availability models. For the database specifically, however, there are still some challenges. That said, I think there are a few great things we get from the cloud design mentality that are rapidly improving database availability models. The first is an assumption about on-demand resources and simplicity of spinning up servers or storage as needed. That makes reacting to failure so much easier, and much more cost-effective, as long as the database can take advantage of it. Next is the move towards commodity infrastructure. The economics certainly make it easier to run redundantly, but commodity components are likely to fail more frequently. This is pushing systems design, making failure tests critical and generally putting more people into the defensive mind-set that’s needed to build for availability. Finally, of course, cloud architectures have forced all of us to step back and re-think how we build core services, and that’s leading to new tools designed from the start with this point of view. Obviously that’s one of the most basic elements that drives us at NuoDB towards building a new kind of database architecture.
Q7. Can you share methodologies for avoiding single points of failure?
Seth Proctor: For sure! The first thing I’d say is to focus on layering & abstraction.
Failures will happen all the time, at every level, and in ways you never expect. Assume that you won’t test all of them ahead of time and focus on making each little piece of your system clear about how it can fail and what it needs from its peers to be successful. Maybe it’s obvious, but to avoid single points of failure you need components that are independent and able to stand-in for each other. Often that means replicating data at lower-levels and using DNS or load-balancers at a higher-level to have one name or endpoint map to those independent components. Oh, also, decouple your application logic as much as possible from your operational model. I know that goes against some trends, but really, if your application has deep knowledge of how some service is deployed and running it makes it really hard to roll with failures or changes to that service.
Q8. What’s new at NuoDB?
Seth Proctor: There are too many new things to capture it all here!
For anyone who hasn’t looked at us, NuoDB is a relational database build against a fundamentally new, distributed architecture. The result is ACID semantics, support for standard SQL (joins, indexes, etc.) and a logical view of a single database (no sharding or active/passive models) designed for resiliency from the start.
Rather than talk about the details here I’d point people at a white paper (Note of the Editor: registration required) we’ve just published on the topic.
Right now we’re heavily focused on a few key challenges that our enterprise customers need to solve: migrating from vertical scale to cloud architectures, retaining consistency and availability and designing for on-demand scale and hybrid deployments. Really important is the need for global scale, where a database scales to multiple data centers and multiple geographies. That brings with it all kinds of important requirements around latencies, failure, throughput, security and residency. It’s really neat stuff.
Q9- How does it differentiate with respect to other NoSQL and NewSQL databases?
Seth Proctor: The obvious difference to most NoSQL solutions is that NuoDB supports standard SQL, transactional consistency and all the other things you’d associate with an RDBMS.
Also, given our focus on enterprise use-cases, another key difference is the strong baseline with security, backup, analysis etc. In the NewSQL space there are several databases that run in-memory, scale-out and provide some kind of SQL support. Running in-memory often means placing all data in-memory, however, which is expensive and can lead to single points of failure and delays on recovery. Also, there are few that really support the arbitrary SQL that enterprises need. For instance, we have customers running 12-way joins or transactions that last hours and run thousands of statements.
These kinds of general-purpose capabilities are very hard to scale on-demand but they are the requirement for getting client-server architectures into the cloud, which is why we’ve spent so long focused on a new architectural view.
One other key difference is our focus on global operations. There are very few people trying to take a single, logical database and distribute it to multiple geographies without impacting consistency, latency or security.
Qx Anything else you wish to add?
Seth Proctor: Only that this was a great set of questions, and exactly the direction I encourage everyone to think about right now. We’re in a really exciting time between public clouds, new software and amazing capacity from commodity infrastructure. The hard part is stepping back and sorting out all the ways that systems can fail.
Architecting with resiliency as a goal is going to get more commonplace as the right foundational services mature.
Asking yourself what that means, what failures you can tolerate and whether you’re building systems that can grow alongside those core services is the right place to be today. What I love about working in this space today is that concepts like resilient design, until recently a rarefied approach, are accessible to everyone.
Anyone trying to build even the simplest application today should be asking these questions and designing from the start with concepts like resiliency front and center.
Seth Proctor, Chief Technology Officer, NuoDB
Seth has 15+ years of experience in the research, design and implementation of scalable systems. That experience includes work on distributed computing, networks, security, languages, operating systems and databases all of which are integral to NuoDB. His particular focus is on how to make technology scale and how to make users scale effectively with their systems.
Prior to NuoDB Seth worked at Nokia on their private cloud architecture. Before that he was at Sun Microsystems Laboratories and collaborated with several product groups and universities. His previous work includes contributions to the Java security framework, the Solaris operating system and several open source projects, in addition to the design of new distributed security and resource management systems. Seth developed new ways of looking at distributed transactions, caching, resource management and profiling in his contributions to Project Darkstar. Darkstar was a key effort at Sun which provided greater insights into how to distribute databases.
Seth holds eight patents for his cutting edge work across several technical disciplines. He has several additional patents awaiting approval related to achieving greater database efficiency and end-user agility.
- On Big Data and the Internet of Things. Interview with Bill Franks
Source: ODBMS.org | Published on 2015-03-09
- On MarkLogic 8. Interview with Stephen Buxton
Source: ODBMS.org | Published on 2015-02-13
- On Data Curation. Interview with Andy Palmer
Source: ODBMS.org | Published on 2015-01-14
- On Solr and Mahout. Interview with Grant Ingersoll
Source: ODBMS.org | Published on 2015-01-06
Follow ODBMS.org on Twitter: @odbmsorg
“Perhaps the biggest challenge is that the IoT has the potential to generate orders of magnitude more data than any other source in existence today. So, in the world of the IoT we will test the limits of ‘big.’”–Bill Franks
On topics of Data Warehouses, Hadoop, the Internet of Things, and Teradata`s perspective on the world of Big Data, I have interviewed Bill Franks, Chief Analytics Officer for Teradata.
Q1. What is Teradata`s perspective on the world of Big Data?
Bill Franks: Our perspective has not really changed with regard to ‘big data:’ the primary mission of Teradata for decades has been helping organizations utilize and analyze large volumes of data to produce insight for business value. Note that our Teradata database was originally designed exclusively for analytics, then called ‘decision support’ – unlike most other platforms, which were designed for general computing – then later adapted for analytic uses. As a result, the Teradata analytic engine is – and has always been – uniquely architected for large – ‘big data’ – volume and complexity aimed at producing actionable intelligence.
Of course, the amount of data that’s considered ‘big’ and thus a challenge – has changed, and we have a lot of novel data sources in recent times. However, we believe that companies which have always focused on analyzing and acting upon data intelligently can adapt to the new world of big data. After all, big data is just more data and the analysis of big data is still analysis. There are as many similarities as differences from the past.
Teradata has engineered further analytic enhancements over the years to create a diverse portfolio of products, partnerships, and services to allow our customers to continue to get the most from their data assets. The pace of change is very rapid today and we expect that to continue. We believe our strength is in our experience, expertise and our ability to help organizations navigate the changing landscape and continue to derive new, useful insights from their increasingly large and diverse data sources.
Q2. Most data warehousing projects consolidate data from different source systems. What is different in the world of Big Data?
Bill Franks: By definition, if you want to look at two different data sources together, you must either move one set of data to the other or move them both to a 3rd location. If data is truly disparate, you can’t use it effectively. That is what drove data warehousing to prominence. One huge difference between data warehousing practices years back and then today is that previously, all data that was captured in the business world met three criteria almost 100% of the time.
1) It was immensely important; given the cost to capture and store it,
2) The data was well structured, and
3) The data was generated by an organization’s internal business processes.
— Therefore, it was mostly placed in relational databases or on a mainframe since those technologies easily handle that type of data. Data warehousing solved the problem of many structured data platforms being spread out – by consolidating the sources for analytic purposes into a single structured platform.
What is different with big data is that today, the data often violates all of the rules.
1) Much of it is not important, or has not yet been proven to be important,
2) The data is not structured in the classic fashion at the outset (though most can and must be structured for analytical purposes), and
3) The data is often from sources external to an organization.
— As a result, we now have disparate data platforms that each serve different functions. Some focus on one type of data, while others focus on flexibility. However, the downside is that these platforms don’t integrate well and it isn’t as easy to tie everything together. That’s a problem Teradata is working diligently to solve with our Unified Data Architecture – our pioneering version of the visionary Gartner Logical Data Warehouse.
Q3. Will data warehouses become obsolete soon and be replaced by Hadoop?
Bill Franks: Absolutely not. A few years ago, that was a common claim. That claim is rarely heard today. In fact, all of the big Hadoop vendors partner with Teradata.
This is because our data warehousing platforms provide some important things Hadoop does not — just as Hadoop provides some things a data warehouse does not. Each platform has its strengths and weaknesses, but when positioned together, additional value is added. Part of the issue is that people mistake policy decisions for technology limitations.
There is no reason you can’t place untested, raw, unclean data of unknown value on a data warehousing platform; it’s the corporate policies that often forbid it. It is true that once data is critical and is leveraged by many applications and business users, you have to keep some control and consistency over it. This is what a data warehouse does for an organization.
But, that doesn’t mean you can’t experiment with new sources freely using the technology that supports formal data warehouses.
A colleague of mine mentioned a conversation he had with a Hadoop user. That user was boasting about how he could with a single command change the data type of information on Hadoop, for instance, if it would help him more easily solve his next problem. My friend then asked him what would happen to the prior dozen or two processes that were built expecting the data to be in the original data type format. Wouldn’t they all then break? The user had a blank stare for a moment and then realized his error. As you develop more processes, you must implement security, consistency, and controls on the underlying data. This is why data warehousing – as Gartner defines it, is going to be around for a long time.
Q4. With the increased need of tools for combining data together, are we going to see a “federated”- Big Data architecture?
Bill Franks: A form of that is exactly what we are pursuing with Teradata’s Unified Data Architecture. Again – we refer to Gartner’s vision of the “Logical Data Warehouse.” What we are doing is putting in place a layer of architecture that connects multiple disparate data stores. This architecture includes – and connects – relational databases like Teradata and Oracle, discovery platforms like our Teradata Aster offering, Hadoop, and other platforms such as MongoDB. The idea is that we make information available to users about data throughout the ecosystem, not just the data on the platform they are operating from. So, I see a data dictionary that includes a “table” called “Sensor Feed.”
I can see the data elements available and write analytic logic against those elements. However, I don’t need to be aware of whether the data is a database table, or a Hadoop file, or is in MongoDB. Users can simply build analytics instead of worrying about where data resides, how to log on to various systems, and how to move data. We’ll handle that for them.
We are also beginning to push processing across the various platforms to optimize performance. Just like with a ‘table’ versus a ‘view’ in a database, making a process enterprise-ready might require moving data around the architecture permanently. But now, users are free to discover where that is required. And, the technical team behind the curtain can worry about the details just as they do with traditional data warehousing. We are very bullish on our approach and think we are well positioned to maintain our leadership position in the analytics space.
Q5. Teradata made several acquisitions lately. How do the tools that Teradata acquired fit the current Teradata Data Architectural Framework?
Bill Franks: I believe this in general was addressed. However, in addition I would point out that we acquired Revelytix in 2014 to obtain Loom: an open platform for discovering, profiling, preparing, and tracking data lineage for data in Hadoop. Likewise, we acquired Hadapt, which created a big data analytic platform natively integrating SQL with Apache Hadoop. Plus our recent RainStor acquisition strengthens Teradata’s enterprise-grade Hadoop solutions and enables organizations to add archival data store capabilities for their entire enterprise, including data stored in OLTP, data warehouses, and applications.
Q6. What are the key differentiators of the Teradata Database core architecture?
Bill Franks: As I said, the Teradata DW was differentiated from the start – uniquely architected for analytics from day one. However, I would add that Teradata continues to broaden our differentiation: we’ve built the best data orchestration software in the industry (Teradata Unity and QueryGrid). The orchestration software is key – because it enables our customers to choose a file system that they use to store the data in – and the analytics that they apply to that data independently — and marry them together with software.
It helps reduce the complexity of connecting to, accessing, understanding interfaces and getting value from multiple analytical systems. Another differentiator is Teradata Intelligent Memory, introduced two years ago. TIM is the world’s first extended memory technology beyond cache to increase query performance. Users can configure the exact amount of in-memory capability needed for critical workloads – based on temperature – hot or cold data. The list goes on. I would say that our data technology really does focus on how data is best used – and what proficient users need most.
Q7. Is SQL really the right language to handle Big Data Analytics?
Bill Franks: In some cases yes and in some cases no. We want users to be able to utilize whatever language or platform is best for any given task. There are many big data requirements that perfectly fit SQL and many that don’t. The key is enabling scalable access to the data and flexibility in approach. Most people are aware that there is a big effort to add a SQL interface to Hadoop. What most haven’t realized is how far we’ve also come the other direction. For some time, Teradata has allowed C and Java processing directly against our database platforms via User Defined Functions and other similar extensions. We are now also enabling other languages such as R and Python to be executed within a Teradata context. What is possible today is so far beyond what was possible even 5 or 10 years ago.
Q8. How do you see the adoption of Cloud for Analytics?
Bill Franks: We are aggressively rolling out our own cloud offerings across our product suites. Many of our enterprise customers also configure our products as a private cloud behind their firewall. Adoption will be mixed based on the type of data and nature of work being done. Anything involving sensitive data is still typically not allowed outside a firewall. If you think back to the issue raised in a prior question of having to be able to combine data for analytics, you can’t really have some data locked behind a firewall and some data locked outside it. The real driver behind the cloud is that people want flexible, pay on demand access to analysis platforms. We have multiple ways to provide that to our clients, of which our cloud offerings are only one option. We have some other novel pricing and licensing options the help customers get access to the resources they require for analytics.
Q9. What are the most important data challenges posed by the Internet of Things (IoT)?
Bill Franks: Perhaps the biggest challenge is that the IOT has the potential to generate orders of magnitude more data than any other source in existence today.
So, in the world of the IOT we will test the limits of ‘big.’ At the same time, much of the data generated by the IOT will have low value in the short term and no value in the long term. One of the biggest challenges will be determining which pieces of the information generated by a given sensor actually matters to your business and for how long. In the long run, it is likely that only a small fraction of the raw data produced by the IOT will be stored beyond a few moments of immediate usage. For example, why keep the sensor readings that help navigate my car into a tight parking spot? Once I’m safely in the spot, I really don’t ever need to revisit that data again. If I hit a car in front of me, I might make an exception and keep the data so that the cause can be identified.
Q10. Could you mention some successful Big Data projects you have recently completed with customers?
Bill Franks: We are seeing a lot of very interesting analytics come about. We’ve helped health organizations discover genetic patterns associated with disease, we’ve helped manufacturers reduce cost and increase customer satisfaction by building predictive maintenance algorithms, we’ve helped cable providers identify valuable consumer viewing habits.
I could go on and on. A great place to see some of the examples, and even hear from some of the companies and people behind it, is at our website.
Bill Franks is the Chief Analytics Officer for Teradata, where he provides insight on trends in the analytics and big data space and helps clients understand how Teradata and its analytic partners can support their efforts. His focus is to translate complex analytics into terms that business users can understand and work with organizations to implement their analytics effectively. His work has spanned many industries for companies ranging from Fortune 100 companies to small non-profits. Franks also helps determine Teradata’s strategies in the areas of analytics and big data.
Franks is the author of the book Taming The Big Data Tidal Wave (John Wiley & Sons, Inc., April, 2012). In the book, he applies his two decades of experience working with clients on large-scale analytics initiatives to outline what it takes to succeed in today’s world of big data and analytics. The book made Tom Peter’s list of 2014 “Must Read” books and also the Top 10 Most Influential Translated Technology Books list from CSDN in China.
Franks’ second book The Analytics Revolution (John Wiley & Sons, Inc., September, 2014) lays out how to move beyond using analytics to find important insights in data (both big and small) and into operationalizing those insights at scale to truly impact a business.
He is a faculty member of the International Institute for Analytics, founded by leading analytics expert Tom Davenport, and an active speaker who has presented at dozens of events in recent years. His blog, Analytics Matters, addresses the transformation required to make analytics a core component of business decisions.
Franks earned a Bachelor’s degree in Applied Statistics from Virginia Tech and a Master’s degree in Applied Statistics from North Carolina State University. More information is available here: http://www.bill-franks.com.
Follow ODBMS.org on Twitter: @odbmsorg
“Apache Ignite is an incubating Apache project, which provides a high-performance, distributed in-memory data management software layer between various data sources and applications.”–Nikita Ivanov
I have interviewed Nikita Ivanov, founder and CTO of GridGain Systems. Main topic of the interview is the new release of Apache Ignite.
Q1. In your opinion, what are the main differences between an In-Memory Database, an In-Memory Data Grid and an In-Memory Data Fabric?
Nikita Ivanov: The main difference between in-memory databases (IMDB) and in-memory data grids is that IMDBs support only SQL (or some proprietary NoSQL dialect) while most Data Grids (IMDG) support multiple ways to access and process data. In IMDB the only way to access and process data is SQL and SQL store-procedures, while IMDGs typically support at least the following paradigms: SQL, Key/Value, MapReduce, MPP, and MPI-based processing.
Compared with IMDGs, an In-Memory Data Fabric represents the latest generation of in-memory technologies, integrated into a single platform, which eliminates the need for point solutions such as IMDB’s or IMDGs. It is a software layer that sits between applications and data stores, and it allows for high-performance data access and processing across different types of data, such as SQL, NoSQL and Hadoop. All without any rip and replace of existing applications or databases.
Q2. How is it possible to accelerate Hadoop-based deployments with in-memory technology?
Nikita Ivanov: To accelerate Hadoop in a meaningful way one needs to find a way to accelerate two core technologies that define Hadoop: HDFS, a distributed file system where data is stored, and MapReduce, a framework that allows parallel processing of the data stored in HDFS.
At GridGain, we’ve developed a highly optimized in-memory file system that is 100% compatible with HDFS that allows to store data directly in DRAM of computers in a Hadoop cluster. We’ve also developed a specifically optimized YARN-based MapReduce implementation that takes full advantage of the data stored directly in DRAM instead of disks.
The combination of these two innovations allows GridGain to speed up any Hadoop payloads – including Pig, Hive, or hand-written MapReduce jobs in any language – up to 10x without any code change. GridGain provides the first Hadoop accelerator that provide a true plug-and-play acceleration to the existing Hadoop jobs.
Q3. Why did you decide to open source your product?
Nikita Ivanov: Even before October of last year, GridGain already had an open core model: We offered an in-memory data fabric under the Apache 2.0 license, and we also offered a commercial edition with a number of enterprise-grade features, such as enhanced security, data center replication, rolling updates, cross-language portability, and others.
The drivers for our decision to contribute our core open source code base to the Apache Software Foundation (ASF) were of course to ensure continued, broad adoption of in-memory technologies and the long-term viability of the code base. But equally importantly, we also want to build a thriving community that adopts and adapts this code base, and hence will be key in finding new use cases for in-memory computing.
Q4. What is Apache Ignite?
Nikita Ivanov: Apache Ignite (incubating) is an open source, distributed framework for a unified In-Memory Data Fabric, originally developed by GridGain Systems. Apache Ignite is an incubating Apache project, which provides a high-performance, distributed in-memory data management software layer between various data sources and applications. Its code is written mostly in Java and Scala with small amount of C++ code, and it will initially combine an in-memory data grid, in-memory compute grid and in-memory streaming processing in one framework.
Apache Ignite’s large scale, in-memory framework offers transactional and real-time analytics applications performance gains of 100-1,000 times faster throughput and/or lower latencies. It is also a key open source foundation to enable the emerging class of so-called hybrid transactional-analytical workloads.
Q5. What is special about v1.0?
Nikita Ivanov: In October of 2014, the GridGain In-Memory Data Fabric core code base was accepted by the Apache Software Foundation (ASF) into the Incubator program under the name “Apache Ignite”.
Since then, GridGain engineers as well as other contributors have been busy working on migrating the existing code base, documentation, and refactoring of the existing internal build, test & release processes to the “Apache Way”.
Version 1.0 represents the first release that meets these goals, and will include additional enhancements above and beyond the most recent open source In-Memory Data Fabric from GridGain. In fact, Apache Ignite has a large set of features, and one of its coolest new features is its ability to automatically integrate with different RDBMS systems, such as Oracle, MySql, Postgres, DB2, Microsoft SQL, etc. This feature automatically generates the application domain model based on the schema definition of the underlying database, and then loads the data.
Despite the breadth of its feature set, however, Ignite is actually very easy to use: For example, there are no custom installers. The product comes as one ZIP file, which is ready to go once you unzip it. And it has only 1 mandatory dependency – ignite-core.jar. All other dependencies, like integration with Spring for configuration, or with the H2 database for SQL, can be added to the process a la carte. Also, the project is fully mavenized, and is composed of over a dozen of maven artifacts that can be imported and used in any combination. Apache Ignite is based on standard Java APIs, and for distributed caches and data grid functionality Ignite implements the JCache (JSR 107) standard.
The new Apache Ignite v1.0 bits are available for download now from the Apache Ignite web site.
Q6. Who will be using the Apache Ignite In-Memory Data Fabric, and for what?
Nikita Ivanov: We expect developers and software architects of high-performance, hyper-scale on-premise and SaaS applications to take advantage of the following capabilities when building or performance-tuning their new or existing applications: compute grid, data grid, service grid, streaming, clustering, distributed data structures, distributed messaging, distributed events and in-memory file system.
Use cases can be found in software designed for financial services, telecommunications, retail, transportation, social media, online advertising, utilities, biosciences and many other industries.
Q7. What is positioning of the Apache Ignite project?
Nikita Ivanov: As we explained in our blog from last November, we believe Apache Ignite has all the right ingredients to become for the world of Fast Data what Hadoop is for Big Data today. This means that unlike Hadoop, which is a batch process focused on enabling the storage of large amounts of data economically, Ignite will enable extremely fast and ultra-low latency processing of data, allowing its users to derive actionable insights from their data much faster. Unlike Spark, a popular sister project of Ignite in the ASF, which is mainly focused on enhancing analytics and machine-learning for the Hadoop world, Ignite is a data source agnostic processing layer, which can be used for both Hadoop-like computation and many other computing paradigms like MPP, MPI, streaming processing.
In addition to real-time analytics, Ignite’s in-memory framework also offers support for full ACID transactions.
Q8. You have previously posted that Oracle and SAP are missing the point of In-Memory Computing. Could you please elaborate on this?
Nikita Ivanov: We continue to believe that Oracle and SAP are missing the point of in-memory computing for the following reasons: By offering a well-integrated platform of a compute grid, data grid, streaming/CEP and Hadoop acceleration, Apache Ignite (incubating) and the GridGain In-Memory Data Fabric offer a strategic approach to in-memory computing, across both transactional and analytical workloads, that delivers performance, scale and comprehensive capabilities far above and beyond what traditional in-memory databases, data grids or other in-memory-based point solutions can offer by themselves.
Both Apache Ignite and GridGain’s enterprise offering built on Apache Ignite will greatly benefit from a thriving community adapting the code base to new and emerging use cases; therefore, we believe this code base is extremely well positioned to drive superior innovation to the world of Fast Data, just as the Hadoop community has been doing for Big Data.
In addition, unlike Oracle or SAP Hana, Apache Ignite is more affordable, easier-to-access and more transparent open source software running on commodity hardware, which typically increases developers’ and architects’ motivation to explore the potential of in-memory computing. That said, if all the customer is looking for from in-memory technology is faster processing of their (SQL) data, then they may still choose to deploy proprietary software from Oracle or SAP.
Qx Anything else you wish to add?
Nikita Ivanov: I guess I should mention that even though Apache Ignite has been in incubation for less than 4 months only, we are excited to see that the project already has a very vibrant and growing community.
But we always welcome community contributions, so if there are readers that would like to contribute, please send an email to the Apache Ignite dev list, and we will get you started. And even if you are not ready to contribute immediately, we would like to invite everyone to join our dev list. Most of the discussions happen there, and you can find out a lot about where the project is going and also provide your own ideas. Another great way, of course, for people to familiarize themselves with Apache Ignite, is to take a look at the code and see what it can do for thier project. The Ignite bits can be downloaded on the Apache Ignite homepage.
Nikita Ivanov is founder and CTO of GridGain Systems, started in 2007 and funded by RTP Ventures and Almaz Capital. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today.
Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. Nikita was one of the pioneers in using Java technology for server side middleware development while working for one of Europe’s largest system integrators in 1996.
He is an active member of Java middleware community, contributor to the Java specification, and holds a Master’s degree in Electro Mechanics from Baltic State Technical University, Saint Petersburg, Russia.
– On Solr and Mahout. Interview with Grant Ingersoll. ODBMS Industry Watch, 2015-01-06
– Big Data: Three questions to McObject. ODBMS Industry Watch, February 14, 2014
Follow ODBMS.org on Twitter: @odbmsorg
“When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits. “– Stephen Buxton.
MarkLogic recently released MarkLogic 8. I wanted to know more about this release. For that, I have interviewed Stephen Buxton, Senior Director, Product Management at MarkLogic.
Q1. You have recently launched MarkLogic® 8 software release. How is it positioned in the Big Data market? How does it differentiate from other products from NoSQL vendors?
Stephen Buxton: MarkLogic 8 is our biggest release ever, further solidifying MarkLogic’s position in the market as the only Enterprise NoSQL database.
With MarkLogic 8, you can now store, manage and search JSON, XML, and RDF all in one unified platform—without sacrificing enterprise features such as transactional consistency, security, or backup and recovery.
While other database companies are still figuring out how to strengthen their platform and add features like transactional consistency, we’ve moved far ahead of them by working on new innovative features such as Bitemporal and Semantics. It’s for these reasons that over 500 enterprise organizations have chosen MarkLogic to run their mission-critical applications.
MarkLogic 8 is more powerful, agile, and trusted than ever before, and is an ideal platform for doing two things: making heterogeneous data integration simpler and faster; and for doing dynamic content delivery at massive scale.
Relational databases do not offer enough flexibility—integration projects can take multiple years, cost millions of dollars, and struggle at scale. But, the newer NoSQL databases that do have agility still lack the enterprise features required to run in the data centers at large organizations. MarkLogic is the only NoSQL database that is able to solve today’s challenge, having the flexibility to serve as an operational and analytical database for all of an organization’s data.
Within MarkLogic, the JSON structure is mapped directly to the internal structure already used by the XML document format, so it has the same speed and scalability as with XML. This also means that all of the production-proven indexing, data management, and security capabilities that MarkLogic is known for are fully maintained.
Q3. In MarkLogic 8 you have been adding full SPARQL 1.1 support and Inferencing capability. Could you please explain what kind of Inferencing capability did you add and what are they useful for?
Stephen Buxton: We made a big leap forward on the semantics foundation that was laid in our previous release, adding full SPARQL 1.1 support, which includes support for property paths, aggregates, and SPARQL Update. Support for automatic inferencing was also added, which is a powerful capability that allows the database to combine existing data and apply pre-defined rules to infer new data. SPARQL 1.1 is a standard defined by the W3C that is supported by many RDF triple stores. But, MarkLogic differentiates itself among triple stores as you can store your documents and data right alongside your triples, and you can query across all three data models easily and efficiently.
Automatic inferencing is a really powerful feature that is part of an overall strategy to provide a more intelligent data layer so that you can build smarter apps.
With inferencing, for example, if you had two pieces of data stored as RDF triples, such as “John lives in Virginia” and “Virginia is in the United States”, then MarkLogic 8 could infer the new fact, “John lives in the United States.”
This can make search results richer and also show you new relationships in your data.
In MarkLogic 8, rules for inferencing are applied at query time. This approach is referred to as backward-chaining inference, a very flexible approach in which only the required rules are applied for each query, so the server does the minimum work necessary to get the correct results; and when your data or ontology or rules sets change, that change is available immediately – it takes effect with the very next query. And, of course, inference queries are transactional, distributed, and obey MarkLogic’s rule-based security, just like any other query. MarkLogic 8 has supplied rule sets for RDFS, RDFS-Plus, OWL-Horst, and their subsets; and you can create your own. With MarkLogic 8 you can further restrict any SPARQL query (with or without inference) by any document attribute, including timestamp, provenance, or even a bitemporal constraint.
More details and examples can be found at developer.marklogic.com.
Q4. The additions to SPARQL include Property Paths, Aggregates, and SPARQL Update. Could you please explain briefly each of them?
Stephen Buxton: SPARQL 1.1 brings support for property paths, aggregates, and SPARQL Update. These capabilities make working with RDF data simpler and more powerful, which means increased context for your data—all using the SPARQL 1.1 industry standard query language.
SPARQL 1.1’s property paths let you traverse an RDF graph – bouncing from point-to-point across a graph. This graph traversal allows you to do powerful, complex queries such as, “Show me all the people who are connected to John” by finding people that know John, and people that know people that know John, and so on.
With aggregate SPARQL functions you can do analytic queries over hundreds of billions of triples. MarkLogic 8 supports all the SPARQL 1.1 Aggregate functions – COUNT, SUM, MIN, MAX, and AVG – as well as the grouping operations GROUP BY, GROUP BY .. HAVING, GROUP_CONCAT and SAMPLE.
SPARQL 1.1 also includes SPARQL Update. With these capabilities, you can delete, insert, and update (delete/insert) individual triples, and manipulate RDF graphs, all using SPARQL 1.1.
Q5. The addition of SPARQL Update capabilities could have the potential to influence the capability you offer of a RDF triple store that scales horizontally and manages billions of triples. Any comment on that?
Stephen Buxton: The enhancements in MarkLogic 8 make it able to function as a full-featured, stand-alone triple store– this means you can now get a triple store that is horizontally scalable as part of a shared-nothing cluster, and still get all of the enterprise features MarkLogic is known for such as such as High Availability, Disaster Recovery, and certified security. Beyond that, anyone looking for “just a triple store” will find they can also store, manage, and query documents and data in the same database, a unique capability that only MarkLogic has.
Q6. You have been adding a so called Bitemporal Data Management. What is it and why is it useful?
Stephen Buxton: Bitemporal is a new feature that allows you to ask, “What did you know and when did you know it?” The MarkLogic Bitemporal feature answers this critical question by tracking what happened, when it happened, and when we found out. A bitemporal database is much more powerful than a temporal database that can only track when something happened. The difference between when something happened and when you found out about it can be incredibly significant, particularly when it comes to audits and regulation.
A bitemporal database tracks time across two different axes, the system and valid time axes. This allows you to go back in time and explore data, manage historical data across systems, ensure data integrity, and do complex bitemporal analysis. You can answer complex questions such as:
• Where did John Thomas live on August 20th as we knew it on September 1st?
• Where was the Blue Van on October 12th as we knew it on October 23rd?
Bitemporal is important for a wide variety of use cases across industries. Getting a more accurate picture of a business at different points-in-time used to be impossible, or very challenging at best. Bitemporal helps ensure that you always have a full and accurate picture of your data at every point-in-time, which is particularly useful in regulated industries.
• Regulatory requirements – Avoid the increasingly harsh downside consequences from not adhering to government and industry regulations, particularly in financial services and insurance
• Audits – Preserve the history of all your data, including the changes made to it, so that clear audits can be conducted without having to worry about lost data, data integrity, or cumbersome ETL processes with archived data
• Investigations and Intelligence – No more lost emails and no more missing information. Bitemporal databases never erase data, so it is possible to see exactly how data was updated based on what was known at the time
• Business Analytics – Run complex queries that were not previously possible in order to better understand your business and answer new questions about how different decisions and changes in the past could have led to different results
• Cost reduction – Manage data with a smaller footprint as the shape of the data changes, avoiding the need to set up additional databases for historical data.
Bitemporal is enhanced by MarkLogic’s Tiered Storage, which allows you to more easily archive your data to cheaper storage tiers with little administrative overhead. This keeps Bitemporal simple, and obviates the high cost imposed by the few relational databases that do have Bitemporal. MarkLogic also eliminates the schema roadblocks that relational databases that have Bitemporal struggle with. MarkLogic is schema-agnostic and can adjust to the shape of data as that data changes over time.
Q7. How is bitemporal different from versioning?
Stephen Buxton: Bitemporal works by ingesting bitemporal documents that are managed as a series of documents with range indexes for valid and system time axes. Documents are stored in a temporal collection protected by security permissions. The initial document inserted into the database is kept and never changes, allowing you to track the provenance of information with full governance and immutability.
Q8. Could you give us some examples of how Bitemporal Data Management could be useful applications for the financial services industry?
Stephen Buxton: One example of Bitemporal is trade reconciliation in financial services. When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits.
Imagine the Head of IT Architecture at a major bank working on mining information and looking for changes in risk profiles. The risk profiles cannot be accurately calculated without having an accurate picture of the reference and trade data, and how it changed over time. This task becomes simple and fast using Bitemporal.
Qx Anything else you wish to add?
Stephen Buxton: In addition to innovative features such as Bitemporal and Semantics, and features that make MarkLogic more widely accessible in the developer community, there are other updates in Marklogic 8 that make it easier to administer and manage. For example, Incremental Backup, another feature added in MarkLogic 8, allows DBAs to perform backups faster while using less storage.
With MarkLogic 8, you can have multiple daily incremental backups with only a minimal impact on database performance. This feature is one worth highlighting because it will help make DBAs live much easier, and will save an organization time and money.
It’s just another example of MarkLogic’s continuing dedication to being an enterprise NoSQL database that is more powerful, agile, and trusted than anything else.
Stephen Buxton is Senior Director of Product Management for Search and Semantics at MarkLogic, where he has been a member of the Products team for 8 years. Stephen focuses on bringing a rich semantic search experience to users of the MarkLogic NoSQL database, document store, and triple store. Before joining MarkLogic, Stephen was Director of Product Management for Text and XML at Oracle Corporation.
-On making information accessible. Interview with David Leeming. ODBMS Industry Watch, July 30, 2014
Follow ODBMS.org on Twitter: @odbmsorg
“We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.”–John Allison
I have interviewed John Allison, CTO and founder of Customer.io, a start up company in Portland, Oregon.
Q1. What is the business of Customer.io ?
John Allison: We help our customers send timely, targeted messages based on user activity on their website or mobile app. We achieve this by collecting analytical data, providing real-time segmentation, and allowing our customers to define rules to trigger messages at different points in their interactions with a user.
Q2. How large are the data sets you analyze?
John Allison: We’ve collected 6 terabytes of analytical event data for over 55 million unique users across our platform. Due to it’s nature, this data continues to grow and grows faster as we collect data for more and more users.
Q3. What are the main business and technical challenges you are currently facing?
John Allison: As we continue to grow our business, we need to ensure the technical side of our service can easily scale out to support new customers who want to use our product.
Q4. Why did you replace your existing underlying database architecture supporting your “MVP” product ? What were the main technical problems you encountered?
John Allison: As our data set grew in size to the point where we couldn’t realistically manage it all on a small number of servers, we began looking for alternatives which would allow us to continue providing our service in a larger, more distributed way.
Q5. How did you evaluate the alternatives?
John Allison: We evaluated many options and found that most didn’t live up to the availability or consistency guarantees they promised when run over a cluster of servers. We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.
Q6. How is the new solution looking like?
John Allison: We’ve taken more of a polyglot approach to storing our data. We are consolidating on three main clustered databases:
1) FoundationDB – Data where distributed transactions and consistency guarantees are most important.
2) Riak – Large amounts of immutable data where availability is more important.
3) ElasticSearch – Indexing data for ad-hoc querying.
All three have built in tools for expanding and administrating a cluster, provide fault-tolerance and increased reliability in the face of server faults, and each provides us with unique ways to access our data.
Q7. What experience do you have with this new database architecture until now? Do you have any measurable results you can share with us?
John Allison: Embracing a distributed architecture and storing data in the right database for a given use-case has led to less time worrying about operations, increased reliability of our service as a whole, and the ability to scale out all parts of our infrastructure to increase our platform’s capacity.
Q8. Moving forward, what are your plans for the next implementation of your product?
John Allison: Continuing to improve our product in order to provide the most value we can for our customers.
John Allison is the CTO and founder of Customer.io, a startup focused on making it easy to build, manage, and measure automatic customer retention emails. Prior to that he was the head of engineering at Challengepost.com. He is a world traveler, Golfer, and an Arkansas Razorback fan.
We have published several new experts articles on Big Data and Analytics in ODBMS.org.
– On Mobile Data Management. Interview with Bob Wiederhold. ODBMS Industry Watch, 2014-11-18.
-Big Data Management at American Express. Interview with Sastry Durvasula and Kevin Murray. ODBMS Industry Watch, 2014-10-12
Follow ODBMS.org on Twitter: @odbmsorg
“When does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.”–Grant Ingersoll.
I have interviewed Grant Ingersoll, CTO and co-founder of LucidWorks. Grant is an active member of the Lucene community, and co-founder of the Apache Mahout machine learning project.
I wish you a Happy and a Peaceful 2015!
Q1. Why LucidWorks Search? What kind of value-add capabilities does it provide with respect to the Apache Lucene/Solr open source search?
Grant Ingersoll: I like to think of LucidWorks Search (LWS) as Solr++, that is, we give you all of the goodness of Solr and then some more. Our primary focus in building LWS is in 4 key areas:
1. IT integration — Make it easy to consume Solr within an IT organization via things like monitoring, APIs, installation and so on.
2. Enterprise readiness — Large enterprises have 1 of everything and they all have a multitude of security requirements, so we focus on making it easier to operate in these environments via things like connectors for data acquisition, security and the like
3. Tools for Subject Matter Experts — These are aimed at technical non developers like Business Analysts, Merchandisers, etc. who are responsible for understanding who asked for what, when and why. These tools are primarily aimed at understanding relevancy of search results and then taking action based on business needs.
4. Deliver a supported version of the open source so that companies can reliably deploy it knowing they have us to back them up.
Q2. At LucidWorkd you have integrated Apache open source projects to deliver a Big Data application development and deployment platform. What does the emerging big data stack look like?
Grant Ingersoll: We use capabilities from the Hadoop ecosystem for a number of activities that we routinely see customers struggling with when they try to better understand their data. In many cases, this boils down to large scale log analysis to power things like recommendation systems or Mahout for machine learning, but it also can be more subtle like doing large scale content extraction from Office documents or natural language processing approaches for identifying interesting phrases. We also rely on Zookeeper quite heavily to make sure that our cluster stays in a happy state and doesn’t suffer from split brain issues and cause failures.
Q3. How does it different with respect to other Big Data Hadoop-based distributions such as Cloudera, Hortonworks, and Greenplum Pivotal HD?
Grant Ingersoll: I can’t speak to their integrations in great detail, but we integrate with all of them (as well as partner with most of them), so I guess you would say we try to work at a layer above the core Hadoop infrastructure and focus on how the Hadoop ecosystem can solve specific problems as opposed to being a general purpose tool. For instance, we ship with a number of out of the box workflows designed to solve common problems in search like click-through log analysis and whole collection document clustering so you don’t have to write them yourself.
Q4. How does it work to build a framework for big data with open source technologies that are “pre-integrated”?
Grant Ingersoll: Well, you quickly realize what a version soup there is out there, trying to support all the different “flavors” of Hadoop. Other than, it is a lot of fun to leverage the technologies to solve real problems that help people better understand their data. Naturally, there are challenges in making sure all the processes work together at scale, so a lot of effort goes into those areas.
Q5. What happens when big data plus search meets the cloud?
Grant Ingersoll: You get cost effective access and insight into your data instead of a big science experiment. In many ways, the benefits are the same as search and ranking in on-prem situations plus the added benefits the cloud brings you in terms of costs, scaling and flexibility. Of course, the well-documented challenge in the cloud is how to get your data there. So, for users who already have their data in the cloud, it’s an especially easy win, for those who don’t, we provide connectors that help.
Q6. Solr Query includes simple join capability between two document types. How do such queries scale with Big Data?
Grant Ingersoll: Solr scales quite well (billions of documents and very large query volumes).
In fact, we’ve seen it routinely scale linearly to quite large cluster sizes.
As with databases, joins require you to pay attention to how you do the join or whether there are better ways of asking your question, but I have seen them used quite successfully in the appropriate situation. At the end of the day, I try to remain pragmatic and use the appropriate tool for the job. A search engine can handle some types of joins, but that doesn’t always mean you should do it in a search engine. I like to think of a search engine as a very fast ranking engine. If the problem requires me to rank something, than search engine technology is going to be hard to beat. If you need it to do all different kinds of joins across a large number of document types or constant large table scans, it may be appropriate to do in a search engine and it may not. It’s a classic “it depends” situation. That being said, over the past few years, these kinds of problems have become much more efficient to do in a search engine thanks to a multitude of improvements the community has made to Lucene and Solr.
Q7. The Apache Mahout Machine Learning Project’s goal is to build scalable machine learning libraries. What is current status of the project?
Grant Ingersoll: We released 0.9 and are working towards a 1.0. The main focus lately has been on preparing for a 1.0 release by culling old, unused code and tightly focusing on a core set of algorithms which are tried and true that we want to support going forward.
Q8. What kind of algorithms is Apache Mahout currently supporting?
Grant Ingersoll: I tend to think of Mahout as being focused on the three “C’s”: clustering, classification and collaborative filtering (recommenders). These algorithms help people better understand and organize their data. Mahout also has various other algorithms like singular value decomposition, collocations and a bunch of libraries for Java primitives.
Q9. How does Mahout relies on the Apache Hadoop framework?
Grant Ingersoll: Many of the algorithms are written for Hadoop specifically, but not all. We try to be prudent about where it makes sense to use Hadoop and where it doesn’t, as not all machine learning algorithms are best suited for Map-Reduce style programming. We are also looking at how to leverage other frameworks like Spark or custom distributed code.
Q10. Who is using Apache Mahout and for what?
Grant Ingersoll: It really spans a lot of interesting companies, ranging from those using it to power recommendations to others classifying users to show them ads. At LucidWorks, we use Mahout for identifying statistically interesting phrases, clustering and classification of user’s query intent and more.
Q11. How scalable is Apache Mahout? What are the limits?
Grant Ingersoll: That will depend on the algorithm. I haven’t personally run an exhaustive benchmark, but I’ve seen many of the clustering and classification algorithms scale linearly.
Q12. How do you take into account user feedback when performing Recommendation mining with Apache Mahout?
Grant Ingersoll: Mahout’s recommenders are primarily of the “collaborative filtering” type, where user feedback equates to a vote for a particular item. All of those votes are, to simplify things a bit, added up to produce a recommendation for the user. Mahout supports a number of different ways of calculating those recommendations, since it is a library for producing recommendations and not just a one size fits all product.
Q13. Looking at three elements: Data, Platform, Analysis, what are the main challenges ahead?
Grant Ingersoll: I’d add a fourth element: the user. Lots of interesting challenges here:
When do we get past the hype cycle of big data and into the nitty gritty of making it real? That is, when does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.
How do we leverage the data, the platform and the analysis to make us smarter/better off instead of just better marketing targets? How do we use these tools to personalize without offending or destroying privacy?
How do we continue to meet scale requirements without breaking the bank on hardware purchases, etc?
Qx. Anything you wish to add?
Grant Ingersoll: Thanks for the great questions!
Grant Ingersoll, CTO and co-founder of LucidWorks, is an active member of the Lucene community – a Lucene and Solr committer, co-founder of the Apache Mahout machine learning project and a long-standing member of the Apache Software Foundation. He is co-author of “Taming Text” from Manning Publications, and his experience includes work at the Center for Natural Language Processing at Syracuse University in natural language processing and information retrieval.
Ingersoll has a Bachelor of Science degree in Math and Computer Science from Amherst College and a Master of Science degree in Computer Science from Syracuse University.
– Taming Text How to Find, Organize, and Manipulate It
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris
Softbound print: September 2012 (est.) | 350 pages, Manning, ISBN: 193398838X
Follow ODBMS.org on Twitter: @odbmsorg