“Apache Ignite is an incubating Apache project, which provides a high-performance, distributed in-memory data management software layer between various data sources and applications.”–Nikita Ivanov
I have interviewed Nikita Ivanov, founder and CTO of GridGain Systems. Main topic of the interview is the new release of Apache Ignite.
Q1. In your opinion, what are the main differences between an In-Memory Database, an In-Memory Data Grid and an In-Memory Data Fabric?
Nikita Ivanov: The main difference between in-memory databases (IMDB) and in-memory data grids is that IMDBs support only SQL (or some proprietary NoSQL dialect) while most Data Grids (IMDG) support multiple ways to access and process data. In IMDB the only way to access and process data is SQL and SQL store-procedures, while IMDGs typically support at least the following paradigms: SQL, Key/Value, MapReduce, MPP, and MPI-based processing.
Compared with IMDGs, an In-Memory Data Fabric represents the latest generation of in-memory technologies, integrated into a single platform, which eliminates the need for point solutions such as IMDB’s or IMDGs. It is a software layer that sits between applications and data stores, and it allows for high-performance data access and processing across different types of data, such as SQL, NoSQL and Hadoop. All without any rip and replace of existing applications or databases.
Q2. How is it possible to accelerate Hadoop-based deployments with in-memory technology?
Nikita Ivanov: To accelerate Hadoop in a meaningful way one needs to find a way to accelerate two core technologies that define Hadoop: HDFS, a distributed file system where data is stored, and MapReduce, a framework that allows parallel processing of the data stored in HDFS.
At GridGain, we’ve developed a highly optimized in-memory file system that is 100% compatible with HDFS that allows to store data directly in DRAM of computers in a Hadoop cluster. We’ve also developed a specifically optimized YARN-based MapReduce implementation that takes full advantage of the data stored directly in DRAM instead of disks.
The combination of these two innovations allows GridGain to speed up any Hadoop payloads – including Pig, Hive, or hand-written MapReduce jobs in any language – up to 10x without any code change. GridGain provides the first Hadoop accelerator that provide a true plug-and-play acceleration to the existing Hadoop jobs.
Q3. Why did you decide to open source your product?
Nikita Ivanov: Even before October of last year, GridGain already had an open core model: We offered an in-memory data fabric under the Apache 2.0 license, and we also offered a commercial edition with a number of enterprise-grade features, such as enhanced security, data center replication, rolling updates, cross-language portability, and others.
The drivers for our decision to contribute our core open source code base to the Apache Software Foundation (ASF) were of course to ensure continued, broad adoption of in-memory technologies and the long-term viability of the code base. But equally importantly, we also want to build a thriving community that adopts and adapts this code base, and hence will be key in finding new use cases for in-memory computing.
Q4. What is Apache Ignite?
Nikita Ivanov: Apache Ignite (incubating) is an open source, distributed framework for a unified In-Memory Data Fabric, originally developed by GridGain Systems. Apache Ignite is an incubating Apache project, which provides a high-performance, distributed in-memory data management software layer between various data sources and applications. Its code is written mostly in Java and Scala with small amount of C++ code, and it will initially combine an in-memory data grid, in-memory compute grid and in-memory streaming processing in one framework.
Apache Ignite’s large scale, in-memory framework offers transactional and real-time analytics applications performance gains of 100-1,000 times faster throughput and/or lower latencies. It is also a key open source foundation to enable the emerging class of so-called hybrid transactional-analytical workloads.
Q5. What is special about v1.0?
Nikita Ivanov: In October of 2014, the GridGain In-Memory Data Fabric core code base was accepted by the Apache Software Foundation (ASF) into the Incubator program under the name “Apache Ignite”.
Since then, GridGain engineers as well as other contributors have been busy working on migrating the existing code base, documentation, and refactoring of the existing internal build, test & release processes to the “Apache Way”.
Version 1.0 represents the first release that meets these goals, and will include additional enhancements above and beyond the most recent open source In-Memory Data Fabric from GridGain. In fact, Apache Ignite has a large set of features, and one of its coolest new features is its ability to automatically integrate with different RDBMS systems, such as Oracle, MySql, Postgres, DB2, Microsoft SQL, etc. This feature automatically generates the application domain model based on the schema definition of the underlying database, and then loads the data.
Despite the breadth of its feature set, however, Ignite is actually very easy to use: For example, there are no custom installers. The product comes as one ZIP file, which is ready to go once you unzip it. And it has only 1 mandatory dependency – ignite-core.jar. All other dependencies, like integration with Spring for configuration, or with the H2 database for SQL, can be added to the process a la carte. Also, the project is fully mavenized, and is composed of over a dozen of maven artifacts that can be imported and used in any combination. Apache Ignite is based on standard Java APIs, and for distributed caches and data grid functionality Ignite implements the JCache (JSR 107) standard.
The new Apache Ignite v1.0 bits are available for download now from the Apache Ignite web site.
Q6. Who will be using the Apache Ignite In-Memory Data Fabric, and for what?
Nikita Ivanov: We expect developers and software architects of high-performance, hyper-scale on-premise and SaaS applications to take advantage of the following capabilities when building or performance-tuning their new or existing applications: compute grid, data grid, service grid, streaming, clustering, distributed data structures, distributed messaging, distributed events and in-memory file system.
Use cases can be found in software designed for financial services, telecommunications, retail, transportation, social media, online advertising, utilities, biosciences and many other industries.
Q7. What is positioning of the Apache Ignite project?
Nikita Ivanov: As we explained in our blog from last November, we believe Apache Ignite has all the right ingredients to become for the world of Fast Data what Hadoop is for Big Data today. This means that unlike Hadoop, which is a batch process focused on enabling the storage of large amounts of data economically, Ignite will enable extremely fast and ultra-low latency processing of data, allowing its users to derive actionable insights from their data much faster. Unlike Spark, a popular sister project of Ignite in the ASF, which is mainly focused on enhancing analytics and machine-learning for the Hadoop world, Ignite is a data source agnostic processing layer, which can be used for both Hadoop-like computation and many other computing paradigms like MPP, MPI, streaming processing.
In addition to real-time analytics, Ignite’s in-memory framework also offers support for full ACID transactions.
Q8. You have previously posted that Oracle and SAP are missing the point of In-Memory Computing. Could you please elaborate on this?
Nikita Ivanov: We continue to believe that Oracle and SAP are missing the point of in-memory computing for the following reasons: By offering a well-integrated platform of a compute grid, data grid, streaming/CEP and Hadoop acceleration, Apache Ignite (incubating) and the GridGain In-Memory Data Fabric offer a strategic approach to in-memory computing, across both transactional and analytical workloads, that delivers performance, scale and comprehensive capabilities far above and beyond what traditional in-memory databases, data grids or other in-memory-based point solutions can offer by themselves.
Both Apache Ignite and GridGain’s enterprise offering built on Apache Ignite will greatly benefit from a thriving community adapting the code base to new and emerging use cases; therefore, we believe this code base is extremely well positioned to drive superior innovation to the world of Fast Data, just as the Hadoop community has been doing for Big Data.
In addition, unlike Oracle or SAP Hana, Apache Ignite is more affordable, easier-to-access and more transparent open source software running on commodity hardware, which typically increases developers’ and architects’ motivation to explore the potential of in-memory computing. That said, if all the customer is looking for from in-memory technology is faster processing of their (SQL) data, then they may still choose to deploy proprietary software from Oracle or SAP.
Qx Anything else you wish to add?
Nikita Ivanov: I guess I should mention that even though Apache Ignite has been in incubation for less than 4 months only, we are excited to see that the project already has a very vibrant and growing community.
But we always welcome community contributions, so if there are readers that would like to contribute, please send an email to the Apache Ignite dev list, and we will get you started. And even if you are not ready to contribute immediately, we would like to invite everyone to join our dev list. Most of the discussions happen there, and you can find out a lot about where the project is going and also provide your own ideas. Another great way, of course, for people to familiarize themselves with Apache Ignite, is to take a look at the code and see what it can do for thier project. The Ignite bits can be downloaded on the Apache Ignite homepage.
Nikita Ivanov is founder and CTO of GridGain Systems, started in 2007 and funded by RTP Ventures and Almaz Capital. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today.
Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. Nikita was one of the pioneers in using Java technology for server side middleware development while working for one of Europe’s largest system integrators in 1996.
He is an active member of Java middleware community, contributor to the Java specification, and holds a Master’s degree in Electro Mechanics from Baltic State Technical University, Saint Petersburg, Russia.
– On Solr and Mahout. Interview with Grant Ingersoll. ODBMS Industry Watch, 2015-01-06
– Big Data: Three questions to McObject. ODBMS Industry Watch, February 14, 2014
Follow ODBMS.org on Twitter: @odbmsorg
“When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits. “– Stephen Buxton.
MarkLogic recently released MarkLogic 8. I wanted to know more about this release. For that, I have interviewed Stephen Buxton, Senior Director, Product Management at MarkLogic.
Q1. You have recently launched MarkLogic® 8 software release. How is it positioned in the Big Data market? How does it differentiate from other products from NoSQL vendors?
Stephen Buxton: MarkLogic 8 is our biggest release ever, further solidifying MarkLogic’s position in the market as the only Enterprise NoSQL database.
With MarkLogic 8, you can now store, manage and search JSON, XML, and RDF all in one unified platform—without sacrificing enterprise features such as transactional consistency, security, or backup and recovery.
While other database companies are still figuring out how to strengthen their platform and add features like transactional consistency, we’ve moved far ahead of them by working on new innovative features such as Bitemporal and Semantics. It’s for these reasons that over 500 enterprise organizations have chosen MarkLogic to run their mission-critical applications.
MarkLogic 8 is more powerful, agile, and trusted than ever before, and is an ideal platform for doing two things: making heterogeneous data integration simpler and faster; and for doing dynamic content delivery at massive scale.
Relational databases do not offer enough flexibility—integration projects can take multiple years, cost millions of dollars, and struggle at scale. But, the newer NoSQL databases that do have agility still lack the enterprise features required to run in the data centers at large organizations. MarkLogic is the only NoSQL database that is able to solve today’s challenge, having the flexibility to serve as an operational and analytical database for all of an organization’s data.
Within MarkLogic, the JSON structure is mapped directly to the internal structure already used by the XML document format, so it has the same speed and scalability as with XML. This also means that all of the production-proven indexing, data management, and security capabilities that MarkLogic is known for are fully maintained.
Q3. In MarkLogic 8 you have been adding full SPARQL 1.1 support and Inferencing capability. Could you please explain what kind of Inferencing capability did you add and what are they useful for?
Stephen Buxton: We made a big leap forward on the semantics foundation that was laid in our previous release, adding full SPARQL 1.1 support, which includes support for property paths, aggregates, and SPARQL Update. Support for automatic inferencing was also added, which is a powerful capability that allows the database to combine existing data and apply pre-defined rules to infer new data. SPARQL 1.1 is a standard defined by the W3C that is supported by many RDF triple stores. But, MarkLogic differentiates itself among triple stores as you can store your documents and data right alongside your triples, and you can query across all three data models easily and efficiently.
Automatic inferencing is a really powerful feature that is part of an overall strategy to provide a more intelligent data layer so that you can build smarter apps.
With inferencing, for example, if you had two pieces of data stored as RDF triples, such as “John lives in Virginia” and “Virginia is in the United States”, then MarkLogic 8 could infer the new fact, “John lives in the United States.”
This can make search results richer and also show you new relationships in your data.
In MarkLogic 8, rules for inferencing are applied at query time. This approach is referred to as backward-chaining inference, a very flexible approach in which only the required rules are applied for each query, so the server does the minimum work necessary to get the correct results; and when your data or ontology or rules sets change, that change is available immediately – it takes effect with the very next query. And, of course, inference queries are transactional, distributed, and obey MarkLogic’s rule-based security, just like any other query. MarkLogic 8 has supplied rule sets for RDFS, RDFS-Plus, OWL-Horst, and their subsets; and you can create your own. With MarkLogic 8 you can further restrict any SPARQL query (with or without inference) by any document attribute, including timestamp, provenance, or even a bitemporal constraint.
More details and examples can be found at developer.marklogic.com.
Q4. The additions to SPARQL include Property Paths, Aggregates, and SPARQL Update. Could you please explain briefly each of them?
Stephen Buxton: SPARQL 1.1 brings support for property paths, aggregates, and SPARQL Update. These capabilities make working with RDF data simpler and more powerful, which means increased context for your data—all using the SPARQL 1.1 industry standard query language.
SPARQL 1.1’s property paths let you traverse an RDF graph – bouncing from point-to-point across a graph. This graph traversal allows you to do powerful, complex queries such as, “Show me all the people who are connected to John” by finding people that know John, and people that know people that know John, and so on.
With aggregate SPARQL functions you can do analytic queries over hundreds of billions of triples. MarkLogic 8 supports all the SPARQL 1.1 Aggregate functions – COUNT, SUM, MIN, MAX, and AVG – as well as the grouping operations GROUP BY, GROUP BY .. HAVING, GROUP_CONCAT and SAMPLE.
SPARQL 1.1 also includes SPARQL Update. With these capabilities, you can delete, insert, and update (delete/insert) individual triples, and manipulate RDF graphs, all using SPARQL 1.1.
Q5. The addition of SPARQL Update capabilities could have the potential to influence the capability you offer of a RDF triple store that scales horizontally and manages billions of triples. Any comment on that?
Stephen Buxton: The enhancements in MarkLogic 8 make it able to function as a full-featured, stand-alone triple store– this means you can now get a triple store that is horizontally scalable as part of a shared-nothing cluster, and still get all of the enterprise features MarkLogic is known for such as such as High Availability, Disaster Recovery, and certified security. Beyond that, anyone looking for “just a triple store” will find they can also store, manage, and query documents and data in the same database, a unique capability that only MarkLogic has.
Q6. You have been adding a so called Bitemporal Data Management. What is it and why is it useful?
Stephen Buxton: Bitemporal is a new feature that allows you to ask, “What did you know and when did you know it?” The MarkLogic Bitemporal feature answers this critical question by tracking what happened, when it happened, and when we found out. A bitemporal database is much more powerful than a temporal database that can only track when something happened. The difference between when something happened and when you found out about it can be incredibly significant, particularly when it comes to audits and regulation.
A bitemporal database tracks time across two different axes, the system and valid time axes. This allows you to go back in time and explore data, manage historical data across systems, ensure data integrity, and do complex bitemporal analysis. You can answer complex questions such as:
• Where did John Thomas live on August 20th as we knew it on September 1st?
• Where was the Blue Van on October 12th as we knew it on October 23rd?
Bitemporal is important for a wide variety of use cases across industries. Getting a more accurate picture of a business at different points-in-time used to be impossible, or very challenging at best. Bitemporal helps ensure that you always have a full and accurate picture of your data at every point-in-time, which is particularly useful in regulated industries.
• Regulatory requirements – Avoid the increasingly harsh downside consequences from not adhering to government and industry regulations, particularly in financial services and insurance
• Audits – Preserve the history of all your data, including the changes made to it, so that clear audits can be conducted without having to worry about lost data, data integrity, or cumbersome ETL processes with archived data
• Investigations and Intelligence – No more lost emails and no more missing information. Bitemporal databases never erase data, so it is possible to see exactly how data was updated based on what was known at the time
• Business Analytics – Run complex queries that were not previously possible in order to better understand your business and answer new questions about how different decisions and changes in the past could have led to different results
• Cost reduction – Manage data with a smaller footprint as the shape of the data changes, avoiding the need to set up additional databases for historical data.
Bitemporal is enhanced by MarkLogic’s Tiered Storage, which allows you to more easily archive your data to cheaper storage tiers with little administrative overhead. This keeps Bitemporal simple, and obviates the high cost imposed by the few relational databases that do have Bitemporal. MarkLogic also eliminates the schema roadblocks that relational databases that have Bitemporal struggle with. MarkLogic is schema-agnostic and can adjust to the shape of data as that data changes over time.
Q7. How is bitemporal different from versioning?
Stephen Buxton: Bitemporal works by ingesting bitemporal documents that are managed as a series of documents with range indexes for valid and system time axes. Documents are stored in a temporal collection protected by security permissions. The initial document inserted into the database is kept and never changes, allowing you to track the provenance of information with full governance and immutability.
Q8. Could you give us some examples of how Bitemporal Data Management could be useful applications for the financial services industry?
Stephen Buxton: One example of Bitemporal is trade reconciliation in financial services. When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits.
Imagine the Head of IT Architecture at a major bank working on mining information and looking for changes in risk profiles. The risk profiles cannot be accurately calculated without having an accurate picture of the reference and trade data, and how it changed over time. This task becomes simple and fast using Bitemporal.
Qx Anything else you wish to add?
Stephen Buxton: In addition to innovative features such as Bitemporal and Semantics, and features that make MarkLogic more widely accessible in the developer community, there are other updates in Marklogic 8 that make it easier to administer and manage. For example, Incremental Backup, another feature added in MarkLogic 8, allows DBAs to perform backups faster while using less storage.
With MarkLogic 8, you can have multiple daily incremental backups with only a minimal impact on database performance. This feature is one worth highlighting because it will help make DBAs live much easier, and will save an organization time and money.
It’s just another example of MarkLogic’s continuing dedication to being an enterprise NoSQL database that is more powerful, agile, and trusted than anything else.
Stephen Buxton is Senior Director of Product Management for Search and Semantics at MarkLogic, where he has been a member of the Products team for 8 years. Stephen focuses on bringing a rich semantic search experience to users of the MarkLogic NoSQL database, document store, and triple store. Before joining MarkLogic, Stephen was Director of Product Management for Text and XML at Oracle Corporation.
–On making information accessible. Interview with David Leeming. ODBMS Industry Watch, July 30, 2014
Follow ODBMS.org on Twitter: @odbmsorg
“We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.”–John Allison
I have interviewed John Allison, CTO and founder of Customer.io, a start up company in Portland, Oregon.
Q1. What is the business of Customer.io ?
John Allison: We help our customers send timely, targeted messages based on user activity on their website or mobile app. We achieve this by collecting analytical data, providing real-time segmentation, and allowing our customers to define rules to trigger messages at different points in their interactions with a user.
Q2. How large are the data sets you analyze?
John Allison: We’ve collected 6 terabytes of analytical event data for over 55 million unique users across our platform. Due to it’s nature, this data continues to grow and grows faster as we collect data for more and more users.
Q3. What are the main business and technical challenges you are currently facing?
John Allison: As we continue to grow our business, we need to ensure the technical side of our service can easily scale out to support new customers who want to use our product.
Q4. Why did you replace your existing underlying database architecture supporting your “MVP” product ? What were the main technical problems you encountered?
John Allison: As our data set grew in size to the point where we couldn’t realistically manage it all on a small number of servers, we began looking for alternatives which would allow us to continue providing our service in a larger, more distributed way.
Q5. How did you evaluate the alternatives?
John Allison: We evaluated many options and found that most didn’t live up to the availability or consistency guarantees they promised when run over a cluster of servers. We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.
Q6. How is the new solution looking like?
John Allison: We’ve taken more of a polyglot approach to storing our data. We are consolidating on three main clustered databases:
1) FoundationDB – Data where distributed transactions and consistency guarantees are most important.
2) Riak – Large amounts of immutable data where availability is more important.
3) ElasticSearch – Indexing data for ad-hoc querying.
All three have built in tools for expanding and administrating a cluster, provide fault-tolerance and increased reliability in the face of server faults, and each provides us with unique ways to access our data.
Q7. What experience do you have with this new database architecture until now? Do you have any measurable results you can share with us?
John Allison: Embracing a distributed architecture and storing data in the right database for a given use-case has led to less time worrying about operations, increased reliability of our service as a whole, and the ability to scale out all parts of our infrastructure to increase our platform’s capacity.
Q8. Moving forward, what are your plans for the next implementation of your product?
John Allison: Continuing to improve our product in order to provide the most value we can for our customers.
John Allison is the CTO and founder of Customer.io, a startup focused on making it easy to build, manage, and measure automatic customer retention emails. Prior to that he was the head of engineering at Challengepost.com. He is a world traveler, Golfer, and an Arkansas Razorback fan.
We have published several new experts articles on Big Data and Analytics in ODBMS.org.
– On Mobile Data Management. Interview with Bob Wiederhold. ODBMS Industry Watch, 2014-11-18.
–Big Data Management at American Express. Interview with Sastry Durvasula and Kevin Murray. ODBMS Industry Watch, 2014-10-12
Follow ODBMS.org on Twitter: @odbmsorg
“When does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.”–Grant Ingersoll.
I have interviewed Grant Ingersoll, CTO and co-founder of LucidWorks. Grant is an active member of the Lucene community, and co-founder of the Apache Mahout machine learning project.
I wish you a Happy and a Peaceful 2015!
Q1. Why LucidWorks Search? What kind of value-add capabilities does it provide with respect to the Apache Lucene/Solr open source search?
Grant Ingersoll: I like to think of LucidWorks Search (LWS) as Solr++, that is, we give you all of the goodness of Solr and then some more. Our primary focus in building LWS is in 4 key areas:
1. IT integration — Make it easy to consume Solr within an IT organization via things like monitoring, APIs, installation and so on.
2. Enterprise readiness — Large enterprises have 1 of everything and they all have a multitude of security requirements, so we focus on making it easier to operate in these environments via things like connectors for data acquisition, security and the like
3. Tools for Subject Matter Experts — These are aimed at technical non developers like Business Analysts, Merchandisers, etc. who are responsible for understanding who asked for what, when and why. These tools are primarily aimed at understanding relevancy of search results and then taking action based on business needs.
4. Deliver a supported version of the open source so that companies can reliably deploy it knowing they have us to back them up.
Q2. At LucidWorkd you have integrated Apache open source projects to deliver a Big Data application development and deployment platform. What does the emerging big data stack look like?
Grant Ingersoll: We use capabilities from the Hadoop ecosystem for a number of activities that we routinely see customers struggling with when they try to better understand their data. In many cases, this boils down to large scale log analysis to power things like recommendation systems or Mahout for machine learning, but it also can be more subtle like doing large scale content extraction from Office documents or natural language processing approaches for identifying interesting phrases. We also rely on Zookeeper quite heavily to make sure that our cluster stays in a happy state and doesn’t suffer from split brain issues and cause failures.
Q3. How does it different with respect to other Big Data Hadoop-based distributions such as Cloudera, Hortonworks, and Greenplum Pivotal HD?
Grant Ingersoll: I can’t speak to their integrations in great detail, but we integrate with all of them (as well as partner with most of them), so I guess you would say we try to work at a layer above the core Hadoop infrastructure and focus on how the Hadoop ecosystem can solve specific problems as opposed to being a general purpose tool. For instance, we ship with a number of out of the box workflows designed to solve common problems in search like click-through log analysis and whole collection document clustering so you don’t have to write them yourself.
Q4. How does it work to build a framework for big data with open source technologies that are “pre-integrated”?
Grant Ingersoll: Well, you quickly realize what a version soup there is out there, trying to support all the different “flavors” of Hadoop. Other than, it is a lot of fun to leverage the technologies to solve real problems that help people better understand their data. Naturally, there are challenges in making sure all the processes work together at scale, so a lot of effort goes into those areas.
Q5. What happens when big data plus search meets the cloud?
Grant Ingersoll: You get cost effective access and insight into your data instead of a big science experiment. In many ways, the benefits are the same as search and ranking in on-prem situations plus the added benefits the cloud brings you in terms of costs, scaling and flexibility. Of course, the well-documented challenge in the cloud is how to get your data there. So, for users who already have their data in the cloud, it’s an especially easy win, for those who don’t, we provide connectors that help.
Q6. Solr Query includes simple join capability between two document types. How do such queries scale with Big Data?
Grant Ingersoll: Solr scales quite well (billions of documents and very large query volumes).
In fact, we’ve seen it routinely scale linearly to quite large cluster sizes.
As with databases, joins require you to pay attention to how you do the join or whether there are better ways of asking your question, but I have seen them used quite successfully in the appropriate situation. At the end of the day, I try to remain pragmatic and use the appropriate tool for the job. A search engine can handle some types of joins, but that doesn’t always mean you should do it in a search engine. I like to think of a search engine as a very fast ranking engine. If the problem requires me to rank something, than search engine technology is going to be hard to beat. If you need it to do all different kinds of joins across a large number of document types or constant large table scans, it may be appropriate to do in a search engine and it may not. It’s a classic “it depends” situation. That being said, over the past few years, these kinds of problems have become much more efficient to do in a search engine thanks to a multitude of improvements the community has made to Lucene and Solr.
Q7. The Apache Mahout Machine Learning Project’s goal is to build scalable machine learning libraries. What is current status of the project?
Grant Ingersoll: We released 0.9 and are working towards a 1.0. The main focus lately has been on preparing for a 1.0 release by culling old, unused code and tightly focusing on a core set of algorithms which are tried and true that we want to support going forward.
Q8. What kind of algorithms is Apache Mahout currently supporting?
Grant Ingersoll: I tend to think of Mahout as being focused on the three “C’s”: clustering, classification and collaborative filtering (recommenders). These algorithms help people better understand and organize their data. Mahout also has various other algorithms like singular value decomposition, collocations and a bunch of libraries for Java primitives.
Q9. How does Mahout relies on the Apache Hadoop framework?
Grant Ingersoll: Many of the algorithms are written for Hadoop specifically, but not all. We try to be prudent about where it makes sense to use Hadoop and where it doesn’t, as not all machine learning algorithms are best suited for Map-Reduce style programming. We are also looking at how to leverage other frameworks like Spark or custom distributed code.
Q10. Who is using Apache Mahout and for what?
Grant Ingersoll: It really spans a lot of interesting companies, ranging from those using it to power recommendations to others classifying users to show them ads. At LucidWorks, we use Mahout for identifying statistically interesting phrases, clustering and classification of user’s query intent and more.
Q11. How scalable is Apache Mahout? What are the limits?
Grant Ingersoll: That will depend on the algorithm. I haven’t personally run an exhaustive benchmark, but I’ve seen many of the clustering and classification algorithms scale linearly.
Q12. How do you take into account user feedback when performing Recommendation mining with Apache Mahout?
Grant Ingersoll: Mahout’s recommenders are primarily of the “collaborative filtering” type, where user feedback equates to a vote for a particular item. All of those votes are, to simplify things a bit, added up to produce a recommendation for the user. Mahout supports a number of different ways of calculating those recommendations, since it is a library for producing recommendations and not just a one size fits all product.
Q13. Looking at three elements: Data, Platform, Analysis, what are the main challenges ahead?
Grant Ingersoll: I’d add a fourth element: the user. Lots of interesting challenges here:
When do we get past the hype cycle of big data and into the nitty gritty of making it real? That is, when does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.
How do we leverage the data, the platform and the analysis to make us smarter/better off instead of just better marketing targets? How do we use these tools to personalize without offending or destroying privacy?
How do we continue to meet scale requirements without breaking the bank on hardware purchases, etc?
Qx. Anything you wish to add?
Grant Ingersoll: Thanks for the great questions!
Grant Ingersoll, CTO and co-founder of LucidWorks, is an active member of the Lucene community – a Lucene and Solr committer, co-founder of the Apache Mahout machine learning project and a long-standing member of the Apache Software Foundation. He is co-author of “Taming Text” from Manning Publications, and his experience includes work at the Center for Natural Language Processing at Syracuse University in natural language processing and information retrieval.
Ingersoll has a Bachelor of Science degree in Math and Computer Science from Amherst College and a Master of Science degree in Computer Science from Syracuse University.
– Taming Text How to Find, Organize, and Manipulate It
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris
Softbound print: September 2012 (est.) | 350 pages, Manning, ISBN: 193398838X
Follow ODBMS.org on Twitter: @odbmsorg
“The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information.”–Anthony Bak
On Big Data Analytics, I have interviewed Anthony Bak, Data Scientist and Mathematician at Ayasdi.
Q1. What are the most important challenges for Big Data Analytics?
Anthony Bak: The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information. How do we sort through all the stories in an efficient manner?
Historically, organizations extracted value from data by building data infrastructure and employing large teams of highly trained Data Scientists who spend months, and sometimes years, asking questions of data to find breakthrough insights. The probability of discovering these insights is low because there are too many questions to ask and not enough data scientists to ask them.
Ayasdi’s platform uses Topological Data Analysis (TDA) to automatically find the relevant stories in complex data and operationalize them to solve difficult and expensive problems. We combine machine learning and statistics with topology, allowing for ground-breaking automation of the discovery process.
Q2. How can you “measure” the value you extract from Big Data in practice?
Anthony Bak: We work closely with our clients to find valuable problems to solve. Before we tackle a problem we quantify both its value to the customer and the outcome delivering that value.
Q3. You use a so called Topological Data Analysis. What is it?
Anthony Bak: Topology is the branch of pure mathematics that studies the notion of shape.
We use topology as a framework combining statistics and machine learning to form geometric summaries of Big Data spaces. These summaries allow us to understand the important and relevant features of the data. We like to say that “Data has shape and shape has meaning”. Our goal is to extract shapes from the data and then understand their meaning.
While there is no complete taxonomy of all geometric features and their meaning there are a few simple patterns that we see in many data sets: clusters, flares and loops.
Clusters are the most basic property of shape a data set can have. They represent natural segmentations of the data into distinct pieces, groups or classes. An example might find two clusters of doctors committing insurance fraud.
Having two groups suggests that there may be two types of fraud represented in the data. From the shape we extract meaning or insight about the problem.
That said, many problems don’t naturally split into clusters and we have to use other geometric features of the data to get insight. We often see that there’s a core of data points that are all very similar representing “normal” behavior and coming off of the core we see flares of points. Flares represent ways and degrees of deviation from the norm.
An example might be gene expression levels for cancer patients where people in various flares have different survival rates.
Loops can represent periodic behavior in the data set. An example might be patient disease profiles (clinical and genetic information) where they go from being healthy, through various stages of illness and then finally back to healthy.
The loop in the data is formed not by a single patient but by sampling many patients in various stages of disease. Understanding and characterizing the disease path potentially allows doctors to give better more targeted treatment.
Finally, a given data set can exhibit all of these geometric features simultaneously as well as more complicated ones that we haven’t described here. Topological Data Analysis is the systematic discovery of geometric features.
Q4. The core algorithm you use is called “Mapper“, developed at Stanford in the Computational Topology group by Gunnar Carlsson and Gurjeet Singh. How has your company, Ayasdi, turned this idea into a product?
Anthony Bak: Gunnar Carlsson, co-founder and Stanford University mathematics professor, is one of the leaders in a branch of mathematics called topology. While topology has been studied for the last 300 years, it’s in just the last 15 years that Gunnar has pioneered the application of topology to understand large and complex sets of data.
Between 2001 and 2005, DARPA and the National Science Foundation sponsored Gunnar’s research into what he called Topological Data Analysis (TDA). Tony Tether, the director of DARPA at the time, has said that TDA was one of the most important projects DARPA was involved in during his eight years at the agency.
Tony told the New York Times, “The discovery techniques of topological data analysis are going to have a huge impact, and Gunnar Carlsson is at the forefront of this research.”
That led to Gunnar teaming up with a group of others to develop a commercial product that could aid the efforts of life sciences, national security, oil and gas and financial services organizations. Today, Ayasdi already has customers in a broad range of industries, including at least 3 of the top global pharmaceutical companies, at least 3 of the top oil and gas companies and several agencies and departments inside the U.S. Government.
Q5. Do you have some uses cases where Topological Data Analysis is implemented to share?
Anthony Bak: There is a well known, 11-year old data set representing a breast cancer research project conducted by the Netherlands Cancer Institute-Antoni van Leeuwenhoek Hospital. The research looked at 272 cancer patients covering 25,000 different genetic markers. Scientists around the world have analyzed this data over and over again. In essence, everyone believed that anything that could be discovered from this data had been discovered.
Within a matter of minutes, Ayasdi was able to identify new, previously undiscovered populations of breast cancer survivors. Ayasdi’s discovery was recently published in Nature.
Using connections and visualizations generated from the breast cancer study, oncologists can map their own patients data onto the existing data set to custom-tailor triage plans. In a separate study, Ayasdi helped discover previously unknown biomarkers for leukaemia.
You can find additional case studies here.
Q6. Query-Based Approach vs. Query-Free Approach: could you please elaborate on this and explain the trade off?
Anthony Bak: Since the creation of SQL in the 1980s, data analysts have tried to find insights by asking questions and writing queries. This approach has two fundamental flaws. First, all queries are based on human assumptions and bias. Secondly, query results only reveal slices of data and do not show relationships between similar groups of data. While this method can uncover clues about how to solve problems, it is a game of chance that usually results in weeks, months, and years of iterative guesswork.
Ayasdi’s insight is that the shape of the data – its flares, cluster, loops – tells you about natural segmentations, groupings and relationships in the data. This information forms the basis of a hypothesis to query and investigate further. The analytical process no longer starts with coming up with a hypothesis and then testing it, instead we let the data, through its geometry, tell us where to look and what questions to ask.
Q7 Anything else you wish to add?
Anthony Bak: Topological data analysis represents a fundamental new framework for thinking about, analyzing and solving complex data problems. While I have emphasized its geometric and topological properties it’s important to point out that TDA does not replace existing statistical and machine learning methods.
Instead, it forms a framework that utilizes existing tools while gaining additional insight from the geometry.
I like to say that statistics and geometry form orthogonal toolsets for analyzing data, to get the best understanding of your data you need to leverage both. TDA is the framework for doing just that.
Anthony Bak is currently a Data Scientist and mathematician at Ayasdi. Prior to Ayasdi, Anthony was at Stanford University where he worked with Ayasdi co-founder Gunnar Carlsson on new methods and applications of Topological Data Analysis. He did his Ph.D. work in algebraic geometry with applications to string theory.
– Extracting insights from the shape of complex data using topology
P. Y. Lum,G. Singh,A. Lehman,T. Ishkanov,M. Vejdemo-Johansson,M. Alagappan,J. Carlsson & G. Carlsson
Nature, Scientific Reports 3, Article number: 1236 doi:10.1038/srep01236, 07 February 2013
Follow ODBMS.org on Twitter: @odbmsorg
“We see mobile rapidly emerging as a core requirement for data management. Any vendor who is serious about being a leader in the next generation database market, has to have a mobile strategy.”
I have interviewed Bob Wiederhold, President and Chief Executive Officer of Couchbase.
Q1. On June 26, you have announced a $60 Million series E round of financing. What are Couchbase’s chances of becoming a major player in the database market (and not only in the NoSQL market)? And what is your strategy for achieving this?
Bob Wiederhold: Enterprises are moving from early NoSQL validation projects to mission critical implementations.
As NoSQL deployments evolve to support the core business, requirements for performance at scale and completeness increase. Couchbase Server is the most complete offering on the market today, delivering the performance, scalability and reliability that enterprises require.
Additionally, we see mobile rapidly emerging as a core requirement for data management. Any vendor who is serious about being a leader in the next generation database market, has to have a mobile strategy.
At this point, we are the only NoSQL vendor offering an embedded mobile database and the sync needed to manage data between the cloud, the device and other devices. We believe that having the most complete, best performing operational NoSQL database along with a comprehensive mobile offering, uniquely positions us for leadership in the NoSQL market.
Q2. Why Couchbase Lite is so strategically important for you?
Bob Wiederhold: First, because the world is going mobile. That is indisputable. Mobile initiatives top the list of every IT department. As I said above, if you don’t have a mobile data management offering, you are not looking at the complete needs of the developer or the enterprise.
Second, let’s level set on Couchbase Lite. Couchbase Lite is our offering for an embedded mobile JSON database.
Our complete mobile offering, Couchbase Mobile, includes Couchbase Server – for data management in the cloud, and Sync Gateway for synchronization of data stored on the device with other devices, or the database in the cloud.
Today, because connectivity is unknown, data synchronization challenges force developers to either choose a total online (data stored in the cloud), or total offline (data stored on the device) data management strategy.
This approach limits functionality, as when the network is unavailable, online apps may freeze and not work at all. People want access to their applications, travel, expense report, or multi-user collaboration etc., whether they’re online or not.
Couchbase Mobile is the only NoSQL offering available that allows developers to build JSON applications that work whether an application is online or off, and manages the synchronization of the data between those applications and the cloud, or other devices. This is revolutionary for the mobile world and we are seeing tremendous interest from the mobile developer community.
Q3 What can enterprise do with a NoSQL mobile database, that they would not be able to do with a non-mobile database?
Bob Wiederhold: Offline access and syncing has been too time and resource intensive for mobile app developers. With Couchbase Mobile, developers don’t have to spend months, or years, building a solution that can store unstructured data on the device and sync that data with external sources – whether that is the cloud or another device. With Couchbase Mobile, developers can easily create mobile applications that are not tied to connectivity or limited by sync considerations. This empowers developers to build an entirely new class of enterprise applications that go far beyond what is available today.
Q4 What kind of businesses and applications will benefit when people use a NoSQL databases on their mobile devices? Can you give us some examples?
Bob Wiederhold: Nearly every business can benefit from the use of a complete mobile solution to build always available apps that work offline or online. One business example is our customer Infinite Campus.
Focused on educational transformation through the use of information technology, Infinite Campus is looking at Couchbase Lite as a solution that will enable students to complete their homework modules even when they don’t have access to a network outside of school. Instructional videos and homework assignments can be selectively pushed to students’ mobile devices when they are online at school.
Using Couchbase Lite, students can work online at school and then complete their homework assignments anywhere – on or offline. And the data seamlessly syncs across devices and between users, so teachers and students can participate in real-time Q&A chat sessions during lectures.
Q5. Do you have some customers who have gone into production with that?
Bob Wiederhold: The product is new, but we already have several customers that are live.
In addition to Microsoft, we have several companies around the world. You can check out one iOS app by Spraed, who is using Couchbase Server – running on AWS, Sync Gateway and Couchbase Lite.
Q6. Couchbase Server is a JSON document-based database. Why this design choice?
Bob Wiederhold: The world is changing. Businesses need to be agile and responsive.
Relational databases, with rigid schema design, don’t allow for fast change. JSON is the next generation architecture that businesses are increasingly using for mission critical applications because the technology allows them to manage and react to all aspects of big data: volume, variety and velocity of data, as well as big users and do that in a cloud based landscape.
Q7. Do you have any plan to work with Cloud providers?
Bob Wiederhold: We already work with many cloud providers. We have a great relationship with Amazon Web Services and many of our customers, including WebMD and Viber, run on AWS.
We also have partnerships and customers running on Windows Azure, GoGrid, and others. More and more organizations are moving infrastructure to the cloud and we will continue expanding our eco system to give our customers the flexibility to choose the best deployment options for their businesses.
Q8. Do you see happening any convergence between operational data management and analytical data processing? And if yes, how?
Bob Wiederhold: Yes, Analytics can happen at real time, near real time in operational stores and in batch modes. We have several customers who are deploying and have deployed complete solutions to integrated operational big data with real time analytical processing. LivePerson has done some incredibly innovative work here. They have been very open about the work they are doing and you can hear them tell their story here.
Q9 Do you have any plan to integrate your system with platforms for use in big data analytics?
Bob Wiederhold: Absolutely and we are integrated today into many platforms, including Hadoop via our Couchbase Hadoop connector and have many customers using Couchbase Server with both realtime and batch mode analytics platforms. See Avira and LivePerson presentations for examples. We continue to work with big data ISVs to ensure our customers can easily integrate their systems with the analytics system of their choosing.
Bob has more than 25 years of high technology experience. Until an acquisition by IBM in 2008, Bob served as chairman, CEO, and president of Transitive Corporation, the worldwide leader in cross-platform virtualization with over 20 million users. Previously, he was president and CEO of Tality Corporation, the worldwide leader in electronic design services, whose revenues and size grew to almost $200 million and had 1,500 worldwide employees.
Bob held several executive general management positions at Cadence Design Systems, Inc., an electronic design automation company, which he joined in 1985 as an early stage start-up and helped to grow to more than $1.5 billion during his 13 years at the company. Bob also headed High Level Design Systems, a successful electronic design automation start-up that was acquired by Cadence in 1996. Bob has extensive board experience having served on both public (Certicom, HLDS) and private company boards (Snaketech, Tality, Transitive, FanfareGroup).
–Magic Quadrant for Operational Database Management Systems. 16 October 2014. Analyst(s): Donald Feinberg, Merv Adrian, Nick Heudecker, Gartner.
Follow ODBMS.org on Twitter: @odbmsorg
“HBase and Hadoop are the only technologies proven to scale to dozens of petabytes on commodity servers, currently being used by companies such as Facebook, Twitter, Adobe and Salesforce.com.”–Monte Zweben.
Is it possible to turn Hadoop into a RDBMS? On this topic, I have interviewed Monte Zweben, Co-Founder and Chief Executive Officer of Splice Machine.
Q1. What are the main challenges of applications and operational analytics that support real-time, interactive queries on data updated in real-time for Big Data?
Monte Zweben: Let’s break down “real-time, interactive queries on data updated in real-time for Big Data”. “Real-time, interactive queries” means that results need to be returned in milliseconds to a few seconds.
For “Data updated in real-time” to happen, changes in data should be reflected in milliseconds. “Big Data” is often defined as dramatically increased volume, velocity, and variety of data. Of these three attributes, data volume typically dominates, because unlike the other attributes, its growth is virtually unbounded.
Traditional RDBMSs like MySQL or Oracle can support real-time, interactive queries on data updated in real-time, but they struggle on handling Big Data. They can only scale up on larger servers that can cost hundreds of thousands, if not millions of dollars per server.
Big Data technologies such as Hadoop can easily handle Big Data data volumes with their ability to scale-out on commodity hardware. However, with their batch analytics heritage, they often struggle to provide real-time, interactive queries. They also lack ACID transactions to support data updated in real time.
So, real-time applications and operational analytics had to choose between real-time interactive queries on data updated in real-time, or Big Data volumes. With Splice Machine, these applications can have the best of both worlds: real-time interactive queries, the reliability of real-time updates on ACID transactions, and the ability to handle Big Data volumes with a 10x price/performance improvement over traditional RDBMSs.
Q2. You suggested that companies should replace their traditional RDBMS systems. Why and when? Do you really think this is always possible? What about legacy systems?
Monte Zweben: Companies should consider replacing their traditional RDBMSs when they experience significant cost or scaling issues. Our informal surveys of customers indicate that up to half of traditional RDBMSs experience cost or scaling issues. The biggest barrier to migrating from a traditional RDBMS to a new database like Splice Machine is converting custom stored procedure (e.g., PL/SQL code). Operational analytics often have limited custom stored procedure code, so the migration process is generally straightforward.
Operational applications typically have thousands of lines of custom stored procedure code, but in extreme cases it can run into hundreds of thousands to millions of lines of code. There are actually commercially-supported tools that will convert from PL/SQL to the Java needed for Splice Machine. We have typically seen them convert from 70-95% accurately, but it will obviously depend on the complexity of the original code. Financially, migration makes sense for many companies to get an ongoing 10x price/performance, but there are cases when it does not make sense because converting custom code is too expensive.
Q3. Is scale-out the solution to Big Data at scale? Why?
Monte Zweben: Scale-out is definitely the critical technology to making Big Data work at scale. Scale-out leverages inexpensive, commodity hardware to parallelize queries to easily achieve a 10x price/performance improvement over existing database technologies.
Q4. You have announced your real-time relational database management system. What is special about Splice Machine`s Hadoop RDBMS?
Monte Zweben: We are the only Hadoop RDBMS. There are obviously many RDBMSs, but we are the only one with scale-out technology from Hadoop. Hadoop is the only scale-out technology proven to scales into dozens of petabytes on commodity hardware at companies like Facebook. There are other SQL-on-Hadoop technologies, but none of them can support real-time ACID transactions.
Q5 Hadoop-connected SQL databases do not eliminate “silos”. How do you handle this?
Q6. How did you manage to move Hadoop beyond its batch analytics heritage to power operational applications and real-time analytics?
Monte Zweben: At its core, Hadoop is a distributed file system (HDFS) where data cannot be updated or deleted. If you want to update or delete anything, you have to reload all the data (i.e., batch load). As a file system, it has very limited ability to seek specific data; instead, you use Java MapReduce programs to scan all of the data to find the data you need. It can easily take hours or even days for queries to return data (i.e., batch analytics). There is no way you could support a real-time application on top of HDFS and MapReduce.
By using HBase (a real-time key value store on top of HDFS), Splice Machine provides a full RDBMS on top of Hadoop.
You can now get real-time, interactive queries on real-time updated data on Hadoop, necessary to support operational applications and analytics.
Q7. How do you use Apache Derby™ and Apache HBase™/Hadoop?
Monte Zweben: Splice Machine marries two proven technology stacks: Apache Derby for ANSI SQL and HBase/Hadoop for proven scale out technology. With over 15 years of development, Apache Derby is a Java-based SQL database. Splice Machine chose Derby because it is a full-featured ANSI SQL database, lightweight (<3 MB), and easy to embed into the HBase/Hadoop stack.
HBase and Hadoop are the only technologies proven to scale to dozens of petabytes on commodity servers, currently being used by companies such as Facebook, Twitter, Adobe and Salesforce.com. Splice Machine chose HBase and Hadoop because of their proven auto-sharding, replication, and failover technology.
Q8. Why did you replace the storage engine in Apache Derby with HBase?
Q9. Why did you redesign the planner, optimizer, and executor of Apache Derby?
Monte Zweben: We redesigned the planner, optimizer, and executor of Derby because Splice Machine has a distributed computing infrastructure instead of its old shared-disk storage. Distributed computing requires a functional re-architecting because computation must be distributed to where the data is, instead of moving the data to the computation.
Q10. What are the main benefits for developers and database architects who build applications?
Monte Zweben: There are two main benefits to Splice Machine for developers and database architects. First, no longer is data scaling a barrier to using massive amounts of data in an application; you no longer need to prune data or rewrite applications to do unnatural acts like manual sharding. Second, you can enjoy the scaling with all the critical features of an RDBMS – strong consistency, joins, secondary indexes for fast lookups, and reliable updates with transactions. Without those features, developers have to implement those functions for each application, a costly, time-consuming, and error-prone process.
Monte Zweben, Co-Founder and Chief Executive Officer, Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Branch Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.
In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of Red Prairie. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company, and is the chairman of Clio Music, an advanced music research and development company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.
Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science. Monte’s involvement with CMU, which has been a long-time leader in distributed computing and Big Data research, helped inspire the original concept behind Splice Machine.
ODBMS.org: Several Free Resources on Hadoop.
–> FOLLOW ODBMS.ORG ON TWITTER: @odbmsorg
“To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”)”–Mike Carey.
The AsterixDB Big Data Management System (BDMS) is the result of approximately four years of R&D involving researchers at UC Irvine, UC Riverside, and Oracle Labs. The AsterixDB code base currently consists of over 250K lines of Java code that has been co-developed by project staff and students at UCI and UCR.
The AsterixDB project has been supported by the U.S. National Science Foundation as well as by several generous industrial gifts.
Q1. Why build a new Big Data Management System?
Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.
To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”).
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:
- a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
- a full query language with at least the expressive power of SQL;
- support for data storage, data management, and automatic indexing;
- support for a wide range of query sizes, with query processing cost being proportional to the given query;
- support for continuous data ingestion, hence the accumulation of Big Data;
- the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
- built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.
So that’s what we set out to do.
Q2. What was wrong with the current Open Source Big Data Stack?
Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?”
We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.
One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed.
Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”
Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)
Q3. AsterixDB has been designed to manage vast quantities of semi-structured data. How do you define semi-structured data?
Mike Carey: In the late 90’s and early 2000’s there was a bunch of work on that – on relaxing both the rigid/flat nature of the relational model as well as the requirement to have a separate, a priori specification of the schema (structure) of your data. We felt that this flexibility was one of the things – aside from its “free” price point – drawing people to the Hadoop ecosystem (and the key-value world) instead of the parallel data warehouse ecosystem.
In the Hadoop world you can start using your data right away, without spending 3 months in committee meetings to decide on your schema and indexes and getting DBA buy-in. To us, semi-structured means schema flexibility, so in AsterixDB, we let you decide how much of your schema you have to know and/or choose to reveal up front, and how much you want to leave to be self-describing and thus allow it to vary later. And it also means not requiring the world to be flat – so we allow nesting of records, sets, and lists. And it also means dealing with textual data “out of the box”, because there’s so much of that now in the Big Data world.
Q4. The motto of your project is “One Size Fits a Bunch”. You claim that AsterixDB can offer better functionality, managability, and performance than gluing together multiple point solutions (e.g., Hadoop + Hive + MongoDB). Could you please elaborate on this?
Mike Carey: Sure. If you look at current Big Data IT infrastructures, you’ll see a lot of different tools and systems being tied together to meet an organization’s end-to-end data processing requirements. In between systems and steps you have the glue – scripts, workflows, and ETL-like data transformations – and if some of the data needs to be accessible faster than a file scan, it’s stored not just in HDFS, but also in a document store or a key-value store.
This just seems like too many moving parts. We felt we could build a system that could meet more (not all!) of today’s requirements, like the ones I listed in my answer to the first question.
If your data is in fewer places or can take a flight with fewer hops to get the answers, that’s going to be more manageable – you’ll have fewer copies to keep track of and fewer processes that might have hiccups to watch over. If you can get more done in one system, obviously that’s more functional. And in terms of performance, we’re not trying to out-perform the specialty systems – we’re just trying to match them on what each does well. If we can do that, you can use our new system without needing as many puzzle pieces and can do so without making a performance sacrifice.
We’ve recently finished up a first comparison of how we perform on tasks that systems like parallel relational systems, MongoDB, and Hive can do – and things look pretty good so far for AsterixDB in that regard.
Q5. AsterixDB has been combining ideas from three distinct areas — semi-structured data management, parallel databases, and data-intensive computing. Could you please elaborate on that?
Mike Carey: Our feeling was that each of these areas has some ideas that are really important for Big Data. Borrowing from semi-structured data ideas, but also more traditional databases, leads you to a place where you have flexibility that parallel databases by themselves do not. Borrowing from parallel databases leads to scale-out that semi-structured data work didn’t provide (since scaling is orthogonal to data model) and with query processing efficiencies that parallel databases offer through techniques like hash joins and indexing – which MapReduce-based data-intensive computing platforms like Hadoop and its language layers don’t give you. Borrowing from the MapReduce world leads to the open-source “pricing” and flexibility of Hadoop-based tools, and argues for the ability to process some of your queries directly over HDFS data (which we call “external data” in AsterixDB, and do also support in addition to managed data).
Q6. How does the AsterixDB Data Model compare with the data models of NoSQL data stores, such as document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and column-based stores like HBase and Cassandra?
Mike Carey: AsterixDB’s data model is flexible – we have a notion of “open” versus “closed” data types – it’s a simple idea but it’s unique as far as we know. When you define a data type for records to be stored in an AsterixDB dataset, you can choose to pre-define any or all of the fields and types that objects to be stored in it will have – and if you mark a given type as being “open” (or let the system default it to “open”), you can store objects there that have those fields (and types) as well as any/all other fields that your data instances happen to have at insertion time.
Or, if you prefer, you can mark a type used by a dataset as “closed”, in which case AsterixDB will make sure that all inserted objects will have exactly the structure that your type definition specifies – nothing more and nothing less.
(We do allow fields to be marked as optional, i.e., nullable, if you want to say something about their type without mandating their presence.)
What this gives you is a choice! If you want to have the total, last-minute flexibility of MongoDB or Couchbase, with your data being self-describing, we support that – you don’t have to predefine your schema if you use data types that are totally open. (The only thing we insist on, at the moment, is that every type must have a key field or fields – we use keys when sharding datasets across a cluster.)
Structurally, our data model was JSON-inspired – it’s essentially a schema language for a JSON superset – so we’re very synergistic with MongoDB or Couchbase data in that regard.
On the other end of the spectrum, if you’re still a relational bigot, you’re welcome to make all of your data types be flat – don’t use features like nested records, lists, or bags in your record definitions – and mark them all as “closed” so that your data matches your schema. With AsterixDB, we can go all the way from traditional relational to “don’t ask, don’t tell”. As for systems with BigTable-like “data models” – I’d personally shy away from calling those “data models”.
Q7. How do you handle horizontal scaling? And vertical scaling?
Mike Carey: We scale out horizontally using the same sort of divide-and-conquer techniques that have been used in commercial parallel relational DBMSs for years now, and more recently in Hadoop as well. That is, we horizontally partition both data (for storage) and queries (when processed) across the nodes of commodity clusters. Basically, our innards look very like those of systems such as Teradata or Parallel DB2 or PDW from Microsoft – we use join methods like parallel hybrid hash joins, and we pay attention to how data is currently partitioned to avoid unnecessary repartitioning – but have a data model that’s way more flexible. And we’re open source and free….
We scale vertically (within one node) in two ways. First of all, we aren’t memory-dependent in the way that many of the current Big Data Analytics solutions are; it’s not that case that you have to buy a big enough cluster so that your data, or at least your intermediate results, can be memory-resident.
Instead, our physical operators (for joins, sorting, aggregation, etc.) all spill to disk if needed – so you can operate on Big Data partitions without getting “out of memory” errors. The other way is that we allow nodes to hold multiple partitions of data; that way, one can also use multi-core nodes effectively.
Q8. What performance figures do you have for AsterixDB?
Mike Carey: As I mentioned earlier, we’ve completed a set of initial performance tests on a small cluster at UCI with 40 cores and 40 disks, and the results of those tests can be found in a recently published AsterixDB overview paper that’s hanging on our project web site’s publication page (http://asterixdb.ics.uci.edu/publications.html).
We have a couple of other performance studies in flight now as well, and we’ll be hanging more information about those studies in the same place on our web site when they’re ready for human consumption. There’s also a deeper dive paper on the AsterixDB storage manager that has some performance results regarding the details of scaling, indexing, and so on; that’s available on our web site too. The quick answer to “how does AsterixDB perform” is that we’re already quite competitive with other systems that have narrower feature sets – which we’re pretty proud of.
Q9. You mentioned support for continuous data ingestion. How does that work?
Mike Carey: We have a special feature for that in AsterixDB – we have a built-in notion of Data Feeds that are designed to simplify the lives of users who want to use our system for warehousing of continuously arriving data.
We provide Data Feed adaptors to enable outside data sources to be defined and plugged in to AsterixDB, and then one can “connect” a Data Feed to an AsterixDB data set and the data will start to flow in. As the data comes in, we can optionally dispatch a user-defined function on each item to do any initial information extraction/annotation that you want. Internally, this creates a long-running job that our system monitors – if data starts coming too fast, we offer various policies to cope with it, ranging from discarding data to sampling data to adding more UDF computation tasks (if that’s the bottleneck). More information about this is available in the Data Feeds tech report on our web site, and we’ll soon be documenting this feature in the downloadable version of AsterixDB. (Right now it’s there but “hidden”, as we have been testing it first on a set of willing UCI student guinea pigs.)
Q10. What is special about the AsterixDB Query Language? Why not use SQL?
Mike Carey: When we set out to define the query language for AsterixDB, we decided to define our own new language – since it seemed like everybody else was doing that at the time (witness Pig, Jaql, HiveQL, etc.) – one aimed at our data model.
SQL doesn’t handle nested or open data very well, so extending ANSI/ISO SQL seemed like a non-starter – that was also based on some experience working on SQL3 in the late 90’s. (Take a look at Oracle’s nested tables, for example.). Based on our team’s backgrounds in XML querying, we actually started there – XQuery was developed by a team of really smart people from the SQL world (including Don Chamberlin, father of SQL) as well as from the XML world and the functional programming world – so we started there. We took XQuery and then started throwing the stuff overboard that wasn’t needed for JSON or that seemed like a poor feature that had been added for XPath compatibility.
What remained was AQL, and we think it’s a pretty nice language for semistructured data handling. We periodically do toy with the notion of adding a SQL-like re-skinning of AQL to make SQL users feel more at home – and we may well do that in the future – but that would be different than “real SQL”. (The N1QL effort at Couchbase is doing something along those lines, language-wise, as an example. The SQL++ design from UCSD is another good example there.)
Q11. What level of concurrency and recovery guarantees does AsterixDB offer?
Mike Carey: We offer transaction support that’s akin to that of current NoSQL stores. That is, we promise record-level ACIDity – so inserting or deleting a given record will happen as an atomic, durable action. However, we don’t offer general-purpose distributed transactions. We support an arbitrary number of secondary indexes on data sets, and we’ll keep all the indexes on a data set transactionally consistent – that we can do because secondary index entries for a given record live in the same data partition as the record itself, so those transactions are purely local.
Q12. How does AsterixDB compare with Hadoop? What about Hadoop Map/Reduce compatibility?
Mike Carey: I think we’ve already covered most of that – Hadoop MapReduce is an answer to low-level “parallel programming for dummies”, and it’s great for that – and languages on top like Pig Latin and HiveQL are better programming abstractions for “data tasks” but have runtimes that could be much better. We started over, much as the recent flurry of Big Data analytics platforms are now doing (e.g., Impala, Spark, and friends), but with a focus on scaling to memory-challenging data sizes. We do have a MapReduce compatibility layer that goes along with our Hyracks runtime layer – Hyracks is name of our internal dataflow runtime layer – but our MapReduce compatibility layer is not related to (or connected to) the AsterixDB system.
Q13. How does AsterixDB relate to Hadapt?
Mike Carey: I’m not familiar with Hadapt, per se, but I read the HadoopDB work that fed into it.
We’re architecturally very different – we’re not Hadoop-based at all – I’d say that HadoopDB was more of an expedient hybrid coupling of Hadoop and databases, to get some of the indexing and local query efficiency of an existing database engine quickly in the Hadoop world. We were thinking longer term, starting from first principles, about what a next-generation BDMS might look like. AsterixDB is what we came up.
Q14. How does AsterixDB relate to Spark?
Mike Carey: Spark is aimed at fast Big Data analytics – its data is coming from HDFS, and the task at hand is to scan and slice and dice and process that data really fast. Things like Shark and SparkSQL give users SQL query power over the scanned data, but Spark in general is really catching fire, it appears, due to its applicability to Big Machine Learning tasks. In contrast, we’re doing Big Data Management – we store and index and query Big Data. It would be a very interesting/useful exercise for us to explore how to make AsterixDB another source where Spark computations can get input data from and send their results to, as we’re not targeting the more complex, in-memory computations that Spark aims to support.
Q15. How can others contribute to the project?
Mike Carey: We would love to see this start happening – and we’re finally feeling more ready for that, and even have some NSF funding to make AsterixDB something that others in the Big Data community can utilize and share.
(Note that our system is Apache-style open source licensed, so there are no “gotchas” lurking there.)
Some possibilities are:
(1) Others can start to use AsterixDB to do real exploratory Big Data projects, or to teach about Big Data (or even just semistructured data) management. Each time we’ve worked with trial users we’ve gained some insights into our feature set, our query optimizations, and so on – so this would help contribute by driving us to become better and better over time.
(2) Folks who are studying specific techniques for dealing with modern data – e.g., new structures for indexing spatiotemporaltextual (J) data – might consider using AsterixDB as a place to try out their new ideas.
(This is not for the meek, of course, as right now effective contributors need to be good at reading and understanding open source software without the benefit of a plethora of internal design documents or other hints.) We also have some internal wish lists of features we wish we had time to work on – some of which are even doable from “outside”, e.g., we’d like to have a much nicer browser-based workbench for users to use when interacting with and managing an AsterixDB cluster.
(3) Students or other open source software enthusiasts who download and try our software and get excited about it – who then might want to become an extension of our team – should contact us and ask about doing so. (Try it first, though!) We would love to have more skilled hands helping with fixing bugs, polishing features, and making the system better – it’s tough to build robust software in a university setting, and we would especially welcome contributors from companies.
Thanks very much for this opportunity to share what we’ve being doing!
Michael J. Carey is a Bren Professor of Information and Computer Sciences at UC Irvine.
Before joining UCI in 2008, Carey worked at BEA Systems for seven years and led the development of BEA’s AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years teaching at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. Carey is an ACM Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests all center around data-intensive computing and scalable data management (a.k.a. Big Data).
– AsterixDB Big Data Management System (BDMS): Downloads, Documentation, Asterix Publications.
Follow ODBMS.org on Twitter: @odbmsorg
“The Hadoop platform indeed provides the ability to efficiently process large-scale data at a price point we haven’t been able to justify with traditional technology. That said, not every technology process requires Hadoop; therefore, we have to be smart about which processes we deploy on Hadoop and which are a better fit for traditional technology (for example, RDBMS).”–Kevin Murray.
I wanted to learn how American Express is taking advantage of analysing big data.
I have interviewed Sastry Durvasula, Vice President – Technology, American Express, and Kevin Murray, Vice President – Technology, American Express.
Q1. With the increasing demand for mobile and digital capabilities, how are American Express’ customer expectations changing?
SASTRY DURVASULA: American Express customers expect us to know them, to understand and anticipate their preferences and personalize our offerings to meet their specific needs. As the world becomes increasingly mobile, our Card Members expect to be able to engage with us whenever, wherever and using whatever device or channel they prefer.
In addition, merchants, small businesses and corporations also want increased value, insights and relevance from our global network.
Q2. Could you explain what is American Express’ big data strategy?
SD: American Express seeks to leverage big data to deliver innovative products in the payments and commerce space that provide value to our customers. This is underpinned by best-in-class engineering and decision science.
From a technical perspective, we are advancing an enterprise-wide big data platform that leverages open source technologies like Hadoop, integrating it with our analytical and operational capabilities across the various business lines. This platform also powers strategic partnerships and real-time experiences through emerging digital channels. Examples include Amex Offers, which connects our Card Members and merchants through relevant and personalized digital offers; an innovative partnership with Trip Advisor to unlock exclusive benefits; insights and tools for our B2B partners and small businesses; and advanced credit and fraud risk management.
Additionally, as always, we seek to leverage data responsibly and in a privacy-controlled environment. Trust and security are hallmarks of our brand. As we leverage big data to create new products and services, these two values remain at the forefront.
Q3. What is the “value” you derive by analysing big data for American Express?
SD: Within American Express, our Technology and Risk & Information Management organizations partner with our lines of business to create new opportunities to drive commerce and serve customers across geographies with the help of big data. Big data is one of our most important tools in being the company we want to be – one that identifies solutions to customers’ needs and helps us deliver what customers want today and what they may want in the future.
Q4. What metrics do you use to monitor big data analytics at American Express?
SD: Big data investments are no different than any other investments in terms of the requirement for quantitative and qualitative ROI metrics with pre- and post-measurements that assess the projects’ value for revenue generation, cost avoidance and customer satisfaction. There is also the recognition that some of the investments, especially in the big data arena, are strategic and longer term in nature, and the value generated should be looked at from that perspective.
Q5. Could you explain how did you implement your big data infrastructure platform at Amex?
KEVIN MURRAY: We started small and expanded as our use cases grew over time, about once or twice a year.
We make it a practice to reassess the hardware and software state within the industry before each major expansion to determine whether any external changes should alter the deployment path we have chosen.
Q6. How did you select the components for your big data infrastructure platform, choosing among the various competing compute and storage solutions available today?
KM: Our research told us low-cost commodity servers with local storage was the common deployment stack across the industry. We made an assessment of industry offerings and evaluated against our objectives to determine a good balance of cost, capabilities and time to market.
Q7. How did you unleash big data across your enterprise and put it to work in a sustainable and agile environment?
SD: We engineered our enterprise-wide big data platform to foster R&D and rapid development of use cases, while delivering highly available production applications. This allows us to be adaptable and agile, scaling up or redeploying, as needed, to meet market and business demands. With the Risk and Information Management team, we established Big Data Labs comprising top-notch decision scientists and engineers to help democratize big data, leveraging self-service tools, APIs and common libraries of algorithms.
Q8. What are the most significant challenges you have encountered so far?
SD: An ongoing challenge is balancing our big data investment between immediate needs and research or innovations that will drive the next generation of capabilities. You can’t focus solely on one or the other but has to find a balance.
Another key challenge is ensuring we are focused on driving outcomes that are meaningful to customers – that are responsive to their current and anticipated needs.
Q9. What did you learn along the way?
KM: The Hadoop platform indeed provides the ability to efficiently process large-scale data at a price point we haven’t been able to justify with traditional technology. That said, not every technology process requires Hadoop; therefore, we have to be smart about which processes we deploy on Hadoop and which are a better fit for traditional technology (for example, RDBMS). Some components of the ecosystem are mature and work well, and others require some engineering to get to an enterprise-ready state. In the end, it’s an exciting journey to offer new innovation to our business.
Q10. Anything else you wish to add?
KM: The big data industry is evolving at lightning speed with new products and services coming to market every day. I think this is being driven by the enterprise’s appetite for something new and innovative that leverages the power of compute, network and storage advancements in the marketplace, combined with a groundswell of talent in the data science domain, pushing academic ideas into practical business use cases. The result is a wealth of new offerings in the marketplace – from ideas and early startups to large-scale mission-critical solutions. This is providing choice to enterprises like we’ve never seen before, and we are focused on maximizing this advantage to bring groundbreaking products and opportunities to life.
Sastry Durvasula, Vice President – Technology, American Express
Sastry Durvasula is Vice President and Global Technology Head of Information Management and Digital Capabilities within the Technology organization at American Express. In this role, Sastry leads IT strategy and transformational development to power the company’s data-driven capabilities and digital products globally. His team also delivers enterprise-wide analytics and business intelligence platforms, and supports critical risk, fraud and regulatory demands. Most recently, Sastry and his team led the launch of the company’s big data platform and transformation of its enterprise data warehouse, which are powering the next generation of information, analytics and digital capabilities. His team also led the development of the company’s API strategy, as well as the Sync platform to deliver innovative products, drive social commerce and launch external partnerships.
Kevin Murray, Vice President – Technology, American Express
Kevin Murray is Vice President of Information Management Infrastructure & Integration within the Technology organization at American Express. Throughout his 25+ year career, he has brought emerging technologies into large enterprises, and most recently launched the big data infrastructure platform at American Express. His team architects and implements a wide range of information management capabilities to leverage the power of increasing compute and storage solutions available today.
Presenting at Strata/Hadoop World NY
Big Data: A Journey of Innovation
Thursday, October 16, 2014, at 1:45-2:25 p.m. Eastern
Room: 1 CO3/1 CO4
The power of big data has become the catalyst for American Express to accelerate transformation for the digital age, drive innovative products, and create new commerce opportunities in a meaningful and responsible way. With the increasing demand for mobile and digital capabilities, the customer expectation for real-time information and differentiated experiences is rapidly changing. Big data offers a solution that enables this organization to use their proprietary closed-loop network to bring together consumers and merchants around the world, adding value to each in a way that is individualized and unique.
During their presentation, Sastry Durvasula and Kevin Murray will discuss American Express’ ongoing big data journey of transformation and innovation. How did the company unleash big data across its global network and put it to work in a sustainable and agile environment? How is it delivering offers using digital channels relevant to their Card Members and partners? What have they learned along the way? Sastry and Kevin will address these questions and share their experiences and insights on the company’s big data strategy in the digital ecosystem.
Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg