“In normal English usage the word resilience is taken to mean the power to resume original shape after compression; in the context of data base management the term data base resilience is defined as the ability to return to a previous state after the occurrence of some event or action which may have changed that state.
from “P. A Dearnley, School of Computing Studies, University of East Anglia, Norwich NR4 7TJ , 1975 “
On the topic database resilience, I have interviewed Seth Proctor, Chief Technology Officer at NuoDB.
Q1. When is a database truly resilient?
Seth Proctor: That is a great question, and the quotation above is a good place to start. In general, resiliency is about flexibility. It’s kind of the view that you should bend but not break. Individual failures (server crashes, disk wear, tripping over power cables) are inevitable but don’t have to result in systemic outages.
In some cases that means reacting to failure in a way that’s non-disruptive to the overall system.
The redundant networks in modern airplanes are a great example of this model. Other systems take a deeper view, watching global state to proactively re-route activity or replace components that may be failing. This is the model that keeps modern telecom networks running reliably. There are many views applied in the database world, but to me a resilient database is one that can react automatically to likely or active failures so that applications continue operating with full access to their data even as failures occur.
Q2. Is database resilience the same as disaster recovery?
Seth Proctor: I don’t believe it is. In traditional systems there is a primary site where the database is “active” and updates are replicated from there to other sites. In the case of failure to the primary site, one of the replicas can take over. Maintaining that replica (or replicas) is usually the key part of Disaster Recovery.
Sometimes that replica is missing the latest changes, and usually the act of “failing over” to a replica involves some window where the database is unavailable. This leads to operational terms like “hot stand-by” where failing over is faster but still delayed, complicated and failure-prone.
True resiliency, in my opinion, comes from systems that are designed to always be available even as some components fail. Reacting to failure efficiently is a key requirement, as is survival in case of complete site loss, so replicating data to multiple locations is critical to resiliency. At a minimum, however, a resilient data management solution cannot lose data (sort of “primum non nocere” for servers) and must be able to provide access to all data even as servers fail. Typical Disaster Recovery solutions on their own are not sufficient. A resilient solution should also be able to continue operations in the face of expected failures: hardware and software upgrades, network updates and service migration.
This is especially true as we push out to hybrid cloud deployments.
Q3. What are the database resilience requirements and challenges, especially in this era of Big Data?
Seth Proctor: There is no one set of requirements since each application has different goals with different resiliency needs. Big Data is often more about speeds and volume while in the operational space correctness, latency and availability are key. For instance, if you’re handling high-value bank transactions you have different needs than something doing weekly trend-analysis on Internet memes. The great thing about “the cloud” is the democratization of features and the new systems that have evolved around scale-out architectures. Things like transactional consistency were originally designed to make failures simpler and systems more resilient; as consistent data solutions scale out in cloud models it’s simpler to make any application resilient without sacrificing performance or increasing complexity.
That said, I look for a couple of key criteria when designing with resiliency in mind. The first is a distributed architecture, the foundation for any system to survive individual failure but remain globally available.
Ideally this provides a model where an application can continue operating even as arbitrary components fail. Second is the need for simple provisioning & monitoring. Without this, it’s hard to react to failures in an automatic or efficient fashion, and it’s almost impossible to orchestrate normal upgrade processes without down-time. Finally, a database needs to have a clear model for how the same data is kept in multiple locations and what the failure modes are that could result in any loss. These requirements also highlight a key challenge: what I’ve just described are what we expect from cloud infrastructure, but are pushing the limits of what most shared-nothing, sharded or vertically-scaled data architectures offer.
Q4. What is the real risk if the database goes offline?
Seth Proctor: Obviously one risk is the ripple effect it has to other services or applications.
When a database fails it can take with it core services, applications or even customers. That can mean lost revenue or opportunity and it almost certainly means disruption across an organization. Depending on how a database goes offline, the risk may also extend to data loss, corruption, or both. Most databases have to trade-off certain elements of latency against guaranteed durability, and it’s on failure that you pay for that choice. Sometimes you can’t even sort out what information was lost. Perhaps most dangerous, modern deployments typically create the illusion of a data management service by using multiple databases for DR, scale-out etc. When a single database goes offline you’re left with a global service in an unknown state with gaps in its capabilities. Orchestrating recovery is often expensive, time-consuming and disruptive to applications.
Q5. How are customers solving the continuous availability problem today?
Seth Proctor: Broadly, database availability is tackled in one of two fashions. The first is by running with many redundant, individual, replicated servers so that any given server can fail or be taken offline for maintenance as needed. Putting aside the complexity of operating so many independent services and the high infrastructure costs, there is no single view of the system. Data is constantly moving between services that weren’t designed with this kind of coordination in mind so you have to pay close attention to latencies, backup strategies and visibility rules for your applications. The other approach is to use a database that has forgone consistency, making a lot of these pieces appear simpler but placing the burden that might be handled by the database on the application instead. In this model each application needs to be written to understand the specifics of the availability model and in exchange has a service designed with redundancy.
Q6. Now that we are in the Cloud era, is there a better way?
Seth Proctor: For many pieces of the stack cloud architectures result in much easier availability models. For the database specifically, however, there are still some challenges. That said, I think there are a few great things we get from the cloud design mentality that are rapidly improving database availability models. The first is an assumption about on-demand resources and simplicity of spinning up servers or storage as needed. That makes reacting to failure so much easier, and much more cost-effective, as long as the database can take advantage of it. Next is the move towards commodity infrastructure. The economics certainly make it easier to run redundantly, but commodity components are likely to fail more frequently. This is pushing systems design, making failure tests critical and generally putting more people into the defensive mind-set that’s needed to build for availability. Finally, of course, cloud architectures have forced all of us to step back and re-think how we build core services, and that’s leading to new tools designed from the start with this point of view. Obviously that’s one of the most basic elements that drives us at NuoDB towards building a new kind of database architecture.
Q7. Can you share methodologies for avoiding single points of failure?
Seth Proctor: For sure! The first thing I’d say is to focus on layering & abstraction.
Failures will happen all the time, at every level, and in ways you never expect. Assume that you won’t test all of them ahead of time and focus on making each little piece of your system clear about how it can fail and what it needs from its peers to be successful. Maybe it’s obvious, but to avoid single points of failure you need components that are independent and able to stand-in for each other. Often that means replicating data at lower-levels and using DNS or load-balancers at a higher-level to have one name or endpoint map to those independent components. Oh, also, decouple your application logic as much as possible from your operational model. I know that goes against some trends, but really, if your application has deep knowledge of how some service is deployed and running it makes it really hard to roll with failures or changes to that service.
Q8. What’s new at NuoDB?
Seth Proctor: There are too many new things to capture it all here!
For anyone who hasn’t looked at us, NuoDB is a relational database build against a fundamentally new, distributed architecture. The result is ACID semantics, support for standard SQL (joins, indexes, etc.) and a logical view of a single database (no sharding or active/passive models) designed for resiliency from the start.
Rather than talk about the details here I’d point people at a white paper (Note of the Editor: registration required) we’ve just published on the topic.
Right now we’re heavily focused on a few key challenges that our enterprise customers need to solve: migrating from vertical scale to cloud architectures, retaining consistency and availability and designing for on-demand scale and hybrid deployments. Really important is the need for global scale, where a database scales to multiple data centers and multiple geographies. That brings with it all kinds of important requirements around latencies, failure, throughput, security and residency. It’s really neat stuff.
Q9- How does it differentiate with respect to other NoSQL and NewSQL databases?
Seth Proctor: The obvious difference to most NoSQL solutions is that NuoDB supports standard SQL, transactional consistency and all the other things you’d associate with an RDBMS.
Also, given our focus on enterprise use-cases, another key difference is the strong baseline with security, backup, analysis etc. In the NewSQL space there are several databases that run in-memory, scale-out and provide some kind of SQL support. Running in-memory often means placing all data in-memory, however, which is expensive and can lead to single points of failure and delays on recovery. Also, there are few that really support the arbitrary SQL that enterprises need. For instance, we have customers running 12-way joins or transactions that last hours and run thousands of statements.
These kinds of general-purpose capabilities are very hard to scale on-demand but they are the requirement for getting client-server architectures into the cloud, which is why we’ve spent so long focused on a new architectural view.
One other key difference is our focus on global operations. There are very few people trying to take a single, logical database and distribute it to multiple geographies without impacting consistency, latency or security.
Qx Anything else you wish to add?
Seth Proctor: Only that this was a great set of questions, and exactly the direction I encourage everyone to think about right now. We’re in a really exciting time between public clouds, new software and amazing capacity from commodity infrastructure. The hard part is stepping back and sorting out all the ways that systems can fail.
Architecting with resiliency as a goal is going to get more commonplace as the right foundational services mature.
Asking yourself what that means, what failures you can tolerate and whether you’re building systems that can grow alongside those core services is the right place to be today. What I love about working in this space today is that concepts like resilient design, until recently a rarefied approach, are accessible to everyone.
Anyone trying to build even the simplest application today should be asking these questions and designing from the start with concepts like resiliency front and center.
Seth Proctor, Chief Technology Officer, NuoDB
Seth has 15+ years of experience in the research, design and implementation of scalable systems. That experience includes work on distributed computing, networks, security, languages, operating systems and databases all of which are integral to NuoDB. His particular focus is on how to make technology scale and how to make users scale effectively with their systems.
Prior to NuoDB Seth worked at Nokia on their private cloud architecture. Before that he was at Sun Microsystems Laboratories and collaborated with several product groups and universities. His previous work includes contributions to the Java security framework, the Solaris operating system and several open source projects, in addition to the design of new distributed security and resource management systems. Seth developed new ways of looking at distributed transactions, caching, resource management and profiling in his contributions to Project Darkstar. Darkstar was a key effort at Sun which provided greater insights into how to distribute databases.
Seth holds eight patents for his cutting edge work across several technical disciplines. He has several additional patents awaiting approval related to achieving greater database efficiency and end-user agility.
- On Big Data and the Internet of Things. Interview with Bill Franks
Source: ODBMS.org | Published on 2015-03-09
- On MarkLogic 8. Interview with Stephen Buxton
Source: ODBMS.org | Published on 2015-02-13
- On Data Curation. Interview with Andy Palmer
Source: ODBMS.org | Published on 2015-01-14
- On Solr and Mahout. Interview with Grant Ingersoll
Source: ODBMS.org | Published on 2015-01-06
Follow ODBMS.org on Twitter: @odbmsorg
“Perhaps the biggest challenge is that the IoT has the potential to generate orders of magnitude more data than any other source in existence today. So, in the world of the IoT we will test the limits of ‘big.’”–Bill Franks
On topics of Data Warehouses, Hadoop, the Internet of Things, and Teradata`s perspective on the world of Big Data, I have interviewed Bill Franks, Chief Analytics Officer for Teradata.
Q1. What is Teradata`s perspective on the world of Big Data?
Bill Franks: Our perspective has not really changed with regard to ‘big data:’ the primary mission of Teradata for decades has been helping organizations utilize and analyze large volumes of data to produce insight for business value. Note that our Teradata database was originally designed exclusively for analytics, then called ‘decision support’ – unlike most other platforms, which were designed for general computing – then later adapted for analytic uses. As a result, the Teradata analytic engine is – and has always been – uniquely architected for large – ‘big data’ – volume and complexity aimed at producing actionable intelligence.
Of course, the amount of data that’s considered ‘big’ and thus a challenge – has changed, and we have a lot of novel data sources in recent times. However, we believe that companies which have always focused on analyzing and acting upon data intelligently can adapt to the new world of big data. After all, big data is just more data and the analysis of big data is still analysis. There are as many similarities as differences from the past.
Teradata has engineered further analytic enhancements over the years to create a diverse portfolio of products, partnerships, and services to allow our customers to continue to get the most from their data assets. The pace of change is very rapid today and we expect that to continue. We believe our strength is in our experience, expertise and our ability to help organizations navigate the changing landscape and continue to derive new, useful insights from their increasingly large and diverse data sources.
Q2. Most data warehousing projects consolidate data from different source systems. What is different in the world of Big Data?
Bill Franks: By definition, if you want to look at two different data sources together, you must either move one set of data to the other or move them both to a 3rd location. If data is truly disparate, you can’t use it effectively. That is what drove data warehousing to prominence. One huge difference between data warehousing practices years back and then today is that previously, all data that was captured in the business world met three criteria almost 100% of the time.
1) It was immensely important; given the cost to capture and store it,
2) The data was well structured, and
3) The data was generated by an organization’s internal business processes.
— Therefore, it was mostly placed in relational databases or on a mainframe since those technologies easily handle that type of data. Data warehousing solved the problem of many structured data platforms being spread out – by consolidating the sources for analytic purposes into a single structured platform.
What is different with big data is that today, the data often violates all of the rules.
1) Much of it is not important, or has not yet been proven to be important,
2) The data is not structured in the classic fashion at the outset (though most can and must be structured for analytical purposes), and
3) The data is often from sources external to an organization.
— As a result, we now have disparate data platforms that each serve different functions. Some focus on one type of data, while others focus on flexibility. However, the downside is that these platforms don’t integrate well and it isn’t as easy to tie everything together. That’s a problem Teradata is working diligently to solve with our Unified Data Architecture – our pioneering version of the visionary Gartner Logical Data Warehouse.
Q3. Will data warehouses become obsolete soon and be replaced by Hadoop?
Bill Franks: Absolutely not. A few years ago, that was a common claim. That claim is rarely heard today. In fact, all of the big Hadoop vendors partner with Teradata.
This is because our data warehousing platforms provide some important things Hadoop does not — just as Hadoop provides some things a data warehouse does not. Each platform has its strengths and weaknesses, but when positioned together, additional value is added. Part of the issue is that people mistake policy decisions for technology limitations.
There is no reason you can’t place untested, raw, unclean data of unknown value on a data warehousing platform; it’s the corporate policies that often forbid it. It is true that once data is critical and is leveraged by many applications and business users, you have to keep some control and consistency over it. This is what a data warehouse does for an organization.
But, that doesn’t mean you can’t experiment with new sources freely using the technology that supports formal data warehouses.
A colleague of mine mentioned a conversation he had with a Hadoop user. That user was boasting about how he could with a single command change the data type of information on Hadoop, for instance, if it would help him more easily solve his next problem. My friend then asked him what would happen to the prior dozen or two processes that were built expecting the data to be in the original data type format. Wouldn’t they all then break? The user had a blank stare for a moment and then realized his error. As you develop more processes, you must implement security, consistency, and controls on the underlying data. This is why data warehousing – as Gartner defines it, is going to be around for a long time.
Q4. With the increased need of tools for combining data together, are we going to see a “federated”- Big Data architecture?
Bill Franks: A form of that is exactly what we are pursuing with Teradata’s Unified Data Architecture. Again – we refer to Gartner’s vision of the “Logical Data Warehouse.” What we are doing is putting in place a layer of architecture that connects multiple disparate data stores. This architecture includes – and connects – relational databases like Teradata and Oracle, discovery platforms like our Teradata Aster offering, Hadoop, and other platforms such as MongoDB. The idea is that we make information available to users about data throughout the ecosystem, not just the data on the platform they are operating from. So, I see a data dictionary that includes a “table” called “Sensor Feed.”
I can see the data elements available and write analytic logic against those elements. However, I don’t need to be aware of whether the data is a database table, or a Hadoop file, or is in MongoDB. Users can simply build analytics instead of worrying about where data resides, how to log on to various systems, and how to move data. We’ll handle that for them.
We are also beginning to push processing across the various platforms to optimize performance. Just like with a ‘table’ versus a ‘view’ in a database, making a process enterprise-ready might require moving data around the architecture permanently. But now, users are free to discover where that is required. And, the technical team behind the curtain can worry about the details just as they do with traditional data warehousing. We are very bullish on our approach and think we are well positioned to maintain our leadership position in the analytics space.
Q5. Teradata made several acquisitions lately. How do the tools that Teradata acquired fit the current Teradata Data Architectural Framework?
Bill Franks: I believe this in general was addressed. However, in addition I would point out that we acquired Revelytix in 2014 to obtain Loom: an open platform for discovering, profiling, preparing, and tracking data lineage for data in Hadoop. Likewise, we acquired Hadapt, which created a big data analytic platform natively integrating SQL with Apache Hadoop. Plus our recent RainStor acquisition strengthens Teradata’s enterprise-grade Hadoop solutions and enables organizations to add archival data store capabilities for their entire enterprise, including data stored in OLTP, data warehouses, and applications.
Q6. What are the key differentiators of the Teradata Database core architecture?
Bill Franks: As I said, the Teradata DW was differentiated from the start – uniquely architected for analytics from day one. However, I would add that Teradata continues to broaden our differentiation: we’ve built the best data orchestration software in the industry (Teradata Unity and QueryGrid). The orchestration software is key – because it enables our customers to choose a file system that they use to store the data in – and the analytics that they apply to that data independently — and marry them together with software.
It helps reduce the complexity of connecting to, accessing, understanding interfaces and getting value from multiple analytical systems. Another differentiator is Teradata Intelligent Memory, introduced two years ago. TIM is the world’s first extended memory technology beyond cache to increase query performance. Users can configure the exact amount of in-memory capability needed for critical workloads – based on temperature – hot or cold data. The list goes on. I would say that our data technology really does focus on how data is best used – and what proficient users need most.
Q7. Is SQL really the right language to handle Big Data Analytics?
Bill Franks: In some cases yes and in some cases no. We want users to be able to utilize whatever language or platform is best for any given task. There are many big data requirements that perfectly fit SQL and many that don’t. The key is enabling scalable access to the data and flexibility in approach. Most people are aware that there is a big effort to add a SQL interface to Hadoop. What most haven’t realized is how far we’ve also come the other direction. For some time, Teradata has allowed C and Java processing directly against our database platforms via User Defined Functions and other similar extensions. We are now also enabling other languages such as R and Python to be executed within a Teradata context. What is possible today is so far beyond what was possible even 5 or 10 years ago.
Q8. How do you see the adoption of Cloud for Analytics?
Bill Franks: We are aggressively rolling out our own cloud offerings across our product suites. Many of our enterprise customers also configure our products as a private cloud behind their firewall. Adoption will be mixed based on the type of data and nature of work being done. Anything involving sensitive data is still typically not allowed outside a firewall. If you think back to the issue raised in a prior question of having to be able to combine data for analytics, you can’t really have some data locked behind a firewall and some data locked outside it. The real driver behind the cloud is that people want flexible, pay on demand access to analysis platforms. We have multiple ways to provide that to our clients, of which our cloud offerings are only one option. We have some other novel pricing and licensing options the help customers get access to the resources they require for analytics.
Q9. What are the most important data challenges posed by the Internet of Things (IoT)?
Bill Franks: Perhaps the biggest challenge is that the IOT has the potential to generate orders of magnitude more data than any other source in existence today.
So, in the world of the IOT we will test the limits of ‘big.’ At the same time, much of the data generated by the IOT will have low value in the short term and no value in the long term. One of the biggest challenges will be determining which pieces of the information generated by a given sensor actually matters to your business and for how long. In the long run, it is likely that only a small fraction of the raw data produced by the IOT will be stored beyond a few moments of immediate usage. For example, why keep the sensor readings that help navigate my car into a tight parking spot? Once I’m safely in the spot, I really don’t ever need to revisit that data again. If I hit a car in front of me, I might make an exception and keep the data so that the cause can be identified.
Q10. Could you mention some successful Big Data projects you have recently completed with customers?
Bill Franks: We are seeing a lot of very interesting analytics come about. We’ve helped health organizations discover genetic patterns associated with disease, we’ve helped manufacturers reduce cost and increase customer satisfaction by building predictive maintenance algorithms, we’ve helped cable providers identify valuable consumer viewing habits.
I could go on and on. A great place to see some of the examples, and even hear from some of the companies and people behind it, is at our website.
Bill Franks is the Chief Analytics Officer for Teradata, where he provides insight on trends in the analytics and big data space and helps clients understand how Teradata and its analytic partners can support their efforts. His focus is to translate complex analytics into terms that business users can understand and work with organizations to implement their analytics effectively. His work has spanned many industries for companies ranging from Fortune 100 companies to small non-profits. Franks also helps determine Teradata’s strategies in the areas of analytics and big data.
Franks is the author of the book Taming The Big Data Tidal Wave (John Wiley & Sons, Inc., April, 2012). In the book, he applies his two decades of experience working with clients on large-scale analytics initiatives to outline what it takes to succeed in today’s world of big data and analytics. The book made Tom Peter’s list of 2014 “Must Read” books and also the Top 10 Most Influential Translated Technology Books list from CSDN in China.
Franks’ second book The Analytics Revolution (John Wiley & Sons, Inc., September, 2014) lays out how to move beyond using analytics to find important insights in data (both big and small) and into operationalizing those insights at scale to truly impact a business.
He is a faculty member of the International Institute for Analytics, founded by leading analytics expert Tom Davenport, and an active speaker who has presented at dozens of events in recent years. His blog, Analytics Matters, addresses the transformation required to make analytics a core component of business decisions.
Franks earned a Bachelor’s degree in Applied Statistics from Virginia Tech and a Master’s degree in Applied Statistics from North Carolina State University. More information is available here: http://www.bill-franks.com.
Follow ODBMS.org on Twitter: @odbmsorg
“Apache Ignite is an incubating Apache project, which provides a high-performance, distributed in-memory data management software layer between various data sources and applications.”–Nikita Ivanov
I have interviewed Nikita Ivanov, founder and CTO of GridGain Systems. Main topic of the interview is the new release of Apache Ignite.
Q1. In your opinion, what are the main differences between an In-Memory Database, an In-Memory Data Grid and an In-Memory Data Fabric?
Nikita Ivanov: The main difference between in-memory databases (IMDB) and in-memory data grids is that IMDBs support only SQL (or some proprietary NoSQL dialect) while most Data Grids (IMDG) support multiple ways to access and process data. In IMDB the only way to access and process data is SQL and SQL store-procedures, while IMDGs typically support at least the following paradigms: SQL, Key/Value, MapReduce, MPP, and MPI-based processing.
Compared with IMDGs, an In-Memory Data Fabric represents the latest generation of in-memory technologies, integrated into a single platform, which eliminates the need for point solutions such as IMDB’s or IMDGs. It is a software layer that sits between applications and data stores, and it allows for high-performance data access and processing across different types of data, such as SQL, NoSQL and Hadoop. All without any rip and replace of existing applications or databases.
Q2. How is it possible to accelerate Hadoop-based deployments with in-memory technology?
Nikita Ivanov: To accelerate Hadoop in a meaningful way one needs to find a way to accelerate two core technologies that define Hadoop: HDFS, a distributed file system where data is stored, and MapReduce, a framework that allows parallel processing of the data stored in HDFS.
At GridGain, we’ve developed a highly optimized in-memory file system that is 100% compatible with HDFS that allows to store data directly in DRAM of computers in a Hadoop cluster. We’ve also developed a specifically optimized YARN-based MapReduce implementation that takes full advantage of the data stored directly in DRAM instead of disks.
The combination of these two innovations allows GridGain to speed up any Hadoop payloads – including Pig, Hive, or hand-written MapReduce jobs in any language – up to 10x without any code change. GridGain provides the first Hadoop accelerator that provide a true plug-and-play acceleration to the existing Hadoop jobs.
Q3. Why did you decide to open source your product?
Nikita Ivanov: Even before October of last year, GridGain already had an open core model: We offered an in-memory data fabric under the Apache 2.0 license, and we also offered a commercial edition with a number of enterprise-grade features, such as enhanced security, data center replication, rolling updates, cross-language portability, and others.
The drivers for our decision to contribute our core open source code base to the Apache Software Foundation (ASF) were of course to ensure continued, broad adoption of in-memory technologies and the long-term viability of the code base. But equally importantly, we also want to build a thriving community that adopts and adapts this code base, and hence will be key in finding new use cases for in-memory computing.
Q4. What is Apache Ignite?
Nikita Ivanov: Apache Ignite (incubating) is an open source, distributed framework for a unified In-Memory Data Fabric, originally developed by GridGain Systems. Apache Ignite is an incubating Apache project, which provides a high-performance, distributed in-memory data management software layer between various data sources and applications. Its code is written mostly in Java and Scala with small amount of C++ code, and it will initially combine an in-memory data grid, in-memory compute grid and in-memory streaming processing in one framework.
Apache Ignite’s large scale, in-memory framework offers transactional and real-time analytics applications performance gains of 100-1,000 times faster throughput and/or lower latencies. It is also a key open source foundation to enable the emerging class of so-called hybrid transactional-analytical workloads.
Q5. What is special about v1.0?
Nikita Ivanov: In October of 2014, the GridGain In-Memory Data Fabric core code base was accepted by the Apache Software Foundation (ASF) into the Incubator program under the name “Apache Ignite”.
Since then, GridGain engineers as well as other contributors have been busy working on migrating the existing code base, documentation, and refactoring of the existing internal build, test & release processes to the “Apache Way”.
Version 1.0 represents the first release that meets these goals, and will include additional enhancements above and beyond the most recent open source In-Memory Data Fabric from GridGain. In fact, Apache Ignite has a large set of features, and one of its coolest new features is its ability to automatically integrate with different RDBMS systems, such as Oracle, MySql, Postgres, DB2, Microsoft SQL, etc. This feature automatically generates the application domain model based on the schema definition of the underlying database, and then loads the data.
Despite the breadth of its feature set, however, Ignite is actually very easy to use: For example, there are no custom installers. The product comes as one ZIP file, which is ready to go once you unzip it. And it has only 1 mandatory dependency – ignite-core.jar. All other dependencies, like integration with Spring for configuration, or with the H2 database for SQL, can be added to the process a la carte. Also, the project is fully mavenized, and is composed of over a dozen of maven artifacts that can be imported and used in any combination. Apache Ignite is based on standard Java APIs, and for distributed caches and data grid functionality Ignite implements the JCache (JSR 107) standard.
The new Apache Ignite v1.0 bits are available for download now from the Apache Ignite web site.
Q6. Who will be using the Apache Ignite In-Memory Data Fabric, and for what?
Nikita Ivanov: We expect developers and software architects of high-performance, hyper-scale on-premise and SaaS applications to take advantage of the following capabilities when building or performance-tuning their new or existing applications: compute grid, data grid, service grid, streaming, clustering, distributed data structures, distributed messaging, distributed events and in-memory file system.
Use cases can be found in software designed for financial services, telecommunications, retail, transportation, social media, online advertising, utilities, biosciences and many other industries.
Q7. What is positioning of the Apache Ignite project?
Nikita Ivanov: As we explained in our blog from last November, we believe Apache Ignite has all the right ingredients to become for the world of Fast Data what Hadoop is for Big Data today. This means that unlike Hadoop, which is a batch process focused on enabling the storage of large amounts of data economically, Ignite will enable extremely fast and ultra-low latency processing of data, allowing its users to derive actionable insights from their data much faster. Unlike Spark, a popular sister project of Ignite in the ASF, which is mainly focused on enhancing analytics and machine-learning for the Hadoop world, Ignite is a data source agnostic processing layer, which can be used for both Hadoop-like computation and many other computing paradigms like MPP, MPI, streaming processing.
In addition to real-time analytics, Ignite’s in-memory framework also offers support for full ACID transactions.
Q8. You have previously posted that Oracle and SAP are missing the point of In-Memory Computing. Could you please elaborate on this?
Nikita Ivanov: We continue to believe that Oracle and SAP are missing the point of in-memory computing for the following reasons: By offering a well-integrated platform of a compute grid, data grid, streaming/CEP and Hadoop acceleration, Apache Ignite (incubating) and the GridGain In-Memory Data Fabric offer a strategic approach to in-memory computing, across both transactional and analytical workloads, that delivers performance, scale and comprehensive capabilities far above and beyond what traditional in-memory databases, data grids or other in-memory-based point solutions can offer by themselves.
Both Apache Ignite and GridGain’s enterprise offering built on Apache Ignite will greatly benefit from a thriving community adapting the code base to new and emerging use cases; therefore, we believe this code base is extremely well positioned to drive superior innovation to the world of Fast Data, just as the Hadoop community has been doing for Big Data.
In addition, unlike Oracle or SAP Hana, Apache Ignite is more affordable, easier-to-access and more transparent open source software running on commodity hardware, which typically increases developers’ and architects’ motivation to explore the potential of in-memory computing. That said, if all the customer is looking for from in-memory technology is faster processing of their (SQL) data, then they may still choose to deploy proprietary software from Oracle or SAP.
Qx Anything else you wish to add?
Nikita Ivanov: I guess I should mention that even though Apache Ignite has been in incubation for less than 4 months only, we are excited to see that the project already has a very vibrant and growing community.
But we always welcome community contributions, so if there are readers that would like to contribute, please send an email to the Apache Ignite dev list, and we will get you started. And even if you are not ready to contribute immediately, we would like to invite everyone to join our dev list. Most of the discussions happen there, and you can find out a lot about where the project is going and also provide your own ideas. Another great way, of course, for people to familiarize themselves with Apache Ignite, is to take a look at the code and see what it can do for thier project. The Ignite bits can be downloaded on the Apache Ignite homepage.
Nikita Ivanov is founder and CTO of GridGain Systems, started in 2007 and funded by RTP Ventures and Almaz Capital. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today.
Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. Nikita was one of the pioneers in using Java technology for server side middleware development while working for one of Europe’s largest system integrators in 1996.
He is an active member of Java middleware community, contributor to the Java specification, and holds a Master’s degree in Electro Mechanics from Baltic State Technical University, Saint Petersburg, Russia.
– On Solr and Mahout. Interview with Grant Ingersoll. ODBMS Industry Watch, 2015-01-06
– Big Data: Three questions to McObject. ODBMS Industry Watch, February 14, 2014
Follow ODBMS.org on Twitter: @odbmsorg
“When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits. “– Stephen Buxton.
MarkLogic recently released MarkLogic 8. I wanted to know more about this release. For that, I have interviewed Stephen Buxton, Senior Director, Product Management at MarkLogic.
Q1. You have recently launched MarkLogic® 8 software release. How is it positioned in the Big Data market? How does it differentiate from other products from NoSQL vendors?
Stephen Buxton: MarkLogic 8 is our biggest release ever, further solidifying MarkLogic’s position in the market as the only Enterprise NoSQL database.
With MarkLogic 8, you can now store, manage and search JSON, XML, and RDF all in one unified platform—without sacrificing enterprise features such as transactional consistency, security, or backup and recovery.
While other database companies are still figuring out how to strengthen their platform and add features like transactional consistency, we’ve moved far ahead of them by working on new innovative features such as Bitemporal and Semantics. It’s for these reasons that over 500 enterprise organizations have chosen MarkLogic to run their mission-critical applications.
MarkLogic 8 is more powerful, agile, and trusted than ever before, and is an ideal platform for doing two things: making heterogeneous data integration simpler and faster; and for doing dynamic content delivery at massive scale.
Relational databases do not offer enough flexibility—integration projects can take multiple years, cost millions of dollars, and struggle at scale. But, the newer NoSQL databases that do have agility still lack the enterprise features required to run in the data centers at large organizations. MarkLogic is the only NoSQL database that is able to solve today’s challenge, having the flexibility to serve as an operational and analytical database for all of an organization’s data.
Within MarkLogic, the JSON structure is mapped directly to the internal structure already used by the XML document format, so it has the same speed and scalability as with XML. This also means that all of the production-proven indexing, data management, and security capabilities that MarkLogic is known for are fully maintained.
Q3. In MarkLogic 8 you have been adding full SPARQL 1.1 support and Inferencing capability. Could you please explain what kind of Inferencing capability did you add and what are they useful for?
Stephen Buxton: We made a big leap forward on the semantics foundation that was laid in our previous release, adding full SPARQL 1.1 support, which includes support for property paths, aggregates, and SPARQL Update. Support for automatic inferencing was also added, which is a powerful capability that allows the database to combine existing data and apply pre-defined rules to infer new data. SPARQL 1.1 is a standard defined by the W3C that is supported by many RDF triple stores. But, MarkLogic differentiates itself among triple stores as you can store your documents and data right alongside your triples, and you can query across all three data models easily and efficiently.
Automatic inferencing is a really powerful feature that is part of an overall strategy to provide a more intelligent data layer so that you can build smarter apps.
With inferencing, for example, if you had two pieces of data stored as RDF triples, such as “John lives in Virginia” and “Virginia is in the United States”, then MarkLogic 8 could infer the new fact, “John lives in the United States.”
This can make search results richer and also show you new relationships in your data.
In MarkLogic 8, rules for inferencing are applied at query time. This approach is referred to as backward-chaining inference, a very flexible approach in which only the required rules are applied for each query, so the server does the minimum work necessary to get the correct results; and when your data or ontology or rules sets change, that change is available immediately – it takes effect with the very next query. And, of course, inference queries are transactional, distributed, and obey MarkLogic’s rule-based security, just like any other query. MarkLogic 8 has supplied rule sets for RDFS, RDFS-Plus, OWL-Horst, and their subsets; and you can create your own. With MarkLogic 8 you can further restrict any SPARQL query (with or without inference) by any document attribute, including timestamp, provenance, or even a bitemporal constraint.
More details and examples can be found at developer.marklogic.com.
Q4. The additions to SPARQL include Property Paths, Aggregates, and SPARQL Update. Could you please explain briefly each of them?
Stephen Buxton: SPARQL 1.1 brings support for property paths, aggregates, and SPARQL Update. These capabilities make working with RDF data simpler and more powerful, which means increased context for your data—all using the SPARQL 1.1 industry standard query language.
SPARQL 1.1’s property paths let you traverse an RDF graph – bouncing from point-to-point across a graph. This graph traversal allows you to do powerful, complex queries such as, “Show me all the people who are connected to John” by finding people that know John, and people that know people that know John, and so on.
With aggregate SPARQL functions you can do analytic queries over hundreds of billions of triples. MarkLogic 8 supports all the SPARQL 1.1 Aggregate functions – COUNT, SUM, MIN, MAX, and AVG – as well as the grouping operations GROUP BY, GROUP BY .. HAVING, GROUP_CONCAT and SAMPLE.
SPARQL 1.1 also includes SPARQL Update. With these capabilities, you can delete, insert, and update (delete/insert) individual triples, and manipulate RDF graphs, all using SPARQL 1.1.
Q5. The addition of SPARQL Update capabilities could have the potential to influence the capability you offer of a RDF triple store that scales horizontally and manages billions of triples. Any comment on that?
Stephen Buxton: The enhancements in MarkLogic 8 make it able to function as a full-featured, stand-alone triple store– this means you can now get a triple store that is horizontally scalable as part of a shared-nothing cluster, and still get all of the enterprise features MarkLogic is known for such as such as High Availability, Disaster Recovery, and certified security. Beyond that, anyone looking for “just a triple store” will find they can also store, manage, and query documents and data in the same database, a unique capability that only MarkLogic has.
Q6. You have been adding a so called Bitemporal Data Management. What is it and why is it useful?
Stephen Buxton: Bitemporal is a new feature that allows you to ask, “What did you know and when did you know it?” The MarkLogic Bitemporal feature answers this critical question by tracking what happened, when it happened, and when we found out. A bitemporal database is much more powerful than a temporal database that can only track when something happened. The difference between when something happened and when you found out about it can be incredibly significant, particularly when it comes to audits and regulation.
A bitemporal database tracks time across two different axes, the system and valid time axes. This allows you to go back in time and explore data, manage historical data across systems, ensure data integrity, and do complex bitemporal analysis. You can answer complex questions such as:
• Where did John Thomas live on August 20th as we knew it on September 1st?
• Where was the Blue Van on October 12th as we knew it on October 23rd?
Bitemporal is important for a wide variety of use cases across industries. Getting a more accurate picture of a business at different points-in-time used to be impossible, or very challenging at best. Bitemporal helps ensure that you always have a full and accurate picture of your data at every point-in-time, which is particularly useful in regulated industries.
• Regulatory requirements – Avoid the increasingly harsh downside consequences from not adhering to government and industry regulations, particularly in financial services and insurance
• Audits – Preserve the history of all your data, including the changes made to it, so that clear audits can be conducted without having to worry about lost data, data integrity, or cumbersome ETL processes with archived data
• Investigations and Intelligence – No more lost emails and no more missing information. Bitemporal databases never erase data, so it is possible to see exactly how data was updated based on what was known at the time
• Business Analytics – Run complex queries that were not previously possible in order to better understand your business and answer new questions about how different decisions and changes in the past could have led to different results
• Cost reduction – Manage data with a smaller footprint as the shape of the data changes, avoiding the need to set up additional databases for historical data.
Bitemporal is enhanced by MarkLogic’s Tiered Storage, which allows you to more easily archive your data to cheaper storage tiers with little administrative overhead. This keeps Bitemporal simple, and obviates the high cost imposed by the few relational databases that do have Bitemporal. MarkLogic also eliminates the schema roadblocks that relational databases that have Bitemporal struggle with. MarkLogic is schema-agnostic and can adjust to the shape of data as that data changes over time.
Q7. How is bitemporal different from versioning?
Stephen Buxton: Bitemporal works by ingesting bitemporal documents that are managed as a series of documents with range indexes for valid and system time axes. Documents are stored in a temporal collection protected by security permissions. The initial document inserted into the database is kept and never changes, allowing you to track the provenance of information with full governance and immutability.
Q8. Could you give us some examples of how Bitemporal Data Management could be useful applications for the financial services industry?
Stephen Buxton: One example of Bitemporal is trade reconciliation in financial services. When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits.
Imagine the Head of IT Architecture at a major bank working on mining information and looking for changes in risk profiles. The risk profiles cannot be accurately calculated without having an accurate picture of the reference and trade data, and how it changed over time. This task becomes simple and fast using Bitemporal.
Qx Anything else you wish to add?
Stephen Buxton: In addition to innovative features such as Bitemporal and Semantics, and features that make MarkLogic more widely accessible in the developer community, there are other updates in Marklogic 8 that make it easier to administer and manage. For example, Incremental Backup, another feature added in MarkLogic 8, allows DBAs to perform backups faster while using less storage.
With MarkLogic 8, you can have multiple daily incremental backups with only a minimal impact on database performance. This feature is one worth highlighting because it will help make DBAs live much easier, and will save an organization time and money.
It’s just another example of MarkLogic’s continuing dedication to being an enterprise NoSQL database that is more powerful, agile, and trusted than anything else.
Stephen Buxton is Senior Director of Product Management for Search and Semantics at MarkLogic, where he has been a member of the Products team for 8 years. Stephen focuses on bringing a rich semantic search experience to users of the MarkLogic NoSQL database, document store, and triple store. Before joining MarkLogic, Stephen was Director of Product Management for Text and XML at Oracle Corporation.
–On making information accessible. Interview with David Leeming. ODBMS Industry Watch, July 30, 2014
Follow ODBMS.org on Twitter: @odbmsorg
“We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.”–John Allison
I have interviewed John Allison, CTO and founder of Customer.io, a start up company in Portland, Oregon.
Q1. What is the business of Customer.io ?
John Allison: We help our customers send timely, targeted messages based on user activity on their website or mobile app. We achieve this by collecting analytical data, providing real-time segmentation, and allowing our customers to define rules to trigger messages at different points in their interactions with a user.
Q2. How large are the data sets you analyze?
John Allison: We’ve collected 6 terabytes of analytical event data for over 55 million unique users across our platform. Due to it’s nature, this data continues to grow and grows faster as we collect data for more and more users.
Q3. What are the main business and technical challenges you are currently facing?
John Allison: As we continue to grow our business, we need to ensure the technical side of our service can easily scale out to support new customers who want to use our product.
Q4. Why did you replace your existing underlying database architecture supporting your “MVP” product ? What were the main technical problems you encountered?
John Allison: As our data set grew in size to the point where we couldn’t realistically manage it all on a small number of servers, we began looking for alternatives which would allow us to continue providing our service in a larger, more distributed way.
Q5. How did you evaluate the alternatives?
John Allison: We evaluated many options and found that most didn’t live up to the availability or consistency guarantees they promised when run over a cluster of servers. We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.
Q6. How is the new solution looking like?
John Allison: We’ve taken more of a polyglot approach to storing our data. We are consolidating on three main clustered databases:
1) FoundationDB – Data where distributed transactions and consistency guarantees are most important.
2) Riak – Large amounts of immutable data where availability is more important.
3) ElasticSearch – Indexing data for ad-hoc querying.
All three have built in tools for expanding and administrating a cluster, provide fault-tolerance and increased reliability in the face of server faults, and each provides us with unique ways to access our data.
Q7. What experience do you have with this new database architecture until now? Do you have any measurable results you can share with us?
John Allison: Embracing a distributed architecture and storing data in the right database for a given use-case has led to less time worrying about operations, increased reliability of our service as a whole, and the ability to scale out all parts of our infrastructure to increase our platform’s capacity.
Q8. Moving forward, what are your plans for the next implementation of your product?
John Allison: Continuing to improve our product in order to provide the most value we can for our customers.
John Allison is the CTO and founder of Customer.io, a startup focused on making it easy to build, manage, and measure automatic customer retention emails. Prior to that he was the head of engineering at Challengepost.com. He is a world traveler, Golfer, and an Arkansas Razorback fan.
We have published several new experts articles on Big Data and Analytics in ODBMS.org.
– On Mobile Data Management. Interview with Bob Wiederhold. ODBMS Industry Watch, 2014-11-18.
–Big Data Management at American Express. Interview with Sastry Durvasula and Kevin Murray. ODBMS Industry Watch, 2014-10-12
Follow ODBMS.org on Twitter: @odbmsorg
“When does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.”–Grant Ingersoll.
I have interviewed Grant Ingersoll, CTO and co-founder of LucidWorks. Grant is an active member of the Lucene community, and co-founder of the Apache Mahout machine learning project.
I wish you a Happy and a Peaceful 2015!
Q1. Why LucidWorks Search? What kind of value-add capabilities does it provide with respect to the Apache Lucene/Solr open source search?
Grant Ingersoll: I like to think of LucidWorks Search (LWS) as Solr++, that is, we give you all of the goodness of Solr and then some more. Our primary focus in building LWS is in 4 key areas:
1. IT integration — Make it easy to consume Solr within an IT organization via things like monitoring, APIs, installation and so on.
2. Enterprise readiness — Large enterprises have 1 of everything and they all have a multitude of security requirements, so we focus on making it easier to operate in these environments via things like connectors for data acquisition, security and the like
3. Tools for Subject Matter Experts — These are aimed at technical non developers like Business Analysts, Merchandisers, etc. who are responsible for understanding who asked for what, when and why. These tools are primarily aimed at understanding relevancy of search results and then taking action based on business needs.
4. Deliver a supported version of the open source so that companies can reliably deploy it knowing they have us to back them up.
Q2. At LucidWorkd you have integrated Apache open source projects to deliver a Big Data application development and deployment platform. What does the emerging big data stack look like?
Grant Ingersoll: We use capabilities from the Hadoop ecosystem for a number of activities that we routinely see customers struggling with when they try to better understand their data. In many cases, this boils down to large scale log analysis to power things like recommendation systems or Mahout for machine learning, but it also can be more subtle like doing large scale content extraction from Office documents or natural language processing approaches for identifying interesting phrases. We also rely on Zookeeper quite heavily to make sure that our cluster stays in a happy state and doesn’t suffer from split brain issues and cause failures.
Q3. How does it different with respect to other Big Data Hadoop-based distributions such as Cloudera, Hortonworks, and Greenplum Pivotal HD?
Grant Ingersoll: I can’t speak to their integrations in great detail, but we integrate with all of them (as well as partner with most of them), so I guess you would say we try to work at a layer above the core Hadoop infrastructure and focus on how the Hadoop ecosystem can solve specific problems as opposed to being a general purpose tool. For instance, we ship with a number of out of the box workflows designed to solve common problems in search like click-through log analysis and whole collection document clustering so you don’t have to write them yourself.
Q4. How does it work to build a framework for big data with open source technologies that are “pre-integrated”?
Grant Ingersoll: Well, you quickly realize what a version soup there is out there, trying to support all the different “flavors” of Hadoop. Other than, it is a lot of fun to leverage the technologies to solve real problems that help people better understand their data. Naturally, there are challenges in making sure all the processes work together at scale, so a lot of effort goes into those areas.
Q5. What happens when big data plus search meets the cloud?
Grant Ingersoll: You get cost effective access and insight into your data instead of a big science experiment. In many ways, the benefits are the same as search and ranking in on-prem situations plus the added benefits the cloud brings you in terms of costs, scaling and flexibility. Of course, the well-documented challenge in the cloud is how to get your data there. So, for users who already have their data in the cloud, it’s an especially easy win, for those who don’t, we provide connectors that help.
Q6. Solr Query includes simple join capability between two document types. How do such queries scale with Big Data?
Grant Ingersoll: Solr scales quite well (billions of documents and very large query volumes).
In fact, we’ve seen it routinely scale linearly to quite large cluster sizes.
As with databases, joins require you to pay attention to how you do the join or whether there are better ways of asking your question, but I have seen them used quite successfully in the appropriate situation. At the end of the day, I try to remain pragmatic and use the appropriate tool for the job. A search engine can handle some types of joins, but that doesn’t always mean you should do it in a search engine. I like to think of a search engine as a very fast ranking engine. If the problem requires me to rank something, than search engine technology is going to be hard to beat. If you need it to do all different kinds of joins across a large number of document types or constant large table scans, it may be appropriate to do in a search engine and it may not. It’s a classic “it depends” situation. That being said, over the past few years, these kinds of problems have become much more efficient to do in a search engine thanks to a multitude of improvements the community has made to Lucene and Solr.
Q7. The Apache Mahout Machine Learning Project’s goal is to build scalable machine learning libraries. What is current status of the project?
Grant Ingersoll: We released 0.9 and are working towards a 1.0. The main focus lately has been on preparing for a 1.0 release by culling old, unused code and tightly focusing on a core set of algorithms which are tried and true that we want to support going forward.
Q8. What kind of algorithms is Apache Mahout currently supporting?
Grant Ingersoll: I tend to think of Mahout as being focused on the three “C’s”: clustering, classification and collaborative filtering (recommenders). These algorithms help people better understand and organize their data. Mahout also has various other algorithms like singular value decomposition, collocations and a bunch of libraries for Java primitives.
Q9. How does Mahout relies on the Apache Hadoop framework?
Grant Ingersoll: Many of the algorithms are written for Hadoop specifically, but not all. We try to be prudent about where it makes sense to use Hadoop and where it doesn’t, as not all machine learning algorithms are best suited for Map-Reduce style programming. We are also looking at how to leverage other frameworks like Spark or custom distributed code.
Q10. Who is using Apache Mahout and for what?
Grant Ingersoll: It really spans a lot of interesting companies, ranging from those using it to power recommendations to others classifying users to show them ads. At LucidWorks, we use Mahout for identifying statistically interesting phrases, clustering and classification of user’s query intent and more.
Q11. How scalable is Apache Mahout? What are the limits?
Grant Ingersoll: That will depend on the algorithm. I haven’t personally run an exhaustive benchmark, but I’ve seen many of the clustering and classification algorithms scale linearly.
Q12. How do you take into account user feedback when performing Recommendation mining with Apache Mahout?
Grant Ingersoll: Mahout’s recommenders are primarily of the “collaborative filtering” type, where user feedback equates to a vote for a particular item. All of those votes are, to simplify things a bit, added up to produce a recommendation for the user. Mahout supports a number of different ways of calculating those recommendations, since it is a library for producing recommendations and not just a one size fits all product.
Q13. Looking at three elements: Data, Platform, Analysis, what are the main challenges ahead?
Grant Ingersoll: I’d add a fourth element: the user. Lots of interesting challenges here:
When do we get past the hype cycle of big data and into the nitty gritty of making it real? That is, when does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.
How do we leverage the data, the platform and the analysis to make us smarter/better off instead of just better marketing targets? How do we use these tools to personalize without offending or destroying privacy?
How do we continue to meet scale requirements without breaking the bank on hardware purchases, etc?
Qx. Anything you wish to add?
Grant Ingersoll: Thanks for the great questions!
Grant Ingersoll, CTO and co-founder of LucidWorks, is an active member of the Lucene community – a Lucene and Solr committer, co-founder of the Apache Mahout machine learning project and a long-standing member of the Apache Software Foundation. He is co-author of “Taming Text” from Manning Publications, and his experience includes work at the Center for Natural Language Processing at Syracuse University in natural language processing and information retrieval.
Ingersoll has a Bachelor of Science degree in Math and Computer Science from Amherst College and a Master of Science degree in Computer Science from Syracuse University.
– Taming Text How to Find, Organize, and Manipulate It
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris
Softbound print: September 2012 (est.) | 350 pages, Manning, ISBN: 193398838X
Follow ODBMS.org on Twitter: @odbmsorg
“The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information.”–Anthony Bak
On Big Data Analytics, I have interviewed Anthony Bak, Data Scientist and Mathematician at Ayasdi.
Q1. What are the most important challenges for Big Data Analytics?
Anthony Bak: The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information. How do we sort through all the stories in an efficient manner?
Historically, organizations extracted value from data by building data infrastructure and employing large teams of highly trained Data Scientists who spend months, and sometimes years, asking questions of data to find breakthrough insights. The probability of discovering these insights is low because there are too many questions to ask and not enough data scientists to ask them.
Ayasdi’s platform uses Topological Data Analysis (TDA) to automatically find the relevant stories in complex data and operationalize them to solve difficult and expensive problems. We combine machine learning and statistics with topology, allowing for ground-breaking automation of the discovery process.
Q2. How can you “measure” the value you extract from Big Data in practice?
Anthony Bak: We work closely with our clients to find valuable problems to solve. Before we tackle a problem we quantify both its value to the customer and the outcome delivering that value.
Q3. You use a so called Topological Data Analysis. What is it?
Anthony Bak: Topology is the branch of pure mathematics that studies the notion of shape.
We use topology as a framework combining statistics and machine learning to form geometric summaries of Big Data spaces. These summaries allow us to understand the important and relevant features of the data. We like to say that “Data has shape and shape has meaning”. Our goal is to extract shapes from the data and then understand their meaning.
While there is no complete taxonomy of all geometric features and their meaning there are a few simple patterns that we see in many data sets: clusters, flares and loops.
Clusters are the most basic property of shape a data set can have. They represent natural segmentations of the data into distinct pieces, groups or classes. An example might find two clusters of doctors committing insurance fraud.
Having two groups suggests that there may be two types of fraud represented in the data. From the shape we extract meaning or insight about the problem.
That said, many problems don’t naturally split into clusters and we have to use other geometric features of the data to get insight. We often see that there’s a core of data points that are all very similar representing “normal” behavior and coming off of the core we see flares of points. Flares represent ways and degrees of deviation from the norm.
An example might be gene expression levels for cancer patients where people in various flares have different survival rates.
Loops can represent periodic behavior in the data set. An example might be patient disease profiles (clinical and genetic information) where they go from being healthy, through various stages of illness and then finally back to healthy.
The loop in the data is formed not by a single patient but by sampling many patients in various stages of disease. Understanding and characterizing the disease path potentially allows doctors to give better more targeted treatment.
Finally, a given data set can exhibit all of these geometric features simultaneously as well as more complicated ones that we haven’t described here. Topological Data Analysis is the systematic discovery of geometric features.
Q4. The core algorithm you use is called “Mapper“, developed at Stanford in the Computational Topology group by Gunnar Carlsson and Gurjeet Singh. How has your company, Ayasdi, turned this idea into a product?
Anthony Bak: Gunnar Carlsson, co-founder and Stanford University mathematics professor, is one of the leaders in a branch of mathematics called topology. While topology has been studied for the last 300 years, it’s in just the last 15 years that Gunnar has pioneered the application of topology to understand large and complex sets of data.
Between 2001 and 2005, DARPA and the National Science Foundation sponsored Gunnar’s research into what he called Topological Data Analysis (TDA). Tony Tether, the director of DARPA at the time, has said that TDA was one of the most important projects DARPA was involved in during his eight years at the agency.
Tony told the New York Times, “The discovery techniques of topological data analysis are going to have a huge impact, and Gunnar Carlsson is at the forefront of this research.”
That led to Gunnar teaming up with a group of others to develop a commercial product that could aid the efforts of life sciences, national security, oil and gas and financial services organizations. Today, Ayasdi already has customers in a broad range of industries, including at least 3 of the top global pharmaceutical companies, at least 3 of the top oil and gas companies and several agencies and departments inside the U.S. Government.
Q5. Do you have some uses cases where Topological Data Analysis is implemented to share?
Anthony Bak: There is a well known, 11-year old data set representing a breast cancer research project conducted by the Netherlands Cancer Institute-Antoni van Leeuwenhoek Hospital. The research looked at 272 cancer patients covering 25,000 different genetic markers. Scientists around the world have analyzed this data over and over again. In essence, everyone believed that anything that could be discovered from this data had been discovered.
Within a matter of minutes, Ayasdi was able to identify new, previously undiscovered populations of breast cancer survivors. Ayasdi’s discovery was recently published in Nature.
Using connections and visualizations generated from the breast cancer study, oncologists can map their own patients data onto the existing data set to custom-tailor triage plans. In a separate study, Ayasdi helped discover previously unknown biomarkers for leukaemia.
You can find additional case studies here.
Q6. Query-Based Approach vs. Query-Free Approach: could you please elaborate on this and explain the trade off?
Anthony Bak: Since the creation of SQL in the 1980s, data analysts have tried to find insights by asking questions and writing queries. This approach has two fundamental flaws. First, all queries are based on human assumptions and bias. Secondly, query results only reveal slices of data and do not show relationships between similar groups of data. While this method can uncover clues about how to solve problems, it is a game of chance that usually results in weeks, months, and years of iterative guesswork.
Ayasdi’s insight is that the shape of the data – its flares, cluster, loops – tells you about natural segmentations, groupings and relationships in the data. This information forms the basis of a hypothesis to query and investigate further. The analytical process no longer starts with coming up with a hypothesis and then testing it, instead we let the data, through its geometry, tell us where to look and what questions to ask.
Q7 Anything else you wish to add?
Anthony Bak: Topological data analysis represents a fundamental new framework for thinking about, analyzing and solving complex data problems. While I have emphasized its geometric and topological properties it’s important to point out that TDA does not replace existing statistical and machine learning methods.
Instead, it forms a framework that utilizes existing tools while gaining additional insight from the geometry.
I like to say that statistics and geometry form orthogonal toolsets for analyzing data, to get the best understanding of your data you need to leverage both. TDA is the framework for doing just that.
Anthony Bak is currently a Data Scientist and mathematician at Ayasdi. Prior to Ayasdi, Anthony was at Stanford University where he worked with Ayasdi co-founder Gunnar Carlsson on new methods and applications of Topological Data Analysis. He did his Ph.D. work in algebraic geometry with applications to string theory.
– Extracting insights from the shape of complex data using topology
P. Y. Lum,G. Singh,A. Lehman,T. Ishkanov,M. Vejdemo-Johansson,M. Alagappan,J. Carlsson & G. Carlsson
Nature, Scientific Reports 3, Article number: 1236 doi:10.1038/srep01236, 07 February 2013
Follow ODBMS.org on Twitter: @odbmsorg
“We see mobile rapidly emerging as a core requirement for data management. Any vendor who is serious about being a leader in the next generation database market, has to have a mobile strategy.”
I have interviewed Bob Wiederhold, President and Chief Executive Officer of Couchbase.
Q1. On June 26, you have announced a $60 Million series E round of financing. What are Couchbase’s chances of becoming a major player in the database market (and not only in the NoSQL market)? And what is your strategy for achieving this?
Bob Wiederhold: Enterprises are moving from early NoSQL validation projects to mission critical implementations.
As NoSQL deployments evolve to support the core business, requirements for performance at scale and completeness increase. Couchbase Server is the most complete offering on the market today, delivering the performance, scalability and reliability that enterprises require.
Additionally, we see mobile rapidly emerging as a core requirement for data management. Any vendor who is serious about being a leader in the next generation database market, has to have a mobile strategy.
At this point, we are the only NoSQL vendor offering an embedded mobile database and the sync needed to manage data between the cloud, the device and other devices. We believe that having the most complete, best performing operational NoSQL database along with a comprehensive mobile offering, uniquely positions us for leadership in the NoSQL market.
Q2. Why Couchbase Lite is so strategically important for you?
Bob Wiederhold: First, because the world is going mobile. That is indisputable. Mobile initiatives top the list of every IT department. As I said above, if you don’t have a mobile data management offering, you are not looking at the complete needs of the developer or the enterprise.
Second, let’s level set on Couchbase Lite. Couchbase Lite is our offering for an embedded mobile JSON database.
Our complete mobile offering, Couchbase Mobile, includes Couchbase Server – for data management in the cloud, and Sync Gateway for synchronization of data stored on the device with other devices, or the database in the cloud.
Today, because connectivity is unknown, data synchronization challenges force developers to either choose a total online (data stored in the cloud), or total offline (data stored on the device) data management strategy.
This approach limits functionality, as when the network is unavailable, online apps may freeze and not work at all. People want access to their applications, travel, expense report, or multi-user collaboration etc., whether they’re online or not.
Couchbase Mobile is the only NoSQL offering available that allows developers to build JSON applications that work whether an application is online or off, and manages the synchronization of the data between those applications and the cloud, or other devices. This is revolutionary for the mobile world and we are seeing tremendous interest from the mobile developer community.
Q3 What can enterprise do with a NoSQL mobile database, that they would not be able to do with a non-mobile database?
Bob Wiederhold: Offline access and syncing has been too time and resource intensive for mobile app developers. With Couchbase Mobile, developers don’t have to spend months, or years, building a solution that can store unstructured data on the device and sync that data with external sources – whether that is the cloud or another device. With Couchbase Mobile, developers can easily create mobile applications that are not tied to connectivity or limited by sync considerations. This empowers developers to build an entirely new class of enterprise applications that go far beyond what is available today.
Q4 What kind of businesses and applications will benefit when people use a NoSQL databases on their mobile devices? Can you give us some examples?
Bob Wiederhold: Nearly every business can benefit from the use of a complete mobile solution to build always available apps that work offline or online. One business example is our customer Infinite Campus.
Focused on educational transformation through the use of information technology, Infinite Campus is looking at Couchbase Lite as a solution that will enable students to complete their homework modules even when they don’t have access to a network outside of school. Instructional videos and homework assignments can be selectively pushed to students’ mobile devices when they are online at school.
Using Couchbase Lite, students can work online at school and then complete their homework assignments anywhere – on or offline. And the data seamlessly syncs across devices and between users, so teachers and students can participate in real-time Q&A chat sessions during lectures.
Q5. Do you have some customers who have gone into production with that?
Bob Wiederhold: The product is new, but we already have several customers that are live.
In addition to Microsoft, we have several companies around the world. You can check out one iOS app by Spraed, who is using Couchbase Server – running on AWS, Sync Gateway and Couchbase Lite.
Q6. Couchbase Server is a JSON document-based database. Why this design choice?
Bob Wiederhold: The world is changing. Businesses need to be agile and responsive.
Relational databases, with rigid schema design, don’t allow for fast change. JSON is the next generation architecture that businesses are increasingly using for mission critical applications because the technology allows them to manage and react to all aspects of big data: volume, variety and velocity of data, as well as big users and do that in a cloud based landscape.
Q7. Do you have any plan to work with Cloud providers?
Bob Wiederhold: We already work with many cloud providers. We have a great relationship with Amazon Web Services and many of our customers, including WebMD and Viber, run on AWS.
We also have partnerships and customers running on Windows Azure, GoGrid, and others. More and more organizations are moving infrastructure to the cloud and we will continue expanding our eco system to give our customers the flexibility to choose the best deployment options for their businesses.
Q8. Do you see happening any convergence between operational data management and analytical data processing? And if yes, how?
Bob Wiederhold: Yes, Analytics can happen at real time, near real time in operational stores and in batch modes. We have several customers who are deploying and have deployed complete solutions to integrated operational big data with real time analytical processing. LivePerson has done some incredibly innovative work here. They have been very open about the work they are doing and you can hear them tell their story here.
Q9 Do you have any plan to integrate your system with platforms for use in big data analytics?
Bob Wiederhold: Absolutely and we are integrated today into many platforms, including Hadoop via our Couchbase Hadoop connector and have many customers using Couchbase Server with both realtime and batch mode analytics platforms. See Avira and LivePerson presentations for examples. We continue to work with big data ISVs to ensure our customers can easily integrate their systems with the analytics system of their choosing.
Bob has more than 25 years of high technology experience. Until an acquisition by IBM in 2008, Bob served as chairman, CEO, and president of Transitive Corporation, the worldwide leader in cross-platform virtualization with over 20 million users. Previously, he was president and CEO of Tality Corporation, the worldwide leader in electronic design services, whose revenues and size grew to almost $200 million and had 1,500 worldwide employees.
Bob held several executive general management positions at Cadence Design Systems, Inc., an electronic design automation company, which he joined in 1985 as an early stage start-up and helped to grow to more than $1.5 billion during his 13 years at the company. Bob also headed High Level Design Systems, a successful electronic design automation start-up that was acquired by Cadence in 1996. Bob has extensive board experience having served on both public (Certicom, HLDS) and private company boards (Snaketech, Tality, Transitive, FanfareGroup).
–Magic Quadrant for Operational Database Management Systems. 16 October 2014. Analyst(s): Donald Feinberg, Merv Adrian, Nick Heudecker, Gartner.
Follow ODBMS.org on Twitter: @odbmsorg
“HBase and Hadoop are the only technologies proven to scale to dozens of petabytes on commodity servers, currently being used by companies such as Facebook, Twitter, Adobe and Salesforce.com.”–Monte Zweben.
Is it possible to turn Hadoop into a RDBMS? On this topic, I have interviewed Monte Zweben, Co-Founder and Chief Executive Officer of Splice Machine.
Q1. What are the main challenges of applications and operational analytics that support real-time, interactive queries on data updated in real-time for Big Data?
Monte Zweben: Let’s break down “real-time, interactive queries on data updated in real-time for Big Data”. “Real-time, interactive queries” means that results need to be returned in milliseconds to a few seconds.
For “Data updated in real-time” to happen, changes in data should be reflected in milliseconds. “Big Data” is often defined as dramatically increased volume, velocity, and variety of data. Of these three attributes, data volume typically dominates, because unlike the other attributes, its growth is virtually unbounded.
Traditional RDBMSs like MySQL or Oracle can support real-time, interactive queries on data updated in real-time, but they struggle on handling Big Data. They can only scale up on larger servers that can cost hundreds of thousands, if not millions of dollars per server.
Big Data technologies such as Hadoop can easily handle Big Data data volumes with their ability to scale-out on commodity hardware. However, with their batch analytics heritage, they often struggle to provide real-time, interactive queries. They also lack ACID transactions to support data updated in real time.
So, real-time applications and operational analytics had to choose between real-time interactive queries on data updated in real-time, or Big Data volumes. With Splice Machine, these applications can have the best of both worlds: real-time interactive queries, the reliability of real-time updates on ACID transactions, and the ability to handle Big Data volumes with a 10x price/performance improvement over traditional RDBMSs.
Q2. You suggested that companies should replace their traditional RDBMS systems. Why and when? Do you really think this is always possible? What about legacy systems?
Monte Zweben: Companies should consider replacing their traditional RDBMSs when they experience significant cost or scaling issues. Our informal surveys of customers indicate that up to half of traditional RDBMSs experience cost or scaling issues. The biggest barrier to migrating from a traditional RDBMS to a new database like Splice Machine is converting custom stored procedure (e.g., PL/SQL code). Operational analytics often have limited custom stored procedure code, so the migration process is generally straightforward.
Operational applications typically have thousands of lines of custom stored procedure code, but in extreme cases it can run into hundreds of thousands to millions of lines of code. There are actually commercially-supported tools that will convert from PL/SQL to the Java needed for Splice Machine. We have typically seen them convert from 70-95% accurately, but it will obviously depend on the complexity of the original code. Financially, migration makes sense for many companies to get an ongoing 10x price/performance, but there are cases when it does not make sense because converting custom code is too expensive.
Q3. Is scale-out the solution to Big Data at scale? Why?
Monte Zweben: Scale-out is definitely the critical technology to making Big Data work at scale. Scale-out leverages inexpensive, commodity hardware to parallelize queries to easily achieve a 10x price/performance improvement over existing database technologies.
Q4. You have announced your real-time relational database management system. What is special about Splice Machine`s Hadoop RDBMS?
Monte Zweben: We are the only Hadoop RDBMS. There are obviously many RDBMSs, but we are the only one with scale-out technology from Hadoop. Hadoop is the only scale-out technology proven to scales into dozens of petabytes on commodity hardware at companies like Facebook. There are other SQL-on-Hadoop technologies, but none of them can support real-time ACID transactions.
Q5 Hadoop-connected SQL databases do not eliminate “silos”. How do you handle this?
Q6. How did you manage to move Hadoop beyond its batch analytics heritage to power operational applications and real-time analytics?
Monte Zweben: At its core, Hadoop is a distributed file system (HDFS) where data cannot be updated or deleted. If you want to update or delete anything, you have to reload all the data (i.e., batch load). As a file system, it has very limited ability to seek specific data; instead, you use Java MapReduce programs to scan all of the data to find the data you need. It can easily take hours or even days for queries to return data (i.e., batch analytics). There is no way you could support a real-time application on top of HDFS and MapReduce.
By using HBase (a real-time key value store on top of HDFS), Splice Machine provides a full RDBMS on top of Hadoop.
You can now get real-time, interactive queries on real-time updated data on Hadoop, necessary to support operational applications and analytics.
Q7. How do you use Apache Derby™ and Apache HBase™/Hadoop?
Monte Zweben: Splice Machine marries two proven technology stacks: Apache Derby for ANSI SQL and HBase/Hadoop for proven scale out technology. With over 15 years of development, Apache Derby is a Java-based SQL database. Splice Machine chose Derby because it is a full-featured ANSI SQL database, lightweight (<3 MB), and easy to embed into the HBase/Hadoop stack.
HBase and Hadoop are the only technologies proven to scale to dozens of petabytes on commodity servers, currently being used by companies such as Facebook, Twitter, Adobe and Salesforce.com. Splice Machine chose HBase and Hadoop because of their proven auto-sharding, replication, and failover technology.
Q8. Why did you replace the storage engine in Apache Derby with HBase?
Q9. Why did you redesign the planner, optimizer, and executor of Apache Derby?
Monte Zweben: We redesigned the planner, optimizer, and executor of Derby because Splice Machine has a distributed computing infrastructure instead of its old shared-disk storage. Distributed computing requires a functional re-architecting because computation must be distributed to where the data is, instead of moving the data to the computation.
Q10. What are the main benefits for developers and database architects who build applications?
Monte Zweben: There are two main benefits to Splice Machine for developers and database architects. First, no longer is data scaling a barrier to using massive amounts of data in an application; you no longer need to prune data or rewrite applications to do unnatural acts like manual sharding. Second, you can enjoy the scaling with all the critical features of an RDBMS – strong consistency, joins, secondary indexes for fast lookups, and reliable updates with transactions. Without those features, developers have to implement those functions for each application, a costly, time-consuming, and error-prone process.
Monte Zweben, Co-Founder and Chief Executive Officer, Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Branch Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.
In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of Red Prairie. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company, and is the chairman of Clio Music, an advanced music research and development company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.
Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science. Monte’s involvement with CMU, which has been a long-time leader in distributed computing and Big Data research, helped inspire the original concept behind Splice Machine.
ODBMS.org: Several Free Resources on Hadoop.
–> FOLLOW ODBMS.ORG ON TWITTER: @odbmsorg