Managing Big Data. An interview with David Gorbet
“Executives and industry leaders are looking at the Big Data issue from a volume perspective, which is certainly an issue – but the increase in data complexity is the biggest challenge that every IT department and CIO must address, and address now. “— David Gorbet.
Managing unstructured Big Data is a challenge and an opportunity at the same time. I have interviewed David Gorbet, vice president of product strategy at MarkLogic.
Q1. You have been quoted saying that “more than 80 percent of today’s information is unstructured and it’s typically too big to manage effectively.” What do you mean by that?
David Gorbet: It used to be the case that all the data an organization needed to run its operations effectively was structured data that was generated within the organization. Things like customer transaction data, ERP data, etc.
Today, companies are looking to leverage a lot more data from a wider variety of sources both inside and outside the organization. Things like documents, contracts, machine data, sensor data, social media, health records, emails, etc. The list is endless really. A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows and columns.
And organizations want to be able to combine all this data and analyze it together in new ways. For example, we have more than one customer in different industries whose applications combine geospatial vessel location data with weather and news data to make real-time mission-critical decisions.
MarkLogic was early in recognizing the need for data management solutions that can handle a huge volume of complex data in real time. We started the company a decade ago to solve this problem, and we now have over 300 customers who have been able to build mission-critical real-time Big Data Applications to run their operations on this complex unstructured data.
This trend is accelerating as businesses all over the world are realizing that their old relational technology simply can’t handle this data effectively.
Q2. In your opinion, how is the Big Data movement affecting the market demand for data management software?
David Gorbet: Executives and industry leaders are looking at the Big Data issue from a volume perspective, which is certainly an issue – but the increase in data complexity is the biggest challenge that every IT department and CIO must address, and address now.
Businesses across industries have to not only store the data but also be able to leverage it quickly and effectively to derive business value.
We allow companies to do this better than traditional solutions, and that’s why our customer base doubled last year and continues to grow rapidly. Big Data is a major driver for the acquisition of new technology, and companies are taking action and choosing us.
Q3. Why do you see MarkLogic as a replacement of traditional database systems, and not simply as a complementary solution?
David Gorbet: First of all, we don’t advocate ripping out all your infrastructure and replacing it with something new. We recognize that there are many applications where traditional relational database technology works just fine. That said, when it came time to build applications to process large volumes of complex data, or a wide variety of data with different schemas, most of our customers had struggled with relational technology before coming to us for help.
Traditional relational database systems just don’t have the ability to handle complex unstructured data like we do. Relational databases are very good solutions for managing information that fits in rows and columns, however businesses are finding that getting value from unstructured information requires a totally new approach.
That approach is to use a database built from the ground up to store and manage unstructured information, and allow users to easily access the data, iterate on the data, and to build applications on top of it that utilize the data in new and exciting ways. As data evolves, the database must evolve with it, and MarkLogic plays a unique role as the only technology currently in the market that can fulfill that need.
Q4. How do you store and manage unstructured information in MarkLogic?
David Gorbet: MarkLogic uses documents as its native data type, which is a new way of storing information that better fits how information is already “shaped.”
To query, MarkLogic has developed an indexing system using techniques from search engines to perform database-style queries. These indexes are maintained as part of the insert or update transaction, so they’re available in real-time with no crawl delay.
For Big Data, search is an important component of the solution, and MarkLogic is the only technology that combines real-time search with database-style queries.
Q5. Would you define MarkLogic as an XML Database? A NoSQL database? Or other?
David Gorbet: MarkLogic is a Big Data database, optimized for large volumes of complex structured or unstructured data.
We’re non-relational, so in that sense we’re part of the NoSQL movement, however we built our database with all the traditional robust database functionality you’d expect and require for mission-critical applications, including failover for high availability, database replication for disaster recovery, journal archiving, and of course ACID transactions, which are critical to maintain data integrity.
If you think of what a next-generation database for today’s data should be, that’s MarkLogic.
Q6. MarkLogic has been working on techniques for storing and searching semantic information inside MarkLogic, and you have been running the Billion Triple Challenge, and the Lehigh University Benchmark. What were the main results of these tests?
David Gorbet: The testing showed that we could load 1 billion triples in less than 24 hours using approximately 750 gigabytes of disk and 150 gigabytes of RAM. Our LUBM query performance was extremely good, and in many cases superior, when compared to the performance from existing relational systems and dedicated triple stores.
Q7. Do you plan in the future to offer an open source API for your products?
David Gorbet: We have a thriving community of developers at community.marklogic.com where we make many of the tools, libraries, connectors, etc. that sit on top of our core server available for free, and in some cases as open source projects living on the social coding site github.
For example, we publish the source for XCC, our connector for Java or .NET applications, and we have an open-source REST API there as well.
Q8. James Phillips from Couchbase said in an interview last year : “It is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology.” What is your opinion on that?
David Gorbet: MarkLogic certainly sees the value of standards, and for years we’ve worked with the World Wide Web Consortia (W3C) standards groups in developing the XQuery and XSLT languages, which are used by MarkLogic for query and transformation. Interoperability helps drive collaboration and new ideas, and supporting standards will allow us continue to be at the forefront of innovation.
Q9. MarkLogic and Hortonworks last March announced a partnerships to enhance Real-Time Big Data Applications with Apache Hadoop. Can you explain how technically the combination of MarkLogic and Hadoop will work?
David Gorbet: Hadoop is a key technology for Big Data, but doesn`t provide the real-time capabilities that are vital for the mission-critical nature of so many
organizations. MarkLogic brings that power to Hadoop, and is executing its Hadoop vision in stages.
Last November, MarkLogic introduced its Connector for Hadoop, and in March 2012, announced a partnership with leading Hadoop vendor Hortonworks. The partnership enables organizations in both the commercial and public sectors to seamlessly combine the power of MapReduce with MarkLogic’s real-time, interactive analysis and indexing on a single, unified platform.
With MarkLogic and Hortonworks, organizations have a fully supported big data application platform that enables real-time data access and full-text search together with batch processing and massive archival storage.
MarkLogic will certify its connector for Hadoop against the Hortonworks Data Platform, and the two companies will also develop reference-architectures for MarkLogic-Hadoop solutions.
Q10. How do you identify new insights and opportunities in Big Data without having to write more code and wait for the batch process to complete?
David Gorbet: The most impactful Big Data Applications will be industry- or even organization-specific, leveraging the data that the organization
consumes and generates in the course of doing business. There is no single set formula for extracting value from this data; it will depend on the application.
That said, there are many applications where simply being able to comb through large volumes of complex data from multiple sources via interactive queries can give organizations new insights about their products,customers, services, etc.
Being able to combine these interactive data explorations with some analytics and visualization can produce new insights that would otherwise be hidden.
We call this Big Data Search.
For example, we recently demonstrated an application at MarkLogic World that shows through real-time co-occurrence analysis new insights about how products are being used. In our example, it was analysis of social media that revealed that Gatorade is closely associated with flu and fever, and our ability to drill seamlessly from high-level aggregate data into the actual source social media posts shows that many people actually take Gatorade to treat flu symptoms. Geographic visualization shows that this phenomenon may be regional. Our ability to sift through all this data in real-time, using fresh data gathered from multiple sources, both internal and external to the organization helps our customers identify new actionable insights.
David Gorbet is the vice president of product strategy for MarkLogic.
Gorbet brings almost two decades of experience delivering some of the highest-volume applications and enterprise software in the world. Prior to MarkLogic, Gorbet helped pioneer Microsoft`s business online services strategy by founding and leading the SharePoint Online team.
– Lecture Notes on “Data Management in the Cloud”.
by Michael Grossniklaus, and David Maier, Portland State University.
The topics covered in the course range from novel data processing
paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data
management platforms (Google BigTable, Microsoft Azure, Amazon S3
and Dynamo, Yahoo PNUTS) and open-source NoSQL databases
(Cassandra, MongoDB, Neo4J).
Lecture Notes|Intermediate|English| DOWNLOAD ~280 slides (PDF)| 2011-12|