On MarkLogic 8. Interview with Stephen Buxton
“When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits. “– Stephen Buxton.
MarkLogic recently released MarkLogic 8. I wanted to know more about this release. For that, I have interviewed Stephen Buxton, Senior Director, Product Management at MarkLogic.
Q1. You have recently launched MarkLogic® 8 software release. How is it positioned in the Big Data market? How does it differentiate from other products from NoSQL vendors?
Stephen Buxton: MarkLogic 8 is our biggest release ever, further solidifying MarkLogic’s position in the market as the only Enterprise NoSQL database.
With MarkLogic 8, you can now store, manage and search JSON, XML, and RDF all in one unified platform—without sacrificing enterprise features such as transactional consistency, security, or backup and recovery.
While other database companies are still figuring out how to strengthen their platform and add features like transactional consistency, we’ve moved far ahead of them by working on new innovative features such as Bitemporal and Semantics. It’s for these reasons that over 500 enterprise organizations have chosen MarkLogic to run their mission-critical applications.
MarkLogic 8 is more powerful, agile, and trusted than ever before, and is an ideal platform for doing two things: making heterogeneous data integration simpler and faster; and for doing dynamic content delivery at massive scale.
Relational databases do not offer enough flexibility—integration projects can take multiple years, cost millions of dollars, and struggle at scale. But, the newer NoSQL databases that do have agility still lack the enterprise features required to run in the data centers at large organizations. MarkLogic is the only NoSQL database that is able to solve today’s challenge, having the flexibility to serve as an operational and analytical database for all of an organization’s data.
Within MarkLogic, the JSON structure is mapped directly to the internal structure already used by the XML document format, so it has the same speed and scalability as with XML. This also means that all of the production-proven indexing, data management, and security capabilities that MarkLogic is known for are fully maintained.
Q3. In MarkLogic 8 you have been adding full SPARQL 1.1 support and Inferencing capability. Could you please explain what kind of Inferencing capability did you add and what are they useful for?
Stephen Buxton: We made a big leap forward on the semantics foundation that was laid in our previous release, adding full SPARQL 1.1 support, which includes support for property paths, aggregates, and SPARQL Update. Support for automatic inferencing was also added, which is a powerful capability that allows the database to combine existing data and apply pre-defined rules to infer new data. SPARQL 1.1 is a standard defined by the W3C that is supported by many RDF triple stores. But, MarkLogic differentiates itself among triple stores as you can store your documents and data right alongside your triples, and you can query across all three data models easily and efficiently.
Automatic inferencing is a really powerful feature that is part of an overall strategy to provide a more intelligent data layer so that you can build smarter apps.
With inferencing, for example, if you had two pieces of data stored as RDF triples, such as “John lives in Virginia” and “Virginia is in the United States”, then MarkLogic 8 could infer the new fact, “John lives in the United States.”
This can make search results richer and also show you new relationships in your data.
In MarkLogic 8, rules for inferencing are applied at query time. This approach is referred to as backward-chaining inference, a very flexible approach in which only the required rules are applied for each query, so the server does the minimum work necessary to get the correct results; and when your data or ontology or rules sets change, that change is available immediately – it takes effect with the very next query. And, of course, inference queries are transactional, distributed, and obey MarkLogic’s rule-based security, just like any other query. MarkLogic 8 has supplied rule sets for RDFS, RDFS-Plus, OWL-Horst, and their subsets; and you can create your own. With MarkLogic 8 you can further restrict any SPARQL query (with or without inference) by any document attribute, including timestamp, provenance, or even a bitemporal constraint.
More details and examples can be found at developer.marklogic.com.
Q4. The additions to SPARQL include Property Paths, Aggregates, and SPARQL Update. Could you please explain briefly each of them?
Stephen Buxton: SPARQL 1.1 brings support for property paths, aggregates, and SPARQL Update. These capabilities make working with RDF data simpler and more powerful, which means increased context for your data—all using the SPARQL 1.1 industry standard query language.
SPARQL 1.1’s property paths let you traverse an RDF graph – bouncing from point-to-point across a graph. This graph traversal allows you to do powerful, complex queries such as, “Show me all the people who are connected to John” by finding people that know John, and people that know people that know John, and so on.
With aggregate SPARQL functions you can do analytic queries over hundreds of billions of triples. MarkLogic 8 supports all the SPARQL 1.1 Aggregate functions – COUNT, SUM, MIN, MAX, and AVG – as well as the grouping operations GROUP BY, GROUP BY .. HAVING, GROUP_CONCAT and SAMPLE.
SPARQL 1.1 also includes SPARQL Update. With these capabilities, you can delete, insert, and update (delete/insert) individual triples, and manipulate RDF graphs, all using SPARQL 1.1.
Q5. The addition of SPARQL Update capabilities could have the potential to influence the capability you offer of a RDF triple store that scales horizontally and manages billions of triples. Any comment on that?
Stephen Buxton: The enhancements in MarkLogic 8 make it able to function as a full-featured, stand-alone triple store– this means you can now get a triple store that is horizontally scalable as part of a shared-nothing cluster, and still get all of the enterprise features MarkLogic is known for such as such as High Availability, Disaster Recovery, and certified security. Beyond that, anyone looking for “just a triple store” will find they can also store, manage, and query documents and data in the same database, a unique capability that only MarkLogic has.
Q6. You have been adding a so called Bitemporal Data Management. What is it and why is it useful?
Stephen Buxton: Bitemporal is a new feature that allows you to ask, “What did you know and when did you know it?” The MarkLogic Bitemporal feature answers this critical question by tracking what happened, when it happened, and when we found out. A bitemporal database is much more powerful than a temporal database that can only track when something happened. The difference between when something happened and when you found out about it can be incredibly significant, particularly when it comes to audits and regulation.
A bitemporal database tracks time across two different axes, the system and valid time axes. This allows you to go back in time and explore data, manage historical data across systems, ensure data integrity, and do complex bitemporal analysis. You can answer complex questions such as:
• Where did John Thomas live on August 20th as we knew it on September 1st?
• Where was the Blue Van on October 12th as we knew it on October 23rd?
Bitemporal is important for a wide variety of use cases across industries. Getting a more accurate picture of a business at different points-in-time used to be impossible, or very challenging at best. Bitemporal helps ensure that you always have a full and accurate picture of your data at every point-in-time, which is particularly useful in regulated industries.
• Regulatory requirements – Avoid the increasingly harsh downside consequences from not adhering to government and industry regulations, particularly in financial services and insurance
• Audits – Preserve the history of all your data, including the changes made to it, so that clear audits can be conducted without having to worry about lost data, data integrity, or cumbersome ETL processes with archived data
• Investigations and Intelligence – No more lost emails and no more missing information. Bitemporal databases never erase data, so it is possible to see exactly how data was updated based on what was known at the time
• Business Analytics – Run complex queries that were not previously possible in order to better understand your business and answer new questions about how different decisions and changes in the past could have led to different results
• Cost reduction – Manage data with a smaller footprint as the shape of the data changes, avoiding the need to set up additional databases for historical data.
Bitemporal is enhanced by MarkLogic’s Tiered Storage, which allows you to more easily archive your data to cheaper storage tiers with little administrative overhead. This keeps Bitemporal simple, and obviates the high cost imposed by the few relational databases that do have Bitemporal. MarkLogic also eliminates the schema roadblocks that relational databases that have Bitemporal struggle with. MarkLogic is schema-agnostic and can adjust to the shape of data as that data changes over time.
Q7. How is bitemporal different from versioning?
Stephen Buxton: Bitemporal works by ingesting bitemporal documents that are managed as a series of documents with range indexes for valid and system time axes. Documents are stored in a temporal collection protected by security permissions. The initial document inserted into the database is kept and never changes, allowing you to track the provenance of information with full governance and immutability.
Q8. Could you give us some examples of how Bitemporal Data Management could be useful applications for the financial services industry?
Stephen Buxton: One example of Bitemporal is trade reconciliation in financial services. When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits.
Imagine the Head of IT Architecture at a major bank working on mining information and looking for changes in risk profiles. The risk profiles cannot be accurately calculated without having an accurate picture of the reference and trade data, and how it changed over time. This task becomes simple and fast using Bitemporal.
Qx Anything else you wish to add?
Stephen Buxton: In addition to innovative features such as Bitemporal and Semantics, and features that make MarkLogic more widely accessible in the developer community, there are other updates in Marklogic 8 that make it easier to administer and manage. For example, Incremental Backup, another feature added in MarkLogic 8, allows DBAs to perform backups faster while using less storage.
With MarkLogic 8, you can have multiple daily incremental backups with only a minimal impact on database performance. This feature is one worth highlighting because it will help make DBAs live much easier, and will save an organization time and money.
It’s just another example of MarkLogic’s continuing dedication to being an enterprise NoSQL database that is more powerful, agile, and trusted than anything else.
Stephen Buxton is Senior Director of Product Management for Search and Semantics at MarkLogic, where he has been a member of the Products team for 8 years. Stephen focuses on bringing a rich semantic search experience to users of the MarkLogic NoSQL database, document store, and triple store. Before joining MarkLogic, Stephen was Director of Product Management for Text and XML at Oracle Corporation.
–On making information accessible. Interview with David Leeming. ODBMS Industry Watch, July 30, 2014
Follow ODBMS.org on Twitter: @odbmsorg