On Silos, Data Integration and Data Security. Interview with David Gorbet
“Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.” — David Gorbet.
I have interviewed David Gorbet, Senior Vice President,Engineering at MarkLogic. We cover several topics in the interview: Silos, Data integration, data quality, security and the new features of MarkLogic 9.
Q1. Data integration is the number one challenge for many organisations. Why?
David Gorbet: There are three ways to look at that question. First, why do organizations have so many data silos? Second, what’s the motivation to integrate these silos, and third, why is this so hard?
Our Product EVP, Joe Pasqua, did an excellent presentation on the first question at this year’s MarkLogic World. The spoiler is that silos are a natural and inevitable result of an organization’s success. As companies become more successful, they start to grow. As they grow, they need to partition in order to scale. To function, these partitions need to run somewhat autonomously, which inevitably creates silos.
Another way silos enter the picture is what I call “application accretion” or less charitably, “crusty application buildup.” Companies merge, and now they have two HR systems. Divisions acquire special-purpose applications and now they have data that exists only in those applications. IT projects are successful and now need to add capabilities, but it’s easier to bolt them on and move data back and forth than to design them into an existing IT system.
Two years ago I proposed a data-centric view of the world versus an application-centric view. If you think about it, most organizations have a relatively small number of “things” that they care deeply about, but a very large number of “activities” they do with these “things.”
For example, most organizations have customers, but customer-related activities happen all across the organization.
Sales is selling to them. Marketing is messaging to them. Support is helping solve their problems. Finance is billing them. And so on… All these activities are designed to be independent because they take place in organizational silos, and the data silos just reflect that. But the data is all about customers, and each of these activities would benefit greatly from information generated by and maintained in the other silos. Imagine if Marketing could know what customers use the product for to tailor the message, or if Sales knew that the customer was having an issue with the product and was engaged with Support? Sometimes dealing with large organizations feels like dealing with a crazy person with multiple personalities. Organizations that can integrate this data can give their customers a much better, saner experience.
And it’s not just customers. Maybe it’s trades for a financial institution, or chemical compounds for a pharmaceutical company, or adverse events for a life sciences company, or “entities of interest” for an intelligence or police organization. Getting a true, 360-degree view of these things can make a huge difference for these organizations.
In some cases, like with one customer I spoke about in my most recent MarkLogic World keynote who looks at the environment of potentially at-risk children, it can literally mean the difference between life and death.
So why is this so hard? Because most technologies require you to create data models that can accommodate everything you need to know about all of your data in advance, before you can even start the data integration project. They also require you to know the types of queries you’re going to do on that data so you can design efficient schemas and indexing schemes.
This is true even of some NoSQL technologies that require you to figure out sharding and compound indexing schemes in advance of loading your data. As I demonstrated in that keynote I mentioned, even if you have a relatively small set of entities that are quite simple, this is incredibly hard to do.
Usually it’s so hard that instead organizations decide to do a subset of the integration to solve a specific need or answer a specific question. Sadly, this tends to create yet another silo.
Q2. Integrate data from silos: how is it possible?
David Gorbet: Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.
How do you do that? You build an operational data hub that can consume data from multiple sources and expose APIs on that data so that downstream consumers, either applications or other systems, can consume it in real time. To do this you need an infrastructure that can accommodate the variability across silos naturally, without a lot of up-front data modeling, and without each silo having a ripple effect on all the others.
For the engineers out there (like me), think of this as trying to turn an O(n2) problem into an O(n) problem.
As the number of silos increases, most projects get exponentially more complex, since you can only have one schema and every new silo impacts that schema, which is shared by all data across all existing silos. You want a technology where adding a new data silo does not require re-doing all the work you’ve already done. In addition, you need a flexible technology that allows a flexible data model that can adapt to change. Change in both what data is used and in how it’s used. A system that can evolve with the evolving needs of the business.
MarkLogic can do this because it can ingest data with multiple different schemas and index and query it together.
You don’t have to create one schema that can accommodate all your data. Our built-in application services allows our customers to build APIs that expose the data directly from their data hub and with ACID transactions, these APIs can be used to build real operational applications.
Q3. What is the problem with traditional solutions like relational databases, Extract Transform and Load (ETL) tools?
David Gorbet: To use a metaphor, most technology used for this type of project is like concrete. Now concrete is incredibly versatile. You can make anything you want out of concrete: a bench, a statue, a building, a bridge… But once you’ve made it, you’d better like it because if you want to change it you have to get out the jackhammer.
Many projects that use these tools start out with lofty goals, and they spend a lot of time upfront modeling data and designing schemas. Very quickly they realize that they are not going to be able to make that magical data model that can accommodate everything and be efficiently queried. They start to cut corners to make their problem more tractable, or they design flexible but overly generic models like tall thin tables that are inefficient to query. Every corner they cut limits the types of applications they can then build on the resulting integrated data, and inevitably they end up needing some data they left behind, or needing to execute a query they hadn’t planned (and built an index) for.
Usually at some point they decide to change the model from a hub-and-spoke data integration model to a point-to-point model, because point-to-point integrations are much easier. That, or it evolves as new requirements emerge, and it becomes impossible to keep up by jackhammering the system and starting over. But this just pushes the complexity out of these now point-to-point flows and into the overall system architecture. It also causes huge governance problems, since data now flows in lots of directions and is transformed in many ways that are generally pretty opaque and hard to trace. The inability to capture and query metadata about these data flows causes master-data problems and governance problems, to the point where some organizations genuinely have no idea where potentially sensitive data is being used. The overall system complexity also makes it hard to scale and expensive to operate.
Q4. What are the typical challenges of handling both structured, and unstructured data?
David Gorbet: It’s hard enough to integrate structured data from multiple silos. Everything I’ve already talked about applies even if you have purely structured data. But when some of your data is unstructured, or has a complex, variable structure, it’s much harder. A lot of data has a mix of structured data and unstructured text. Medical records, journal articles, contracts, emails, tweets, specifications, product catalogs, etc. The traditional solution to textual data in a relational world is to put it in an opaque BLOB or CLOB, and then surface its content via a search technology that can crawl the data and build indexes on it. This approach suffers from several problems.
First, it involves stitching together multiple different technologies, each of which has its own operational and governance characteristics. They don’t scale the same way. They don’t have the same security model (unless they have no security model, which is actually pretty common). They don’t have the same availability characteristics or disaster recovery model.
They don’t backup consistently with each other. The indexes are separate, so they can’t be queried together, and keeping them in sync so that they’re consistent is difficult or impossible.
Second, more and more text is being mined for structure. There are technologies that can identify people, places, things, events, etc. in freeform text and structure it. Sentiment analysis is being done to add metadata to text. So it’s no longer accurate to think of text as islands of unstructured data inside a structured record. It’s more like text and structure are inter-mixed at all levels of granularity. The resulting structure is by its nature fluid, and therefore incompatible with the up-front modeling required by relational technology.
Third, search engines don’t index structure unless you tell them to, which essentially involves explaining the “schema” of the text to them so that they can build facets and provide structured search capabilities. So even in your “unstructured” technology, you’re often dealing with schema design.
Finally, as powerful as it is, search technology doesn’t know anything about the semantics of the data. Semantic search enables a much richer search and discovery experience. Look for example at the info box to the right of your Google results. This is provided by Google’s knowledge graph, a graph of data using Semantic Web technologies. If you want to provide this kind of experience, where the system can understand concepts and expand or narrow the context of the search accordingly, you need yet another technology to manage the knowledge graph.
Two years ago at my MarkLogic World keynote I said that search is the query language for unstructured data, so if you have a mix of structured and unstructured data, you need to be able to search and query together. MarkLogic lets you mix structured and unstructured search, as well as semantic search, all in one query, resolved in one technology.
Q5. An important aspect when analysing data is Data Quality. How do you evaluate if the data is of good or of bad quality?
David Gorbet: Data quality is tough, particularly when you’re bringing data together from multiple silos. Traditional technologies require you to transform the data from one schema into another in order to move it from place to place. Every transformation leaves some data behind, and every one has the potential to be a point of data loss or data corruption if the transformation isn’t perfect. In addition, the lineage of the data is often lost. Where did this attribute of this entity come from? When was it extracted? What was the transform that was run on it? What did it look like before?
All of this is lost in the ETL process. The best way to ensure data quality is to always bring along with each record the original, untransformed data, as well as metadata tracing its provenance, lineage and context.
MarkLogic lets you do this, because our flexible schema accommodates source data, canonicalized (transformed) data, and metadata all in the same record, and all of it is queryable together. So if you find a bug in your transform, it’s easy to query for all impacted records, and because you have the source data there, you can easily fix it as well.
In addition, our Bitemporal feature can trace changes to a record over time, and let you query your data as it is, as it was, or as you thought it was at any given point in time or over any historical (or in some cases future) time range. So you have traceability when your data changes, and you can understand how and why it has changed.
Q6. Data leakage is another problem for many corporations that experienced high profile security incidents. What can be done to solve this problem?
David Gorbet: Security is another important aspect of data governance. And security isn’t just about locking all your data in a vault and only letting some people look at it. Security is more granular than that. There are some data that can be seen by just about anyone in your organization. Some that should only be seen by people who need it, and some that should be hidden from all but people with specific roles. In some cases, even users with a particular role should not see data unless they have a provable need in addition to the role required. This is called “compartment security,” meaning you have to be in a certain compartment to see data, regardless of your role or clearance overall.
There is a principle in security called “defense in depth.” Basically it means pushing the security to the lowest layer possible in the stack. That’s why it’s critically important that your DBMS have strong and granular security features.
This is especially true if you’re integrating data from silos, each of which may have its own security rules.
You need your integrated data hub to be able to observe and enforce those rules, regardless of how complex they are.
Increasingly the concern is over the so-called “insider threat.” This is the employee, contractor, vendor, managed service provider, or cloud provider who has access to your infrastructure. Another good reason not to implement security in your application, because if you do, any DBA will be able to circumvent it. Today, with the move to cloud and other outsourced infrastructure, organizations are also concerned about what’s on the file system. Even if you secure your data at the DBMS layer, a system administrator with file system access can still get at it. To counter this, more organizations are requiring “at rest” encryption of data, which means that the data is encrypted on the file system. A good implementation will require a separate role to manage encryption keys, different from the DBA or SA roles, along with a separate key management technology. In our implementation, MarkLogic never even sees the database encryption keys, relying instead on a separate key management system (KMS) to unlock data for us. This separation of concerns is a lot more secure, because it would require insiders to collude across functions and organizations to steal data. You can even keep your data in the cloud and your keys on-premises, or with another managed service provider.
Q8. What is new in MarkLogic® 9 database? ?
David Gorbet: There’s so much in MarkLogic 9 it’s hard to cover all of it. That presentation I referenced earlier from Joe does a pretty good job of summarizing the features. Many of the features in MarkLogic 9 are designed to make data integration even easier. MarkLogic 9 has new ways of modeling data that can keep it in its flexible document form, but project it into tabular form for more traditional analysis (aggregates, group-bys, joins, etc.) using either SQL or a NoSQL API we call the Optic API. This allows you to define the structured parts of your data and let MarkLogic index it in a way that makes it most efficient to query and aggregate.
You can also use this technique to extract RDF triples from your data, giving you easy access to the full power of Semantics technologies.
We’re doing more to make it easier to get data into MarkLogic via a new data movement SDK that you can hook directly up to your data pipeline. This SDK can help orchestrate transformations and parallel loads of data no matter where it comes from.
We’re also doubling down on security. Earlier I mentioned encryption at rest. That’s a new feature for MarkLogic 9.
We’re also doing sub-record-level role- and compartment-based access control. This means that if you have a record (like a customer record) that you want to make broadly available, but there is some data in that record (like a SSN) that you want to restrict access to, you can easily do that. You can also obfuscate and transform data within a record to redact it for export or for use in a context that is less secure than MarkLogic.
Security is a governance feature, and we’re improving other governance features as well, with policy-based tiering for lifecycle management, and improvements to our Bitemporal feature that make it a full-fledged compliance feature.
We’re introducing new tools to help monitor and manage multiple clusters at a time. And we’re making many other improvements in many other areas, like our new geospatial region index that makes region-region queries much faster, improvements to tools like Query Console and MLCP, and many, many more.
One exciting feature that is a bit hard to understand at first is our new Entity Services feature. You can think of this as a catalog of entities. You can put whatever you want in this catalog. Entity attributes, relationships, etc. but also policies, governance rules, and other entity class metadata. This is a queryable semantic model, so you can query your catalog at runtime in your application. We’ll also be providing tools that use this catalog to help build the right set of indexes, indexing templates, APIs, etc. for your specific data. Over time, Entity Services will become the foundation of our vision of the “smart database.” You’ll hear us start talking a lot more about that soon.
David Gorbet, Senior Vice President, Engineering, MarkLogic.
David Gorbet has the best job in the world. As SVP of Engineering, David manages the team that delivers the MarkLogic product and supports our customers as they use it to power their amazing applications. Working with all those smart, talented engineers as they pour their passion into our product is a humbling experience, and seeing the creativity and vision of our customers and how they’re using our product to change their industry is simply awesome.
Prior to MarkLogic, David helped pioneer Microsoft’s business online services strategy by founding and leading the SharePoint Online team. In addition to SharePoint Online, David has held a number of positions at Microsoft and elsewhere with a number of enterprise server products and applications, and numerous incubation products.
David holds a Bachelor of Applied Science Degree in Systems Design Engineering with an additional major in Psychology from the University of Waterloo, and an MBA from the University of Washington Foster School of Business.
–Join the Early Access program for a MarkLogic 9 introduction by visiting: ea.marklogic.com
-The MarkLogic Developer License is free to all who sign up and join the MarkLogic developer community.
– On Data Governance. Interview with David Saul. ODBMS Industry Watch, 2016-07-23
– On Data Interoperability. Interview with Julie Lockner. ODBMS Industry Watch, 2016-06-07
– On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, 2016-05-24
Follow us on Twitter: @odbmsorg