“Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.” — David Gorbet.
I have interviewed David Gorbet, Senior Vice President,Engineering at MarkLogic. We cover several topics in the interview: Silos, Data integration, data quality, security and the new features of MarkLogic 9.
Q1. Data integration is the number one challenge for many organisations. Why?
David Gorbet: There are three ways to look at that question. First, why do organizations have so many data silos? Second, what’s the motivation to integrate these silos, and third, why is this so hard?
Our Product EVP, Joe Pasqua, did an excellent presentation on the first question at this year’s MarkLogic World. The spoiler is that silos are a natural and inevitable result of an organization’s success. As companies become more successful, they start to grow. As they grow, they need to partition in order to scale. To function, these partitions need to run somewhat autonomously, which inevitably creates silos.
Another way silos enter the picture is what I call “application accretion” or less charitably, “crusty application buildup.” Companies merge, and now they have two HR systems. Divisions acquire special-purpose applications and now they have data that exists only in those applications. IT projects are successful and now need to add capabilities, but it’s easier to bolt them on and move data back and forth than to design them into an existing IT system.
Two years ago I proposed a data-centric view of the world versus an application-centric view. If you think about it, most organizations have a relatively small number of “things” that they care deeply about, but a very large number of “activities” they do with these “things.”
For example, most organizations have customers, but customer-related activities happen all across the organization.
Sales is selling to them. Marketing is messaging to them. Support is helping solve their problems. Finance is billing them. And so on… All these activities are designed to be independent because they take place in organizational silos, and the data silos just reflect that. But the data is all about customers, and each of these activities would benefit greatly from information generated by and maintained in the other silos. Imagine if Marketing could know what customers use the product for to tailor the message, or if Sales knew that the customer was having an issue with the product and was engaged with Support? Sometimes dealing with large organizations feels like dealing with a crazy person with multiple personalities. Organizations that can integrate this data can give their customers a much better, saner experience.
And it’s not just customers. Maybe it’s trades for a financial institution, or chemical compounds for a pharmaceutical company, or adverse events for a life sciences company, or “entities of interest” for an intelligence or police organization. Getting a true, 360-degree view of these things can make a huge difference for these organizations.
In some cases, like with one customer I spoke about in my most recent MarkLogic World keynote who looks at the environment of potentially at-risk children, it can literally mean the difference between life and death.
So why is this so hard? Because most technologies require you to create data models that can accommodate everything you need to know about all of your data in advance, before you can even start the data integration project. They also require you to know the types of queries you’re going to do on that data so you can design efficient schemas and indexing schemes.
This is true even of some NoSQL technologies that require you to figure out sharding and compound indexing schemes in advance of loading your data. As I demonstrated in that keynote I mentioned, even if you have a relatively small set of entities that are quite simple, this is incredibly hard to do.
Usually it’s so hard that instead organizations decide to do a subset of the integration to solve a specific need or answer a specific question. Sadly, this tends to create yet another silo.
Q2. Integrate data from silos: how is it possible?
David Gorbet: Data integration isn’t just about moving data from one place to another. It’s about building an actionable, operational view on data that comes from multiple sources so you can integrate the combined data into your operations rather than just looking at it later as you would in a typical warehouse project.
How do you do that? You build an operational data hub that can consume data from multiple sources and expose APIs on that data so that downstream consumers, either applications or other systems, can consume it in real time. To do this you need an infrastructure that can accommodate the variability across silos naturally, without a lot of up-front data modeling, and without each silo having a ripple effect on all the others.
For the engineers out there (like me), think of this as trying to turn an O(n2) problem into an O(n) problem.
As the number of silos increases, most projects get exponentially more complex, since you can only have one schema and every new silo impacts that schema, which is shared by all data across all existing silos. You want a technology where adding a new data silo does not require re-doing all the work you’ve already done. In addition, you need a flexible technology that allows a flexible data model that can adapt to change. Change in both what data is used and in how it’s used. A system that can evolve with the evolving needs of the business.
MarkLogic can do this because it can ingest data with multiple different schemas and index and query it together.
You don’t have to create one schema that can accommodate all your data. Our built-in application services allows our customers to build APIs that expose the data directly from their data hub and with ACID transactions, these APIs can be used to build real operational applications.
Q3. What is the problem with traditional solutions like relational databases, Extract Transform and Load (ETL) tools?
David Gorbet: To use a metaphor, most technology used for this type of project is like concrete. Now concrete is incredibly versatile. You can make anything you want out of concrete: a bench, a statue, a building, a bridge… But once you’ve made it, you’d better like it because if you want to change it you have to get out the jackhammer.
Many projects that use these tools start out with lofty goals, and they spend a lot of time upfront modeling data and designing schemas. Very quickly they realize that they are not going to be able to make that magical data model that can accommodate everything and be efficiently queried. They start to cut corners to make their problem more tractable, or they design flexible but overly generic models like tall thin tables that are inefficient to query. Every corner they cut limits the types of applications they can then build on the resulting integrated data, and inevitably they end up needing some data they left behind, or needing to execute a query they hadn’t planned (and built an index) for.
Usually at some point they decide to change the model from a hub-and-spoke data integration model to a point-to-point model, because point-to-point integrations are much easier. That, or it evolves as new requirements emerge, and it becomes impossible to keep up by jackhammering the system and starting over. But this just pushes the complexity out of these now point-to-point flows and into the overall system architecture. It also causes huge governance problems, since data now flows in lots of directions and is transformed in many ways that are generally pretty opaque and hard to trace. The inability to capture and query metadata about these data flows causes master-data problems and governance problems, to the point where some organizations genuinely have no idea where potentially sensitive data is being used. The overall system complexity also makes it hard to scale and expensive to operate.
Q4. What are the typical challenges of handling both structured, and unstructured data?
David Gorbet: It’s hard enough to integrate structured data from multiple silos. Everything I’ve already talked about applies even if you have purely structured data. But when some of your data is unstructured, or has a complex, variable structure, it’s much harder. A lot of data has a mix of structured data and unstructured text. Medical records, journal articles, contracts, emails, tweets, specifications, product catalogs, etc. The traditional solution to textual data in a relational world is to put it in an opaque BLOB or CLOB, and then surface its content via a search technology that can crawl the data and build indexes on it. This approach suffers from several problems.
First, it involves stitching together multiple different technologies, each of which has its own operational and governance characteristics. They don’t scale the same way. They don’t have the same security model (unless they have no security model, which is actually pretty common). They don’t have the same availability characteristics or disaster recovery model.
They don’t backup consistently with each other. The indexes are separate, so they can’t be queried together, and keeping them in sync so that they’re consistent is difficult or impossible.
Second, more and more text is being mined for structure. There are technologies that can identify people, places, things, events, etc. in freeform text and structure it. Sentiment analysis is being done to add metadata to text. So it’s no longer accurate to think of text as islands of unstructured data inside a structured record. It’s more like text and structure are inter-mixed at all levels of granularity. The resulting structure is by its nature fluid, and therefore incompatible with the up-front modeling required by relational technology.
Third, search engines don’t index structure unless you tell them to, which essentially involves explaining the “schema” of the text to them so that they can build facets and provide structured search capabilities. So even in your “unstructured” technology, you’re often dealing with schema design.
Finally, as powerful as it is, search technology doesn’t know anything about the semantics of the data. Semantic search enables a much richer search and discovery experience. Look for example at the info box to the right of your Google results. This is provided by Google’s knowledge graph, a graph of data using Semantic Web technologies. If you want to provide this kind of experience, where the system can understand concepts and expand or narrow the context of the search accordingly, you need yet another technology to manage the knowledge graph.
Two years ago at my MarkLogic World keynote I said that search is the query language for unstructured data, so if you have a mix of structured and unstructured data, you need to be able to search and query together. MarkLogic lets you mix structured and unstructured search, as well as semantic search, all in one query, resolved in one technology.
Q5. An important aspect when analysing data is Data Quality. How do you evaluate if the data is of good or of bad quality?
David Gorbet: Data quality is tough, particularly when you’re bringing data together from multiple silos. Traditional technologies require you to transform the data from one schema into another in order to move it from place to place. Every transformation leaves some data behind, and every one has the potential to be a point of data loss or data corruption if the transformation isn’t perfect. In addition, the lineage of the data is often lost. Where did this attribute of this entity come from? When was it extracted? What was the transform that was run on it? What did it look like before?
All of this is lost in the ETL process. The best way to ensure data quality is to always bring along with each record the original, untransformed data, as well as metadata tracing its provenance, lineage and context.
MarkLogic lets you do this, because our flexible schema accommodates source data, canonicalized (transformed) data, and metadata all in the same record, and all of it is queryable together. So if you find a bug in your transform, it’s easy to query for all impacted records, and because you have the source data there, you can easily fix it as well.
In addition, our Bitemporal feature can trace changes to a record over time, and let you query your data as it is, as it was, or as you thought it was at any given point in time or over any historical (or in some cases future) time range. So you have traceability when your data changes, and you can understand how and why it has changed.
Q6. Data leakage is another problem for many corporations that experienced high profile security incidents. What can be done to solve this problem?
David Gorbet: Security is another important aspect of data governance. And security isn’t just about locking all your data in a vault and only letting some people look at it. Security is more granular than that. There are some data that can be seen by just about anyone in your organization. Some that should only be seen by people who need it, and some that should be hidden from all but people with specific roles. In some cases, even users with a particular role should not see data unless they have a provable need in addition to the role required. This is called “compartment security,” meaning you have to be in a certain compartment to see data, regardless of your role or clearance overall.
There is a principle in security called “defense in depth.” Basically it means pushing the security to the lowest layer possible in the stack. That’s why it’s critically important that your DBMS have strong and granular security features.
This is especially true if you’re integrating data from silos, each of which may have its own security rules.
You need your integrated data hub to be able to observe and enforce those rules, regardless of how complex they are.
Increasingly the concern is over the so-called “insider threat.” This is the employee, contractor, vendor, managed service provider, or cloud provider who has access to your infrastructure. Another good reason not to implement security in your application, because if you do, any DBA will be able to circumvent it. Today, with the move to cloud and other outsourced infrastructure, organizations are also concerned about what’s on the file system. Even if you secure your data at the DBMS layer, a system administrator with file system access can still get at it. To counter this, more organizations are requiring “at rest” encryption of data, which means that the data is encrypted on the file system. A good implementation will require a separate role to manage encryption keys, different from the DBA or SA roles, along with a separate key management technology. In our implementation, MarkLogic never even sees the database encryption keys, relying instead on a separate key management system (KMS) to unlock data for us. This separation of concerns is a lot more secure, because it would require insiders to collude across functions and organizations to steal data. You can even keep your data in the cloud and your keys on-premises, or with another managed service provider.
Q8. What is new in MarkLogic® 9 database? ?
David Gorbet: There’s so much in MarkLogic 9 it’s hard to cover all of it. That presentation I referenced earlier from Joe does a pretty good job of summarizing the features. Many of the features in MarkLogic 9 are designed to make data integration even easier. MarkLogic 9 has new ways of modeling data that can keep it in its flexible document form, but project it into tabular form for more traditional analysis (aggregates, group-bys, joins, etc.) using either SQL or a NoSQL API we call the Optic API. This allows you to define the structured parts of your data and let MarkLogic index it in a way that makes it most efficient to query and aggregate.
You can also use this technique to extract RDF triples from your data, giving you easy access to the full power of Semantics technologies.
We’re doing more to make it easier to get data into MarkLogic via a new data movement SDK that you can hook directly up to your data pipeline. This SDK can help orchestrate transformations and parallel loads of data no matter where it comes from.
We’re also doubling down on security. Earlier I mentioned encryption at rest. That’s a new feature for MarkLogic 9.
We’re also doing sub-record-level role- and compartment-based access control. This means that if you have a record (like a customer record) that you want to make broadly available, but there is some data in that record (like a SSN) that you want to restrict access to, you can easily do that. You can also obfuscate and transform data within a record to redact it for export or for use in a context that is less secure than MarkLogic.
Security is a governance feature, and we’re improving other governance features as well, with policy-based tiering for lifecycle management, and improvements to our Bitemporal feature that make it a full-fledged compliance feature.
We’re introducing new tools to help monitor and manage multiple clusters at a time. And we’re making many other improvements in many other areas, like our new geospatial region index that makes region-region queries much faster, improvements to tools like Query Console and MLCP, and many, many more.
One exciting feature that is a bit hard to understand at first is our new Entity Services feature. You can think of this as a catalog of entities. You can put whatever you want in this catalog. Entity attributes, relationships, etc. but also policies, governance rules, and other entity class metadata. This is a queryable semantic model, so you can query your catalog at runtime in your application. We’ll also be providing tools that use this catalog to help build the right set of indexes, indexing templates, APIs, etc. for your specific data. Over time, Entity Services will become the foundation of our vision of the “smart database.” You’ll hear us start talking a lot more about that soon.
David Gorbet, Senior Vice President, Engineering, MarkLogic.
David Gorbet has the best job in the world. As SVP of Engineering, David manages the team that delivers the MarkLogic product and supports our customers as they use it to power their amazing applications. Working with all those smart, talented engineers as they pour their passion into our product is a humbling experience, and seeing the creativity and vision of our customers and how they’re using our product to change their industry is simply awesome.
Prior to MarkLogic, David helped pioneer Microsoft’s business online services strategy by founding and leading the SharePoint Online team. In addition to SharePoint Online, David has held a number of positions at Microsoft and elsewhere with a number of enterprise server products and applications, and numerous incubation products.
David holds a Bachelor of Applied Science Degree in Systems Design Engineering with an additional major in Psychology from the University of Waterloo, and an MBA from the University of Washington Foster School of Business.
–Join the Early Access program for a MarkLogic 9 introduction by visiting: ea.marklogic.com
-The MarkLogic Developer License is free to all who sign up and join the MarkLogic developer community.
– On Data Governance. Interview with David Saul. ODBMS Industry Watch, 2016-07-23
– On Data Interoperability. Interview with Julie Lockner. ODBMS Industry Watch, 2016-06-07
– On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, 2016-05-24
Follow us on Twitter: @odbmsorg
“Any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.”–Dave Thomas.
I have interviewed Dave Thomas,Chief Scientist at Kx Labs.
Q1. For many years business users have had their data locked up in databases and data warehouses. What is wrong with that?
Dave Thomas: It isn’t so much an issue of where the data resides, whether it is in files, databases, data warehouses or a modern data lake. The challenge is that modern businesses need access to the raw data, as well as the ability to rapidly aggregate and analyze their data.
Q2. Typical business intelligence (BI) tool users have never seen their actual data. Why?
Dave Thomas: For large corporations hardware and software both used to be prohibitively expensive, hence much of their data was aggregated prior to making it available to users. Even today when machines are very inexpensive most corporate IT infrastructures are impoverished relative to what one can buy on the street or in the Cloud.
Compounding the problem, IT charge-back mechanisms are biased to reduce IT spending rather than to maximize the value of data delivered to the business.
Traditional technologies are not sufficiently performant to allow processing of large volumes of data.
Many companies have inexpensive data lakes and have realized after the fact that using a commodity storage systems, such as HDFS, has severely constrained their performance and limited their utility. Hence more corporations are moving data away from HDFS into high-performance storage or memory.
Q3. What are the limitations of the existing BI and extract, transform and load (ETL) data tools?
Dave Thomas: Traditional BI tools assume that it is possible for DBAs and BI experts to a priori define the best way to structure and query the data. This reduces the whole power of BI to mere reporting. In an attempt to deal with huge BI backlogs, generic query and reporting tools have become popular to shift reporting to self-serve. However, they are often designed for sophisticated BI users rather than for normal business users. They are often not performant because they depend on the implementation of the underlying data stores.
For the most part, existing ETL tools are constrained by having to move the data to the ETL process and then on to the end user. Many ETL tools only work against one kind of data source. ETL can’t be written by normal users and due to the cost of an incorrect ETL run, such tools are not available to the data analyst. One of the major topics of discussion in Big Data shops is the complexity and performance of their Big Data pipeline. ETL, data blending, shouldn’t be a separate process or product. It should be something one can do with queries in a single efficient data language.
Q4. What are the typical technical challenges in finance, IoT and other time-series applications?
1. Speed, as data volumes and variety are always increasing.
2. Ability to deal with both real-time events and historical events efficiently. Ideally in a single technology.
3. To handle time-series one needs to be able to deal with simultaneous arrival of events. Time with nanosecond precision is our solution. Other solutions are constrained by using milliseconds and event counters that are much less efficient.
4. High-performance operations on time, over days, months and years are essential for time-series. This is why time is a native type in Kx.
5. The essence of time-series is processing sliding time windows of data for both joins and aggregations.
6. In IOT, data is always dirty. Kx’s native support for missing data and out of band data due to failing sensors, allows one to deal with the realities of sensor data.
Q5. Kx offers analysts a language called q. Why not extend standard SQL?
Dave Thomas: I think there is a misunderstanding about q. Q is a full functional data language that both includes and extends SQL. Selects are easier than SQL because they provide implicit joins and group-bys. This makes queries roughly 50% of the code of SQL. Unlike many flavors of SQL, q lets one put a functional expression in any position in an SQL statement. One can easily extend the aggregation operations available to the end-user.
Q6. Can you show the difference between a query written in q and in standard SQL?
Dave Thomas: Here’s an example of retrieving parts from an orders table with a foreign key join to a parts table, summing by quantity and then sorting by color:
select sum qty by p.color from sp
select p.color, sum(sp.qty) from sp, p
where sp.p=p.p group by p.color order by color
Q7. How do queries execute inside the database?
Dave Thomas: Q is native to the database engine. Hence queries and analytics execute in the columns of the Kx database. There is no data shipping between the client and database server.
Q8. Shawn Rogers of Dell said: “A ‘citizen data scientist’ is an everyday, non-technical user that lacks the statistical and analytical prowess of a traditional data scientist, but is equally eager to leverage data in order to uncover insights, and importantly, do so at the speed of business.” What is your take on this?
Dave Thomas: High-performance data technologies, such as Kx, using modern large-memory hardware, can support data analysts versus data scientist queries. In the product Analyst for Kx, for example, users can work interactively on a sample of data using visual tools to import, clean, query, transform, analyze and visualize data with minimal, if any programming or even SQL. Given correct operations on one or more samples they then can be run against trillions of rows of data. Data analysts today can truly live in their data.
Q9. What are the risks of bringing the power of analytics to users who are non-expert programmers?
Dave Thomas: Clearly any important analysis needs to be validated and cross-checked. Hence any important data driving a business decision needs to be sanity checked, just as it would if one was using a spreadsheet.
In our experience users do make initial mistakes, but as they live in their data they quickly learn.
Visualization really helps, as does the provision of metadata about the data sources. Reducing the cycle time provides increased understanding, and allows one to make mistakes.
Runaway query performance has been a concern of DBAs, but for many years frameworks have been in place such as our smart query router that will ensure that ad hoc queries against massive datasets are throttled so they don’t run away. Fortunately, recent cost reductions in non-volatile memory make it possible to have high-performance query-only replicas of data that can be made available to different parts of the organization based on its needs.
Q10. How can non-expert programmers understand if the information expressed in visual analytics such as heat maps or in operational dashboard charts, is of good quality or not?
Dave Thomas: In our experience users spot visual anomalies much faster than inconsistencies in a spreadsheet.
Q11. What are the opportunities arising in “democratizing” the use of massive data sets?
Dave Thomas: We are finally living in a world where for many companies it is possible to run a real-time business where everyone can have fast, efficient access to the data they need. Rather than being held hostage to aggregations, spreadsheets and all sorts of variants of the truth, the organization can expediently see new opportunities to improve results in sales, marketing, production and other business operations.
Q12. How important is data query and data semantics?
Dave Thomas: Unfortunately we are not educated on how to express data semantics and data query.
Even computer scientists often study less about writing queries than how to execute them efficiently.
We need to educate students and employees on how to live in their data. It may well be that the future of programming for most will be writing queries. Given powerful data languages even compiler optimizations can be expressed by queries.
We need to invest much more in data governance and the use of standard terminology in order to share data within and across companies.
Dave Thomas, Kx Labs.
As Chief Scientist Dave envisions the future roadmap for Kx tools. Dave has had a long and storied career in computer software development and is perhaps best known as the founder and past CEO of Object Technology International, formerly OTI, now IBM OTI Labs, a pioneer in Agile Product Development. He was the principal visionary and architect for IBM VisualAge Smalltalk and Java tools and virtual machines including the popular open-source, multi-language Eclipse.org IDE. As the cofounder of Bedarra Research Labs he led the creation of the Ivy visual analytics workbench. Dave is a renowned speaker, university lecturer and Chairman of the Australian developer YOW! conferences.
– New Kx release includes encryption, enhanced compression and Tableau integration. ODBMS.org JULY 4, 2016.
–Kdb+ and the Internet of Things/Big Data. InDetail Paper by Bloor Research Author: Philip Howard. ODBMS.org- JANUARY 28, 2015
– Democratizing fast access to Big Data. By Dave Thomas, chief scientist at Kx Labs. ODBMS.org-April 26, 2016
–On Data Governance. Interview with David Saul. ODBMS Industry Watch, Published on 2016-07-23
–On the Challenges and Opportunities of IoT. Interview with Steve Graves. ODBMS Industry Watch, Published on 2016-07-06
–On Data Analytics and the Enterprise. Interview with Narendra Mulani. ODBMS Industry Watch, Published on 2016-05-24
Follow us on Twitter: @odbmsorg
“Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?.–Jim Starkey.
I have interviewed Jim Starkey. A database legend, Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history.
Q1. In your opinion, what are the most significant advances in databases in the last few years?
Jim Starkey: I’d have to say the “atom programming model” where a database is layered on a substrate of peer-to-peer replicating distributed objects rather than disk files. The atom programming model enables scalability, redundancy, high availability, and distribution not available in traditional, disk-based database architectures.
Q2. What was your original motivation to invent the NuoDB Emergent Architecture?
Jim Starkey: It all grew out of a long Sunday morning shower. I knew that the performance limits of single-computer database systems were in sight, so distributing the load was the only possible solution, but existing distributed systems required that a new node copy a complete database or partition before it could do useful work. I started thinking of ways to attack this problem and came up with the idea of peer to peer replicating distributed objects that could be serialized for network delivery and persisted to disk. It was a pretty neat idea. I came out much later with the core architecture nearly complete and very wrinkled (we have an awesome domestic hot water system).
Q3. In your career as an entrepreneur and architect what was the most significant innovation you did?
Jim Starkey: Oh, clearly multi-generational concurrency control (MVCC). The problem I was trying to solve was allowing ad hoc access to a production database for a 4GL product I was working on at the time, but the ramifications go far beyond that. MVCC is the core technology that makes true distributed database systems possible. Transaction serialization is like Newtonian physics – all observers share a single universal reference frame. MVCC is like special relativity, where each observer views the universe from his or her reference frame. The views appear different but are, in fact, consistent.
Q4. Proprietary vs. open source software: what are the pros and cons?
Jim Starkey: It’s complicated. I’ve had feet in both camps for 15 years. But let’s draw a distinction between open source and open development. Open development – where anyone can contribute – is pretty good at delivering implementations of established technologies, but it’s very difficult to push the state of the art in that environment. Innovation, in my experience, requires focus, vision, and consistency that are hard to maintain in open development. If you have a controlled development environment, the question of open source versus propriety is tactics, not philosophy. Yes, there’s an argument that having the source available gives users guarantees they don’t get from proprietary software, but with something as complicated as a database, most users aren’t going to try to master the sources. But having source available lowers the perceived risk of new technologies, which is a big plus.
Q5. You led the Falcon project – a transactional storage engine for the MySQL server- through the acquisition of MySQL by Sun Microsystems. What impact did it have this project in the database space?
Jim Starkey: In all honesty, I’d have to say that Falcon’s most important contribution was its competition with InnoDB. In the end, that competition made InnoDB three times faster. Falcon, multi-version in memory using the disk for backfill, was interesting, but no matter how we cut it, it was limited by the performance of the machine it ran on. It was fast, but no single node database can be fast enough.
Q6. What are the most challenging issues in databases right now?
Jim Starkey: I think it’s time to step back and reexamine the assumptions that have accreted around database technology – data model, API, access language, data semantics, and implementation architectures. The “relational model”, for example, is based on what Codd called relations and we call tables, but otherwise have nothing to do with his mathematic model. That model, based on set theory, requires automatic duplicate elimination. To the best of my knowledge, nobody ever implemented Codd’s model, but we still have tables which bear a scary resemblance to decks of punch cards. Are they necessary? Or do they just get in the way?
Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?. SQL has no provision for flexible text search, no provision for multi-column, multi-table search, and no mechanics in the APIs to handle the results if it could do them. And this is just one a dozen problems that SQL databases can’t handle. It was a really good technical fit for computers, memory, and disks of the 1980’s, but is it right answer now?
Q7. How do you see the database market evolving?
Jim Starkey: I’m afraid my crystal ball isn’t that good. Blobs, another of my creations, spread throughout the industry in two years. MVCC took 25 years to become ubiquitous. I have a good idea of where I think it should go, but little expectation of how or when it will.
Qx. Anything else you wish to add?
Jim Starkey: Let me say a few things about my current project, AmorphousDB, an implementation of the Amorphous Data Model (meaning, no data model at all). AmorphousDB is my modest effort to question everything database.
The best way to think about Amorphous is to envision a relational database and mentally erase the boxes around the tables so all records free float in the same space – including data and metadata. Then, if you’re uncomfortable, add back a “record type” attribute and associated syntactic sugar, so table-type semantics are available, but optional. Then abandon punch card data semantics and view all data as abstract and subject to search. Eliminate the fourteen different types of numbers and strings, leaving simply numbers and strings, but add useful types like URL’s, email addresses, and money. Index everything unless told not to. Finally, imagine an API that fits on a single sheet of paper (OK, 9 point font, both sides) and an implementation that can span hundreds of nodes. That’s AmorphousDB.
Jim Starkey invented the NuoDB Emergent Architecture, and developed the initial implementation of the product. He founded NuoDB [formerly NimbusDB] in 2008, and retired at the end of 2012, shortly before the NuoDB product launch.
Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history from the Datacomputer project on the fledgling ARPAnet to his most recent startup, NuoDB, Inc. Through the period, he has been
responsible for many database innovations from the date data type to the BLOB to multi-version concurrency control (MVCC). Starkey has extensive experience in proprietary and open source software.
Starkey joined Digital Equipment Corporation in 1975, where he created the Datatrieve family of products, the DEC Standard Relational Interface architecture, and the first of the Rdb products, Rdb/ELN. Starkey was also software architect for DEC’s database machine group.
Leaving DEC in 1984, Starkey founded Interbase Software to develop relational database software for the engineering workstation market. Interbase was a technical leader in the database industry producing the first commercial implementations of heterogeneous networking, blobs, triggers, two phase commit, database events, etc. Ashton-Tate acquired Interbase Software in 1991, and was, in turn, acquired by Borland International a few months later. The Interbase database engine was released open source by Borland in 2000 and became the basis for the Firebird open source database project.
In 2000, Starkey founded Netfrastructure, Inc., to build a unified platform for distributable, high quality Web applications. The Netfrastructure platform included a relational database engine, an integrated search engine, an integrated Java virtual machine, and a high performance page generator.
MySQL, AB, acquired Netfrastructure, Inc. in 2006 to be the kernel of a wholly owned transactional storage engine for the MySQL server, later known as Falcon. Starkey led the Falcon project through the acquisition of MySQL by Sun Microsystems.
Jim has a degree in Mathematics from the University of Wisconsin.
For amusement, Jim codes on weekends, while sailing, but not while flying his plane.
Follow us on Twitter: @odbmsorg
“Intelligent system designers do have ethical responsibilities.”
I have interviewed John Markoff, technology writer at The New York Times.
In 2013 he was awarded a Pulitzer Prize.
The interview is related to his recent book “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots, published in August of 2015 by HarperCollins Ecco.
Q1. Do you share the concerns of prominent technology leaders such as Tesla’s chief executive, Elon Musk, who suggested we might need to regulate the development of artificial intelligence?
John Markoff: I share their concerns, but not their assertions that we may be on the cusp of some kind of singularity or rapid advance to artificial general intelligence. I do think that machine autonomy raises specific ethical and safety concerns and regulation is an obvious response.
Q2. How difficult is it to reconcile the different interests of the people who are involved in a direct or indirect way in developing and deploying new technology?
John Markoff: This is why we have governments and governmental regulation. I think AI, in that respect is no different than any other technology. It should and can be regulated when human safety is at stake.
Q3. In your book Machines of Loving Grace you argued that “we must decide to design ourselves into our future, or risk being excluded from it altogether”. What do you mean by that?
John Markoff: You can use AI technologies either to automate or to augment humans. The problem is minimized when you take an approach that is based on human centric design principles.
Q4. How is it possible in practice? Isn’t the technology space dominated by giants such as IBM, Apple,Google who dictate the direction of new technology?
John Markoff: This is a very interesting time with “giant” technology companies realizing that there are consequences in the deployment of these technologies. Google, IBM and Microsoft have all recently made public commitments to the safe use of AI.
Q5. What are the most significant new developments in the humans-computers area, that are likely to have a significant influence in our daily life in the near future?
John Markoff: One of the best things about being a reporter is that you don’t have to predict the future. You only have to note what the various visionaries say, so you can call that to their attention when their predictions prove inaccurate. With that caveat, if I am forced to bet on any particular information technology it would be augmented reality. This is because I believe that multi-touch interfaces for mobile devices simply can’t be the last step in user interface.
Q6. Do you believe that robots will really transform modern life?
John Markoff: I struggle with the definition of what is a “robot.” If something is tele-operated, for example, is it a robot? That said I think that we will increasingly be surrounded by machines that perform tasks.
The question is will they come as quickly as Silicon Valley seems to believe. My friend Paul Saffo has said, “Never mistake a clear view for a short distance.” And I think that is the case with all kinds of mobile robots, including self driving cars.
Q7. For the designers of Intelligent Systems, how difficult is to draw a line between what is human and what is machine?
John Markoff: I feel strongly that the possibility of designing cyborgs, particularly with respect to intellectual prosthesis is a boundary we should cross with great caution. Remember the Borg from StarTrek. “Resistance is futile, you will be assimilated.” I think the challenge is to use these systems to enhance human thought, not for social control.
Q8. What are the ethical responsibilities of designers of intelligent systems?
John Markoff: I think the most important aspect of that question is the simple acknowledgement that intelligent system designers do have ethical responsibilities. That has not always been the case, but it seems to be a growing force within the community of AI and robotics designers in the past five years, so I’m not entirely pessimistic.
Q9. If humans delegate decisions to machines, who will be responsible for the consequences?
John Markoff: Ben Shneiderman, the University of Maryland computer scientist and user interface designer has written eloquently on this point. Indeed he argues against autonomous systems for precisely this reason. His point is that it is essential to keep a human in the loop. If not you run the risk of abdicating ethical responsibility for system design.
Q10. Assuming there is a real potential in using data–driven methods to both help charities develop better services and products, and understand civil society activity. In your opinion, what are the key lessons and recommendations for future work in this space?
John Markoff: I’m afraid I’m not an expert in the IT needs of either charities or NGOs. That said a wide range of AI advances are already being delivered at nominal cost via smart phones. As cheap sensors proliferate virtually all everyday objects will gain intelligence that will be widely accessible.
Qx. Anything else you wish to add?
John Markoff: Only that I think it is interesting that the augmentation vs automation dichotomy is increasingly seen as a path through which to navigate the impact of these technologies. Computer system designers are the ones who will decide what the impact of these technologies are and whether to replace or augment humans in society.
JOHN GREGORY MARKOFF
John Markoff joined The New York Times in March 1988 as a reporter for the business section. He is now a technology writer based in San Francisco bureau of the paper. Prior to joining the Times, he worked for The San Francisco Examiner from 1985 to 1988. He reported for the New York Times Science Section from 2010 to 2015.
Markoff has written about technology and science since 1977. He covered technology and the defense industry for The Pacific News Service in San Francisco from 1977 to 1981; he was a reporter at Infoworld from 1981 to 1983; he was the West Coast editor for Byte Magazine from 1984 to 1985 and wrote a column on personal computers for The San Jose Mercury from 1983 to 1985.
He has also been a lecturer at the University of California at Berkeley School of Journalism and an adjunct faculty member of the Stanford Graduate Program on Journalism.
The Times nominated him for a Pulitzer Prize in 1995, 1998 and 2000. The San Francisco Examiner nominated him for a Pulitzer in 1987. In 2005, with a group of Times reporters, he received the Loeb Award for business journalism. In 2007 he shared the Society of American Business Editors and Writers Breaking News award. In 2013 he was awarded a Pulitzer Prize in explanatory reporting as part of a New York Times project on labor and automation.
In 2007 he became a member of the International Media Council at the World Economic Forum. Also in 2007, he was named a fellow of the Society of Professional Journalists, the organization’s highest honor.
In June of 2010 the New York Times presented him with the Nathaniel Nash Award, which is given annually for foreign and business reporting.
Born in Oakland, California on October 29, 1949, Markoff grew up in Palo Alto, California and graduated from Whitman College, Walla Walla, Washington, in 1971. He attended graduate school at the University of Oregon and received a masters degree in sociology in 1976.
Markoff is the co-author of “The High Cost of High Tech,” published in 1985 by Harper & Row. He wrote “Cyberpunk: Outlaws and Hackers on the Computer Frontier” with Katie Hafner, which was published in 1991 by Simon & Schuster.
In January of 1996 Hyperion published “Takedown: The Pursuit and Capture of America’s Most Wanted Computer Outlaw,” which he co-authored with Tsutomu Shimomura. “What the Dormouse Said: How the Sixties Counterculture shaped the Personal Computer Industry,” was published in 2005 by Viking Books. “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots,” was published in August of 2015 by HarperCollins Ecco.
He is currently researching a biography of Stewart Brand.
He is married to Leslie Terzian Markoff and they live in San Francisco, Calif.
MACHINES OF LOVING GRACE – The Quest for Common Ground Between Humans and Robots By John Markoff, Illustrated. 378 pp. Ecco/HarperCollins Publishers.
–Shneiderman’s “Eight Golden Rules of Interface Design”. These rules were obtained from the text Designing the User Interface by Ben Shneiderman.
– “Designing the User Interface”, 6th Edition. This is a revised edition of the highly successful textbook on Human Computer Interaction originally developed by Ben Shneiderman and Catherine Plaisant at the University of Maryland.
– Recruit Institute of Technology. Interview with Alon Halevy. ODBMS Industry Watch, Published on 2016-04-02
– Civility in the Age of Artificial Intelligence, by STEVE LOHR, technology reporter for The New York Times, ODBMS.org
– On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.
– On Big Data and Society. Interview with Viktor Mayer-Schönberger, ODBMS Industry Watch.
Follow us on Twitter: @odbmsorg
“The increasing complexity and pace of global regulations is making it more difficult and expensive for financial services organizations to comply. At the same time, firms want to derive value from their data assets. How do they create synergy between these two seemingly divergent goals? The maturation of semantic technologies, when combined with increased acceptance of industry standards, holds out the promise of resolving those issues.” –David Saul.
I have interviewed David Saul, Senior Vice President and Chief Scientist at State Street Corporation. Main topics of the interview are the governance and management of data, and semantic technologies.
Q1. What is your role at State Street Bank?
David Saul: State Street has a long history as an innovator in financial services and my objective is to help maintain that leadership position. I work with our clients, internal developers, vendors, regulators and academics to identify and introduce appropriate innovations into our business. For the last several years I have focused on the development and adoption of semantic data standards.
The concept of the semantic web was first proposed over ten years ago by Sir Tim Berners-Lee, the creator of the World Wide Web, and has since been realized in multiple implementations. Semantics is a natural evolution of earlier work on metadata, language dialects and taxonomies for regulatory compliance. Examples include the SEC’s XBRL mandate and OFR’s Legal Entity Identifier (LEI) as part of the Dodd-Frank legislation.
Q2. What is Data Governance?
David Saul: State Street’s most important asset is the data that we ingest, process, store and distribute on behalf of our clients. Data Governance encompasses the management and controls needed to maintain stewardship of that data while in our custody.
Effective data governance can be measured by the ability to answer the following four questions:
- Do you know where your data is? Are you able to identify the critical business data in the firm, who owns it and, most importantly, and what it means?
- Do you maintain a catalog and monitor current and future regulatory requirements?
- Do you understand the existing products/services solutions used and can you identify any gaps?
- Do you participate in and influence relevant industry data standards?
Q3. What makes a good Data Governance Program?
David Saul: A mature Data Governance program provides a balanced framework to monetize data while also complying with regulatory requirements. The application of semantic data standards allows synergy between data analytics and risk management.
One example is the Financial Industry Business Ontology (FIBO) from the Enterprise Data Management (EDM) Council and the Object Management Group (OMG). Recent publications from regulators in the US and elsewhere have endorsed the use of data standards as the only way to deal with the increase in the scope and complexity of their responsibilities. For example, in its 2014 Annual Report the US Treasury Office of Financial Research (OFR) devotes its entire section 5 to “Advancing Data Standards”.
Semantics provides additional advantages over traditional technologies in its speed and flexibility. Developing Extract, Transform and Load (ETL) processes and data warehouses cannot keep pace with changes in business models and relevant regulations. The ability to easily create and change semantic maps of data ecosystems is being offered today by a number of vendors. The open nature of data standards like FIBO not only provides transparency but also provides assurance that these standards will be long lasting. Current academic research is showing our semantics can be a path into more leading edge technologies like machine learning and natural language.
Q4. How do you handle possible organizational conflicts from overlapping functions when dealing with Data?
David Saul: Effective governance and management of data requires a balance between distributed ownership and centralized control. The organizational role of the chief data officer at State Street has evolved to provide centralized policies, procedures and controls for data stewardship while maintaining operational management within the business processing units.
Beyond individual institutions, the application of data standards provides benefits to multiple constituencies:
- Financial services firms gain additional revenue from their clients while keeping risks at an acceptable level.
- Product and services companies have clearer requirements to innovate, develop and sell.
- Regulators and supervisors receive the information they need to meet statutory mandates and ensure that laws are complied with.
- Standards organizations follow their mission to enable simple and effective communication among the parties.
Q5. What are the main challenges in corporate, financial services, and regulatory sectors, especially on issues of Big Data, Analytics, and Risk Management?
David Saul: The increasing complexity and pace of global regulations is making it more difficult and expensive for financial services organizations to comply. At the same time, firms want to derive value from their data assets. How do they create synergy between these two seemingly divergent goals? The maturation of semantic technologies, when combined with increased acceptance of industry standards, holds out the promise of resolving those issues. Semantics and ontologies provide greater transparency and interoperability, thereby enhancing the overall trust in the financial system. Enhanced trust benefits all constituencies who have a direct interest.
Q6. You previously contributed to the Financial Stability Board Data Gaps Implementation Group. What are the main contributions of such group?
David Saul: State Street is an advocate for global data harmonization in multiple forums. Contributing expertise to industry associations and standards bodies benefits both the firm and the industry as a whole. Just one example is the International Organization of Securities Commissions (IOSCO) work on the financial industry Unique Product Identifier (UPI).
Q7. You also contributed to the White House Task Force on Smart Disclosure. What are the main results obtained?
David Saul: On May 9, 2014, President Barack Obama signed the Digital Accountability and Transparency Act, or the DATA Act, which had been passed unanimously by both the House of Representatives and the Senate. It requires the Department of the Treasury and the White House Office of Management and Budget to transform U.S. federal spending from disconnected documents into open, standardized data, and to publish that data online. State Street was among stakeholders from the tech industry, nonprofit sector, and executive and legislative branches of government who convened in May 2016 at the DATA Act Summit to build a shared vision for making the DATA Act a success.
David Saul, Senior Vice President and Chief Scientist, State Street Corporation.
David Saul is a senior vice president and chief scientist at State Street Corporation, reporting to the chief information officer. In this role, he proposes and assesses new advanced technologies for the organization, and also evaluates existing technologies and their likely evolution to reinforce the organization’s leadership position in financial services.
Mr. Saul previously was chief information security officer, where he oversaw State Street’s corporate information security program, controls and technology. Prior to that, he managed State Street’s Office of Architecture, where he was responsible for the overall enterprise technology, data and security architecture of the corporation.
Mr. Saul joined State Street in 1992 after 15 years with IBM’s Cambridge Scientific Center, where he managed innovations in operating systems virtualization, multiprocessing, networking and personal computers.
Mr. Saul serves as a trustee of the Massachusetts Eye and Ear Infirmary. In 2007, he was honored with a Computerworld Premier 100 IT Leader Award. He holds his bachelor’s and master’s degrees from the Massachusetts Institute of Technology.
– On data analytics for finance. Interview with Jason S.Cornez. ODBMS Industry Watch, Published on 2016-05-17
– Using NoSQL for Ireland’s Online Tax Research Database. ODBMS Industry Watch, Published on 2016-05-02
– Opportunity Now: Europe’s Mission to Innovate, By Robert Madelin, Senior Adviser for Innovation to the President of the European Commission. ODBMS.org
– Big Data in Financial Markets Regulation – Friend or Foe? By Morgan Deane, member of the Board and International Head of Legal & Compliance for the Helvea-Baader Bank Group. ODBMS.org, January 18, 2015
– The need for a data centric regulatory risk assessment framework. By Ramendra K. Sahoo, Director in PwC’s Advanced Risk Analytics. ODBMS.org
– Big Data Strategy – From Customer Targeting to Customer Centric. By Patrick Maes, CTO and GM Strategy & Planning, Global Technology Services and Operations, Australia & New Zealand Banking Group
Follow us on Twitter: @odbmsorg
“Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud.”–Steve Graves.
Q1. What are in your opinion the main Challenges and Opportunities of the Internet of Things (IoT) seen from the perspective of a database vendor?
Steve Graves: Let’s start with the opportunities.
When we started McObject in 2001, we chose “eXtremeDB, the embedded database for intelligent, connected devices” as our tagline. eXtremeDB was designed from the get-go to live in the “things” comprising what the industry now calls the Internet of Things. The popularization of this term has created a lot of visibility and, more importantly, excitement and buzz for what was previously viewed as the relatively boring “embedded systems.” And that creates a lot of opportunities.
A lot of really smart, creative people are thinking of innovative ways to improve our health, our workplace, our environment, our infrastructure, and more. That means new opportunities for vendors of every component of the technology stack.
The challenges are manifold, and I can’t begin to address all of them. The media is largely fixated on security, which itself is multi-dimensional.
We can talk about protecting IoT-enabled devices (e.g. your car) from being hacked. We can talk about protecting the privacy of your data at rest. And we can talk about protecting the privacy of data in motion.
Every vendor needs recognize the importance of security. But, it isn’t enough for a vendor, like McObject, to provide the features to secure the target system; the developer that assembles the stack along with their own proprietary technology to create an IoT solution needs to use available security features, and use them correctly.
After security, scaling IoT systems is the next big challenge. It’s easy enough to prototype something.
But careful planning is needed to leap from prototype to full-blown deployment. Obvious decisions have to be made about connectivity and necessary bandwidth, how many things per gateway, one tier of gateways or more, and how much compute capacity is needed in the cloud. Beyond that, there are less obvious decisions to be made that will affect scalability, like making sure the DBMS used on devices and/or gateways is able to handle the workload (e.g. that the gateway DBMS can scale from 10 input streams to 100 input streams); determining how to divide the analytics workload between gateways and the cloud; and ensuring that the gateway, its DBMS and its communication stack can stream data to the cloud while simultaneously processing its own input streams and analytics.
Assembling a team with the wide range of skills needed for a successful IoT project presents an entirely different set of challenges. The skills needed to build a ‘thing’ are markedly different than the skills needed to implement the data analytics in the cloud. In fact, ‘things’ are usually very much like good ol’ embedded systems, and system engineers that know their way around real-time/embedded operating systems, JTAG debuggers, and so on, have always been at a premium.
Q2. Data management for the IoT: What are the main differences between data management in field-deployed devices and at aggregation points?
Steve Graves: Quite simply: scale. A field-deployed device (or a gateway to field-deployed devices that do not, themselves, have any data management need or capability) has to manage a modest amount of data. But an aggregation point (the cloud being the most obvious example) has to manage many times more data – possibly orders of magnitude more.
At the same time, I have to say that they might not be all that different. Some IoT systems are going to be closed, meaning the nature of the things making up the system is known, and these won’t require much scaling. For example, a building automation system for a small- to mid-size building would have perhaps 100s of sensors and 10s of gateways, and may (or may not) push data up to a central aggregation point. If there are just 10s of gateways, we can create a UI that connects to the database on each gateway where each database is one shard of a single logical database, and execute analytics against that logical database without any need of a central aggregation point. We can extend this hypothetical case to a campus of buildings, or to a landlord with many buildings in a metropolitan area, and then a central aggregation point makes sense.
But the database system would not necessarily be different, only the organization of the physical and logical databases.
The gateways of each building would stream to a database server in the cloud. In the case of 10 buildings, we could have 10 database servers in the cloud that represent 10 shards of that logical database in the cloud. This architecture allows for great scalability. The landlord acquires another building? Great, stand up another database server and the UI connects to 11 shards instead of 10. In this scenario, database servers are software, not hardware. For the numbers we’re talking about (10 or 11 buildings), it could easily be handled by a single hardware server of modest ability.
At the other end of the scale (pun intended) are IoT systems that are wide open. By that, I mean the creators are not able to anticipate the universe of “things” that could be connected, or their quantity. In the first case, the database system should be able to ingest data that was heretofore unknown. This argues for a NoSQL database system, i.e. a database system that is schema-less. In this scenario, the database system on field-deployed devices is probably radically different from the database system in the cloud. Field-deployed devices are purpose-specific, so A) they don’t need and wouldn’t benefit from a NoSQL database system, and B) most NoSQL database systems are too resource-hungry to reside on embedded device nodes.
Q3. If we look at the characteristics of a database system for managing device-based data in the IoT, how do they differ from the characteristics of a database system (typically deployed on a server) for analyzing the “big data” generated by myriad devices?
Steve Graves: Again, let’s recognize that field-deployed devices in the IoT are classic embedded systems. In practical terms, that means relatively modest hardware like an ARM, MIPS, PowerPC or Atom processor running at 100s of megahertz, or perhaps 1 ghz if we’re lucky, and with only enough memory to perform its function. Further, it may require a real-time operating system, or at least an embedded operating system that is less resource hungry than a full-on Linux distro. So, for a database system to run in this environment, it will need to have been designed to run in this environment. It isn’t practical to try to shoehorn in a database system that was written on the assumption that CPU cycles and memory are abundant. It may also be the case that the device has little-to-no persistent storage, which mandates an in-memory database.
So a database system for a field-deployed device is going to
1. have a small code size
2. use little stack
3. preferably, allocate no heap memory
4. have no, or minimal, external dependencies (e.g. not link in an extra 1 MB of code from the C run-time library)
5. have built-in ability to replicate data (to a gateway or directly to the cloud)
a. Replication should be “open”, meaning be able to replicate to a different database system
6. Have built-in security features
7. Nice to have:
a. built-in analytics to aggregate data prior to replicating it
b. ability to define the schema
c. ability to operate entirely in memory
A database system for the cloud might benefit from being schema-less, as described previously. It should certainly have pretty elastic scalability. Servers in the cloud are going to have ample resources and robust operating systems. So a database system for the cloud doesn’t need to have a small code size, use a small amount of stack memory, or worry about external dependencies such as the C run-time library. On the contrary, a database system for the cloud is expected to do much more (handle data at scale, execute analytics, etc.) and will, therefore, need ample resources. In fact, this database system should be able to take maximum advantage of the resources available, including being able to scale horizontally (across cores, CPUs, and servers).
In summary, the edge (device-based) DBMS needs to operate in a constrained environment. A cloud DBMS needs to be able to effectively and efficiently utilize the ample resources available to it.
Q4. Why is the ability to define a database schema important (versus a schema-less DBMS, aka NoSQL) for field-deployed devices?
Steve Graves: Field-deployed devices will normally perform a few specific functions (sometimes, just one function). For example, a building automation system manages HVAC, lighting, etc. A livestock management system manages feed, output, and so on. In such systems, the data requirements are well known. The hallmark NoSQL advantage of being able to store data without predefining its structure is unwarranted. The other purported hallmark of NoSQL is horizontal scalability, but this is not a need for field-deployed devices.
Walking away from the relational database model (and its implicit use of a database schema) has serious implications.
A great deal of scientific knowledge has been amassed around the relational database model over the last few decades, and without it developers are completely on their own with respect to enforcing sound data management practices.
In the NoSQL sphere, there is nothing comparable to the relational model (e.g. E.F. Codd’s work) and the mathematical foundation (relational calculus) underpinning it.
There should be overwhelming justification for a decision to not use relational.
In my experience, that justification is absent for data management of field-deployed devices.
A database system that “knows” the data design (via a schema) can more intelligently manage the data. For example, it can manage constraints, domain dependencies, events and much more. And some of the purported inflexibility imposed by a schema can be eliminated if the DBMS supports dynamic DDL (see more details on this in the answer to question Q6, below).
Q5. In your opinion, do IoT aggregation points resemble data lakes?
Steve Graves: The term data lake was originally conceived in the context of Hadoop and map-reduce functionality. In more recent times, the meaning of the term has morphed to become synonymous with big data, and that is how I use the term. Insofar as a gateway can also be an aggregation point, I would not say ‘aggregation points resemble data lakes’ because gateway aggregation points, in all likelihood, will not manage Big Data.
Q6. What are the main technical challenges for database systems used to accommodate new and unforeseen data, for example when a new type of device begins streaming data?
Steve Graves: The obvious challenges are
1. The ability to ingest new data that has a previously unknown structure
2. The ability to execute analytics on #1
3. The ability to integrate analytics on #1 with analytics on previously known data
#1 is handled well by NoSQL DBMSs. But, it might also be handled well by an RDBMS via “dynamic DDL” (dynamic data definition language), e.g. the ability to execute CREATE TABLE, ALTER TABLE, and/or CREATE INDEX statements against an existing database.
To efficiently execute analytics against any data, the structure of the data must eventually be understood.
RDBMS handle this through the database dictionary (the binary equivalent of the data definition language).
But some NoSQL DBMSs handle this through different meta data. For example, the MarkLogic DBMS uses JSON metadata to understand the structure of documents in its document store.
NoSQL DBMSs with no meta data whatsoever put the entire burden on the developers. In other words, since the data is opaque to the DBMS, the application code must read and interpret the content.
Q7. Client/server DBMS architecture vs. in-process DBMSs: which one is more suitable for IoT?
Steve Graves: For edge DBMSs (on constrained devices), an in-process architecture will be more suitable. It requires fewer resources than client/server architecture, and imposes less latency through elimination of inter-process communication. For cloud DBMSs, a client/server architecture will be more suitable. In the cloud environment, resources are not scarce, and the the advantage of being able to scale horizontally will outweigh the added latency associated with client/server.
Qx Anything else you wish to add?
Steve Graves: We feel that eXtremeDB is uniquely positioned for the Internet of Things. Not only have devices and gateways been in eXtremeDB’s wheelhouse for 15 years with over 25 million real world deployments, but the scalability, time series data management, and analytics built into the eXtremeDB server (big data) offering make it an attractive cloud database solution as well. Being able to leverage a single DBMS across devices, gateways and the cloud has obvious synergistic advantages.
Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.
Big Data, Analytics, and the Internet of Things, by Mohak Shah, analytics leader and research scientist at Bosch Research, USA.ODBMS.org APRIL 6, 2015
Privacy considerations & responsibilities in the era of Big Data & Internet of Things, by Ramkumar Ravichandran, Director, Analytics, Visa Inc. ODBMS.org January 8, 2015.
Securing Your Largest USB-Connected Device: Your Car,BY Shomit Ghose, General Partner, ONSET Ventures, ODBMs.org MARCH 31, 2016.
eXtremeDB Financial Edition DBMS Sweeps Records in Big Data Benchmark,ODBMS.org JULY 2, 2016
On the Internet of Things. Interview with Colin Mahony, ODBMS Industry Watch, Published on 2016-03-14
A Grand Tour of Big Data. Interview with Alan Morrison, ODBMS Industry Watch, Published on 2016-02-25
On the Industrial Internet of Things. Interview with Leon Guzenda, ODBMS Industry Watch, January 28, 2016
Follow us on Twitter: @odbmsorg
“From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care?”– Julie Lockner.
I have interviewed Julie Lockner. Julie leads data platform product marketing for InterSystems. Main topics of the interview are Data Interoperability and InterSystems` data platform strategy.
Q1. Everybody is talking about Big Data — is the term obsolete?
Julie Lockner: Well, there is no doubt that the sheer volume of data is exploding, especially with the proliferation of smart devices and the Internet of Things (IoT). An overlooked aspect of IoT is the enormous volume of data generated by a variety devices, and how to connect, integrate and manage it all.
The real challenge, though, is not just processing all that data, but extracting useful insights from the variety of device types. Put another way, not all data is created using a common standard. You want to know how to interpret data from each device, know which data from what type of device is important, and which trends are noteworthy. Better information can create better results when it can be aggregated and analyzed consistently, and that’s what we really care about. Better, higher quality outcomes, not bigger data.
Q2. If not Big Data, where do we go from here?
Julie Lockner: We always want to be focusing on helping our customers build smarter applications to solve real business challenges, such as helping them to better compete on service, roll out high-quality products quicker, simplify processes – not build solutions in search of a problem. A canonical example is in retail. Our customers want to leverage insight from every transaction they process to create a better buying experience online or at the point of sale. This means being able to aggregate information about a customer, analyze what the customer is doing while on the website, and make an offer at transaction time that would delight them. That’s the goal – a better experience – because that is what online consumers expect.
From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care? That implies we are analyzing not just more data, but better data that comes in all shapes and sizes, and that changes more frequently. It really points to the need for data interoperability.
Q3. What are the challenges software developers are telling you they have in today’s data-intensive world?
Julie Lockner: That they have too many database technologies to choose from and prefer to have a simple data platform architecture that can support multiple data models and multiple workloads within a single development environment.
We understand that our customers need to build applications that can handle a vast increase in data volume, but also a vast array of data types – static, non-static, local, remote, structured and non-structured. It must be a platform that coalesces all these things, brings services to data, offers a range of data models, and deals with data at any volume to create a more stable, long-term foundation. They want all of these capabilities in one platform – not a platform for each data type.
For software developers today, it’s not enough to pick elements that solve some aspect of a problem and build enterprise solutions around them; not all components scale equally. You need a common platform without sacrificing scalability, security, resilience, rapid response. Meeting all these demands with the right data platform will create a successful application.
And the development experience is significantly improved and productivity drastically increased when they can use a single platform that meets all these needs. This is why they work with InterSystems.
Q4. Traditionally, analytics is used with structured data, “slicing and dicing” numbers. But the traditional approach also involves creating and maintaining a data warehouse, which can only provide a historical view of data. Does this work also in the new world of Internet of Things?
Julie Lockner: I don’t think so. It is generally possible to take amorphous data and build it into a structured data model, but to respond effectively to rapidly changing events, you need to be able to take data in the form in which it comes to you.
If your data platform lacks certain fields, if you lack schema definition, you need to be able to capitalize on all these forms without generating a static model or a refinement process. With a data warehouse approach, it can take days or weeks to create fully cleansed, normalized data.
That’s just not fast enough in today’s always-on world – especially as machine-generated data is not conforming to a common format any time soon. It comes back to the need for a data platform that supports interoperability.
Q5. How hard is it to make decisions based on real-time analysis of structured and unstructured data?
Julie Lockner: It doesn’t have to be hard. You need to generate rules that feed rules engines that, in turn, drive decisions, and then constantly update those rules. That is a radical enhancement of the concept of analytics in the service of improving outcomes, as more real-time feedback loops become available.
The collection of changes we describe as Big Data will profoundly transform enterprise applications of the future. Today we can see the potential to drive business in new ways and take advantage of a convergence of trends, but it is not happening yet. Where progress has been made is the intelligence of devices and first-level data aggregation, but not in the area of services that are needed. We’re not there yet.
Q6. What’s next on the horizon for InterSystems in meeting the data platform requirements of this new world?
Julie Lockner: We continually work on our data platform, developing the most innovative ways we can think of to integrate with new technologies and new modes of thinking. Interoperability is a hugely important component. It may seem a simple task to get to the single most pertinent fact, but the means to get there may be quite complex. You need to be able to make the right data available – easily – to construct the right questions.
Data is in all forms and at varying levels of completeness, cleanliness, and accuracy. For data to be consumed as we describe, you need measures of how well you can use it. You need to curate data so it gets cleansed and you can cull what is important. You need flexibility in how you view data, too. Gathering data without imposing an orthodoxy or structure allows you to gain access to more data. Not all data will conform to a schema a priori.
Q7. Recently you conducted a benchmark test of an application based on InterSystems Caché®. Could you please summarize the main results you have obtained?
Julie Lockner: One of our largest customers is Epic Systems, one of the world’s top healthcare software companies.
Epic relies on Caché as the data platform for electronic medical record solutions serving more than half the U.S. patient population and millions of patients worldwide.
Epic tested the scalability and performance improvements of Caché version 2015.1. Almost doubling the scalability of prior versions, Caché delivers what Epic President Cark Dvorak has described as “a key strategic advantage for our user organizations that are pursuing large-scale medical informatics programs as well as aggressive growth strategies in preparation for the volume-to-value transformation in healthcare.”
Qx Anything else you wish to add?
Julie Lockner: The reason why InterSystems has succeeded in the market for so many years is a commitment to the success of those who depend on our technology. A recent Gartner Magic Quadrant report found we had the highest number of customers surveyed – 85% – who would buy from us again. That is the highest number of any vendor participating in that study.
The foundation of the company’s culture is all about helping our customers succeed. When our customers come to us with a challenge, we all pitch in to solve it. Many times our solutions may address an unusual problem that could benefit others – which then becomes the source of many of our innovations. It is one of the ways we are using problem-solving skills as a winning strategy to benefit others. When our customers are successful at using our engine to solve the world’s most important challenges, we all win.
Julie Lockner leads data platform product marketing for InterSystems. She has more than 20 years of experience in IT product marketing management and technology strategy, including roles at analyst firm ESG as well as Informatica and EMC.
– White Paper: Big Data Healthcare: Data Scalability with InterSystems Caché® and Intel® Processors (LINK to .PDF)
Follow us on Twitter: @odbmsorg
“A hybrid technology infrastructure that combines existing analytics architecture with new big data technologies can help companies to achieve superior outcomes.”–Narendra Mulani
I have interviewed Narendra Mulani, Chief Analytics Officer, Accenture Analytics. Main topics of our interview are: Data Analytics, Big Data, the Internet of Things, and their repercussion for the enterprise.
Q1. What is your role at Accenture?
Narendra Mulani: I’m the Chief Analytics Officer at Accenture Analytics and I am responsible for building and inspiring a culture of analytics and driving Accenture’s strategic agenda for growth across the business. I lead a team of analytics professionals around the globe that are dedicated to helping clients transform into insight-driven enterprises and focused on creating value through innovative solutions that combine industry and functional knowledge with analytics and technology.
With the constantly increasing amount of data and new technologies becoming available, it truly is an exciting time for Accenture and our clients alike. I’m thrilled to be collaborating with my team and clients and taking part, first-hand, in the power of analytics and the positive disruption it is creating for businesses around globe.
Q2. What are the main drivers you see in the market for Big Data Analytics?
Narendra Mulani: Companies across industries are fighting to secure or keep their lead in the marketplace.
To excel in this competitive environment, they are looking to exploit one of their growing assets: Data.
Organizations see big data as a catalyst for their transformation into digital enterprises and as a way to secure an insight-driven competitive advantage. In particular, big data technologies are enabling companies with greater agility as it helps them to analyze data comprehensively and take more informed actions at a swifter pace. We’ve already passed the transition point with big data – instead of discussing the possibilities with big data, many are already experiencing the actual insight-driven benefits from it, including increased revenues, a larger base of loyal customers, and more efficient operations. In fact, we see our clients looking for granular solutions that leverage big data, advanced analytics and the cloud to address industry specific problems.
Q3. Analytics and Mobility: how do they correlate?
Narendra Mulani: Analytics and mobility are two digital areas that work hand-in-hand on many levels.
As an example, mobile devices and the increasingly connected world through the Internet of Things (IoT) have become two key drivers for big data analytics. As mobile devices, sensors, and the IoT are constantly creating new data sources and data types, big data analytics is being applied to transform the increasing amount of data into important and actionable insight that can create new business opportunities and outcomes. Also, this view can be reversed, where analytics feeds insight into mobile devices such as tablets to workers in offices or out in the field to enable them to make real-time decisions that could benefit their business.
Q4. Data explosion: What does it create ? Risks, Value or both?
Narendra Mulani: The data explosion that’s happening today and will continue to happen due to the Internet of Things creates a lot of opportunity for businesses. While organizations recognize the value that the data can generate, the sheer amount of data – internal data, external data, big data, small data, etc – can be overwhelming and create an obstacle for analytics adoption, project completion, and innovation. To overcome this challenge and pursue actionable insights and outcomes, organizations shouldn’t look to analyze all of the data that’s available, but identify the right data needed to solve the current project or challenge at hand to create value.
It’s also important for companies to manage the potential risk associated with the influx of data and take the steps needed to optimize and protect it. They can do this by aligning IT and business leads to jointly develop and maintain data governance and security strategies. At a high level, the strategies would govern who uses the data and how the data is analyzed and leveraged, define the technologies that would manage and analyze the data, and ensure the data is secured with the necessary standards. Suitable governance and security strategies should be requirements for insight-driven businesses. Without them, organizations could experience adverse and counter-productive results.
Q5. You introduced the concept of the “Modern Data Supply Chain”? How does it differ from the traditional Supply Chain?
Narendra Mulani: As companies’ data ecosystems are usually very complex with many data silos, a modern data supply chain helps them to simplify their data environment and generate the most value from their data. In brief, when data is treated as a supply chain, it can flow swiftly, easily and usefully through the entire organization— and also through its ecosystem of partners, including customers and suppliers.
To establish an effective modern data supply chain, companies should create a hybrid technology environment that enables a data service platform with emerging big data technologies. As a result, businesses will be able to access, manage, move, mobilize and interact with broader and deeper data sets across the organization at a much quicker pace than previously possible and place action on the attained analytics insights that could help it to more effectively deliver to its consumers, develop new innovative solutions, and differentiate in its market.
Q6. You talked about “Retooling the Enterprise”. What do you mean by this?
Narendra Mulani: Some businesses today are no longer just using analytics, they are taking the next step by transforming into insight-driven enterprises. To achieve “insight-driven enterprise” status, organizations need to retool themselves for optimization. They can pursue an insight-driven transformation by:
· Establishing a center of gravity for analytics – a center of gravity for analytics often takes the shape of a Center of Excellence or a similar concentration of talent and resources.
· Employing agile governance – build horizontal governance structures that are focused on outcomes and speed to value, and take a “test and learn” approach to rolling out new capabilities. A secure governance foundation could also improve the democratization of data throughout a business.
· Creating an inter-disciplinary high performing analytics team — field teams with diverse skills, organize talent effectively, and create innovative programs to keep the best talent engaged.
· Deploying new capabilities faster – deploy new, modern and agile technologies, as well as hybrid architectures and specifically designed toolsets, to help revolutionize how data has been traditionally managed, curated and consumed, to achieve speed to capability and desired outcomes. When appropriate, cloud technologies should be integrated into the IT mix to benefit from cloud-based usage models.
· Raising the company’s analytics IQ – have a vision of what would be your “intelligent enterprise” and implement an Analytics Academy that provides analytics training for functional business resources in addition to the core management training programs.
Q7. What are the risks from the Internet of Things? And how is it possible to handle such risks?
Narendra Mulani: The IoT is prompting an even greater focus on data security and privacy. As a company’s machines, employees and ecosystems of partners, providers, and customers become connected through the IoT, securing the data that is flowing across the IoT grid can be increasingly complex. Today’s sophisticated cyber attackers are also amplifying this complexity as they are constantly evolving and leveraging data technology to challenge a company’s security efforts.
To establish strong, effective real-time cyber defense strategy, security teams will need to employ innovative technologies to identify threat behavioral patterns — including artificial intelligence, automation, visualisation, and big data analytics – and an agile and fluid workforce to leverage the opportunities presented by technology innovations. They should also establish policies to address privacy issues that arise out of all the personal data that are being collected. Through this combination of efforts, companies will be able to strengthen its approach to cyber defense in today’s highly connected IoT world and empower cyber defenders to help their companies better anticipate and respond to cyber attacks.
Q8. What are the main lessons you have learned in implementing Big Data Analytic projects?
Narendra Mulani: Organizations should explore the entire big data technology ecosystem, take an outcome-focused approach to addressing specific business problems, and establish precise success metrics before an analytics project even begins. The big data landscape is in a constant state of change with new data sources and emerging big data technologies appearing every day that could offer a company a new value-generating opportunity. A hybrid technology infrastructure that combines existing analytics architecture with new big data technologies can help companies to achieve superior outcomes.
An outcome-focused strategy that embraces analytics experimentation and explores the possible data and technology that can help a company meet its goals and has checkpoints for measuring performance will be very valuable, as this strategy will help the analytics team to know if they should continue on course or need to make a course correction to attain the desired outcome.
Q9. Is Data Analytics only good for businesses? What about using (Big) Data for Societal issues?
Narendra Mulani: Analytics is helping businesses across industries and governments as well to make more informed decisions for effective outcomes, whether it might be to improve customer experience, healthcare or public safety.
As an example, we’re working with a utility company in the UK to help them leverage analytics insights to anticipate equipment failures and respond in near real-time to critical situations, such as leaks or adverse weather events. We are also working with a government agency to analyze its video monitoring feeds to identify potential public safety risks.
Qx Anything else you wish to add?
Narendra Mulani: Another area that’s on the rise is Artificial Intelligence – we define it as a collection of multiple technologies that enable machines to sense, comprehend, act and learn, either on their own or to augment human activities. The new technologies include machine learning, deep learning, natural language processing, video analytics and more. AI is disrupting how businesses operate and compete and we believe it will also fundamentally transform and improve how we work and live. When an organization is pursuing an AI project, it’s our belief that it should be business-oriented, people-focused, and technology rich for it to be most effective.
As Chief Analytics Officer and Head Geek – Accenture Analytics, Narendra Mulani is responsible for creating a culture of analytics and driving Accenture’s strategic agenda for growth across the business. He leads a dedicated team of 17,000 Analytic professionals that serve clients around the globe, focusing on value creation through innovative solutions that combine industry and functional knowledge with analytics and technology.
Narendra has held a number of leadership roles within Accenture since joining in 1997. Most recently, he was the managing director – Products North America, where he was responsible for creating value for our clients across a number of industries. Prior to that, he was managing director – Supply Chain, Accenture Management Consulting, leading a global practice responsible for defining and implementing supply chain capabilities at a diverse set of Fortune 500 clients.
Narendra graduated from Bombay University in 1978 with a Bachelor of Commerce, and received an MBA in Finance in 1982 as well as a PhD in 1985 focused on Multivariate Statistics, both from the University of Massachusetts.
Outside of work, Narendra is involved with various activities that support education and the arts. He lives in Connecticut with his wife Nita and two children, Ravi and Nikhil.
– Accenture Analytics. Launching an insights-driven transformation. Download the point of view on analytics operating models to better understand how high performing companies are organizing their capabilities.
On Big Data and Data Science. Interview with James Kobielus, Source: ODBMS Industry Watch, 2016-04-19
On the Internet of Things. Interview with Colin Mahony Source: ODBMS Industry Watch, 2016-03-14
A Grand Tour of Big Data. Interview with Alan Morrison, Source: ODBMS Industry Watch, 2016-02-25
On the Industrial Internet of Things. Interview with Leon Guzenda, Source: ODBMS Industry Watch, 2016-01-28
On Artificial Intelligence and Society. Interview with Oren Etzioni, Source: ODBMS Industry Watch, 2016-01-15
Follow us on Twitter: @odbmsorg
“Understanding human language remains a difficult problem. The challenges here are not only technical, but there is also a perception from popular culture that computers today perform at the level we see in science fiction. So there is a gap between what is expected and what is possible.”–Jason S.Cornez.
I have interviewed Jason S.Cornez, Chief Technology Officer, RavenPack. Main topic of the interview is unstructured data analytics for finance.
Q1. What is the business of RavenPack?
Jason S.Cornez: We specialize in the systematic analysis of unstructured data for finance. RavenPack Analytics transforms unstructured big data sets,such as traditional news and social media, into structured granular data and indicators to help financial services firms improve their performance. RavenPack addresses the challenges posed by the characteristics of Big Data – volume, variety, veracity and velocity – by converting unstructured content into a format that can be more effectively analyzed, manipulated and deployed in financial applications.
Q2. How is Deutsche Bank using RavenPack News Analytics as an overlay to a pairs trading strategy?
Jason S.Cornez: The profits and risks from trading stock pairs are very much related to the type of information event which creates divergence. If divergence is caused by a piece of news related specifically to one constituent of the pair, there is a good chance that prices will diverge further. On the other hand, if divergence is caused by random price movements or a differential reaction to common information, convergence is more likely to follow after the initial divergence. To test the effects of news on a pairs trading strategy, Deutsche Bank used two aggregated indicators based on RavenPack’s Big Data analytics derived from news and social media data measuring sentiment and media attention.
Specifically, using the two indicators, Deutsche Bank created a filter that would ignore trades where divergence was supported by negative sentiment and abnormal news volume.
Overall, Deutsche Bank finds that applying a news analytics overlay can help differentiate between “good” price divergence (which is likely to converge) and “bad” divergence. More importantly, such ability provides significant improvements to the performance of a traditional pairs trading strategy, especially by reducing divergence risk.
Q3. Who needs sentiment analytics in finance and why?
Jason S.Cornez: Sentiment analytics can help improve performance of trading strategies,reduce risk, and monitor compliance. Quantitative investors often subscribe to RavenPack Analytics granular data. This provides them with the ability to detect relevant, novel and unexpected events – be they corporate, macroeconomic or geopolitical -so they can enter new positions, or protect existing ones. These events, and the sentiment associated with them, help drive alpha generation as a novel factor in automated trading models.
Traditional Asset Managers, such as those managing hedge funds, mutual funds, pension funds and family offices may subscribe to RavenPack Indicators to help run portfolio optimization. The Indicators provide snapshots of sentiment and information density for an entity or instrument that can be used alongside fundamental or technical indicators to build portfolios with better risk/return profiles.
Brokerage and Market Makers can leverage RavenPack sentiment data to manage risk and generate trade ideas. They rely on RavenPack’s detection of relevant, novel and unexpected events – be they corporate, macroeconomic or geopolitical – to create circuit breakers protecting them from event risk.
Risk and Compliance Managers use RavenPack data to monitor accumulation of adverse sentiment or detect headline risk. The data help risk managers locate accumulations of risk and volatility, or changes in liquidity – either by aggregating sentiment, identifying event-driven regime shifts, or by creating alerts for when sentiment indicators reach extremes. As well, RavenPack event data also aids surveillance analysts to receive fewer false positives from market abuse alerts.
Finally, Professional and Academic researchers use RavenPack data to better understand how news and social media affect markets. They want to inform their clients how to find new sources of value and, hence, research and write about how quantitative investment managers find value in the data. RavenPack’s granular data is a great source of unique data for academics to enhance their published research – be it presenting a new way to use the data or controlling for news and social media in their work.
Q4. What are the main challenges and opportunities for Big Data analytics for financial markets?
Jason S.Cornez: Much of the work so far in Big Data analytics has been confined to structured data. These are sets of labeled and elementized values, such as what you might find in a traditional database table. Tools like Hadoop and Spark have helped to make structured big data analyitcs approachable.
RavenPack has always focused on unstructured data, primarily English-language text. Doing analytics here isn’t just about data mining, it requires more sophisticated processing for each document. Understanding human language remains a difficult problem. The challenges here are not only technical, but there is also a perception from popular culture that computers today perform at the level we see in science fiction. So there is a gap between what is expected and what is possible. One of our goals here is certainly to help make computers a little smarter.
Things start to get really interesting when you produce analytics by marrying structured data with unstructured data. A simple example could be a news story where an analyst expects mortgage rates to hit 4% by summer. It is certainly great if a computer understands that this is a story about interest rate guidance, but so much better if the computer is able to combine this with historical mortgage rates to know that the rates are currently rising, but still far below historical norms. As an industry, I don’t think too much has been done here yet, but that we’ll be seeing more activity here in the coming years.
Financial markets rely on information in order to be efficient. Big Data analytics promises to provide more information, and more types of information, faster than was previously possible. A more efficient market could help to level the playing field, as it were. And even if markets never become truly efficient, the financial industry sees that Big Data analytics can certainly help them. Several of these opportunities were addressed in the answer to the previous question.
Q5. What are your practical experience in building an infrastructure for Big Data Analytics of mostly unstructured text content, in realtime?
Jason S.Cornez: RavenPack has been processing Big Data since before Cloud Computing was a practical reality. We noticed that most competitors in the news analytics space were offering software solutions, whereas RavenPack has always been a service provider. We sell data, not software. As such we invested in our own infrastructure maintained at trusted hosting facilities. This was perhaps not the easiest or cheapest route, but it leads to compelling products that are relatively easy for a customer to adopt.
From the beginning, we’ve built a distributed system where collection, storage, classification, analytics, publication, and monitoring all run on distinct machines connected by a high-speed network. We learned virtualization technologies so that we could leverage our hardware investments more efficiently. We’ve been rigorous about maintaining a separation of concerns and establishing well-defined interfaces between our components. This not only makes our system robust, but it also allows us to choose the best technologies for each task.
In recent years, we’ve migrated to Cloud Computing and our early investments in distributed systems are really paying off. Most of our components work directly in the cloud and also scale without additional engineering work.
Q6. How do you manage to have a very low latency?
Jason S.Cornez: Low latency has always been a requirement of the system. Starting with low-latency, realtime processing in mind led to many of the architectural decisions that I mentioned above – especially about being distributed and being able to leverage big hardware. It’s painful to think about re-engineering an existing system that wasn’t designed with low latency in mind.
A specific observation is that storage, especially magnetic based storage, is far slower than CPU and also far slower than networking. So we have a heavily multi-threaded system where all storage tasks are delegated to background threads and the flow of data in the realtime system never needs to wait on a database.
Speaking of multi-threading, RavenPack performs various types of classification on each document. Many of these are independent and can be performed in parallel. As well, within a single document and single type of classification, many aspects work only on local information, such as a paragraph. This work can also be done in parallel. As more powerful, multi-core machines continue to appear, our system can continue to improve.
Of course, low latency really begins with good algorithms and good tools. We measure the system as a whole on a daily basis and we profile our code for both speed and space on a regular basis. At times, there is a trade-off between a feature and doing it feasibly. We often sacrifice a new feature until we can solve how to implement it without negatively impacting the performance of our system.
Q7. What are the main technological challenges you are currently facing?
Jason S.Cornez: There are many challenges ahead. Some of the obvious ones are about branching out from English into other languages, or from plain text to other media formats.
On the purely technical side, we see that cloud computing and big data are still very young fields. Cloud resources are much more ephemeral than those in a controlled, hosted environment. We must adapt software to work well in the face of disappearing machines and inaccessible resources. One example is startup time of a system. Traditionally, startup is a rare event and our servers run for a long time. But now that changes, and system startup is much more frequent and hence must be made more efficient. We are evolving rapidly in these areas right now.
Perhaps the biggest challenge remains the perception gap that I mentioned earlier. I’m very proud of the system we’ve built, but it remains possible for a human to find an entity or an event in a document that our system misses. I don’t think this problem will ever go away, but I’m confident RavenPack is making great strides here.
Q8. Why and how do you use Allegro Common Lisp?
Jason S.Cornez: RavenPack has been using Franz Allegro Common Lisp since we began. It is the primary language we use for analysis and classification of unstructured text. Common Lisp is an excellent language for both exploratory programming and high performance computing.
Common Lisp is a multi-paradigm language, or even a paradigm-neutral language. So the engineer has the flexibility to map from concept to code in the most natural way possible. Some concepts map naturally to an object-oriented design, others to a functional design, and other to an imperative design. The language naturally supports all of these so you never need to map from your concept into the philosophy of the language. And further, lisp is a programmable programming language, so as new paradigms come along, they can be added to the language by any developer. This is so easy and natural in Common Lisp that you often do it even when there is only a single use case in mind.
Common Lisp also shines for deploying and maintaining production software. Of course, it supports native OS threads, native machine compilation, and high performance garbage collection. But as well, you can attach to, inspect, modify and patch live systems.
Q9. What are the main lessons you learned so far?
Jason S.Cornez: It’s been a long and interesting journey, and nearly everything we know now has been learned along the way. One way I like to think about the main lessons learned is to consider what I believe to be the barriers that might make it difficult for a competitor or potential client to replicate what we’ve done.
A significant selling-point of our product that provides lots of value to our clients is our extensive historical archive of analytics. This of course is derived from our archive of content. The curation of such an archive is much harder than most people imagine. There is the minor issue of implementing the spec that the provider supplies. But the fun begins as you realize that the archive is incomplete and in multiple incompatible formats, some of them not documented at all. There are multiple timestamps, many with no timezone. The realtime feed looks different from the historical archive. The list goes on.
None of this is meant as a complaint about our content partners – this is the nature of things. And even having learned this lesson, there isn’t much we could have done differently. Of course, we now have a checklist of questions we give to any new content provider – and they often improve their offering as a result of working with us. But if we hear that incorporating someone’s content will be easy, we now know to take this with a grain of salt.
Qx Anything else you wish to add?
Jason S.Cornez: Thanks for this opportunity. I hope it has been helpful.
Jason S.Cornez, Chief Technology Officer, RavenPack.
Jason joined RavenPack in 2003 and is responsible for the design and implementation of the RavenPack software platform. He is a hands-on technology leader, with a consistent record of delivering break-through products. A Silicon Valley start-up veteran with 20 years of professional experience, Jason combines technical know-how with an understanding of business needs to turn vision into reality. Jason holds a Master’s Degree in Computer Science, along with undergraduate degrees in Mathematics and EECS, from the Massachusetts Institute of Technology.
– Common Lisp Educational Resources: list of books, Lisp-oriented web sites and tutorials.
– Basic Lisp Techniques: The PDF file provides an introduction to the Common Lisp language.
– Mean Reversion II: Pairs Trading Strategies (LINK to .PDF) – Registration required-, Deutsche Bank, Feb. 16, 2016. In this paper, Deutsche Bank shows how to use RavenPack News Analytics as an overlay to a pairs trading strategy.
Follow ODBMS.org on Twitter: @odbmsorg