“To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”)”–Mike Carey.
The AsterixDB Big Data Management System (BDMS) is the result of approximately four years of R&D involving researchers at UC Irvine, UC Riverside, and Oracle Labs. The AsterixDB code base currently consists of over 250K lines of Java code that has been co-developed by project staff and students at UCI and UCR.
The AsterixDB project has been supported by the U.S. National Science Foundation as well as by several generous industrial gifts.
Q1. Why build a new Big Data Management System?
Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.
To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”).
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:
- a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
- a full query language with at least the expressive power of SQL;
- support for data storage, data management, and automatic indexing;
- support for a wide range of query sizes, with query processing cost being proportional to the given query;
- support for continuous data ingestion, hence the accumulation of Big Data;
- the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
- built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.
So that’s what we set out to do.
Q2. What was wrong with the current Open Source Big Data Stack?
Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?”
We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.
One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed.
Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”
Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)
Q3. AsterixDB has been designed to manage vast quantities of semi-structured data. How do you define semi-structured data?
Mike Carey: In the late 90’s and early 2000’s there was a bunch of work on that – on relaxing both the rigid/flat nature of the relational model as well as the requirement to have a separate, a priori specification of the schema (structure) of your data. We felt that this flexibility was one of the things – aside from its “free” price point – drawing people to the Hadoop ecosystem (and the key-value world) instead of the parallel data warehouse ecosystem.
In the Hadoop world you can start using your data right away, without spending 3 months in committee meetings to decide on your schema and indexes and getting DBA buy-in. To us, semi-structured means schema flexibility, so in AsterixDB, we let you decide how much of your schema you have to know and/or choose to reveal up front, and how much you want to leave to be self-describing and thus allow it to vary later. And it also means not requiring the world to be flat – so we allow nesting of records, sets, and lists. And it also means dealing with textual data “out of the box”, because there’s so much of that now in the Big Data world.
Q4. The motto of your project is “One Size Fits a Bunch”. You claim that AsterixDB can offer better functionality, managability, and performance than gluing together multiple point solutions (e.g., Hadoop + Hive + MongoDB). Could you please elaborate on this?
Mike Carey: Sure. If you look at current Big Data IT infrastructures, you’ll see a lot of different tools and systems being tied together to meet an organization’s end-to-end data processing requirements. In between systems and steps you have the glue – scripts, workflows, and ETL-like data transformations – and if some of the data needs to be accessible faster than a file scan, it’s stored not just in HDFS, but also in a document store or a key-value store.
This just seems like too many moving parts. We felt we could build a system that could meet more (not all!) of today’s requirements, like the ones I listed in my answer to the first question.
If your data is in fewer places or can take a flight with fewer hops to get the answers, that’s going to be more manageable – you’ll have fewer copies to keep track of and fewer processes that might have hiccups to watch over. If you can get more done in one system, obviously that’s more functional. And in terms of performance, we’re not trying to out-perform the specialty systems – we’re just trying to match them on what each does well. If we can do that, you can use our new system without needing as many puzzle pieces and can do so without making a performance sacrifice.
We’ve recently finished up a first comparison of how we perform on tasks that systems like parallel relational systems, MongoDB, and Hive can do – and things look pretty good so far for AsterixDB in that regard.
Q5. AsterixDB has been combining ideas from three distinct areas — semi-structured data management, parallel databases, and data-intensive computing. Could you please elaborate on that?
Mike Carey: Our feeling was that each of these areas has some ideas that are really important for Big Data. Borrowing from semi-structured data ideas, but also more traditional databases, leads you to a place where you have flexibility that parallel databases by themselves do not. Borrowing from parallel databases leads to scale-out that semi-structured data work didn’t provide (since scaling is orthogonal to data model) and with query processing efficiencies that parallel databases offer through techniques like hash joins and indexing – which MapReduce-based data-intensive computing platforms like Hadoop and its language layers don’t give you. Borrowing from the MapReduce world leads to the open-source “pricing” and flexibility of Hadoop-based tools, and argues for the ability to process some of your queries directly over HDFS data (which we call “external data” in AsterixDB, and do also support in addition to managed data).
Q6. How does the AsterixDB Data Model compare with the data models of NoSQL data stores, such as document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and column-based stores like HBase and Cassandra?
Mike Carey: AsterixDB’s data model is flexible – we have a notion of “open” versus “closed” data types – it’s a simple idea but it’s unique as far as we know. When you define a data type for records to be stored in an AsterixDB dataset, you can choose to pre-define any or all of the fields and types that objects to be stored in it will have – and if you mark a given type as being “open” (or let the system default it to “open”), you can store objects there that have those fields (and types) as well as any/all other fields that your data instances happen to have at insertion time.
Or, if you prefer, you can mark a type used by a dataset as “closed”, in which case AsterixDB will make sure that all inserted objects will have exactly the structure that your type definition specifies – nothing more and nothing less.
(We do allow fields to be marked as optional, i.e., nullable, if you want to say something about their type without mandating their presence.)
What this gives you is a choice! If you want to have the total, last-minute flexibility of MongoDB or Couchbase, with your data being self-describing, we support that – you don’t have to predefine your schema if you use data types that are totally open. (The only thing we insist on, at the moment, is that every type must have a key field or fields – we use keys when sharding datasets across a cluster.)
Structurally, our data model was JSON-inspired – it’s essentially a schema language for a JSON superset – so we’re very synergistic with MongoDB or Couchbase data in that regard.
On the other end of the spectrum, if you’re still a relational bigot, you’re welcome to make all of your data types be flat – don’t use features like nested records, lists, or bags in your record definitions – and mark them all as “closed” so that your data matches your schema. With AsterixDB, we can go all the way from traditional relational to “don’t ask, don’t tell”. As for systems with BigTable-like “data models” – I’d personally shy away from calling those “data models”.
Q7. How do you handle horizontal scaling? And vertical scaling?
Mike Carey: We scale out horizontally using the same sort of divide-and-conquer techniques that have been used in commercial parallel relational DBMSs for years now, and more recently in Hadoop as well. That is, we horizontally partition both data (for storage) and queries (when processed) across the nodes of commodity clusters. Basically, our innards look very like those of systems such as Teradata or Parallel DB2 or PDW from Microsoft – we use join methods like parallel hybrid hash joins, and we pay attention to how data is currently partitioned to avoid unnecessary repartitioning – but have a data model that’s way more flexible. And we’re open source and free….
We scale vertically (within one node) in two ways. First of all, we aren’t memory-dependent in the way that many of the current Big Data Analytics solutions are; it’s not that case that you have to buy a big enough cluster so that your data, or at least your intermediate results, can be memory-resident.
Instead, our physical operators (for joins, sorting, aggregation, etc.) all spill to disk if needed – so you can operate on Big Data partitions without getting “out of memory” errors. The other way is that we allow nodes to hold multiple partitions of data; that way, one can also use multi-core nodes effectively.
Q8. What performance figures do you have for AsterixDB?
Mike Carey: As I mentioned earlier, we’ve completed a set of initial performance tests on a small cluster at UCI with 40 cores and 40 disks, and the results of those tests can be found in a recently published AsterixDB overview paper that’s hanging on our project web site’s publication page (http://asterixdb.ics.uci.edu/publications.html).
We have a couple of other performance studies in flight now as well, and we’ll be hanging more information about those studies in the same place on our web site when they’re ready for human consumption. There’s also a deeper dive paper on the AsterixDB storage manager that has some performance results regarding the details of scaling, indexing, and so on; that’s available on our web site too. The quick answer to “how does AsterixDB perform” is that we’re already quite competitive with other systems that have narrower feature sets – which we’re pretty proud of.
Q9. You mentioned support for continuous data ingestion. How does that work?
Mike Carey: We have a special feature for that in AsterixDB – we have a built-in notion of Data Feeds that are designed to simplify the lives of users who want to use our system for warehousing of continuously arriving data.
We provide Data Feed adaptors to enable outside data sources to be defined and plugged in to AsterixDB, and then one can “connect” a Data Feed to an AsterixDB data set and the data will start to flow in. As the data comes in, we can optionally dispatch a user-defined function on each item to do any initial information extraction/annotation that you want. Internally, this creates a long-running job that our system monitors – if data starts coming too fast, we offer various policies to cope with it, ranging from discarding data to sampling data to adding more UDF computation tasks (if that’s the bottleneck). More information about this is available in the Data Feeds tech report on our web site, and we’ll soon be documenting this feature in the downloadable version of AsterixDB. (Right now it’s there but “hidden”, as we have been testing it first on a set of willing UCI student guinea pigs.)
Q10. What is special about the AsterixDB Query Language? Why not use SQL?
Mike Carey: When we set out to define the query language for AsterixDB, we decided to define our own new language – since it seemed like everybody else was doing that at the time (witness Pig, Jaql, HiveQL, etc.) – one aimed at our data model.
SQL doesn’t handle nested or open data very well, so extending ANSI/ISO SQL seemed like a non-starter – that was also based on some experience working on SQL3 in the late 90’s. (Take a look at Oracle’s nested tables, for example.). Based on our team’s backgrounds in XML querying, we actually started there – XQuery was developed by a team of really smart people from the SQL world (including Don Chamberlin, father of SQL) as well as from the XML world and the functional programming world – so we started there. We took XQuery and then started throwing the stuff overboard that wasn’t needed for JSON or that seemed like a poor feature that had been added for XPath compatibility.
What remained was AQL, and we think it’s a pretty nice language for semistructured data handling. We periodically do toy with the notion of adding a SQL-like re-skinning of AQL to make SQL users feel more at home – and we may well do that in the future – but that would be different than “real SQL”. (The N1QL effort at Couchbase is doing something along those lines, language-wise, as an example. The SQL++ design from UCSD is another good example there.)
Q11. What level of concurrency and recovery guarantees does AsterixDB offer?
Mike Carey: We offer transaction support that’s akin to that of current NoSQL stores. That is, we promise record-level ACIDity – so inserting or deleting a given record will happen as an atomic, durable action. However, we don’t offer general-purpose distributed transactions. We support an arbitrary number of secondary indexes on data sets, and we’ll keep all the indexes on a data set transactionally consistent – that we can do because secondary index entries for a given record live in the same data partition as the record itself, so those transactions are purely local.
Q12. How does AsterixDB compare with Hadoop? What about Hadoop Map/Reduce compatibility?
Mike Carey: I think we’ve already covered most of that – Hadoop MapReduce is an answer to low-level “parallel programming for dummies”, and it’s great for that – and languages on top like Pig Latin and HiveQL are better programming abstractions for “data tasks” but have runtimes that could be much better. We started over, much as the recent flurry of Big Data analytics platforms are now doing (e.g., Impala, Spark, and friends), but with a focus on scaling to memory-challenging data sizes. We do have a MapReduce compatibility layer that goes along with our Hyracks runtime layer – Hyracks is name of our internal dataflow runtime layer – but our MapReduce compatibility layer is not related to (or connected to) the AsterixDB system.
Q13. How does AsterixDB relate to Hadapt?
Mike Carey: I’m not familiar with Hadapt, per se, but I read the HadoopDB work that fed into it.
We’re architecturally very different – we’re not Hadoop-based at all – I’d say that HadoopDB was more of an expedient hybrid coupling of Hadoop and databases, to get some of the indexing and local query efficiency of an existing database engine quickly in the Hadoop world. We were thinking longer term, starting from first principles, about what a next-generation BDMS might look like. AsterixDB is what we came up.
Q14. How does AsterixDB relate to Spark?
Mike Carey: Spark is aimed at fast Big Data analytics – its data is coming from HDFS, and the task at hand is to scan and slice and dice and process that data really fast. Things like Shark and SparkSQL give users SQL query power over the scanned data, but Spark in general is really catching fire, it appears, due to its applicability to Big Machine Learning tasks. In contrast, we’re doing Big Data Management – we store and index and query Big Data. It would be a very interesting/useful exercise for us to explore how to make AsterixDB another source where Spark computations can get input data from and send their results to, as we’re not targeting the more complex, in-memory computations that Spark aims to support.
Q15. How can others contribute to the project?
Mike Carey: We would love to see this start happening – and we’re finally feeling more ready for that, and even have some NSF funding to make AsterixDB something that others in the Big Data community can utilize and share.
(Note that our system is Apache-style open source licensed, so there are no “gotchas” lurking there.)
Some possibilities are:
(1) Others can start to use AsterixDB to do real exploratory Big Data projects, or to teach about Big Data (or even just semistructured data) management. Each time we’ve worked with trial users we’ve gained some insights into our feature set, our query optimizations, and so on – so this would help contribute by driving us to become better and better over time.
(2) Folks who are studying specific techniques for dealing with modern data – e.g., new structures for indexing spatiotemporaltextual (J) data – might consider using AsterixDB as a place to try out their new ideas.
(This is not for the meek, of course, as right now effective contributors need to be good at reading and understanding open source software without the benefit of a plethora of internal design documents or other hints.) We also have some internal wish lists of features we wish we had time to work on – some of which are even doable from “outside”, e.g., we’d like to have a much nicer browser-based workbench for users to use when interacting with and managing an AsterixDB cluster.
(3) Students or other open source software enthusiasts who download and try our software and get excited about it – who then might want to become an extension of our team – should contact us and ask about doing so. (Try it first, though!) We would love to have more skilled hands helping with fixing bugs, polishing features, and making the system better – it’s tough to build robust software in a university setting, and we would especially welcome contributors from companies.
Thanks very much for this opportunity to share what we’ve being doing!
Michael J. Carey is a Bren Professor of Information and Computer Sciences at UC Irvine.
Before joining UCI in 2008, Carey worked at BEA Systems for seven years and led the development of BEA’s AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years teaching at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. Carey is an ACM Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests all center around data-intensive computing and scalable data management (a.k.a. Big Data).
- AsterixDB Big Data Management System (BDMS): Downloads, Documentation, Asterix Publications.
Follow ODBMS.org on Twitter: @odbmsorg
“The Hadoop platform indeed provides the ability to efficiently process large-scale data at a price point we haven’t been able to justify with traditional technology. That said, not every technology process requires Hadoop; therefore, we have to be smart about which processes we deploy on Hadoop and which are a better fit for traditional technology (for example, RDBMS).”–Kevin Murray.
I wanted to learn how American Express is taking advantage of analysing big data.
I have interviewed Sastry Durvasula, Vice President – Technology, American Express, and Kevin Murray, Vice President – Technology, American Express.
Q1. With the increasing demand for mobile and digital capabilities, how are American Express’ customer expectations changing?
SASTRY DURVASULA: American Express customers expect us to know them, to understand and anticipate their preferences and personalize our offerings to meet their specific needs. As the world becomes increasingly mobile, our Card Members expect to be able to engage with us whenever, wherever and using whatever device or channel they prefer.
In addition, merchants, small businesses and corporations also want increased value, insights and relevance from our global network.
Q2. Could you explain what is American Express’ big data strategy?
SD: American Express seeks to leverage big data to deliver innovative products in the payments and commerce space that provide value to our customers. This is underpinned by best-in-class engineering and decision science.
From a technical perspective, we are advancing an enterprise-wide big data platform that leverages open source technologies like Hadoop, integrating it with our analytical and operational capabilities across the various business lines. This platform also powers strategic partnerships and real-time experiences through emerging digital channels. Examples include Amex Offers, which connects our Card Members and merchants through relevant and personalized digital offers; an innovative partnership with Trip Advisor to unlock exclusive benefits; insights and tools for our B2B partners and small businesses; and advanced credit and fraud risk management.
Additionally, as always, we seek to leverage data responsibly and in a privacy-controlled environment. Trust and security are hallmarks of our brand. As we leverage big data to create new products and services, these two values remain at the forefront.
Q3. What is the “value” you derive by analysing big data for American Express?
SD: Within American Express, our Technology and Risk & Information Management organizations partner with our lines of business to create new opportunities to drive commerce and serve customers across geographies with the help of big data. Big data is one of our most important tools in being the company we want to be – one that identifies solutions to customers’ needs and helps us deliver what customers want today and what they may want in the future.
Q4. What metrics do you use to monitor big data analytics at American Express?
SD: Big data investments are no different than any other investments in terms of the requirement for quantitative and qualitative ROI metrics with pre- and post-measurements that assess the projects’ value for revenue generation, cost avoidance and customer satisfaction. There is also the recognition that some of the investments, especially in the big data arena, are strategic and longer term in nature, and the value generated should be looked at from that perspective.
Q5. Could you explain how did you implement your big data infrastructure platform at Amex?
KEVIN MURRAY: We started small and expanded as our use cases grew over time, about once or twice a year.
We make it a practice to reassess the hardware and software state within the industry before each major expansion to determine whether any external changes should alter the deployment path we have chosen.
Q6. How did you select the components for your big data infrastructure platform, choosing among the various competing compute and storage solutions available today?
KM: Our research told us low-cost commodity servers with local storage was the common deployment stack across the industry. We made an assessment of industry offerings and evaluated against our objectives to determine a good balance of cost, capabilities and time to market.
Q7. How did you unleash big data across your enterprise and put it to work in a sustainable and agile environment?
SD: We engineered our enterprise-wide big data platform to foster R&D and rapid development of use cases, while delivering highly available production applications. This allows us to be adaptable and agile, scaling up or redeploying, as needed, to meet market and business demands. With the Risk and Information Management team, we established Big Data Labs comprising top-notch decision scientists and engineers to help democratize big data, leveraging self-service tools, APIs and common libraries of algorithms.
Q8. What are the most significant challenges you have encountered so far?
SD: An ongoing challenge is balancing our big data investment between immediate needs and research or innovations that will drive the next generation of capabilities. You can’t focus solely on one or the other but has to find a balance.
Another key challenge is ensuring we are focused on driving outcomes that are meaningful to customers – that are responsive to their current and anticipated needs.
Q9. What did you learn along the way?
KM: The Hadoop platform indeed provides the ability to efficiently process large-scale data at a price point we haven’t been able to justify with traditional technology. That said, not every technology process requires Hadoop; therefore, we have to be smart about which processes we deploy on Hadoop and which are a better fit for traditional technology (for example, RDBMS). Some components of the ecosystem are mature and work well, and others require some engineering to get to an enterprise-ready state. In the end, it’s an exciting journey to offer new innovation to our business.
Q10. Anything else you wish to add?
KM: The big data industry is evolving at lightning speed with new products and services coming to market every day. I think this is being driven by the enterprise’s appetite for something new and innovative that leverages the power of compute, network and storage advancements in the marketplace, combined with a groundswell of talent in the data science domain, pushing academic ideas into practical business use cases. The result is a wealth of new offerings in the marketplace – from ideas and early startups to large-scale mission-critical solutions. This is providing choice to enterprises like we’ve never seen before, and we are focused on maximizing this advantage to bring groundbreaking products and opportunities to life.
Sastry Durvasula, Vice President – Technology, American Express
Sastry Durvasula is Vice President and Global Technology Head of Information Management and Digital Capabilities within the Technology organization at American Express. In this role, Sastry leads IT strategy and transformational development to power the company’s data-driven capabilities and digital products globally. His team also delivers enterprise-wide analytics and business intelligence platforms, and supports critical risk, fraud and regulatory demands. Most recently, Sastry and his team led the launch of the company’s big data platform and transformation of its enterprise data warehouse, which are powering the next generation of information, analytics and digital capabilities. His team also led the development of the company’s API strategy, as well as the Sync platform to deliver innovative products, drive social commerce and launch external partnerships.
Kevin Murray, Vice President – Technology, American Express
Kevin Murray is Vice President of Information Management Infrastructure & Integration within the Technology organization at American Express. Throughout his 25+ year career, he has brought emerging technologies into large enterprises, and most recently launched the big data infrastructure platform at American Express. His team architects and implements a wide range of information management capabilities to leverage the power of increasing compute and storage solutions available today.
Presenting at Strata/Hadoop World NY
Big Data: A Journey of Innovation
Thursday, October 16, 2014, at 1:45-2:25 p.m. Eastern
Room: 1 CO3/1 CO4
The power of big data has become the catalyst for American Express to accelerate transformation for the digital age, drive innovative products, and create new commerce opportunities in a meaningful and responsible way. With the increasing demand for mobile and digital capabilities, the customer expectation for real-time information and differentiated experiences is rapidly changing. Big data offers a solution that enables this organization to use their proprietary closed-loop network to bring together consumers and merchants around the world, adding value to each in a way that is individualized and unique.
During their presentation, Sastry Durvasula and Kevin Murray will discuss American Express’ ongoing big data journey of transformation and innovation. How did the company unleash big data across its global network and put it to work in a sustainable and agile environment? How is it delivering offers using digital channels relevant to their Card Members and partners? What have they learned along the way? Sastry and Kevin will address these questions and share their experiences and insights on the company’s big data strategy in the digital ecosystem.
Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg
“We need high performance databases for a wide range of challenges and analyses that arise from a variety of different systems and processes.”–Jutta Bremm, BMW
Q1. What is your role, and for what IT projects are you responsible for at BMW?
Jutta Bremm: I am IT Project Leader for IT projects at BMW with a volume of more than 10 million Euro per year.
Q2. What are the main technical challenges you have at BWM?
Jutta Bremm: We need high performance databases for a wide range of challenges and analyses that arise from a variety of different systems and processes.
These don’t only include recursive, parameterized explosions for bills of materials, but also the provision of standardized tools to the business departments. That way, they can run their own queries more often and are not so dependent on IT to do it for them.
Q3. You define CortexDB as a schema-less multi-model database. What does it mean in practice? What kind of applications is it useful for?
Peter Palm: In CortexDB, datasets are stored as independent entities (cf. objects). To achieve this, the system transforms all content into a new type of index structure. This ensures that every item of content and every field “knows” the context in which it is being used. As a result, the database isn’t searched. Instead, queries are run on information that is already known and the results are combined using simple procedures based on set theory.
This is why there’s no predefined schema for the datasets – only for the index of all fields and the content.
This is what differentiates CortexDB from all other databases, which require the configuration of at least one index even though the datasets themselves are stored in schema-less mode.
The innovative index structure means that no administrative adaptation or optimization of the index is necessary.
Nor is there any requirement for an index for a specific applications – and that enables users to query all the content whenever they want and combine queries with each other too. That makes it very flexible for them to query any field and easily make any necessary development changes to in-house applications.
From the server’s perspective, the fields and content, as well as the interpretation of dataset structure and utilization, are not that important. The application working with the data creates a data structure that can be changed at any time (this is known as schema-less). For CortexDB, all that’s relevant is the content-based structure, which can be used in a generalized way and modified any time. This design gives customers a significant advantage when working with recursive data structures.
This is why CortexDB is particularly well suited to tasks whose definitive structure cannot be fixed at the beginning of the project, as well as for systems that change dynamically. The content-based architecture and the innovative index also deliver significant benefits for BI systems, as ad hoc analyses can be run and adapted whenever required.
In addition, users can add a validity period (“valid from…”) to any item of content. This enables them to view the evolution of particular data over time (known as historization). This evolutionary information is ideal for storing data that change frequently, such as smart metering and insurance information. For each field in a dataset, users don’t only see the information that was valid at the time of the transaction, but also the validity date after which the information was/is/will be valid. This is what we call a temporal database.
These benefits are complemented by the fact that individual fields can be used alone or in combination with others and repeated within a dataset. This – together with the use of validity dates – is what we call a “multi-value” database.
The terms “multi-model”, “multi-value” and “schema-less” also explain the fact that benefits of the database functions mentioned above apply to other NoSQL databases too, but users can extend these with new functions. In principle, any other database can be seen as a subset of CortexDB:
Database type: Key/Value Store
Function: One dataset = one key with one value (a value or value list) => a single, large index of keys
How it works in CortexDB: Every value and every field is indexed automatically and can be freely combined with others by using an occurrence list
Database type: Document Store
Function: One dataset combines several fields using a common ID (often json objects)
How it works in CortexDB: One ID combines fields that belong together in a dataset. Datasets can be output as json objects via an API.
Database type: GraphDB
Function: Links to other datasets are saved as meta information and can be used via proprietary graph queries.
How it works in CortexDB: Links are stored as actual data in a dataset and can be edited using additional fields. Fields can be repeated as often as required.
Database type: Big Table
Function: Multi-dimensional tables that use timestamps to define the validity of information. Its datasets can have a variety of attributes.
How it works in CortexDB: The use of a validity date in addition to a transaction date delivers a temporal database. Additional content can be added despite the dataset description.
Database type: Object oriented
Function: A class model defines the objects that need to be monitored persistently.
How it works in CortexDB: With the Cortex UniPlex application, users can define dataset types. Compared with classes, these define the maximum attributes of a dataset. Nevertheless, users can add more fields at any time, even if they have not been defined for UniPlex.
Q4. Can you please describe the use cases where you use CortexDB at BMW?
Jutta Bremm: The current use case for which we’re working with CortexDB is the explosion of bills of material for the configuration of test vehicles.
The construction of test vehicles must be planned and timed just as carefully as with mass production. To make the process smoother, we conduct reviews before starting construction to ensure that the bills of materials include the right parts and are therefore complete and free of any errors and conflicts.
One thing I’d like to point out here is that every vehicle comprises 15,000 parts, so there are between 10 to the power of 30 and 10 to the power of 60 configuration possibilities! It’s easy to understand why this isn’t an easy task. This high variance is due to the number of different models, engine types, displacements, optional extras, interior fittings and colors. As a result, a development BOM can only be stored in a highly compressed format.
To obtain an individual car from all this, the BOM must be “exploded” recursively. Multiple parameters have an effect on this, including validities (deadlines for parts, products, optional extras, markets etc.), construction stipulations (“this part can only be installed together with a navigation device and a 3-liter engine”) and structures (“this part is comprised of several smaller parts”).
Unlike conventional solutions, for which an explosion function is complex and expensive, the interpretation of the compressed BOM is very easy for CortexDB due to its bidirectional linking technology.
Q5. Why did you select CortexDB and not a classical relational database system? Did you compare CortexDB with other database management systems?
Jutta Bremm: We were looking for a product that would be easy to use, as well as simple and flexible to configure, for our users in product data management. We also wanted the highest possible level of functionality included as standard.
We looked at 4 products that appeared to be suitable for use by the departments for analysis and evaluation. The essential functions for product data management – explosion and the documentation of components used – were only available as standard with CORTEX. For all other products, we were looking at customer-specific extensions that would have cost several hundred thousand euros.
Q6. How do you store complex data structures (such as for example graphs) in CortexDB?
Peter Palm: CortexDB sees graphs as a derivative of certain database functions.
Firstly, it uses the “internal reference” field type (link). This is a data field in which the UUID of a target dataset is stored. That alone enables the use of simple links.
Second, users can choose to define fields as “repeating fields”. That means that the same field can also be used within a dataset. This is useful when a contact has more than one email address or phone number, and for links to individual parts in a BOM.
Repeating fields defined in this way can be grouped together to produce “repeating field groups”. Content items that belong together are thus stored as an information block. An example of this is bank account details that comprise the bank’s name, the sort code and the account number.
The use of repeating field groups, in which validity values are added to linked fields, enables complex data structures within a single dataset.
In addition, every dataset “knows” which other dataset is pointing to it. This bidirectional information using a simple link means that data administration is only required for one dataset. It is only necessary in both datasets if there are two conflicting points of view on a graph (e.g. “my friend considers me as an enemy”).
In addition, result sets can be combined with partial sets resulting from links when running queries and making selections. This limits the results to those that include certain details about their link structures.
Q7. How do you perform data analytics with CortexDB?
Peter Palm: The content in every field “knows” the field context it is being used in and how often (“occurrence list” or “field index”). By combining partial sets (as in set theory), result sets are determined extremely fast, eliminating the need for read access to individual datasets.
CortexDB comes with an application that lets users freely configure queries, reports and graphical output. There is also an application API (data service) that enables these elements to be used within in-house applications or interfaces.
The solution also identifies correlations itself using algorithms, even if they are connected via graphs. Unlike data warehouse systems, this lets users do more than just test estimates or ideas – it determines a result on its own and delivers it to the user for further analysis or for modification of the algorithm.
Q8. Do you some performance metrics for the analysis of recursive structured BOMs (bill of material) for your vehicles?
Jutta Bremm: Internal tests on BOM explosion with conventional relational databases showed that it took up to 120 seconds. Compare that with CortexDB, which delivers the result of the same explosion in 50 milliseconds.
Q9. How do you handle data quality control?
Jutta Bremm: We require 100% data quality (consistency at all times) and CortexDB delivers that.
Q10. What are the main business benefits of using CortexDB for these use cases?
Jutta Bremm: The agile modeling, the flexible adaptation options and the level of functionality delivered as standard shortens the duration of a project and reduces the costs compared to the other products we tested (see Q5).
Qx. Anything else you wish to add?
Jutta Bremm, Peter Palm: By using the temporal capabilities (time of transaction and time of validity), users can easily see which individual value in a dataset was/is/will be valid and from when.
Jutta Bremm, IT Project Manager, BMW.
Jutta is a IT Project Leader at BMW in product data management since 1987.
She was involved in IT projects at Siemens, Wacker Chemie, Sparkassenverband since 1978.
Peter Palm, Chief Visionary Officer (CVO) at Cortex.
Started CortexDB development in 1997.
Holds a Master in electronic engineering.
Area of expertise: Computer hardware development, Chip design, Independent Design Center for Chip Design (Std-Cell, Gate Array), Operating system development, CRM development since 1986.
Follow ODBMS.org on Twitter: @odbmsorg
“The main challenge when working with “big data” in Yahoo has always been our definition of “big”. :] There are several thousands of feeds on Yahoo’s Hadoop clusters, with daily, hourly and up-to-the-minute data frequencies, spanning Petabytes of data”.–Mithun Radhakrishnan
Q1. You work on Apache Hive, in the Yahoo Hadoop team. What are the most current projects you are working on?
Mithun Radhakrishnan: I work on the Hive team at Yahoo. Currently, we are migrating our Hadoop clusters from Hadoop 0.23 (initial release of YARN) to Hadoop 2.5. My team has been focusing on making sure that Hive 0.12 is performant on Hadoop 2.5, as well as rolling out Hive 0.13 to Yahoo’s Grid infrastructure. We have also been busy trying to enhance the performance of Hive queries, as well as of the Hive metastore, to work effectively at Yahoo’s large scale.
Q2. What are the most important challenges you are facing for the deployment, scaling and performance of Hive-related services at Yahoo?
Mithun Radhakrishnan:The main challenge when working with “big data” in Yahoo has always been our definition of “big”. :] There are several thousands of feeds on Yahoo’s Hadoop clusters, with daily, hourly and up-to-the-minute data frequencies, spanning Petabytes of data. Each feed would correspond to a Hive table, with the timestamp (date, hour, minute) being just one of several levels of partition-keys. Some of our more popular feeds add hundreds of thousands of partitions daily, and span millions overall. We’re working on optimizations in Hive’s metadata-storage, to scale to these high levels.
Another recent challenge has been the increased adoption of Business Intelligence and Data visualization tools (such as Tableau and MicroStrategy), connected directly to Grid data over HiveServer2. Such use imposes expectations not only on Hive query performance, but also on data transport as well as the metastore.
And finally, the hardware on which Hadoop runs at Yahoo is heterogeneous, accumulated over many years of usage at Yahoo. While our newer clusters use bleeding-edge hardware with gobs of memory, some of our clusters are several years old.
At our scale, we don’t have the luxury of completely replacing our hardware every year. We need our Grid software (Hadoop, Hive, Pig, etc.) to be performant on a variety of processor/memory/disk configurations.
Q3. What kind of Hive-related services did you implement at Yahoo?
Mithun Radhakrishnan: Yahoo has traditionally been an Apache Pig shop, but recently, we’ve seen an increase in the number of Hive jobs. This may be attributed to increased SQL-based analytics, proliferation of Business Intelligence tools, and some use of Hive for data transformations.
At Yahoo, we use HCatalog (i.e. Hive’s metadata server) for interoperability between Pig, Hive and MapReduce. An HCatalog Server runs as a separate service, serving metadata about various datasets.
Users consume this data using Hive directly, or using Pig and MapReduce (via HCatalog wrappers).
The data lifecycle (ingestion, replication and retirement) is managed via the Grid Data Management (GDM) suite, which was a pre-cursor to the Apache Falcon project. GDM is tightly integrated with HCatalog, and deals with data-registrations and discovery with HCatalog.
To enable data analysis and visualization tools for analysts, we deploy HiveServer2 instances. This allows direct JDBC/ODBC based connections to Grid data, to drastically cut down analysis and decision time, as well as unnecessary intermediate copies in a separate data warehouse.
A large number of users employ Oozie jobs to produce/consume Hive data, using Oozie’s “Hive Actions“. The Yahoo Hive and Oozie teams have integrated the two systems to reduce latencies in data processing pipelines.
Q4, What is Y!Grid ? And what is it useful for?
Mithun Radhakrishnan: Y!Grid is Yahoo’s Grid of Hadoop Clusters that’s used for all the “big data” processing that happens in Yahoo today. It currently consists of 16 clusters in multiple datacenters, spanning 32,500 nodes, and accounts for almost a million Hadoop jobs every day. Around 10% of those are Hive jobs.
No one else makes as much use out of Hadoop every single day as Yahoo does. Some of the notable use cases of Hadoop at Yahoo include:
Content Personalization for increasing engagement by presenting personalized content to users based on their profile and current activity
Ad Targeting and Optimization for serving the right ad to the right customer by targeting billions of impressions everyday based on recent user activities
New Revenue Streams from native ads and mobile search monetization through better serving, budgeting, reporting and analytics
Data Processing Pipelines for aggregating various dimensions of event level traffic data (page, ad, link views, link clicks, etc.) across billions of audience, search, and advertising events everyday
Mail Anti-spam and Membership Anti-abuse for blocking billions of spam emails and hundreds of thousands of abusive accounts per day through machine learning algorithms
Search Assist and Analytics for improving the Yahoo Search experience by processing billions of web pages
Q5. What are the strengths and weakness of Hive in your experience?
Mithun Radhakrishnan: Hive combines the immense computing power of Hadoop with the accessibility and expressiveness of SQL. Its main strengths include:
Scale: Hive scales easily to multi-terabyte datasets, and isn’t shackled by memory constraints.
SQL: Hive allows business logic to be expressed in SQL. This lowers the bar of entry for usage, allowing data-analysts with little Hadoop experience to use their expertise with Hadoop data. (Performance tuning is, admittedly, a different kettle of fish. ;])
Standard: Apache Hive supports analytics through Tableau, Microstrategy and Microsoft Excel, and has supported this for the longest time.
Strong community: The dev community in Hive is brilliant, vibrant and active (as a glance at the Git log would reveal. ;]) We’ve recently seen the introduction of an Apache Tez backend, vectorization support, optimized file-formats like ORC, as well as the promise of very interesting things to come (such as the new Cost Based Optimizer, and an Apache Spark-based back-end).
Which is not to say that everything’s perfect:
M/R: Until recently, Hive’s physical plans could only target MapReduce, which caused multi-stage queries to run quite slowly. Hive 0.13 now supports the expression of physical plans as arbitrary DAGs, using Apache Tez. This dramatically boosts performance, as our benchmarks have shown.
Standard SQL: HiveQL isn’t quite SQL92-compliant yet, although it’s tending in the right direction. Industry-standard benchmarks like TPC-h and TPC-ds typically need rewriting to run on Hive. To borrow a simile from Rowan Atkinson: it is sort of like Andrew Lloyd Webber rearranging the score of Evita to suit the vocal range of Britney Spears. :]
Metastore performance and data throughput in HiveServer2 still have room for improvement.
Q6. You are an Apache HCatalog committer. What are your most important contributions? Who is currently using HCatalog and for what?
Mithun Radhakrishnan: The Apache HCatalog project has been merged with the main Apache Hive project now.
My work with HCatalog has primarily revolved around integration with other projects. Specifically:
I worked on the HCatalog notification system, to send JMS compliant notifications in response to changes to a dataset’s metadata. In Yahoo, we use this specifically with Oozie, to kick off Oozie workflows as soon as their dependency dataset-partitions are published in HCatalog. This reduces workflow launch latencies, end-to-end pipeline execution times, while also reducing NameNode pressure caused by polling.
I’ve worked (and am still working) on integration with data ingestion services like GDM. My focus at the moment is on metastore performance, and replication of tables/partitions across HCatalog instances.
HCatalog is an integral part of data processing pipelines at Yahoo, given its integration with GDM and Oozie. Outside of Yahoo, HCatalog is also used at Twitter and LinkedIn, as far as I’m aware. I’m sure there are other firms as well.
HCatalog is also used externally by several projects such as Apache Falcon, Apache Oozie, etc.
Q7. You have been benchmarking various versions of Hive. What are the main results you have obtained?
Mithun Radhakrishnan: I’ve had the opportunity to benchmark Apache Hive 0.10 through Hive 0.13, across various scales of input data, multiple data formats and tuning parameters. We’ve observed that the query performance has improved steadily, with each major release. But the jump in Hive 0.13 has been quite phenomenal. The switch to a more expressive physical execution engine in Apache Tez, coupled with vectorization, ORC files and table/column statistics has really paid dividends.
For the Yahoo Hadoop team, the main result from the benchmark was that Apache Hive 0.13 supports a “high dynamic range” of data scale: it is performant enough at the 100GB scale to approach interactivity, while simultaneously also scaling to 10+ TB of data. Given that the system scales over such a wide range, and that Yahoo already deploys Hive in production, we find little reason to deploy any other frameworks for SQL-based analytics on Y!Grid.
Q8. How did you define the workloads for your benchmark of Hive?
Mithun Radhakrishnan: When we started off, we considered creating a Yahoo-specific benchmark: a set of Hive scripts and accompanying datasets to represent the Yahoo workload. The problem was that there was a variety of datasets, and several Hive users, running different kinds of workloads.
In the end, we opted to use the TPC-h benchmarks instead. These are industry standard, more or less representative of the jobs we run at Yahoo. Hortonworks was already running a large subset of TPC-ds benchmarks on Hive. We decided that TPC-h would allow for complementary coverage. We did partition the data and transform it in the way that we would have with production data.
At the time, the comparisons most people were trying to make were between Shark (on Apache Spark) and Hive.
Shark engineers had posted results from running a port of TPC-h, transliterated to Hive’s SQL dialect. We figured we’d get an apples-to-apples comparison by running those scripts against Hive.
Q9. Did you compare Hive with other Big Data software platforms?
Mithun Radhakrishnan: The objective of the benchmark was primarily to track Apache Hive’s progressive performance gains viz. prior versions. However, I did compare Hive 0.12 and 0.13′s performance against Shark 0.7.1 and Shark 0.8 (which was trunk at the time.)
The results were mixed. At the 100GB scale, I did see Shark perform admirably. But I ran into problems with Shark at scale. A large majority of queries simply didn’t complete on Shark at the 10+TB scale. It appeared that a lot of time was lost in shuffling data between consecutive stages. Coupled with the fact that Shark was only compatible with Hive Metastore v0.9, that the number of reducers wasn’t deduced per job, the lack of support for security or interoperability with our existing production systems, Apache Hive looked the better fit for Y!Grid.
I haven’t had the opportunity to compare Hive against other systems yet. I do hope to, as soon as I can find the time, but my day-job keeps me pretty busy. :]
Qx Anything else you wish to add?
Mithun Radhakrishnan: Lots of people fret over query performance. Performance is important, but one must think holistically about the data and workload that needs to be processed, hardware choices available at various price points, holistic long-term TCOs of operating the system, current and future use cases, support, etc. Everyone’s situation would be a bit different and something to take into account when thinking about a SQL-on-Hadoop solution.
My name is Mithun Radhakrishnan. I work on Apache Hive, in the Yahoo Hadoop team. My team is responsible for the deployment, scaling and performance of Hive-related services (including HCatalog and HiveServer2) on the Y!Grid, the largest production Hadoop Clusters in existence today.
I’ve been working on Hadoop-related projects in Yahoo since 2009, including the Grid Data Management System (pre-cursor to Apache Falcon), HCatalog and Hive. I’m an Apache HCatalog committer and Hive contributor. Prior to working at Yahoo, I was a firmware developer at Hewlett-Packard, writing hardware self-diagnostic and healing firmware for HP’s big-iron boxen (Integrity Servers, running Intel Itaniums).
I’m currently working broadly on getting the Hive Metastore to perform at Yahoo-scale.
I’ve recently had the pleasure of benchmarking various versions of Hive (0.10-13), with different settings, file-formats, etc., to gauge progressive performance gains. I’ll be presenting my findings at Strata 2014.
-Hive on Apache Tez: Benchmarked at Yahoo! Scale. Mithun Radhakrishnan (Yahoo! Inc.). Talk at Strata+Hadoop Conference. 2:35pm Thursday, 10/16/2014
Follow ODBMS.org on Twitter: @odbmsorg
“A main challenge facing clinical and genomic data sharing efforts is the lack of harmonized methods and interoperable approaches that would enable such sharing. This barrier is one of the main motives for the formation of the Global Alliance for Genomics and Health.”
I have interviewed David Haussler, director of the Center for Biomolecular Science & Engineering at the University of California, Santa Cruz. David is one of eight organizing committee members of the Global Alliance for Genomics and Health.
Q1. What is the Global Alliance for Genomics and Health?
David Haussler: The Global Alliance for Genomics and Health is a partnership of more than 180 of the world’s leading stakeholders working together to create a common framework of harmonized approaches to enable the responsible and effective sharing of genomic and clinical data. The Global Alliance is made of up of a diverse, international group of organizations working in healthcare, biomedical research, disease and patient advocacy, life science, and information technology, who come together with the goal of accelerating progress in medicine and human health.
Q2. What are the main objectives of the Data Working group?
David Haussler: The Data Working Group is focused on the interoperability and scalability of formats and interfaces for genomic information. The main near-term objective of the Data Working Group is to establish a role as the international coordinating body and frontrunner for organizing, developing and aligning the computer formats and application programming interfaces (APIs) used to represent and exchange genomic data on individuals.
This includes stewardship of existing file formats used to store genomic information (BAM and VCF files) and engaging the community in devising forward-looking data models and APIs for representing, submitting, exchanging, and querying genomic data.
Q3. What are the main challenges (technical and non-technical) in representing and exchanging genomic data on individuals?
David Haussler: A main challenge facing clinical and genomic data sharing efforts is the lack of harmonized methods and interoperable approaches that would enable such sharing. This barrier is one of the main motives for the formation of the Global Alliance for Genomics and Health.
Currently, the ad hoc use of different data formats and technologies in different systems, lack of alignment between approaches to ethics and national legislation across jurisdictions, and the challenges of devising secure systems for controlled sharing of data puts the world on track to create Balkanized data sets and not be able to learn from aggregated information.
It is the hope of the Global Alliance that by addressing these technical, regulatory and other barriers at the outset, we will reverse the current course and enable medical progress through large-scale data aggregation and analysis.
Q4. What do you mean with “responsible” data sharing?
David Haussler: The meaning of responsible data sharing comes down to respect for the privacy and the data sharing preferences of participants. One of the core missions of the Global Alliance is to promote the highest standards for ethics and ensure that participants have a choice to securely share their genomic and clinical data as much as they want to, including not at all.
Aligning with this mission, two of the four initial Working Groups are focused on aspects of this responsible sharing: the Security Working Group and the Regulatory and Ethics Working Group.
The Regulatory and Ethics Working Group is in the process of drafting an International Code of Conduct, which will support the establishment of a set of ethical principles and practices for research seeking to share genomic and clinical data. The Security Working Group aims to support a technology environment that provides assurance to patients, researchers, clinicians, and other stakeholders that data are shared, annotated, and interpreted only by those with appropriate authorization to do so. All work done by the Global Alliance, including in the Data Working Group, is closely tied to ensure that any data sharing is done in a manner that respects privacy and security, while still retaining essential attributes to enable effective analysis.
Q5. What are the plans of the Data Working group to overcome such challenges?
David Haussler: Initially, the Data Working Group will take a role in overseeing the current BAM, CRAM, and VCF format standards to provide a governance and support structure for these efforts.
In the near-term, we will work with the international community to develop formal data models, APIs, and reference implementations of those APIs for representing, submitting, exchanging, querying, and analyzing genomic data in a scalable and potentially distributed fashion. This work will be consistent with the security model developed by the Alliance’s Security Working Ground, the clinical data framework developed by the Alliance’s Clinical Working Group, and the International Code of Conduct developed by the Alliance’s Regulatory and Ethics Working Group.
The Data Working Group in conjunction with partner organisations has also contributed to the startup of a project known as “Beacon”, which fosters the development of ‘beacons’: any institution or site that provides a simple yes or no in response to query regarding the presence of a specific human genetic variant in their genetic data. This open web service is designed both to be technically simple, so that it is easy to implement, and to not return information that could be construed as violating anyone’s privacy, so that it is available as a public, unrestricted web resource.
Q6. Why new APIs are needed? and what are the key areas in which these new APIs will be used?
David Haussler: We need to switch from file formats to APIs so that new architectures can be employed for storage and access to genomics data as we scale to thousands and eventually millions of genomes. APIs allow third parties to write code with standardized methods for utilization of genomic data that do not require download or parsing of large files and that are broadly compatible across many institutional systems.
Specifically, APIs are needed for and will be used in these four key areas:
Reference variants. This API represents a reference genome structure consisting of typical human chromosomal DNA sequences and well-established human polymorphisms including larger structural variations. It defines mechanisms for mapping other information to the reference, including individual genomes, RNA data, and annotation. It should support mapping of DNA or RNA reads from a BAM file or equivalent, individual genome variants as described in a VCF file or equivalent, and various types of reference genome annotation as found in a genome browser or in one of the existing human genetic variation databases.
Read data. This API represents collections of primary data collected from sequencing machines, covering functions currently supported by FASTQ and BAM file formats, and including a query interface over groups of samples. It addresses issues of efficient interaction with large databases, the relationship of reads to a reference genome, lossy or loss-free data compression, and error correction.
Expression, methylation, and other epigenetic data. It is also necessary to have APIs that represent gene expression and the epigenetic state of the DNA or chromatin in a tissue sample. We plan to build an API specifically for gene expression, and establish a framework in which other external groups can create APIs for other types of epigenetic and functional data. These APIs will interact with the reference variant and read data APIs.
Metadata. Metadata is general information about a sample, such as tissue type, including how, when, and where information was extracted from that sample, such as the name of the sequencing center. We intend that there be a single sample metadata schema that is shared by all data models, used universally across expert working groups so that there is maximum compatibility. Optional fields will allow customization as necessary so that it does not force the specification of too much information for any given API or project.
Q7. How do you plan to create a “shared” data representation, storage, and analysis of genomic data?
David Haussler: The Global Alliance intends to enable the sharing of genomic and clinical data, but will not itself store, analyze, or interpret data. By undertaking work such as the development of APIs for representing, submitting, exchanging, and querying genomic data, we seek to create a common framework of interoperable approaches, lifting up best practices and creating new methods where none exist, that will enable more effective, responsible sharing of genomic and clinical data and facilitate large-scale research by entities throughout the world. The Global Alliance also seeks to catalyze data sharing projects that drive and demonstrate the value of data sharing, and to convene stakeholders from different sectors and localities to share information, establish best practices, and enable interoperability across the broadest possible group.
Q8. Why InterSystems joined the Global Alliance for Genomics and Health and what will be their contribution?
David Haussler: To answer this, I point you towards the statement of Paul Grabscheid, Vice President of Strategy comments at Intersystems, when the company joined the Global Alliance: http://www.intersystems.com/who-we-are/newsroom/news-item/intersystems-joins-global-alliance-for-genomics-and-health/.
On membership generally, since the Global Alliance’s initial formation with 70 partners in June of 2013, the group has brought on many more highly esteemed research and health institutions with broadened international representation, including partners from over 40 leading life science and information technology companies, world leaders in cloud computing, biotechnology, and healthcare generally, and additional respected disease and patient advocacy groups.
Q9. What are the progress and deliverables so far?
David Haussler: The Data Working Group has formed its first four task teams:
(1) File Formats Task Team,
(2) Reference Variation Task Team,
(3) Read Store Task Team, and
(4) Metadata Task Team.
Work from each Task Team is addressed below and is available at https://github.com/ga4gh unless otherwise noted:
File Formats. The developers of the current VCF, BAM, and CRAM file formats have been engaged in a File Formats Task Team led by Ewan Birney of the EBI to govern, maintain and extend these formats. A pre-existing official specification and software development site has been endorsed at https://github.com/samtools/hts-specs and will be used by the Task Team to address suggestions from the developer community for file format modifications.
Reference Variation. The Reference Variation Task Team, co-led by Gil McVean of Oxford University and Benedict Paten of UC Santa Cruz, held its organizing meeting in Hinxton, UK on March 3, 2014.
The team aims to compare existing reference structures such as the GRC reference genome and the dbSNP database alongside newer graph-based approaches, with the near-term goal of delivering one or more new or enhanced reference structures with pilot implementations.
Read Store. The Read Task Team, led by Dave Patterson of UC Berkeley, involves members from various companies, government, and academic institutions. It has compared in detail APIs from NCBI, EBI, Google, SMART/HL7 FHIR, and UC Berkeley, has established a publicly readable mailing list with discussions, designed and released an initial v0.1 API, and is currently working on the v0.5 API. All work is open and issues/comments may be raised by any members of the public through mechanisms provided by the GitHub open source software development environment.
Metadata. The Metadata Task Team, is led by Helen Parkinson of EBI and Tanya Barrett of NCBI, which held its organizing meeting April 17, 2014. In the near term, the team aims to create a single sample metadata schema that is shared by all data models, used universally across expert working groups so that there is maximum compatibility.
In addition to these task teams, to develop APIs in the context of major ongoing research projects, the Data Working Group currently interacts with three projects from outside the Global Alliance: the ICGC/TCGA Pan-Cancer Whole Genome Analysis project, Matchmaker Exchange, and Beacon.
Q10. What is the Beacon project, and how does it relate to your work at the Global Alliance for Genomics and Health?
David Haussler: In order to root the activities of the Global Alliance in real-world problems and to demonstrate the value of interoperable approaches to data sharing, the Alliance supports specific projects, of which the Beacon project is one. Ongoing engagement between these projects and Working Groups is intended to encourage a focus on the needs of projects currently advancing science and medicine, and crosscutting engagement of the Working Groups with one another and with stakeholders in the community.
The Beacon project, led by Jim Ostell of the NCBI, is a project that was created in order to test the willingness of international sites to share genetic data in the simplest of all technical contexts. It is defined as a simple public web service that any institution can implement as a service.
A site offering this service is called a “beacon.” This open web service is designed both to be technically simple (so that it is easy to implement) and to not return information that could be construed as violating anyone’s privacy (so that there is no good excuse for not implementing it as a public, unrestricted web resource).
A goal of the Data Working Group is to foster the development of more than a dozen independent “beacons” in the near-term, and in collaboration with the other Alliance Working Groups, to gain initial direct experience with the barriers to international genetic data sharing through this Beacon project.
There are currently 4 beacons running at the following locations: UC Berkeley (http://beacon.eecs.berkeley.edu), NCBI (http://www.ncbi.nlm.nih.gov/projects/genome/beacon/), UC Santa Cruz (http://hgwdev-max.cse.ucsc.edu/cgi-bin/beacon/query), and EMBL-EBI (http://www.ebi.ac.uk/eva/beacon).
This is still very much early days. Both the interface and the rules for engagement with beacons are rapidly evolving.
David Haussler’s research lies at the interface of mathematics, computer science, and molecular biology. He develops new statistical and algorithmic methods to explore the molecular function and evolution of the human genome, integrating cross-species comparative and high-throughput genomics data to study gene structure, function, and regulation. He is credited with pioneering the use of hidden Markov models (HMMs), stochastic context-free grammars, and the discriminative kernel method for analyzing DNA, RNA, and protein sequences. He was the first to apply the latter methods to the genome-wide search for gene expression biomarkers in cancer, now a major effort of his laboratory.
As a collaborator on the international Human Genome Project, his team posted the first publicly available computational assembly of the human genome sequence on the Internet on July 7, 2000. Following this, his team developed the UCSC Genome Browser, a web-based tool that is used extensively in biomedical research and serves as the platform for several large-scale genomics projects, including NHGRI’s ENCODE project to use omics methods to explore the function of every base in the human genome, NIH’s Mammalian Gene Collection, NHGRI’s 1000 genomes project to explore human genetic variation, and NCI’s Cancer Genome Atlas (TCGA) project to explore the genomic changes in cancer.
His group’s informatics work on cancer genomics, including the UCSC Cancer Genomics Browser, provides a complete analysis pipeline from raw DNA reads through the detection and interpretation of mutations and altered gene expression in tumor samples. His group collaborates with researchers at medical centers nationally, including members of the Stand Up To Cancer “Dream Teams” and the Cancer Genome Atlas, to discover molecular causes of cancer and pioneer a new personalized, genomics-based approach to cancer treatment.
The UCSC Cancer Genomics Hub (CGHub), a product of the Haussler lab, is a secure repository for storing, cataloging, and accessing cancer genome sequences, alignments, and mutation information for 25 cancer types from TCGA, the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project, and other related projects. The current planned capacity of this data center is five petabytes. The CGHub will serve as a platform to aggregate other large-scale cancer genomics information, growing to provide the statistical power to attack the complexity of cancer.
He co-founded the Genome 10K Project to assemble a genomic zoo—a collection of DNA sequences representing the genomes of 10,000 vertebrate species—to capture genetic diversity as a resource for the life sciences and for worldwide conservation efforts.
Haussler is an organizing member of the Global Alliance for Genomics and Health, through which research, health care, and disease advocacy organizations that have taken the first steps to standardize and enable secure sharing of genomic and clinical data.
Haussler received his PhD in computer science from the University of Colorado at Boulder. He is a member of the National Academy of Sciences and the American Academy of Arts and Sciences and a fellow of AAAS and AAAI. He has won a number of awards, including the 2011 Weldon Memorial Prize from University of Oxford, the 2009 ASHG Curt Stern Award in Human Genetics, the 2008 Senior Scientist Accomplishment Award from the International Society for Computational Biology, the 2005 Dickson Prize for Science from Carnegie Mellon University, and the 2003 ACM/AAAI Allen Newell Award in Artificial Intelligence.
Follow ODBMS.org on Twitter: @odbmsorg
“Analysis of big data can identify the subtle differences that explain why similar-seeming patients have different outcomes, and predictive decision support can help physicians guide more patients on the path to recovery”–Steve Nathan.
Why using predictive analytics in healthcare? On this topic, I have interviewed Steve Nathan, CEO at Amara Health Analytics.
Q1. Amara provides real-time predictive analytics to support clinicians in the early detection of critical disease states. Can you tell us a bit more of what are these predictive analytics and which data sets do you use?
Steve Nathan: Our system runs on hospital data, including labs, pharmacy, real-time vitals, ADT, EMR, and clinical narrative text. We have developed domain-specific natural language processing (NLP) and machine learning to find predictive signal that is beyond the reach of traditional approaches to clinical analytics.
Q2. How large are the data sets you analyse?
Steve Nathan: For an average 500 bed hospital we analyse about 100 million data items in real-time annually. We expect this volume of data to increase dramatically as more centers deploy hospital-wide automated real-time streaming of data from patient monitors into the EMR.
Q3. Why early detection is important?
Steve Nathan: Let’s look at the example of sepsis, which is the body’s systemic toxic response to an infection. Clinicians know how to treat sepsis once identified, but it’s often difficult to identify early because it can mimic other conditions and there are a multitude of clinical variables involved, some of which are subjective in nature.
At the later stages of sepsis, mortality increases almost 8% with every hour of delayed treatment. So technology that can assist clinicians in early identification is important.
Q4. What is your back-end platform for real-time data processing and analytics?
Steve Nathan: The major components of the backend are Mirth (open source HL7 parsing/transform), a data aggregator, a real-time data streaming engine, Jess rules engine, the core analytics engine (including NLP pipeline), and Cassandra NoSQL database from DataStax.
Q5 What are the main technical challenges you encounter when you analyse big data sets that are composed of previous patient histories and medical records?
Steve Nathan: Input data comes from a wide variety of centers and is wildly heterogeneous. Issues range from proprietary health IT systems and incompatible usage of available standards like HL7, to wide variations in clinical language used by physicians and nurses in their notes, to varying patterns of diagnostic testing by different providers. And of course every patient is unique. So much of our system is dedicated to producing a standardized timeline for each patient in which every variable has a consistent interpretation, no matter where it originally came from. We then use these patient timelines as input for machine learning, and there are some unique challenges there in developing predictive models that are appropriate for use in real-time decision support.
Q6. Why did you choose a NoSQL database for this task?
Steve Nathan: The primary motivation was flexibility of data representation. We wanted to be able to change schemas dynamically. Also, DataStax’s integration of Cassandra with Solr was very important for us because we do a lot text mining as part of our overall approach.
Q7. What are the lessons learned so far?
Steve Nathan: Processing huge amounts of historical patient data in simulations to refine predictive models is very challenging to do with good performance. We must be able to run many years worth of archived data through the system in a small number of hours, so the system architecture must be designed with this processing mode in mind.
It is important to consider early on how to upgrade the live system with no down time while avoiding the possibility of losing any data.
Q8. What about data protection issues?
Steve Nathan: We comply with HIPAA regulations and go to great lengths to protect data. This includes work processes and policies for all employees, contractors, and business associates. And, importantly, it includes the architecture and deployment of our systems – for example all data is transferred via VPN, and every VPN connection is terminated in a VM that contains a single protected dataset, rather than terminating at a DC router.
Q9. Which relationships exist between Amara and UC San Diego?
Steve Nathan: There is no formal relationship between Amara and UCSD. Amara’s co-founder and chairman, Dr. Ramamohan Paturi, is a professor of Computer Science at UCSD.
Qx. Anything else you wish to add?
Steve Nathan: Analysis of big data can identify the subtle differences that explain why similar-seeming patients have different outcomes, and predictive decision support can help physicians guide more patients on the path to recovery.
Steve Nathan has over 25 years in the enterprise software industry. As CEO of Parity Computing beginning in 2008, Steve led the company’s product expansion with a pioneering analytics platform for enhancing biomedical research productivity. He then led the formation of Amara Health Analytics, the concept and strategy for the Clinical Vigilance™ product line, and the spinout of Amara as an independent company.
Steve has held key leadership positions at recognized technology innovators Sun Microsystems and Cray Research, as well as at start-ups Celerity Computing, Alignent Software, and Exist Global. As General Manager at Sun Microsystems, he had P&L responsibility for messaging, portal, and web infrastructure in the iPlanet business unit, where he grew the annual revenue of this product line from $15M to $150M in three years.
Steve holds B.S. degrees from the University of California at Riverside in Computer Science, Mathematics, and Psychology; and he is a 1999 graduate of the Stanford University Business Executive Program.
Follow ODBMS.org and ODBMS Industry Watch on Twitter:
“It’s unlikely that a big data benchmark will gain wide recognition until a clear “playing field” has emerged and focused the competitive pressure.” –Francois Raab
On the topic of constructing big data benchmarks I have interviewed Francois Raab and Yanpei Chen. Francois is the original author of the TPC-C Benchmark. He is currently the President of InfoSizing, Inc.
Yanpei is a member of the Performance Engineering Team at Cloudera.
Q1.There have been a number of attempts at constructing big data benchmarks. None of them has yet gained wide recognition and usage. Why?
Yanpei: Many big data benchmarks are just like big data systems – new, and with room to improve and grow.
In more detail big data systems:
- rapidly evolve, so it’s important to define performance in ways that matter for end customers.
- consist of many interdependent components, so it’s difficult to measure performance in a reliable fashion.
- service diverse business needs using diverse implementations, so benchmarks need to accommodate different system implementations.
Francois: It’s unlikely that a big data benchmark will gain wide recognition until a clear “playing field” has emerged and focused the competitive pressure. There are 3 phases in the evolution of a new technology. First, the technology is introduced and applied to a wide array of solutions without a proven return on investment. Next, a “killer app” emerges from the early adopters and its rapid growth draws all the vendors into competing on a common playing field. Lastly, some technologies emerge as clear winners in the race and the market start to consolidate around a few dominant vendors. Big data has not entered the second phase yet.
Q2. Is it possible to build a truly representative big data benchmark?
To me, the rise of “big data” in part comes from our increased ability to instrument, measure, and ultimately derive value from large scale systems – technology systems, financial systems, medical systems, or physical systems touching day-to-day life. Big data systems, as a special case of technology systems, also deal with ever increasing instrumentation and measurement. Over time, I am absolutely confident that we will increase our understanding of big data systems, and with it, improve the quality of our big data benchmarks.
Cloudera’s broad customers base gives us visibility into big data deployments across telecom, banking, retail, manufacturing, media, government, healthcare, and many other industry sectors. We’re in a great position to identify representative use cases.
Francois: A benchmark is a somewhat abstract (i.e. simplified) model of a real life scenario. The question we face today is to identify a scenario that Fortune 500 companies would widely recognize as relevant to their operations and vital to their competitive survival. Once that critical mass has been reached it will quickly spread to the entire commercial data processing landscape and a successful big data benchmark will be built based on that scenario.
Q3. How would you define a Big Data Benchmark ?
Yanpei: The key properties of good big data benchmarks are a re-cast of the same properties for benchmarks of more established systems.
A good big data benchmark should be representative of real-life use cases; it should generate performance insights immediately relevant to diverse and evolving big data use cases. The benchmark should also be scalable; it should stress big data systems today, as well as the vastly improved systems in the future. The benchmark should be portable, meaning it should accommodate systems with different implementations that achieve the same end-goal. The benchmark should also be verifiable, in that the results can be checked by independent auditors if needed, and end-users can reproduce on their own systems the winning configurations and result.
Q4. Can you give some examples of Successful Benchmarks ?
Francois: The success of a benchmark can be measured by its number of published results and by its longevity over shifts in the underlying technologies. By that measure TPC-C and TPC-H are leading the field. While it can be argued that they have lost relevance over their two decades lifetime, they still encapsulate critical elements at the core of the application domains they represent (transaction processing and decision support).
Q5. One of the main purposes of a benchmark is to evaluate and contrast the merits of various implementations of the same set of requirements. How do you do this with Big Data?
Yanpei: You construct benchmarks that are portable. In other words, you specify implementation-independent requirements.
Best illustrated by example – TPC-C. TPC-C specifies five operations – New Order, Payment, Delivery, Order-Status, and Stock-Level. It also describes the interdependencies between these operations. For example, every New Order will be accompanied by Payment, but only one in ten New Orders will trigger an Order Status. TPC-C describes the load that the system under test should handle – many concurrent operations arriving in randomized order with randomized inter-arrival time, but at controlled relative frequencies. TPC-C also specifies the initial content of all the datasets, as well as how the content grows over the execution of the benchmark. This is an implementation-independent set of requirements – “handle these operations on these data sets.” The underlying system could be a relational database, or a key-value store like HBase.
Francois: Benchmarks can be defined one of two way: by creating a kit to be deployed on technology specific platforms or by specifying a set of technology agnostic requirements to be implemented at will. Because big data has first emerged from the MapReduce paradigm, we have seen a number of technology centric benchmarks (also called component benchmarks) that put a narrow focus on one or more components of a predefined solution. But we should soon expect to see a big data application emerge as the new must-have in commercial data centers.
Q6. In a recent position paper you argued for building future big data benchmarks using what you call a “functional workload model”. What is it?
Francois: We introduced a couple of terms in that position paper to highlight the core concepts underlying representative, scalable, portable, and verifiable big data benchmarks.
The “functional workload model” is a way to specify such benchmarks. It contains three things – the “functions of abstraction”, the load pattern serviced by the system, and the data sets being acted upon.
“Functions of abstraction” describes “what is being computed” without specifying “how the computation should be done.”
The intent is an abstract, functional description that allows the benchmark to be portable across systems of different compute paradigms. “What is being computed” should be justified by empirical evidence, either system traces or industry-wide surveys, with emphasis on identifying the common computation goals.
The load pattern describes “what is the serviced load” without specifying “how it is serviced.” It outlines the execution frequency, distribution, arrival rate, bursts and averages over time of each individual function of abstraction.
The data sets describe “what is the data and the relationships within the data” without specifying “how it is represented.” It is in terms of the structure and interdependence between data elements, initial size and contents, how it evolves over the course of the workload execution, and how it is expected to scale with the system size and load volume.
These concepts help us routinely identify shortcomings in haphazardly specified benchmarks. For example, some of the most often-cited big data benchmarks contain artificial functions of abstraction that do not match any common use cases.
Or, a multi-job, multi-query load pattern is missing altogether, or the data sets are represented in unrealistic formats that inflate performance advantages.
Q7. Why did you select TPC-C as a starting point for your work?
Yanpei: Because TPC-C already has a functional workload model within its specification. And because Francois wrote TPC-C.
Francois: The functional workload model is the underlying structure on which TPC-C was built. Subsequent TPC benchmarks, like TPC-H and TPC-E, were also built based on a functional workload model.
Q8. How does your functional workload model compares with TPC-C ?
Yanpei: TPC-C already uses the functional workload concept.
Q9. For your functions of abstractions concept to be useful, it must be applicable to different types of big data systems. Two important examples are relational databases and MapReduce. How do you do that? How does your work compare with other MapReduce-Specific Benchmarks ?
Yanpei: Best illustrated by example.
Suppose we discover that sorting data is a common operation in real-life production use cases. We would then define “sort” as a function of abstraction. We would define it in the same fashion as the official Sort Benchmark – the input data is of size X, format Y, and the system is asked to produce output sorted by order Z.
A relational database implementation could do, say, “insert into TABLE … ” followed by “select * from TABLE ordered by COLUMN”. A MapReduce implementation would use the IdentityMapper and IdentityReducer, and rely on the implicit shuffle-sort in MapReduce.
This is obvious for sort, because the sort operation has traditionally been defined in a system-independent way.
In contrast, many of the existing MapReduce and relational database performance measurement tools are specified in ways that do not translate across different types of systems. The many SQL-on-Hadoop systems are fast removing the boundary.
The functions of abstraction concept allows us to understand use-case at a level above than any SQL-only or Hadoop-only specifications.
Q10. What are in your opinion the Emerging Big Data Application Domains?
Francois: Everyone wants to figure out which application domain will become the big data killer app. Today, no commercial data center can live without on-line transaction processing or without decision support systems.
Which big data application will become indispensable tomorrow? That is the million dollar question! Once we know that, a standard big data benchmark will soon follow.
Yanpei: The maturation of the Hadoop platform has been relentless. Its role has changed as the platform has gotten more secure, more reliable, more powerful, and (especially) more real-time. It’s no longer a system used for just big batch jobs. Instead, it has become the first place that data lands. It scales and it can store anything – no data need be discarded. It’s used to pre-process data before delivering it to an enterprise data warehouse, a document repository, an analytic engine, a CRM or ERP application, or other specialized system. Most significantly, it has begun to take over some of the work previously done by those traditional platforms, because it can do real-time search and analysis on the data directly, in place, and without further Extract-Transform-Load (ETL).
This leads to the emergence of the enterprise data hub (EDH), a new architecture to complement existing investments and help put data at the center of an organisation’s business. An enterprise data hub allows storage of any amount and type of data, for as long as is needed, and accessible in any way needed.
Additional necessary attributes of EDHs include: It’s Secure and Compliant, offering perimeter security and encryption, plus fine-grained (row and column-level), role-based access controls over data, just like a data warehouse. It’s Governed, enabling users to do data discovery, data auditing, and data lineage, thus understanding what data is in their EDH and how the data are used.
It’s Unified and Manageable, providing native high-availability, fault-tolerance, self-healing storage, automated replication, and disaster recovery, as well as advanced workload management capabilities to enable multiple speciacialist systems to analyze the same data set. And it’s Open, ensuring that customers are not locked in to any particular vendor’s license agreement, that you can choose what tools to use with your EDH, and nobody can hold your data or applications hostage.
The emergence of EDHs pose both challenges and opportunities for defining big data benchmarks. As Francois alluded to, the representative scenarios typically involve application domains whose performance has traditionally been measured separately, such as the case for on-line transaction processing and decision support systems. How to define and measure performance for such concurrent application domains present both a challenge and an opportunity.
Further, to compare different EDHs, it becomes necessary to quantify characteristics that are previously yes/no checks – which is the more secure EDH? the better governed? the more unified and more manageable? the more open? How to quantify such characteristics will stretch our performance thinking and measurement methodology into new territory.
Q11. Future Work ?
Yanpei: We have a strong Performance Engineering Team at Cloudera. We insist on systematic, fair, and repeatable tests both for our internal performance assessment and competitive studies. We are also engaged with community efforts to define big data benchmarks. Look for our future posts on the Cloudera Developer Blog!
Francois Raab is a recognized, award winning expert in the field of performance engineering, benchmark design and system testing. He is the original author of the TPC-C Benchmark, the most successful industry standard measure of OLTP performance. He was also co-author of “The Benchmark Handbook” (pub. Morgan Kaufmann). Francois is accredited as a Certified Benchmark Auditor by the Transaction Processing Performance Council. His consulting services are retained by most major system vendors as well as Fortune-500 IT organizations. With over 30 years of experience in the field of databases and commercial data processing, Francois is a leading member of the performance measurement, system sizing and technology evaluation community. He is currently the President of InfoSizing, Inc.
Yanpei Chen is a member of the Performance Engineering Team at Cloudera, where he works on internal and competitive performance measurement and optimization. His work touches upon multiple interconnected computation frameworks, including Cloudera Search, Cloudera Impala, Apache Hadoop, Apache HBase, and Apache Hive. He is the lead author of the Statistical Workload Injector for MapReduce (SWIM), an open source tool that allows someone to synthesize and replay MapReduce production workloads. SWIM has become a standard MapReduce performance measurement tool used to certify many Cloudera partners. He received his doctorate at the UC Berkeley AMP Lab, where he worked on performance-driven, large-scale system design and evaluation.
-New (August 18, 2014): TPCx-HS: First Vendor-Neutral, Industry Standard Big Data Benchmark.
The Transaction Processing Performance Council (TPC) announced the immediate availability of TPCx-HS, developed to provide verifiable performance, price/performance, availability, and optional energy consumption metrics of big data systems.
- The Fifth Workshop on Big Data Benchmarking (5th WBDB) August 5-6, 2014, Potsdam, Germany: Program and Videos of all talks.
Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg
“The problem we had was that, if we tried to store the XML in traditional relational databases, we were losing the structure of our articles or having to redesign them.”–David Leeming
On making information accessible, I have interviewed David Leeming, Solutions Manager at the Royal Society of Chemistry. The Royal Society of Chemistry is the world’s leading chemistry community, advancing excellence in the chemical sciences, with 49,000 members worldwide.
Q1. What is the main business of The Royal Society of Chemistry?
David Leeming: The Royal Society of Chemistry is the world’s leading chemistry community, advancing excellence in the chemical sciences. With over 49,000 members and a knowledge business that spans the globe, we are the UK’s professional body for chemical scientists; a not-for-profit organisation with 170 years of history and an international vision of the future. We promote, support and celebrate chemistry. We work to shape the future of the chemical sciences – for the benefit of science and humanity.
Q2. What is your role at The Royal Society of Chemistry? And what are your current projects?
David Leeming: I manage the solutions team within the Royal Society of Chemistry’s recently formed Strategic Innovation Group, whose goal is to develop and deliver digital products and platforms that support and enhance our strategy and goals. As well as delivering platforms such as our Publishing Platform and our platform for teaching and learning resources, Learn Chemistry, my team also trials new technologies to scope what the organisation could be doing next in digital delivery to connect the chemistry community with high-quality research and chemical data.
Q3. Why did you digitize and convert over 1 million pages of written word and chemical formulae into XML?
David Leeming: Our aim is to connect the world with the chemical sciences, so we need to be able to make the chemistry research that we publish available and accessible to researchers all over the world. With the age of digital technology and demands from researchers to get their hands on data in new formats, rather than the traditional hard copy journals, we undertook a project to digitise our vast library of scientific data, news and literature and make it available online. By getting the XML for our articles and data, we are able to improve the discoverability and functionality of the platforms we use to deliver our content, by using the content structure. We are also able to hold only one copy of an article and transform the XML into other format layouts without having to hold multiple copies of the article.
Q4. What kind of new products and services were you expecting to develop when you converted the data in XML?
David Leeming: We can enhance and semantically enrich our content and analyse it to discover niche domains or areas of chemistry research that might benefit researchers who may not normally think to look in our journals.
For example, since launch in 2010 we have launched a number of new journals in environmental science.
On Learn Chemistry we have been able to produce mini-sites for specific initiatives such as a collection of Chemistry of Sport education resources that coincided with the 2012 Olympic Games.
Q5. What technical challenges did you encounter when trying to store and manage these XML files?
David Leeming: Before we discovered MarkLogic the problem we had was that, if we tried to store the XML in traditional relational databases, we were losing the structure of our articles or having to redesign them.
The XML was basically stored in flat files, and we needed to build indexes from this. Keeping this up to date and versioned was difficult. Since we’ve been using MarkLogic, we have been able to store our content in its original structure within the database, giving us full version control and adding search capabilities. The initial technical difficulties we faced when switching over to MarkLogic were finding experienced developers and getting traditional database administrators to think a bit differently about the way developers use the database and not to be concerned about traditional data modelling.
Q6. Specifically, how do you manage the logical associations between different types of XML content? And how do you query(update) them?
David Leeming: We have two ways of separating content. One is by product – this was our original method. We stored the XML documents for each product in their own database. We have now changed this method to store by content type. So, for instance, a journal article is stored together with other journal articles, book chapters are stored together, and so on. We keep the original XML schema and different documents have different schemas, but on loading the data we add a ‘meta’ record that is common across all data objects. This holds identifiers and describes what each data object is. This aids in the searching and querying we perform on the data to ensure the correct content is accessed and updated.
Q7. What results did you obtain in using a NoSQL database for this task?
David Leeming: Using the NoSQL database meant that we didn’t need to spend 6 months or so building a data model that doesn’t really fit the structure of the documents we currently have or may have in the future. Each time we get a different XML document, there’s no need to change the underlying data model, so it is very easy to implement.
Q8. What are your plans ahead?
David Leeming: Going forward we are starting to add greater semantic enrichment to data sets to provide better discoverability across chemistry data that will enable increased collaboration among researchers.
David Leeming is the Royal Society of Chemistry’s Solutions Manager. He is an experienced leader, project manager and business analyst in the digital publishing business, with expertise in building innovative platforms and solutions. He works with a number of technologies, with specific interest in online publishing, XML, sprint methodologies and MarkLogic. He manages teams of highly skilled engineers, analysts and an outsourced development unit based in India.
Follow ODBMS.org and ODBMS Industry Watch on Twitter:
“My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.”–Jules J. Berman.
I have interviewed Jules J. Berman, former President of the Association for Pathology Informatics. The focus of the interview is on how to manage Big Data.
Q1. In your experience what are the common mistakes that endanger most Big Data projects?
Jules J. Berman: Overconfidence is the biggest culprit. The creators of Big Data resources like to believe that they have collected all the data relevant to their domain, that all of the data is accurate, and that the data is organized in a manner that supports meaningful data searches. The Big Data analysts like to believe that their results and conclusions are correct. Hah!
Q2. How do you organize large volumes of complex data? Any insights you could give us on this?
Jules J. Berman: Large volumes of Big Data are organized the same way that humans organize the large volumes of complex data held in their brains: through classification. We could not cope with all the sensory input we receive each day if we did not bin visual objects into categories.
There is a science to constructing classifications, and if the science is misapplied, then the complex data objects held in a Big Data resource cannot be sensibly retrieved, or collected with objects to which they are logically related. Novices to the field make two common errors: confusing properties with classes (e.g., creating red-colored objects as a new class), or assigning a part of an object as a subclass (e.g., making “legs” a subclass of “person”). Just like any other science, the science of classification must be studied, practiced, and mastered.
Q3. You have been working on data permanence: what does it mean in practice? How can it be achieved when the content of the data is constantly changing?
Jules J. Berman: Everyone knows the slogan from Orwell’s masterpiece, 1984: “Big Brother is watching you”. If you’ve read the book, you’ll remember that there was another major theme; one that involved data mutability. The minions of Big Brother were constantly fiddling with collected data to distort reality. Because Big Brother held all the data, Big Brother could create perceptions of reality that suited the totalitarian state.
I see the problem of data mutability (i.e., the ability to modify, delete, or fabricate data) as being much more important than issues related to over-surveillance. In hospitals, the regrettable act of “retro-noting” (i.e., inserting patient notes out of sequence to cover omissions, or to justify billing, or to eradicate errors), is an example of data mutability.
The solution involves employing time stamps and metadata, and procedures that block data erasures. Data mutability, and the related topic of missing legacy data, are two of my favorite issues, and they are both covered in my book, “Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.”
Q4. Data verification: what are the challenges?
Jules J. Berman: The biggest challenge involves getting data analysts to take the topic of data verification seriously. I personally know data scientists who have the attitude that data verification is “not my job.” They believe that they have no control over the data; their job is to do the best they can with the data they receive.
I think that we really must get everybody on board with the idea that data needs to be verified. The task of creating verified sets of data is child’s play compared with the professional issues instigated by recalcitrant data scientists.
Q5. Data validation: what are the challenges?
Jules J. Berman: There are many ways of thinking about validation, but my perception is that most people in the field are approaching validation as a post-analytic process, wherein old conclusions are tested on new data, or tested on alternate data sources, or are re-calculated on a regular basis. The validation process is aimed at determining whether what seems true for me today will be true for you and me, today and tomorrow.
Like anything else in Big Data, it requires work and vigilance, and a delay in gratification.
Q6. Are there any general methods for data verification and validation that can be specifically applied to Big Data resources?
Jules J. Berman: There’s a large literature out there on this subject. In my opinion, the methods are not as important as the documentation. Protocols must be written, actions must be recommended, and steps must be taken to implement corrections. If you’re serious about Big Data, you must be serious about documenting everything: how you found errors, what you did to correct the errors, what you did to make sure that future errors of the same kind will not occur, what you did to monitor the occurrence of future errors of the same type. It never seems to end, but it’s just part of the job.
Q7. How would you find relationships among data objects held in disparate Big Data resources: Could you give us some examples?
Jules J. Berman: In my book Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. published in May, I give real-world examples for all of the points raised in this interview, but my favorite “reach” into disparate data involves some inventive research into the sinking of the Titanic.
Here’s an excerpt from my book: “A recent headline story explains how century old tidal data plausibly explained the appearance the iceberg that sank the titanic, on April 15, 1912. Records show that several months earlier, in 1912, the moon earth and sun aligned to produce a strong tidal pull, and this happened when the moon was the closest to the earth in 1,400 years. The resulting tidal surge was sufficient to break the January Labrador ice sheet, sending an unusual number of icebergs towards the open North Atlantic waters.
The Labrador icebergs arrived in the commercial shipping lanes four months later, in time for a fateful rendezvous with the Titanic. Back in January 1912, when tidal measurements were being collected, nobody foresaw that the data would be examined a century later.”
Of course, the finest tool for finding relationships among data objects held in disparate Big Data resources is the human brain. Good data analysts spend lots of time surveying the data held in various resources. When you spend the time, the inspirational moments will come, and you will begin to synthesize new relationships among data from different knowledge domains. Typically, analysis follows inspiration; not vice versa.
Q8. Data integration: how can data be extracted and integrated with data from other resources?
Jules J. Berman: Of course, standards, specifications, and metadata play an important role.
The Holy Grail in the Big Data field involves finding and implementing standard methods for organizing and tagging data, so that every piece of data held on any computer, can be linked and combined into a virtual Super-Big Data resource.
On a less grand scale, it’s always nice when workers in a common field collect their data in a standard form.
In most cases, I’ve been favoring specifications over standards. Data standards seldom, if ever “fit” your data correctly, are prone to re-versioning, often cost money, and usually come with a fine-print license that restricts how the standards are used and how your annotated data are distributed. Specifications are recommendations for describing data; RDF is a good example. Specifications provide the flexibility required for complex data, but the structure required for data integration.
A smart data manager can do a lot more with a specification than with a standard.
Q9. What about Big Data sharing?
Jules J. Berman: Data sharing is absolutely essential to the field of data science. If the data upon which your assertions are based is unavailable to the public, then why would anyone believe your results and conclusions?
In the Big Data realm, there are lots of things that can go wrong with a data analysis project. The chances that any new analysis is correct, on first pass, is slim-to-none. Everything must be repeated over and over, critiqued, and validated on fresh data.
My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.
Jules Berman received two baccalaureate degrees from MIT; in Mathematics, and in Earth and Planetary Sciences. He received the Ph.D. from Temple University, and the M.D. from the U. of Miami.
He received post-doctoral training at NIH and residency training at Geo. Washington U Med Ctr. He is board certified in anatomic pathology and in cytopathology. He served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and the Johns Hopkins Medical Institutions. In 1998, he became a Medical Officer at the U.S. National Cancer Institute and served as the Program Director for Pathology Informatics in the Institute’s Cancer Diagnosis Program. In 2006, Jules Berman was President of the Association for Pathology Informatics. In 2011 he received the Lifetime Achievement Award from the Association for Pathology Informatics. Today, Jules Berman is a free-lance writer. He has first-authored more than 100 articles and 11 book titles in science and medicine.
- Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information, Jules J Berman, Ph.D., M.D. Paperback: 288 pages, Morgan Kaufmann; 1 edition (June 13, 2013), ISBN-10: 0124045766
Follow ODBMS.org on Twitter: @odbmsorg
“A true columnar store is not only about the way you store data, but the engine and the optimizations that are enabled by the column store”–Shilpa Lawande.
On the subject of column stores, and what are the important features in the new release of the HP Vertica Analytics Platform, I have interviewed Shilpa Lawande, VP Engineering & Customer Experience at HP Vertica.
Q1. Back in 2011 I did an interview with you  at the time Vertica was just acquired by HP. What is new in the current version of Vertica?
Shilpa Lawande: We’ve come a long way since 2011 and our innovation engine is going strong!
From “Bulldozer” to “Crane” and now “Dragline,” we’ve built on our columnar-compressed, MPP share-nothing core, expanded security and manageability, dramatically expanded data ingestion capabilities, and what’s most exciting is that we’ve added a host of advanced analytics functions and extensibility APIs to the HP Vertica Analytics Platform itself. One key innovation is our ability to ingest and auto-schematize semi-structured data using HP Vertica Flex Zone, which takes away much of the friction in the analytic life-cycle from exploration to production.
We’ve also grown a vibrant community of practitioners and an ecosystem of complementary tools, including Hadoop.
Dragline, our next release of the HP Vertica Analytics Platform addresses the needs of the most demanding, analytic-driven organizations by providing many new features, including:
- Project Maverick’s Live Aggregate Projections to speed up queries that rely on resource- intensive aggregate functions like SUM, MIN/MAX, and COUNT.
- Dynamic mixed workload management, which identifies and adapts to varying query complexities — simple and ad-hoc queries as well as long-running advanced queries — and dynamically assigns the appropriate amount of resources to ensure the needs of all data consumers
- HP Vertica Pulse, which helps organizations leverage an in-database sentiment analysis tool that scores short data posts, including social data, such as Twitter feeds or product reviews, to gauge the most popular topics of interest, analyze how sentiment changes over time and identify advocates and detractors.
- HP Vertica Place, which stores and analyzes geospatial data in real time, including locations, networks and regions.
- An expanded SQL-on-Hadoop offering that gives users freedom to pick their data formats and where to store it, including HDFS, but still benefit from the power of the Vertica analytic engine. OF course, there’s a lot more to the “Dragline” release, but these are the highlights.
Q2. Vertica is referred to as an analytics platform. How does it differentiate with respect to conventional relational database systems (RDBMSes)?
Shilpa Lawande: Good question. First, let me clear the misconception that column stores are not relational – Vertica is a relational database, an RDBMS – it speaks tables and columns and standard SQL and ODBC, and like your favorite RDBMS, talks to a variety of BI tools. Now, there are many variations in the database market from low-cost solutions that lack advanced analytics to high-end solutions that can’t handle big data.
HP Vertica is the only one purpose-built for big data analytics – most conventional RDBMS were purpose- built for OLTP and then retrofitted for analytics. Vertica’s core architecture with columnar storage, a columnar engine, aggressive use of data compression, our scale-out architecture, and, most importantly, our unique hybrid load architecture enables what we call real-time analytics, which gives us the edge over the competition.
You can keep loading your data throughout the day — not in batch at night — and you can query the data as it comes in, without any specialized indexes, materialized views, or other pre- processing. And we have a huge and ever-growing library of features and functions to explore and perform analytics on big data–both structured and semi-structured. All of these core capabilities add up to a powerful analytics platform–far beyond a conventional relational database.
Q3. Vertica is column-based. Could you please explain what are the main technological differences with respect to a conventional relational database system?
Shilpa Lawande: It’s about performance. A conventional RDBMS is bottlenecked with disk I/O.
The reason for this is that with a traditional database, data is stored on disks in a row-wise manner, so even if the query needs only a few columns, the entire row must be retrieved from disk. In analytic workloads, often there are hundreds of columns in the data and only a few are used in the query, so row-oriented databases simply don’t scale as the data sets get large.
Vendors who offer this type of database often require that you create indexes and materialized views to retrieve the relevant data in a reasonable about of time. With columnar storage, you store data for each column separately, so that you can grab just the columns you need to answer the query. This can speed query times immensely, where hour-long queries can happen in minutes or seconds. Furthermore, Vertica stores and processes the data sorted, which enables us to do all manner of interesting optimizations to queries that further boost performance.
Some of the traditional database vendors out there claim they now have columnar store, but a true columnar store is not only about the way you store data, but the engine and the optimizations that are enabled by the column store.
For instance, an optimization called late materialized allows Vertica to delay retrieval of columns as late as possible in query processing so that minimal I/O and data movement is done until absolutely necessary. Vertica is the only engine that is true columnar; everything else out there is a retrofit of a general purpose engine that can read some kind of a columnar format.
Q4. What is so special of Vertica data compression?
Shilpa Lawande: The capability of Vertica to store data in columns allows us to take advantage of the similar traits in data. This gives us not only a footprint reduction in the disk needed to store data, but also an I/O performance boost — compressed data takes a shorter time to load. But, even more importantly, we use various encoding techniques on the data itself that enable us to process the data without expanding it first.
We have over a dozen schemes for how we store the data to optimize its storage, retrieval, and processing.
Q5. Vertica is designed for massively parallel processing (MPP). What is it?
Shilpa Lawande: Vertica is a database designed to run on a cluster of industry-standard hardware.
There are no special- purpose hardware components. The database is based on a shared-nothing architecture, where many nodes each store part of the database and do part of the work in processing queries. We optimize the processing so much as to minimize data traffic over the network. We have built-in high availability to handle node failures. We also have a sophisticated elasticity mechanism that allows us to efficiently add and remove nodes from the cluster. This enables us to scale-out to very large data sizes and handle very large data problems. In other words, it is massively parallel processing!
Q6. In the past, columnar databases were said to be slow to load. Is it still true now?
Shilpa Lawande: This may have been true with older unsophisticated columnar databases. We have customers loading over 35 TB data / hour into Vertica, so I think we’ve put that one squarely to rest.
Q7. Who are the users ready to try column-based “data slicers”? And for what kind of use cases?
Shilpa Lawande: Vertica is a technology broadly applicable in many industries and in many business situations. Here are just a few of them.
Data Warehouse Modernization – the customer has some underperforming solution for data warehouse in place and they want to replace or augment their current analytics with a solution that will scale and deliver faster analytics at an overall lower TCO that requires substantially less hardware resources.
Hadoop Acceleration – the customer has bought into Hadoop for a data lake solution and would like a more expressive and faster SQL-on-Hadoop solution or an analytic platform that can offer real-time analytics for production use.
Predictive analytics – the customer has some kind of machine data, clickstream logs, call detail records, security event data, network performance data, etc. over long periods of time and they would like to get value out of this data via predictive analytics. Use-cases include website personalization, network performance optimization, security thread forensics, quality control, predictive maintenance, etc.
Q8. What are the typical indicators which are used to measure how well systems are running and analyzing data in the enterprise? In other words, how “good” is the value derived from analyzing (Big) Data?
Shilpa Lawande: There are many, many advantages and places to derive value from big data.
First, just having the ability to answer your daily analytics faster can be a huge boost for the organization. For example, we had one brick-and-mortar retailer who wanted to brief sales associates and managers daily on what the hottest selling products were, who had inventory and other store trends. With their legacy analytics system, they could not deliver analytics fast enough to have these analytics on hand. With Vertica, they now provide very detailed (and I might add graphically pleasing) analytics across all of their stores, right in the hands of the store manager via a tablet device. The analytics has boosted sales performance and efficiency across the chain. The user experience they get wouldn’t be possible without the speed of Vertica.
But what is most exciting to me is when Vertica is used to save lives and the environment. We have a client in the medical field who has used Vertica analytics to better detect infections in newborn infants by leveraging the data they have from the NICU. It’s difficult to detect infections in newborns because they don’t often run a fever, nor can they explain how they feel. The estimate is that this big data analytics has saved the lives of hundreds of newborn babies in the first year of use. Another example is the HP Earth Insights project, which used Vertica to create an early warning system to identify species threatened by destruction of tropical forests around the world.
This project done in cooperation with Conservation International is making an amazing difference to scientists and helping inform and influence policy decisions around our environment.
There are a LOT of great use cases like these coming out of the Vertica community.
Q9. What are the main technical challenges when analyzing data at speed?
Shilpa Lawande: In an analytics system, you tend to have a lot going on at the same time. There are data loads, both in batch and trickle loads. There is daily and regular analytics for generating daily reports. There may be data discovery where users are trying to find value in data. Of course, there are dashboards that executives rely upon to stay up to date. Finally, you may have ad-hoc queries that come in and try to take away resources. So perhaps the biggest challenge is dealing with all of these workloads and coming up with the most efficient way to manage it all.
We’ve invested a lot of resources in this area and the fruit of that labor is very much evident in the “Dragline” release.
Q10. Do you have some concrete example of use cases where HP Vertica is used to analyze data at speed?
Shilpa Lawande: Yes, we have many, see here.
Q11. How HP Vertica differs with respect to other analytical platforms offered by competitors such as IBM, Teradata, to in-memory databases such as SAP HANA?
Shilpa Lawande: Vertica offers everything that’s good about legacy data warehouse technologies like the ability to use your favorite visualization tools, standard SQL, and advanced analytic functionality.
In general, the legacy databases you mentioned are pretty good at handling analysis of business data, but they are still playing catch-up when it comes to big data – the volume, variety, and velocity. A row store simply cannot deliver the analytical performance and scale of an MPP columnar platform like Vertica.
In-memory databases are a good acceleration solution for some classes of business analytics, but, again, when it comes to very large data problems, the economics of putting all the data in memory simply do not work. That said, Vertica itself has an in-memory component which is at the core of our high-speed loading architecture, so I believe we have the best of both worlds – ability to use memory where it matters and still support petabyte scales!
Shilpa Lawande has been an integral part of the Vertica engineering team from its inception to its acquisition by HP in 2011. Shilpa brings over 15 years of experience in databases, data warehousing and grid computing to HP/Vertica.
Besides being responsible for Vertica’s Engineering team, Shilpa also manages the Customer Experience organization for Vertica including Customer Support, Training and Professional Services. Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle Database.
Shilpa is a co-inventor on several patents on query optimization, materialized views and automatic index tuning for databases. She has also co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing. She has been named to the 2012 Women to Watch list by Mass High Tech and awarded HP Software Business Unit Leader of the year in 2012.
Shilpa has a Masters in Computer Science from the University of Wisconsin-Madison and a Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Mumbai.
Follow ODBMS.org on Twitter: @odbmsorg