“With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.” –Florian Waas.
On the subject of Big Data Analytics, I interviewed Florian Waas (flw). Florian is the Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team.
Q1. What are the main technical challenges for big data analytics?
Florian Waas: Put simply, in the Big Data era the old paradigm of shipping data to the application isn’t working any more. Rather, the application logic must “come” to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack.
Instead of stand-alone products for ETL, BI/reporting and analytics we have to think about seamless integration: in what ways can we open up a data processing platform to enable applications to get closer?
What language interfaces, but also what resource management facilities can we offer? And so on.
At Greenplum, we’ve pioneered a couple of ways to make this integration reality: a few years ago with a Map-Reduce interface for the database and more recently with MADlib, an open source in-database analytics package. In fact, both rely on a powerful query processor under the covers that automates shipping application logic directly to the data.
Q2. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?
Florian Waas: With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.
Scale and performance requirements strain conventional databases. Almost always, the problems are a matter of the underlying architecture. If not built for scale from the ground-up a database will ultimately hit the wall — this is what makes it so difficult for the established vendors to play in this space because you cannot simply retrofit a 20+ year-old architecture to become a distributed MPP database over night.
Having said that, over the past few years, a whole crop of new MPP database companies has demonstrated that multiple PB’s don’t pose a terribly big challenge if you approach it with the right architecture in mind.
Q3. How do you handle structured and unstructured data?
Florian Waas: As a rule of thumb, we suggest to our customers to use Greenplum Database for structured data and to consider Greenplum HD—Greenplum’s enterprise Hadoop edition—for unstructured data. We’ve equipped both systems with high-performance connectors to import and export data to each other, which makes for a smooth transition when using one for pre-processing for the other, query HD using Greenplum Database, or whatever combination the application scenario might call for.
Having said this, we have seen a growing number of customers loading highly unstructured data directly into Greenplum Database and convert it into structured data on the fly through in-database logic for data cleansing, etc.
Q4. Cloud computing and open source: Do you they play a role at Greenplum? If yes, how?
Florian Waas: Cloud computing is an important direction for our business and hardly any vendor is better positioned than EMC in this space. Suffice it to say, we’re working on some exciting projects.
So, stay tuned!
As you know, Greenplum has been very close to the open source movement, historically. Besides our ties with the Postgres and Hadoop communities we released our own open source distribution of MADlib for in-database analytics (see also madlib.net)
Q5. In your blog you write that classical database benchmarks “aren’t any good at assessing the query optimizer”. Can you please elaborate on this?
Florian Waas: Unlike customer workloads, standard benchmarks pose few challenges for a query optimizer – the emphasis in these benchmarks is on query execution and storage structures. Recently, several systems that have no query optimizer to speak of have scored top results in the TPC-H benchmark.
And, while impressive at these benchmarks, these systems do usually not perform well in customer accounts when faced with ad-hoc queries — that’s where a good optimizer makes all the difference.
Q6. Why do we need specialized benchmarks for a subcomponent of a database?
Florian Waas: On the one hand, an optimizer benchmark will be a great tool for consumers.
A significant portion of the total cost of ownership of a database system comes from the cost of query tuning and manual query rewriting, in other words, the shortcomings of the query optimizer. Without an optimizer benchmark it’s impossible for consumers to compare the maintenance cost. That’s like buying a car without knowing its fuel consumption!
On the other hand, an optimizer benchmark will be extremely useful for engineering teams in optimizer development. It’s somewhat ironic that vendors haven’t invested in a methodology to show off that part of the system where most of their engine development cost goes.
Q7. Are you aware of any work in this area (Benchmarking query optimizers)?
Florian Waas: Funny, you’d asked. Over the past months I’ve been working with coworkers and colleagues in the industry on some techniques – we’re still far away from a complete benchmark but we’ve made some inroads.
Q8. You had done some work with “dealing with plan regressions caused by changes to the query optimizer”. Could you please explain what the problem is and what kind of solutions did you develop?
Florian Waas: A plan regression is a regression of a query due to changes to the optimizer from one release to the next. For the customer this could mean, after an upgrade or patch release one or more of their truly critical queries might run slower–maybe even so slow that it start impacting their daily business operations.
With the current test technology plan regressions are very hard to guard against simply because the size of the input space makes it impossible to achieve perfect test coverage. This dilemma made a number of vendors increasingly risk averse and turned into the biggest obstacle for innovation in this space. Some vendors came up with rather reactionary safety measures. To use another car analogy: many of these are akin to driving with defective breaks but wearing a helmet in the hopes that this will help prevent the worst in a crash.
I firmly believe in fixing the defective breaks, so to speak, and developing better test and analysis tools. We’ve made some good progress on this front and start seeing some payback already. This is an exciting and largely under-developed area of research!
Q9. Glenn Paulley of Sybase in a keynote at SIGMOD 2011 asked the question of ‘how much more complexity can database systems deal with? What is your take on this?
Florian Waas: Unnecessary complexity is bad. I think everybody will agree with that. Some complexity is inevitable though, and the question becomes: How are we dealing with it?
Database vendors have all too often fallen into the trap of implementing questionable features quickly without looking at the bigger picture. This has led to tons of internal complexity and special casing, not to mention the resulting spaghetti code.
When abstracted correctly and broken down into sound building blocks a lot of complexity can actually be handled quite well. Again, query optimization is a great example here: modern optimizers can be a joy to work with.
They are built and maintained by small surgical teams that innovate effectively! Whereas older models require literally dozens of engineers just to maintain the code base and fix bugs.
In short, I view dealing with complexity primarily as an exciting architecture and design challenge and I’m proud we assembled a team here at Greenplum that’s equally excited to take on this challenge!
Q10. I published an interview with Marko Rodriguez and Peter Neubauer, leaders of the Tinkerpop2 project. What is your opinion on Graph Analysis and Manipulation for databases?
Florian Waas: Great stuff these guys are building –- I’m interested to see how we can combine Big Data with graph analysis!
Q11. Anything else you wish to add?
Florian Waas: It’s been fun!
Florian Waas (flw) is Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team. His day job is to bring theory and practice together in the form of scalable and robust database technology.
- On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.
ODBMS.ORG: Resources on Analytical Data Platforms: Blog Posts | Free Software| Articles|
“ Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases. For a database project, the conceptual data model is a much more important software engineering contribution than use cases.“ – Dr. Michael Blaha.
First of all let me wish you a Happy, Healthy and Successful 2012!
Now, we look at Use Cases and discuss how good are they for Database Modeling. Hope you`ll find the interview interesting. I encourage the community to post comments.
Q1. How are requirements taken into accounts when performing data base modeling in the daily praxis? What are the common problems and pitfalls?
Michael Blaha: Software development approaches vary widely. I’ve seen organizations use the following techniques for capturing requirements (listed in random order).
– Preparation of use cases.
– Preparation of requirements documents.
– Representation and explanation via a conceptual data model.
– Representation and explanation via prototyping.
– Haphazard approach. Just start writing code.
General issues include
– the amount of time required to capture requirements,
– missing requirements (requirements that are never mentioned)
– forgotten requirements (requirements that are mentioned but then forgotten)
– bogus requirements (requirements that are not germane to the business needs or that needlessly reach into design)
– incomplete understanding (requirements that are contradictory or misunderstood)
Q2. What is a use case?
Michael Blaha: A use case is a piece of functionality that a system provides to its users. A use case describes how a system interacts with outside actors.
Q3. What are the advantages of use cases?
– Use cases lead to written documentation of requirements.
– They are intuitive to business specialists.
– Use cases are easy for developers to understand.
– They enable aspects of system functionality to be enumerated and managed.
– They include error cases.
– They let consulting shops bill many hours for low-skilled personnel.
(This is a cynical view, but I believe this is a major reason for some of the current practice.)
Q4. What are the disadvantages of use cases?
– They are very time consuming. It takes much time to write them down. It takes much time to interview business experts (time that is often unavailable).
– Use cases are just one aspect of requirements. Other aspects should also be considered, such as existing documentation and
artifacts from related software. Many developers obsess on use cases and forget to look for other requirement sources.
– Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases.
– I have yet to see benefit from use case diagramming. I have yet to see significant benefit from use case structuring.
– In my opinion, use cases have been overhyped by marketeers.
Q5. How are use cases typically used in practice for database projects?
Michael Blaha: To capture requirements. It is OK to capture detailed requirements with use cases, but they should be
subservient to the class model. The class model defines the domain of discourse that use cases can then reference.
For database applications it is much inferior to start with use cases and afterwards construct a class model. Database applications, in particular, need a data approach and not a process approach.
It is ironic that use cases have arisen from the object-oriented community. Note that OO programming languages define a class structure to which logic is attached. So it is odd that use cases put process first and defer attention to data structure.
Q6. A possible alternative approach to data modeling is to write use cases first, then identifying the subsystems and components, and finally identifying the database schema. Do you agree with this?
Michael Blaha: This is a popular approach. No I do not agree with it. I strongly disagree.
For a database project, the conceptual data model is a much more important software engineering contribution than use cases.
Only when the conceptual model is well understood can use cases be fully understood and reconciled. Only then can developers integrate use cases and abstract their content into a form suitable for building a quality software product.
Q7. Many requirements and the design to satisfy those requirements are normally done with programming, not just schema. Do you agree with this? How do you handle this with use cases?
Michael Blaha: Databases provide a powerful language, but most do not provide a complete language.
The SQL language of relational databases is far from complete and some other language must be used to express full functionality.
OO databases are better in this regard. Since OO databases integrate a programming language with a persistence mechanism they inherently offer a full language for expressing functionality.
Use cases target functionality and functionality alone. Use cases, by their nature, do not pay attention to data structure.
Q8. Do you need to use UML for use cases?
Michael Blaha: No. The idea of use cases are valuable if used properly (in conjunction with data and normally subservient to data).
In my opinion, UML use case diagrams are a waste of time. They don’t add clarity. They add bulk and consume time.
Q9. Are there any suitable tools around to help the process of creating use cases for database design? If yes, how good are they?
Michael Blaha: Well, it’s clear by now that I don’t think much of use case diagrams. I think a textual approach is OK and there are probably requirement tools to manage such text, but I am unfamiliar with the product space.
Q10. Use case methods of design are usually applied to object-oriented models. Do you use use cases when working with an object database?
Michael Blaha: I would argue not. Most object-oriented languages put data first. First develop the data structure and then attach methods to the structure. Use cases are the opposite of this. They put functionality first.
Q11. Can you use use cases as a design method for relational databases, NoSQL databases, graph databases as well? And if yes how?
Michael Blaha: Not reasonably. I guess developers can force fit any technique and try to claim success.
To be realistic, traditional database developers (relational databases) are already resistant (for cultural reasons) to object-oriented jargon/style and the UML. When I show them that the UML class model is really just an ER model and fits in nicely with database conceptualization, they acknowledge my point, but it is still a foreign culture.
I don’t see how use cases have much to offer for NoSQL and graph databases.
Q12. So if you don’t have use cases, how do you address functionality when building database applications?
Michael Blaha: I strongly favor the technique of interactive conceptual data modeling. I get the various business and
However, I fully consider the use cases by playing them against the evolving model. Of course as I consider use cases relative to the class model, I am reconciling the use cases. I am also considering abstraction as I construct the class model and consequently causing the business experts to do more abstraction in formulating their use case business requirements.
I have built class models this way many times before and it works great. Some developers are shocked at how well it can work.
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.
The landscape for data management products is rapidly evolving. NuoDB is a new entrant in the market. I asked a few questions to Barry Morris, Founder & Chief Executive Officer NuoDB.
Q1. What is NuoDB?
Barry Morris: NuoDB combines the power of standard SQL with cloud elasticity. People want SQL and ACID transactions but these have been expensive to scale-up or down. NuoDB changes that by introducing a breakthrough “emergent” internal architecture.
NuoDB scales elastically as you add machines and is self-healing if you take machines away. It also allows you to share machines between databases, vastly increases business continuity guarantees, and runs in geo-distributed active/active configurations. It does all this with very little database administration.
Q2. When did you start the company?
Barry Morris: The technology has been developed over a period of several years but the company was funded in 2010 and has been operating out of Cambridge MA since then.
Q3. Big data, Web-scale concurrency and Cloud: how can NuoDB help here?
Barry Morris: These are some of the main themes of modern system software.
The 30-year-old database architecture that is universally used by traditional SQL database systems is great for traditional applications and in traditional datacenters, but the old architecture is a liability in web-facing applications running on private, public or hybrid clouds.
NuoDB is a general purpose SQL database designed specifically for these modern application requirements.
Massive concurrency with NuoDB is easy to understand: If the system is overloaded with users you can just add as many diskless “transaction managers” as you need. NuoDB handles big data by using redundant Key-value stores for it’s storage architecture.
Cloud support boils down to the Five Key Cloud requirements: Elastic scalability, sharing machines between databases, extreme availability geo-distribution and “zero” DBA costs. You get these for free from a database with an emergent architecture, such as NuoDB.
Q4. What kind of data (structured, non structured), and volumes of data is NuoDB able to handle?
Barry Morris: NuoDB can handle any kinds of data, in a set-based relational model.
Naturally we support all standard SQL types. Additionally we have rich BLOB support that allows us to store anything in an opaque fashion. A forthcoming release of the product will also support user-defined types.
NuoDB extends the traditional SQL type model in some interesting ways. We store arbitrary-length strings and numbers because we store everything by value not by type. Schemas can change easily and dynamically because they are not tightly coupled to the data.
Additionally NuoDB supports table inheritance, a powerful feature for applications that traditionally have wide, sparse tables.
Q5. How do you ensure data availability?
Barry Morris: There is no single point of failure in NuoDB. In fact it is quite hard to stop a NuoDB database from running, or to lose data.
There are three tiers of processes in a NuoDB solution (Brokers, Transaction Managers, and Archive Managers) and each tier can be arbitrarily redundant. All NuoDB Brokers, Transaction Managers and Archive Managers run as true peers of others in their respective tiers. To stop a database you have to stop all processes on at least one tier.
Data is stored redundantly, in as many Archive Managers as you want to deploy. All Archive Managers are peers and if you lose one the system just keeps going with the remaining Archive Managers.
Additionally the system can run across multiple datacenters. In the case of losing a datacenter a NuoDB database will keep running in the other datacenters.
Q6. How do you orchestrate and guarantee ACID transactions in a distributed environment?
Barry Morris: We do not use a network lock manager because we need to be asynchronous in order to scale elastically.
Instead concurrency and consistency are managed using an advanced form of MVCC (multi-version concurrency control). We never actually delete or update a record in the system directly. Instead we always create new versions of records pointing at the old versions. It is our job to do all the bookkeeping about who sees which versions of which records and at what time.
The durability model is based on redundant storage in Key-value stores we call Archive Managers.
Q7. You say you use a distributed non-blocking atomic commit protocol. What is it? What is it useful for?
Barry Morris: Until changes to the state of a transactional database are committed they are not part of the state of the database. This is true for any ACID database system.
The Commit Protocol in NuoDB is complex because at any time we can have thousands of concurrent applications reading, writing, updating and deleting data in an asynchronous distributed system with multiple live storage nodes. A naïve design would “lock” the system in order to commit a change to the durable state, but that is a good way of ensuring the system does not perform or scale. Instead we have a distributed, asynchronous commit protocol that allows a transaction to commit without requiring network locks.
Q8. What is special about NuoDB Architecture?
Barry Morris: NuoDB has an emergent architecture. It is like a flock of birds that can fly in an organized formation without having a central “brain”. Each bird follows some simple rules and the overall effect is to organize the group.
In NuoDB there is no one is in charge. There is no supervisor, no master, and central authority on anything. Everything that would normally be centralized is distributed. Any Transaction Manager or Archive Manager with the right security credentials can dynamically opt-in or opt-out of a particular database. All properties of the system emerge from the interactions of peers participating on a discretionary basis rather than from a monolithic central coordinator.
Q9. Do you support SQL?
Barry Morris: Yes. NuoDB is a relational database, and SQL is the primary query language. We have very broad and very standard SQL support.
Q10. How do you reduce the overhead when storing data to a disk? Do you use in-memory cache?
Barry Morris: Our storage model is that we have multiple redundant Archive Nodes for any given database. Archive Nodes are Key-value stores that know how to create and retrieve blobs of data. Archive Nodes can be implemented on top of any Key-value store (we currently have a file-system implementation and an Amazon S3 implementation). Any database can use multiple types of Archive Nodes at the same time (eg SSD based alongside disk-based).
Archive Nodes allow trade-offs to be made on disk write latency. For an ultra-conservative strategy you can run the system with fully journaled, flushed disk writes on multiple Archive Nodes in multiple datacenters; on the other end of the spectrum you can rely on redundant memory copies as your committed data, with disk writes happening as a background activity. And there are options in between. In all cases there are caching strategies in place that are internal to the Archive Nodes.
Q11. How do you achieve that query processing scales with the number of available nodes?
Barry Morris: Queries running on multiple nodes get the obvious benefit of node parallelism at the query-processing level. They additionally get parallelism at the disk-read level because there are multiple archive nodes. In most applications the most significant scaling benefit is that the Transaction Nodes (which do the query-processing) can load data from each others memory if it is available there. This is orders-of-magnitude faster than loading data from disk.
Q12. What about storage? How does it scale?
Barry Morris: You can have arbitrarily large databases, bounded only by the storage capacity of your underlying Key-value store. The NuoDB system itself is agnostic about data set size.
Working set size is very important for performance. If you have a fairly stable working set that fits in the distributed memory of your Transaction Nodes then you effectively have a distributed in-memory database, and you will see extremely high performance numbers. With the low and decreasing cost of memory this is not an unusual circumstance.
Scalability and Elasticity
Q13. How do you obtain scalability and elasticity?
Barry Morris: These are simple consequences of the emergent architecture.
Transaction Managers are diskless nodes. When a Transaction Manager is added to a database it starts getting allocated work and will consequently increase the throughput of the database system. If you remove a transaction manager you might have some failed transactions, which the application is necessarily designed to handle, but you won’t lose any data and the system will continue regardless.
Q14. How complex is for a developer to write applications using NuoDB?
Barry Morris: It’s no different from traditional relational databases. NuoDB offers standard SQL with JDBC, ODBC, Hibernate, Ruby Active Records etc.
Q15. NuoDB is in a restricted Beta program. When will it be available in production? Is there a way to try the system now?
Barry Morris: NuoDB is in it’s Beta 4 release. We expect it to ship by the end of the year, or very early in 2012. You can try the system by going to our download page.
- ODBMS.org: Free Downloads and Links on various data management technologies:
*Analytical Data Platforms.
*Cloud Data Stores,
*Databases in general
*Entity Framework (EF) Resources,
*Graphs and Data Stores
*NoSQL Data Stores,
*Object-Relational Impedance Mismatch,
” In the last years, our main development focus in the data management area has been on technologies that help customers turn insight into action.” – Kristof Kloeckner, IBM Corporation.
I wanted to learn about IBM`s current strategy in data management. I have interviewed Kristof Kloeckner, General Manager of IBM Rational Software, IBM Corporation.
Q1. In your opinion, what are the most important technologies being developed and deployed in the last years that are affecting data management?
Kloeckner: In the last years, our main development focus in the data management area has been on technologies that help customers turn insight into action. Three fundamental shifts are being experienced across industries where these technology advances have been critical i) information is exploding, ii) business change is outpacing the collective ability to keep up, and iii) the performance gap between leaders and laggards/followers is widening. Successful organizations are taking a structured approach to turning insight into action through business analytics and optimization. Specifically, innovative technologies that have helped businesses optimize to address these shifts include:
1 – Competing on speed: enabling organizations to speed up decision making and processes to act in a time frame that matters to retain/attract new clients and grow revenues:
2 – Exceeding customer and employee expectations: consumer IT innovation has created high employee expectations of IT: ease of use, collaboration, and mobile access
3. – Exploiting information from everywhere – access to data wherever it resides and in any variety, any velocity, and at any volume. Advancements in Big Data technology include new compression capabilities to transparently compress data on disk to reduce disk space and storage infrastructure requirements.
4- Realize new possibilities from Analytics. Developments within analytics processing that combine and analyze information through multiple analytics services including content analytics and predictive analytics are transforming Healthcare.
5 – Governance information: addressing the pressure seen with regulatory compliance. Technology advancements with Hadoop allow multiple users to access unstructured data for governance purposes.
• Content-development workflow advancements for enterprise business glossaries enable a governed approach to build, publish, and manage enterprise vocabularies
6 – OLTP Scale Out and Availability.
Q2. In a recent keynote you introduced the concept of “Software-driven innovation”. What is it?
Kloeckner:Innovation is increasingly achieved and delivered via software. The need for innovation is key for companies trying to gain competitive advantage in today’s business environment.
In a recent IBM survey of more than 1,500 CEOs – nearly 50% of respondents called out the increased need for product and service innovation as a top concern.
Software connects people, products and information. The more interconnected, instrumented, and intelligent a product or service is, the greater its potential value is to its users. Challenges to achieving software-driven innovation derive from the complexity of interconnected systems, but also from the complexity of software delivery due to the increasingly global distribution of development teams, new models of software supply chains including outsourcing and crowdsourcing, and the reality of constantly shifting market requirements. Effective software-driven innovation requires application and systems lifecycle management with a focus on integration of processes, data and tools, collaboration across roles and organization and optimization of business outcomes through measurements and simplified governance.
Q3. You defined three elements for realizing “software-driven innovation”: Integration, Collaboration, and Optimization. Could you please elaborate on this? In particular, how is the software design and delivery lifecycle managed?
Kloeckner:Collaborative Lifecycle Management integrates tools and processes end-to-end, from requirements management to design and construction to test and verification, and ultimately, deployment.
It enables traceability of artifacts across the lifecycle, real-time planning, in-context collaboration, development analytics or ‘intelligence’ and continuous improvement. It eliminates manual hand-offs, and can lead to increased project velocity, adaptability to change and reduced project risk.
Collaborative design management brings cross-team collaboration on software and systems design and creates deep integrations across the lifecycle. Effective modeling, design and architecture are critical to developing smarter products and services.
Collaborative DevOps bridges the gap between development and deployment by sharing information between the teams, improving deployment planning and automating many deployment steps. This leads to faster delivery of solutions into production.
Collaborative Lifecycle Management, Design Management and DevOps can considerably increase business agility by accelerating software and systems delivery. They can be augmented by capabilities that further increase alignment of development with business goals, for instance through application portfolio management.
Q4. IBM recently announced a new application development collaboration software built on its Jazz development platform. What is special about Jazz?
Kloeckner:Jazz is IBM’s initiative for improving collaboration across the software & systems lifecycle Inspired by the artists who transformed musical expression, Jazz is an initiative to transform software and systems delivery by making it more collaborative, productive and transparent, through integration of information and tasks across the phases of the lifecycle. The Jazz initiative consists of three elements: Platform, Products and Community.
Jazz is built on architectural principles that represent a key departure from approaches taken in the past.
Unlike monolithic, closed platforms of the past, Jazz has an innovative approach to integration based on open, flexible services and Internet architecture (linked data). It is complemented by Open Services for Lifecycle Collaboration (OSLC), an industry initiative of more than 30 companies to define the interfaces for tools integration based on the principles of linked data.
Q5. What about other approaches such as social networks ala Facebook? Aren’t they also a sort of collaboration platforms?
Kloeckner:Without a doubt, software delivery is a team sport, and the process of delivery becomes an instance of ‘social business’. Collaboration is vital for organizational cohesion on a global scale, but also for the sharing of best practices and artifacts. Increasingly, developers expect to use similar mechanisms to interact and share to those they are using in their personal lives.
At IBM we combine development tools such as Rational Team Concert and Rational Asset Manager with our social business platform IBM Connections to provide an infrastructure for code reuse supporting more than 30000 members in more than 1700 projects. It has a governance structure derived from Open Source which we call Community Source. The platform uses wikis, forums, blogs and other feedback and communication mechanisms.
Q6. What is IBM strategy in data management? DB2, Apache Hadoop, CloudBurst, SmartCloud Enterprise to name a few. How do they relate to each other (if any)?
Kloeckner:Our strategy is to enable the placement of data and the workloads that use it in cost-effective patterns that encompass public and private clouds as well as traditional deployment topologies.
IBM has provided enablement on both public as well as private clouds. This enablement provided easy access to IBM’s rich enterprise data management portfolio for evolving businesses. Our focus is on the enablement of deeper cloud capabilities such as the Platform as a Service and Database as a Service style features in the IBM Workload Deployer (the evolved WebSphere Cloudburst Appliance). The adoption of these technologies leads to the need to integrate with existing IT infrastructures as well as the evolving world of NoSQL.
This deeper cloud capability additionally focuses on the need to expose deeper analytical capability and insight into raw business data which is difficult to model and or understand yet. This leads to the Big Data topic. The run-time infrastructure being used for this deeper insight into this data is built on Apache Hadoop and is the basis for IBM’s Big Data product. This offering is available in a cloud offering or as a product offering. However, Big Data is allowing businesses to expand the range of information that they can do analysis on. Moreover, for this data to be useful, it often needs to tie into the more traditional Information Management world – existing operational and warehouse systems which provide the context needed to both support the broader range of analytics that Big Data enables as well as the context needed to operationalize the results.
Q7. What are the main business and technical issues related to data management when using the Cloud?
Kloeckner:There are four main issues, each of which relate to the data itself: data movement, data security, data ownership/stewardship, as well as the analysis of enormous datasets. The challenge with data movement manifests itself primarily in two ways: availability and the obvious time that it takes to transfer data between sites. The relation to availability is not immediately obvious but high speed data movement is used for replication based HA models as well as recovery scenarios that need to pull data from a remote site. The higher the bandwidth and the lower the latency, the easier it is to create a highly available data management offering on the cloud.
Data security is one place where most of the concerns surface and a lot of the enterprise companies have very sensitive data. Questions such as “Who is the group of people working behind the cloud facade and can I trust them?” are the probably the most common. Fortunately with IBM, we’ve been handling sensitive enterprise data for most of the last century. Having a reputable company to partner with to either host or protect your data in the cloud is absolutely key and it is also why many companies are reaching out to IBM as a cloud provider. Dealing with sensitive data is business as usual for us.
Related to security, is the issue around ownership and stewardship of such data in the cloud. If data is believed to be out of the control of its stakeholders, unless you can demonstrate otherwise, they may begin to lose confidence in the trust-worthiness of that information. This leads to the need for governance of such data in the cloud and tied to security of it.
Lastly, what do you do with petabytes of structured and unstructured data? Cloud and hadoop some times get interchanged in some circles, but IBM’s direction on hadoop is illustrated above with BigInsights which includes BigData and Streams products. Cloud is just as critical put serves a different role. One area that information management is focused on around clouds as an example is making it simpler for business users to build sand boxes and smaller systems in the cloud and integrate there data capture and governance with the technical IT groups that make the business data. We need to make it simple for business users, but allow IT some control over what data goes into the cloud and how long it lives.
Q8. Private cloud vs. public cloud? What is the experience of IBM on this?
Kloeckner:Clients have choices for deploying enterprise applications, and the key factors they take into consideration are not just cost or convenience, but also security, reliability, scalability, control and tooling, and a trusted vendor relationship. We find that for enterprise critical applications customers in their majority favor private (including private managed) clouds. We also find increasing adoption of IBM SmartCloud Enterprise.
IBM has a proven approach for building and managing cloud solutions, providing an integrated platform that uses the same standards and processes across the entire portfolio of products and services. IBM’s expertise and experience in designing, building and implementing cloud solutions—beginning with its own—offers clients the confidence of knowing that they are engaging not just a provider, but a trusted partner in the transformation of their IT service delivery. The IBM Cloud Computing reference architecture builds on IBM’s industry-leading experience and success in implementing SOA solutions. We have also invested in technologies that integrate cloud environments and allow clients combine services deployed in their private cloud environments with services delivered through public clouds. We believe that such hybrid environments are becoming increasingly more common.
Q9. In your opinion, will we ever have a “Trusted” Cloud?
Kloeckner:Concerns about security are the most prominent reasons that organizations cite for not adopting public cloud services. Therefore, creating more comprehensive security capabilities is a prerequisite for getting organizations to adopt public cloud-based services for more complex, business-sensitive, and demanding purposes. A recent Forrester Research reports they fully expect to see the emergence of highly secure and trusted cloud services over the next five years, during which time cloud security will grow into a $1.5 billion market and will shift from being an inhibitor to an enabler of cloud services adoption. The advent of secure cloud services will be a disruptive force in the security solutions market, challenging traditional security solution providers to revamp their architectures, partner ecosystems, and service offerings, while creating an opportunity for emerging vendors, integrators, and consultants to establish themselves.
Q10. Where do you see the data management industry heading in the next years?
Kloeckner:Across industries, technology providers and technology consumers are moving to address the exponential growth in quantity, complexity, variety, and velocity of data to enable better business insights and drive informed actions. Add to this the immediacy of the web and data in motion, the new control and influence of the individual consumer, and the fact that most companies have yet to fully exploit the raw data they have in house within the context of structured models, and you can easily discern the driving forces behind strategic moves, acquisitions and recent product announcements. The potential for business risks continue to increase as do requirements for regulatory compliance. By 2015, the information and analytics software spending is estimated to reach $74B as a result.
The consumability of core technologies such as High Performance Computing and embarrassingly parallel workloads, real-time analysis of streaming data and process flows, ease of access to analytics as services via the cloud and embedded within business processes, as well as advances in natural language interfaces will take analytics to the mainstream and to main street.. IBM’s Watson is just a hint of what is to come.
Solution delivery trends in mobile, cloud, and software as a service are driving information management technology investments to meet these demands. Evolving data access trends from traditional business applications and existing OLTP environments are extending in reach in to social media and content as an example, driving technology investments in processing Big Data as well as content and text analytics.
As technology advances in systems processing power, in-memory computing capabilities, extreme low-energy servers, and storage capabilities such as solid state drives continue to progress, data management technologies continue to advance in parallel leveraging these technologies to enable optimized workloads and consolidated platforms for cloud environments.
Both companies and consumers look for better/faster/easier access to a single best view of the truth, ability to predict the next best action. Whether it be process optimizations within a business, risk and fraud management, dramatically enhanced personalization of each customer experience, or improving the speed and accuracy of diagnosis based the full compendium of the latest research, historical data and best practices, health care, financial services, customer service, precision marketing and even auto repair will be impacted.
Kristof Kloeckner, Ph.D., is General Manager of IBM Rational Software, IBM Corporation.
Dr. Kloeckner is responsible for all facets of the Rational Software business including strategy, marketing, sales, operations, technology development and overall financial performance. He and his global team are focused on providing the premier development and delivery solutions for transforming business through software-driven innovation. Dr. Kloeckner is based in Somers, New York.
- ODBMS.org: Free Downloads and Links on various data management technologies and analytics.
” It is expected there will be 6 billion mobile phones by the end of 2011, and there are currently over 300 Twitter accounts and 500K Facebook status updates created every minute. And, there is now a $2 billion a year market for virtual goods! “ — Shilpa Lawande
I wanted to know more about Vertica Analytics Platform for Big Data. I have interviewed Shilpa Lawande, VP of Engineering at Vertica. Vertica was acquired by HP early this year.
Q1. What are the main technical challenges for big data analytics?
Shilpa Lawande: Big data problems have several characteristics that make them technically challenging. First is the volume of data, especially machine-generated data, and how fast that data is growing every year, with new sources of data that are emerging. It is expected there will be 6 billion mobile phones by the end of 2011, and there are currently over 300 Twitter accounts and 500K Facebook status updates created every minute. And, there is now a $2 billion a year market for virtual goods!
A lot of insights are contained in unstructured or semi-structured data from these types of applications, and the problem is analyzing this data at scale. Equally challenging is the problem of ‘how to analyze.’ It can take significant exploration to find the right model for analysis, and the ability to iterate very quickly and “fail fast” through many (possible throwaway) models – at scale – is critical.
Second, as businesses get more value out of analytics, it creates a success problem – they want the data available faster, or in other words, want real-time analytics. And they want more people to have access to it, or in other words, high user volumes.
One of Vertica’s early customers is a Telco that started using Vertica as a ‘data mart’ because they couldn’t get resources from their enterprise data warehouse. Today, they have over a petabyte of data in Vertica, several orders of magnitude bigger than their enterprise data warehouse.
Techniques like social graph analysis, for instance leveraging the influencers in a social network to create better user experience are hard problems to solve at scale. All of these problems combined create a perfect storm of challenges and opportunities to create faster, cheaper and better solutions for big data analytics than traditional approaches can solve.
Q2. How Vertica helps solving such challenges?
Shilpa Lawande: Vertica was designed from the ground up for analytics. We did not try to retrofit 30-year old RDBMS technology to build the Vertica Analytics Platform. Instead, Vertica built a true columnar database engine including sorted columnar storage, a query optimizer and an execution engine.
With sorted columnar storage, there are two methods that drastically reduce the I/O bandwidth requirements for such big data analytics workloads. The first is that Vertica only reads the columns that queries need.
Second, Vertica compresses the data significantly better than anyone else.
Vertica’s execution engine is optimized for modern multi-core processors and we ensure that data stays compressed as much as possible through the query execution, thereby reducing the CPU cycles to process the query. Additionally, we have a scale-out MPP architecture, which means you can add more nodes to Vertica.
All of these elements are extremely critical to handle the data volume challenge.
With Vertica, customers can load several terabytes of data quickly (per hour in fact) and query their data within minutes of it being loaded – that is real-time analytics on big data for you.
There is a myth that columnar databases are slow to load. This may have been true with older generation column stores, but in Vertica, we have a hybrid in-memory/disk load architecture that rapidly ingests incoming data into a write-optimized row store and then converts that to read-optimized sorted columnar storage in the background. This is entirely transparent to the user because queries can access data in both locations seamlessly. We have a very lightweight transaction implementation with snapshot isolation queries can always run without any locks.
And we have no auxiliary data structures, like indexes or materialized views, which need to be maintained post-load.
Last, but not least, we designed the system for “always on,” with built-in high availability features.
Operations that translate into downtime in traditional databases are online in Vertica, including adding or upgrading nodes, adding or modifying database objects, etc.
With Vertica, we’ve removed many of the barriers to monetizing big data and hope to continue to do so.
Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?
Shilpa Lawande:The short answer to the performance and scale question is that we make the most efficient use of all the resources, and we parallelize at every opportunity. When we compress and encode the data, we sometimes see 80:1 compression (depending on the data) . Vertica’s fully peer-to-peer MPP architecture allows you can issue loads and queries from any node and run multiple load streams. Within each node, operations make full use of the multi-core processors to parallelize their work. A great measure of our raw performance is that we have several OEM customers who run Vertica on a single 1U node, embedded within applications like security and event log management, with very low data latency requirements.
Scalability has three aspects – data volume, hardware size, and concurrency. Vertica’s performance scales linearly (and often super linearly due to compression and other factors) when you double the data volume or run the same data volume on twice the number of nodes. We have customers who have grown their databases from scratch to over a petabyte, with clusters from tens to hundreds of nodes. As far as concurrency goes, running queries 50-200x faster ensures that we can get a lot more queries done in a unit of time. To efficiently handle a highly concurrent mix of short and long queries, we have built-in workload management that controls how resources are allocated to different classes of queries. Some of our customers run with thousands of concurrent users running sub-second queries.
Q4. What is new in Vertica 5.0?
Shilpa Lawande: In Vertica 5.0, we focused on extensibility and elasticity of the platform. Vertica 5.0 introduced a Software Development Kit (SDK) that allows users to write custom user-defined functions so they can do more with the Vertica platform, such as analyze Apache logs or build custom statistical extensions. We also added several more built-in analytic extensions, including event-series joins, event-series pattern matching, and statistical and geospatial functions.
With our new Elastic Cluster features, we allow very fast expansion and contraction of the cluster to handle changing workloads – in a recent POC we showed expansion from 8 nodes to 16 nodes with 11TB of data in an hour. Another feature we’ve added is called Import/Export, which automates fast export of data from one Vertica cluster to another, a very useful feature for creating sandboxes from a production system for exploratory analysis.
Besides these two major features, there are a number of improvements to the manageability of the database, most notably the Data Collector, that captures and maintains a history of system performance data, and the Workload Analyzer, that analyzes this data to point out suboptimal performance and how to fix it.
Q5. How is Vertica currently being used? Could you give examples of applications that use Vertica?
Shilpa Lawande: Vertica has customers in most major verticals, including Telco, Financial Services, Retail, Healthcare, Media and Advertising, and Online Gaming. We have eight out of the top 10 US Telcos using Vertica for Call-Detail-Record analysis, and in Financial Services, a common use-case is a tickstore, where Vertica is used to store and analyze many years of financial trades and quotes data to build models.
A growing segment of Vertica customers are Online Gaming and Web 2.0 companies that use Vertica to build models for in-game personalization.
Q6. Vertica vs. Apache Hadoop: what are the similarities and what are the differences?
Shilpa Lawande: Vertica and Hadoop are both systems that can store and analyze large amounts of data on commodity hardware. The main differences are how the data gets in and out, how fast the system can perform, and what transaction guarantees are provided. Also, from the standpoint of data access, Vertica’s interface is SQL and data must be designed and loaded into a SQL schema for analysis.
With Hadoop, data is loaded AS IS into a distributed file system and accessed programmatically by writing Map-Reduce programs. By not requiring a schema first, Hadoop provides a great tool for exploratory analysis of the data, as long as you have the software development expertise to write Map Reduce programs. Hadoop assumes that the workload it runs will be long running, so it makes heavy use of checkpointing at intermediate stages.
This means parts of a job can fail, be restarted and eventually complete successfully. There are no transactional guarantees.
Vertica, on the other hand, is optimized for performance by careful layout of data and pipelined operations that minimize saving intermediate state. Vertica gets queries to run sub-second and if a query fails, you just run it again. Vertica provides standard ACID transaction semantics on loads and queries.
We recently did a comparison between Hadoop, Pig, and Vertica for a graph problem (see post on our blog) and when it comes to performance, the choice is clearly in favor of Vertica. But we believe in using the right tool for the job and have over 30 customers using both the systems together. Hadoop is a great tool for the early exploration phase, where you need to determine what value there is in the data, what the best schema is, or to transform the source data before loading into Vertica. Once the data models have been identified, use Vertica to get fast responses to queries over the data.
Other customers keep all their data in Vertica and leverage Hadoop’s scheduling capabilities to retrieve the data for different kinds of analysis. To facilitate this, we provide a Hadoop connector that allows efficient bi-directional transfer of data between Vertica and Hadoop. We plan to continue to enhance Vertica’s analytic platform as well as be a partner in the Hadoop ecosystem.
Q7. Cloud computing: Does it play a role at Vertica? If yes, how?
Shilpa Lawande: In 2007, when ‘Cloud’ first started to gather steam, we realized that Vertica had the perfect architecture for the cloud. It is no surprise because the very trends in commodity multi-core servers, storage and interconnects that resulted in our design choices also enable the cloud. We were the first analytic database to run on Amazon EC2 and today have several customers using EC2 to run their analytics.
We consider cloud as an important deployment configuration for Vertica and as the IaaS offerings improve, and we gain real-world experience from our customers, you can expect Vertica’s product to be further optimized for cloud deployments. Cloud is also a big focus area at HP and we now have the unique opportunity to create the best cloud platform to run Vertica and to provide value-added solutions for big data analytics based on Vertica
Q8. How Vertica fits into HP data management strategy?
Shilpa Lawande: Vertica provides HP with a proven platform for big data analytics that can be deployed as software, an appliance, and on the cloud, all of which are key focus areas for HP’s business.
Big data is a growing market and the majority of the data is unstructured. The combination of Vertica and Autonomy gives HP a comprehensive “Information Platform” to manage, analyze, and derive value from the explosion in both structured and unstructured data.
Shilpa Lawande, VP of Engineering, Vertica.
Shilpa Lawande has been an integral part of the Vertica engineering team since its inception, bringing over 10 years of experience in databases, data warehousing and grid computing to Vertica. Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle 9i and 10g databases.
Lawande is a co-inventor on several patents on query optimization, materialized views and automatic index tuning for databases. She has also co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing.
Lawande has a Masters in Computer Science from the University of Wisconsin-Madison and a Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Mumbai.
“One of the core concepts of Big Data is being able to evolve analytics over time. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources. “ — Werner Vogels, Amazon.com
I wanted to know more on what is going on at Amazon.com in the area of Big Data and Analytics. For that, I have interviewed Dr. Werner Vogels, Chief Technology Officer and Vice President of Amazon.com
Q1. In your keynote at the Strata Making Data Work Conference held this February in Santa Clara, California, you said that “Data and Storage should be unconstrained“. What did you mean with that?
Vogels: In the old world of data analysis you knew exactly which questions you wanted to asked, which drove a very predictable collection and storage model. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources.
Q2. You also claimed that “Big Data requires NO LIMIT“. However, Prof. Alex Szalay talking about astronomy warns us that “Data is everywhere, never be at a single location. Not scalable, not maintainable.” Will Big Data Analysis become a new Astronomy?
Vogels: Big Data is the hot topic for this year. With the rise of the internet, and the increasing number of consumers, researchers and businesses of all sizes getting online, the amount of data now available to collect, store, manage, analyze and share is growing. When companies come across such large amounts of data it can lead to data paralysis where they don’t have the resources to make effective use of the information.
To Alex’s point, it’s challenging to get the relevant data at the place where you want it to do your analysis. This is why we see many organizations putting their data in the cloud where it’s easily accessible for everyone.
Q3. You also quoted Jim Gray and mentioned the “Fourth Paradigm: Data Intensive Scientific Discovery”. What is it? Is Business Intelligence becoming more like Science for profit?
Vogels:The book I referenced was The Fourth Paradigm: Data-Intensive Scientific Discovery. This is a collection of essays that discusses the vision for data-intensive scientific discovery, which is the concept of shifting computational science to a data intensive model where we analyze observations.
For Business Intelligence this means that analysis goes beyond finance and accountancy and help companies’ continuously improve the service to their customers.
Q4. Michael Olson of Cloudera, in a recent interview said talking about Analytical Data Platforms that “Cloud is a deployment detail, not fundamental. Where you run your software and what software you run are two different decisions, and you need to make the right choice in both cases.”
What is your opinion on this? What is in your opinion the relationships between Big Data Analysis and Cloud Computing?
Vogels: Big Data holds the promise of helping companies create a competitive advantage as through data analysis they learn how to better serve their customers. This is an approach that we have already applied for 15 years to Amazon.com and we have a solid understanding of the all the challenges around managing and processing Big Data.
One of the core concepts of Big Data is being able to evolve analytics over time. For that, a company cannot be constrained by any resource. As such, Cloud Computing and Big Data are closely linked because for a company to be able to collect, store, organize, analyze and share data, they need access to infinite resources.
AWS customers are doing some really innovative things around dealing with Big Data. For example digital advertising and marketing firm, Razorfish. Razorfish targets online adverts based on data from browsing sessions. A common issue Razorfish found was the need to process gigantic data sets. These large data sets are often the result of holiday shopping traffic on a retail website, or sudden dramatic growth on a media or social networking site.
Normally crunching these numbers would take them two days or more. By leveraging on-demand services such as Amazon Elastic MapReduce, Razorfish is able to drop their processing time to eight hours. There was no upfront investment in computing hardware, no procurement delay, and no additional operations staff hired. All this means Razorfish can offer multi-million dollar client service programs on a small business budget, helping them to increase their return on ad spend by 500%.
Q5. How has Amazon’s technology evolved over the past three years?
Vogels: Every day, Amazon Web Services adds enough new server capacity to support all of Amazon’s global infrastructure in the company’s fifth full year of operation, when it was a $2.76B annual revenue enterprise. Today we have hundreds of thousands of customers in over 190 countries—both startups and large companies. To give you an idea of the scale we’re talking about, Amazon S3 holds over 260 billion objects and regularly peaks at 200k requests per second.
Our pace of innovation has been rapid because of our relentless customer focus. Our process is to release a service into beta that is useful to a lot of people, get customer feedback and rapidly begin adding the bells and whistles based in large part on what customers want and need from the services. There’s really no substitute for the accelerated learning we’ve had from working with hundreds of thousands of customers with every imaginable use case.
We are also relentless about driving efficiencies and passing along the cost savings to our customers.
We’ve lowered our prices 12 times in the past 5 years with no competitive pressure to do so. We’re very comfortable with running high volume, low margin businesses which is very different than traditional IT vendors.
Q6. Amazon’s Dynamo is proprietary, however the publication of your seminar 2007 Dynamo-paper was used as input for Open Source Projects (e.g. Cassandra—which began as a fusion of Google`s Bigtable and Amazon’s Dynamo concepts). Why Amazon allowed this?
Vogels: Dynamo is internal technology developed at Amazon to address the need for an incrementally scalable, highly-available key-value storage system. The technology is designed to give its users the ability to trade-off cost, consistency, durability and performance, while maintaining high-availability.
We found, though, that there had been some struggles with applying the concepts so we published the paper as feedback to the academic community about what one needed to do to build realistic production systems.
Q7. What is the positioning of Amazon with respect to Open Source projects? Why didn’t you develop Open Source data platforms from the start like for example Facebook and LinkedIn?
Vogels: I believe that you need to pour your resources into areas where you can make important contributions, where you can provide the customer with the best possible experience. Our mission is to provide the one-man app developer or the 20,000 person enterprise with a platform of web services they can use to build sophisticated, scalable applications. I believe anything we can do to make AWS lower-cost and widely available will help the community tremendously.
Q8. Amazon Elastic MapReduce utilizes a hosted Hadoop framework running on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Why choosing Hadoop? Why not using already existing BI products?
Vogels: We chose Hadoop for several reasons. First, it is the only available framework that could scale to process 100s or even 1000s of terabytes of data and scale to installations of up to 4000 nodes. Second, Hadoop is open source and we can innovate on top of the framework and inside it to help our customers develop more preformat applications quicker. Third, we recognized that Hadoop was gaining substantial popularity in the industry with multiple customers using Hadoop and many vendors innovating on top of Hadoop.
Three years later we believe we made the right choice. We also see that existing BI vendors such as Microstrategy are willing to work with us and integrate their solutions on top of Elastic MapReduce.
Q9. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead? And what are the main business challenges ahead?
Vogels: I think that sharing is another important aspect to the mix. Collaborating during the whole process of collecting data, storing it, organizing it and analyzing it is essential. Whether it’s scientists in a research field or doctors at different hospitals collaborating on drug trials, they can use the cloud to easily share results and work on common datasets.
Dr. Vogels is Vice President & Chief Technology Officer at Amazon.com where he is responsible for driving the company’s technology vision, which is to continuously enhance the innovation on behalf of Amazon’s customers at a global scale.
Prior to joining Amazon, he worked as a researcher at Cornell University where he was a principal investigator in several research projects that target the scalability and robustness of mission-critical enterprise computing systems. He has held positions of VP of Technology and CTO in companies that handled the transition of academic technology into industry.
Vogels holds a Ph.D. from the Vrije Universiteit in Amsterdam and has authored many articles for journals and conferences, most of them on distributed systems technologies for enterprise computing.
He was named the 2008 CTO of the Year by Information Week for his contributions to making Cloud Computing a reality. For his unique style in engaging customers, media and the general public, Dr. Vogels received the 2009 Media Momentum Personality of Year Award.
“ We believe the ability to do transactions will remain important to app developers.” — Blake Connell, VMware.
vFabric SQLFire beta was recently announced by VMware as part of their Cloud Application Platform, called vFabric 5. I wanted to know more about it. I have therefore interviewed Blake Connell, Sr. Product Marketing Manager, vFabric at VMware.
Q1. What is vFabric SQLFire beta?
Blake Connell: SQLFire is a memory-optimized distributed database. The beta is available for anyone to download and try while we complete the initial release.
Q2. What are the main technical challenges that vFabric SQLFire is supposed to solve?
Blake Connell: SQLFire addresses speed, scalability and availability. SQLFire Is memory-optimized for maximum speed, this matters because users are becoming more and more demanding about responsiveness of web sites and applications.
SQLFire is designed for horizontal scalability, if you need more performance or need to store more data you simply add more members to the SQLFire distributed system. SQLFire extends SQL with new keywords that describe how tables should be partitioned and replicated, so as you add nodes the system intelligently rebalances itself for high performance, and it’s all completely transparent to the application. This is much more compelling than having to shard your database manually and build shard-awareness into your application.
Finally SQLFire uses a shared-nothing architecture to ensure availability. If a node in the distributed system dies, applications can simply talk to another node. In fact the database drivers shipped with SQLFire will transparently reconnect to a working node without the application needing to do anything, the app doesn’t even have to retry the request. A key compelling characteristics of SQLFire is that it enables developers to work with the well known SQL programming interface making adoption much more seamless and far less disruptive than alternatives.
Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?
Blake Connell: A big part of the way we address scalability is through horizontal scalability and automatic partitioning / replication of data as I mentioned. But there are some other very important design choices
we’ve made to build a scalable database.
First, we believe the ability to do transactions will remain important to app developers. However we believe that the vast majority of transactions are small both in time required and amount of data affected. We’ve built a transaction system based on these assumptions that completely avoids the need for a centralized coordinator or lock manager. In fact each node can act as its own transaction coordinator, even when the data in the transaction lives in multiple nodes. In this way you get transactions with linear scalability. If you take this along with our partitioning scheme the end result is a relational database that is much more scalable than a traditional RDBMS.
Q4. What are the main results of your performance benchmark for SQLFire?
Blake Connell: In our preliminary performance testing of vFabric SQLFire, the results show the product is able to achieve near linear scalability as the cluster expands, while the CPU utilization remains steady.
In this test, this indicates SQLFire can accommodate even more load without exhausting CPU resources.
In this performance test, two thirds of all queries complete in under 1 ms, roughly 88% complete in under 2 ms, and 97% take under 5 ms which is extremely fast.
Q5. How do you handle structured and unstructured data?
Blake Connell: vFabric SQLFire handles structured data. A related offering, vFabric GemFire handles unstructured data.
Q6. How do you use vFabric SQLFire from Java applications and/ or from ADO.NET?
Blake Connell: We provide JDBC and ADO.NET drivers for you to embed in your application. After that you just use mostly standard SQL. We have DDL extensions, so for example when you create a table you can specify how it should be partitioned, replicated and so forth, but the DML is just standard SQL.
The DDL extensions are all optional and don’t interfere with standard DDL.
Q7. Could you give examples of applications that use SQLFire beta?
Blake Connell: We have a financial services trading application exploring vFabric SQLFire.
The application dramatically increases the speed at which it can take a customer order and send it to a stock exchange while maintaining reliability and security.
Q8. Why using vFabric SQLFire and not of one of the existing NoSQL database or Relatonal Databases?
Blake Connell: vFabric SQLFire offers capabilities not found in traditional RDBMS or NoSQL solutions.
Comparing vFabric SQLFire to the traditional RDBMS:
- Data can be partitioned and/or replicated in memory across a cluster of machines to deliver scale-out benefits and high availability (HA)
- Not constrained by strict adherence to ACID properties. Developers uphold various degrees of ACID properties based on desired levels of data availability, consistency and performance.
- Optimized key-based access
Comparing vFabric SQLFire to NoSQL:
- vFabric SQLFire is focused on data consistency with scale-out and HA, often NoSQL offerings provide eventually consistent data models.
- vFabric SQLFire supports queries and joins with standard SQL syntax.
Q9. How vFabric SQLFire fits into VMware products strategy?
Blake Connell: At VMware, our approach to data is really two-fold:
1. Leverage virtualization to automate database deployment and operations.
2. Provide new approaches to data beyond the traditional one-size fits all database model.
vFabirc SQLFire and vFabric GemFire are key data offerings from VMware that provide alternative data approaches for modern applications. Innovation is occurring at all layers of IT including the database layer. vFabric GemFire is an memory-oriented data fabric. Java developers using the Spring Framework or .NET developers can leverage an API to write applications to vFabric GemFire. vFabric SQLFire enables teams to leverage their skills and investment in SQL knowledge and gain the benefits of distributed, in-memory, shared-nothing architectures.
Our recently released vFabric Data Director leverages vSphere virtualization to provide Database-as-a-Service with centralized back-up, management and governance. This is a huge step forward for both developers looking for fast and simple access to the database resources and for IT teams who need a manageable approach to providing data services.
Q10. What is next in vFabric SQLFire?
Blake Connell: In the first half of 2012 we anticipate enhancing SQLFire to include asynchronous WAN replication. This will let you run a single SQLFire database in multiple datacenters with high performance and low latency.
“How much more complexity can human developers and organizations deal with?”– Tom Fastner, eBay.
Much has already been written about analytics at eBay. But what is the current status? Which data platforms and data management technologies do they currently use? I asked a few questions to Tom Fastner, a Senior Member of Technical Staff, and Architect with eBay.
Q1. What are the main technical challenges for big data analytics at eBay?
Tom Fastner: The primary challenges are:
I/O bandwidth: limited due to configuration of the nodes.
Concurrency/workload management: Workload management tools usually manage the limited resource. For many years EDW systems bottle neck on the CPU; big systems are configured with ample CPU making I/O the bottleneck. Vendors are starting to put mechanisms in place to manage I/O, but it will take some time to get to the same level of sophistication.
Data movement (loads, initial loads, backup/restores): As new platforms are emerging you need to make data available on more systems challenging networks, movement tools and support to ensure scalable operations that maintain data consistency
Q2. What are the current Metrics of eBay`s main data warehouses?
Tom Fastner: We have 3 different platforms for Analytics:
A) EDW: Dual systems for transactional (structured) data; Teradata 3.5PB and 2.5 PB spinning disk; 10+ years experience; very high concurrency; good accessibility; hundreds of applications.
B) Singularity: deep Teradata system for semi-structured data; 36 PB spinning disk; lower concurrency that EDW, but can store more data; biggest use case is User Behavior Analysis; largest table is 1.2 PB with ~1.9 Trillion rows.
C) Hadoop: for unstructured/complex data; ~40 PB spinning disk; text analytics, machine learning; has the User Behavior data and selected EDW tables; lower concurrency and utilization.
Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?
Tom Fastner: EDW: We model for the unknown (close to 3rd NF) to provide a solid physical data model suitable for many applications, that limits the number of physical copies needed to satisfy specific application requirements. A lot of scalability and performance is built into the database, but as any shared resource it does require an excellent operations team to fully leverage the capabilities of the platform
Singularity: The platform is identical to EDW, the only exception are limitations in the workload management due to configuration choices. But since we are leveraging the latest database release we are exploring ways to adopt new storage and processing patterns. Some new data sources are stored in a denormalized form significantly simplifying data modeling and ETL. On top we developed functions to support the analysis of the semi-structured data. It also enables more sophisticated algorithms that would be very hard, inefficient or impossible to implement with pure SQL. One example is the pathing of user sessions. However the size of the data requires us to focus more on best practices (develop on small subsets, use 1% sample; process by day),
Hadoop: The emphasis on Hadoop is on optimizing for access. The reusability of data structures (besides “raw” data) is very low.
Q4: How do you handle un-structured data?
Tom Fastner: Un-structured data is handled on Hadoop only. The data is copied from the source systems into HDFS for further processing. We do not store any of that on the Singularity (Teradata) system.
Q5. What kind of data management technologies do you use? What is your experience in using them?
Tom Fastner: ETL: AbInitio, home grown parallel Ingest system.
Repositories: Teradata EDW; Teradata Deep system; Hadoop.
BI: Microstrategy, SAS, Tableau, Excel.
Data modeling: Power Designer.
Adhoc: Teradata SQL Assistant; Hadoop Pig and Hive.
Content Management: Joomla based.
In regards to tools capabilities, there is always something you might wish for. In some cases the tool of choice might even have an important capability, but we have not implemented it (due to resources, complexity, priorities, bugs). The way we try to address required features is through partnerships with some of those vendors.
The most mature partnership is with Teradata; other good examples are Tableau and Microstrategy where we have regular meetings discussing future enhancements. However I do not feel comfortable to rate all those tools and point out shortcomings.
Q6. Cloud computing and open source: Do you they play a role at eBay? If yes, how?
Tom Fastner: We do leverage internal cloud functions for Hadoop; no cloud for Teradata.
Open source: committers for Hadoop and Joomla; strong commitment to improve those technologies
Q7. Glenn Paulley of Sybase in a recent keynote asked the question of how much more complexity can database systems deal with? What is your take on this?
Tom Fastner: I am not familiar with Glenn Pulley’s presentation, so I can’t respond in that context. But in my role at eBay I like to take into account the full picture: How much more complexity can human developers and organizations deal with?
As we learn to deal with more complex data structures and algorithms using specialized solutions, we will look for ways to make them repeatable and available for a wider audience as they mature and demand grows in the market. Specialized solutions do come at a high price (TCO), requiring separate hardware and/or software stack, data replication, data management issues, computing specialists.
Q8. Anything else you wish to add?
Tom Fastner: Ebay is rapidly changing, and analytics is driving many key initiatives like buyer experience, search optimization, buyer protection or mobile commerce. We are investing heavily in new technologies and approaches to leverage new data sources to drive innovation.
Tom Fastner is a Senior Member of Technical Staff, Architect with eBay. In this role Tom works on the architecture of the analytical platforms and related tools. Currently he spends most of his time driving
innovation to process Big Data, the Singularity© system. Before Tom joined eBay he was with NCR/Teradata for 11 years. He was involved in the Dual Active program from the beginning in 2003 as the technical lead of the first global implementation of Dual Active. Prior to NCR/Teradata Tom worked for debis Systemhaus with several
clients in the aerospace sector. He holds a Masters in Computer Science from the Technical University of Munich, Germany, and is a Teradata Certified Master.
“I want to ensure that the MySQL code base (under the name of MariaDB) will survive as open source, in spite of what Oracle may do.” -- Michael “Monty” Widenius.
Michael “Monty” Widenius is the main author of the original version of the open-source MySQL database and a founding member of the MySQL AB company. Since 2009, Monty is working on a branch of the MySQL code base, called MariaDB.
I wanted to know what`s new with MariaDB. You can read the interview with Monty below.
Q1. MariaDB is an open-source database server that offers drop-in replacement functionality for MySQL. Why did you decide to develop a new MySQL database?
Monty: Two reasons:
1) I want to ensure that the MySQL code base (under the name of MariaDB) will survive as open source, in spite of what Oracle may do. As Oracle is now moving away from Open Source to Open Core (see my blog) it was in hindsight the right thing to do!
2) I want to ensure that the MySQL developers have a good home where they can continue to develop MySQL in an open source manner. This is important as if the MySQL ecosystem would loose the original core developers there would be no way for the product to survive. This is also in hindsight proven to be important as Oracle has lost almost all of the original core developers, fortunately most of them has joined the MariaDB project.
Q2. What is new in MariaDB 5.3.1 beta?
Monty: 5.1 and 5.2 was about fixing some outstanding issues and getting in patches for MySQL that has been available in the community for a long time but never was accepted into the MySQL code base for political reasons.
5.3 is where we have put most of our development efforts, especially in the optimizer area and replication. Replication is now a magnitude faster than before (when running with many concurrent updates) and many queries involving sub queries or big joins are now 2x – 100x faster. All the changes are listed in the MariaDB knowledge base.
Q3. How is MariaDB currently being used? Could you give examples of applications that use MariaDB?
Monty: We see a lot of old MySQL users switching MariaDB. These are especially big sites with a lot of queries that needs even more performance. We have some case studies in the knowledgebase.
What we do expect with 5.3 is that also people that need a more flexible SQL will start using MariaDB thanks to the new ‘no-sql’ features we have added like handler socket and dynamic columns.
Dynamic columns allows you to have a different set of columns for every row in a table.
Q4. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?
Monty: A lot of the new features, like batched key access, hash joins, sub query caching etc are targeted to allow handling of larger data sets.
We have also started to continuously do benchmarks on data in the terabyte range to be able to improve MariaDB even more in the area. We hope to have some very interesting announcements in this area shortly.
Q5. How do you handle structured and unstructured data?
Monty: Yes. The dynamic columns feature is especially designed to do that. We have now one developer working on using the dynamic columns feature to implement a storage engine for HBASE.
Q6. Who else is using MariaDB and why?
Monty: MariaDB has several hound-red of thousands of downloads and is part of many Linux distributions, but as we don’t track users we don’t know who they are. (This was exactly the problem we had with MySQL in the early days). That is why we have started to collect success stories about MariaDB installations to spread the awareness of who is using it.
The problem is that big companies usually never want to tell what they are using which makes this a bit difficult
The main reasons people people are switching to MariaDB is:
1. Faster, more features and fewer bugs.
The fewer bugs comes from the fact that we have fixed a lot of bugs that MySQL has and is introducing in each release.
2. It’s a drop in replacement of MySQL, so it’s trivial to switch.
(All old connectors works unchanged and you don’t have to dump/reload your data).
3. Lot’s of new critical features that people have wanted for years:
– Microsecond support
– Virtual columns.
– Faster sub queries, big data queries & replication
– Segmented key cache (speeds up MyISAM tables a LOT)
– Progress reporting for alter table.
– SphinxSE search engine.
– FederatedX search engine (better and supported federated engine)
See for a full list here..
4. The upgrade process to a new release is easier than in MySQL; you don’t have to dump & restore your data to
upgrade to a new version.
5. It’s guaranteed to be open source now and forever (no commercial extensions from Monty Program Ab).
6. We actively work with the community and add patches and storage engines they create to the MariaDB code base.
Q7. What’s next in MariaDB?
Monty: We are just now working on a the last cleanups so that we can release MariaDB 5.5.
This will include everything in MariaDB 5.3, MySQL 5.5 and also most of the closed source features that is in MySQL 5.5 Enterprise.
In parallel we are working on 5.6. The plans for it can be found here.
The most important features in this are probably:
– True parallel replication (not only per database)
– Multi source slaves.
– A better MEMORY engine that can handle BLOB and VARCHAR gracefully (this is already in development).
We are also looking at introducing a new clustered storage engines that is optimized for the cloud.
What will be in 5.6 is still a bit up in the air; We are working with a lot of different companies and the community to define the new features. The final feature set will be decided upon in our next MariaDB developer meeting in Greece in November.
Q8. How can the open source community contribute to the project?
Monty: Anyone can contribute patches to MariaDB and if you are active and have proven that you are not going to break the code you can get commit access to the MariaDB tree.
The following link tells you how you can get involved.
Another way is of course to contract Monty Program Ab to implement features to MariaDB. This is a good options for those that have more money than time and wants to get something done quickly.
Q9. Cloud computing: Does it play a role for MariaDB? If yes, how?
Monty: Yes, cloud computing is very important for us. MySQL & MariaDB is already one of the most popular databases in the cloud; Rackspace, Amazon and Microsoft are all providing MySQL instances.
The popularity of MySQL in the cloud is mostly thanks to the fact that MySQL is quite easy to configure in different setups, from using very little memory to using all resources on the box. This combined with replication has made MySQL/MariaDB the database of choice in the cloud.
The one thing MariaDB is missing for being an even better choice for the cloud is a good engine that allows one to, on the switch of a key, add / drop nodes to dynamicly change how many cloud entities one is using. We hope to have a solution for this very soon.
Michael (Monty) Widenious, MariaDB
State of MariaDB, Dynamic Column in MariaDB
Describes MariaDB 5.1 a branch of MySQL 5.1. It also introduces the new features in versions 5.2(beta) and 5.3 (alpha). Copy of the presentation given at ICOODB Frankfurt 2010.
Presentation | Intermediate | English | DOWNLOAD (.PDF) | September 2010|
“A key value is to provide strong data points that demonstrate and quantify how XML database processing can be done with very high performance.” — Agustin Gonzalez, Intel Corporation.
“We wanted to show that DB2′s shared-nothing architecture scales horizontally for XML warehousing just as it does for traditional relational warehousing workloads.” — Dr. Matthias Nicola, IBM Corporation.
TPoX stands for “Transaction Processing over XML” and is a XML database benchmark that Intel and IBM have developed several years ago and then released as open source.
A couple of months ago, the project has published some new results.
To learn more about this I have interviewed the main leaders of the TPoX project, Dr. Matthias Nicola, Senior engineer for DB2 at IBM Corporation and Agustin Gonzalez, Senior Staff Software Engineer at Intel Corporation.
Q1. What is exactly TPoX?
Matthias: TPoX is an XML database benchmark that focuses on XML transaction processing. TPoX simulates a simple financial application that issues XQuery or SQL/XML transactions to stress the XML storage, XML indexing, XML Schema support, XML updates, logging, concurrency and other components of an XML database system. The TPoX package comes with an XML data generator, an extensible Workload Driver, three XML Schemas that define the XML structures, and a set of predefined transactions. TPoX is free, open source, and available at http://tpox.sourceforge.net/ where detailed information can be found. Although TPoX comes with a predefined workload, it’s very easy to change this workload to adjust the benchmark to whatever your goals might be. The TPoX Workload driver is very flexible, it can even run plain old SQL against a relational database and simulate hundreds concurrent database users. So, when you ask “What is TPoX”, the complete answer is that it is an XML database benchmark but also a very flexible and extensible framework for database performance testing in general.
Q2. When did you start with this project? What was the original motivation for TPoX? What is the motivation now?
Matthias: We started with this project approximately in 2003/2004. At that time we were working on the native XML support in DB2 that was later released in DB2 version 9.1 in 2006. We needed an XML workload -a benchmark- that was representative of an important class of real-world XML applications and that would stress all critical parts of a database system.
We needed a tool to put a heavy load on the new XML database functionality that we were developing. Some XML benchmarks had been proposed by the research community, such as XMark, MBench, XMach-1, XBench, X007, and a few others. They were are all useful in their respective scope, such as evaluating XQuery processors, but we felt that none of them truly aimed at evaluating a database system in its entirety. We found that they did not represent all relevant characteristics of real-world XML applications.
For example, many of them only defined a read-only and single-user workload on a single XML document. However, real applications typically have many concurrent users, a mix of read and write operations, and millions or even billions of XML documents.
That’s what we wanted to capture in the TPoX benchmark.
Agustin: And the motivation today is the same as when TPoX became freely available as open source: database and hardware vendors, database researchers, and even database practitioners in the IT departments of large corporations need a tool evaluate system performance, compare products, or compare different design and configuration options.
At Intel, the main motivation behind TPoX it to benchmark and improve our platforms for the increasingly relevant intersection of XML and databases. So far, the joint results with IBM have exceeded our expectations.
Q3. TPoX is an application-level benchmark. What does it mean? Why did you choose to develop an application-level benchmark?
Matthias: We typically distinguish between micro-benchmarks and application-level benchmarks, both of which are very useful but have different goals. A micro-benchmark typically defines a range of tests such that each test exercises a narrow and well-defined piece of functionality. For example, if your focus is an XQuery processor you can define tests to evaluate XPath with parent steps, other tests to evaluate XPath with descendant-or-self axis, other tests to evaluate XQuery “let” clauses, and so on.
This is very useful for micro-optimization of important features and functions. In contrast, an application-level benchmark tries to evaluate the end-to-end performance of a realistic application scenario and to exercise the performance of a complete system as a whole, instead of just parts of it.
Agustin: As an application-level benchmark, TPoX has proven much more useful and believable than “synthetic” micro-benchmarks. As a result, TPoX can even be used to predict how similar real-world applications will perform, or where they will encounter a bottleneck. You cannot make such predications with a micro-benchmark. Another important feature is that TPoX is very scalable – you can run TPoX on a laptop but also scale it up and run on large enterprise-grade servers, such as multi-processor Intel Xeon platforms.
Q4. How do you exactly evaluate the performance of XML databases?
Agustin: Well, one way is to use TPoX on a given platform and then compare to existing results on different combinations of hardware and software. I know that this is a simplistic answer but we really learn a lot from this approach. Keeping a precise history of the test configurations and the results obtained is always critical.
Matthias: This is actually a very broad question! We use a wide range of approaches. We use micro-benchmarks, we use application-level benchmarks such as TPoX, we use real-world workloads that we get from some of our DB2 customers, and we continuously develop new performance tests. When we use TPoX, we often choose a certain database and hardware configuration that we want to test and then we gradually “turn up the heat”. For example, we perform repeated TPoX benchmark runs and increase the number of concurrent users until we hit a bottleneck, either in the hardware or the software. Then we analyze the bottleneck, try to fix it, and repeat the process. The goal is to always push the available hardware and software to the limit, in order to continuously improve both.
Q5. What is the difference of TPoX with respect to classical database benchmarks such as TPC-C and TPC-H?
Matthias: One of the obvious differences is that TPC-C and TPC-H focus on very traditional and mature relational database scenarios. In contrast, TPoX aims at the comparatively young field of XML in databases. Another difference is that the TPC benchmarks have been standardized and “approved” by the TPC committee, while TPoX was developed by Intel and IBM, and extended by various students and Universities as an open source project.
Agustin: But, TPoX also has some important commonalities with the TPC benchmarks. TPC-C, TPC-H, and TPoX are all application-level benchmarks. Also, TPC-C, TPC-H, and TPoX have each chosen to focus on a specific type of database workload. This is important because no benchmark can (or should try to) exercise all possible types of workloads. TPC-C is a relational transaction processing benchmark, TPC-H is a relational decision support benchmark, and TPoX is an XML transaction processing benchmark. Some people have called TPoX the “XML-equivalent of TPC-C”. Another similarity between TPC-C, TPC-E, and TPoX is that all three are throughput oriented “steady state benchmarks”, which makes it straightforward to communicate results and perform comparisons.
Q6. Do you evaluate both XML-enabled and Native XML databases? Which XML Databases did you evaluate?
Matthias: TPoX can be used to evaluate pretty much any database that offers XML support. The TPoX workload driver is architected such that only a thin layer (a single Java class) deals with the specific interaction to the database system under test. Personally have used TPoX only on DB2. I know that other companies as well as students at various Universities have also run TPoX against other well-known database systems.
Q7. How did you define the TPoX Application Scenario? How did you ensure that the TPoX Application Scenario you defined is representative of a broader class of applications?
Matthias: Over the years we have been working with a broad range of companies that have XML applications and require XML database support. Many of them are in the financial sector. We have worked closely with them to understand their XML processing needs. We have examined their XML documents, their XML Schemas, their XML operations, their data volumes, their transaction rates, and so on. All of that experience has flown into the design of TPoX. One very basic but very critical observation is that there are practically no real-world XML applications that use only a single large XML document. Instead, the majority of XML applications use very large numbers of small documents.
Agustin: TPoX is also very realistic because it uses a real-world XML Schema called FIXML, which standardizes trade-related messages in the financial industry. It is a very complex schema that defines thousands of optional elements and attributes and allows for immense document variability. It is extremely hard to map the FIXML schema to a traditional normalized relational schema. In the past, many XML processing systems were not able to handle the FIXML schema. But, since type of XML is used in real-world applications, it is a great fit for a benchmark.
Q8. How did you define the workload?
Matthias: Again, by experience with real XML transaction processing applications.
Q9. In your documentation you write that TPoX uses a “stateless” workload? What does it mean in practice? Why did you make this choice?
Matthias: It means that every transaction is submitted to the database independently from any previous transactions. As a result, the TPoX workload driver doesn’t need to remember anything about previous transactions. This makes it easier to design and implement a benchmark that scales to billions of XML documents and hundreds of millions transactions in a single benchmark run.
Q10. Why not define a workload also for complex analytical queries?
Matthias: We did! And we ran it on a 10TB XML data warehouse with more than 5.5 Billion XML documents.
That was a very exciting project and you can find more details on my blog.
Although the initial wave of XML database adoption was more focused on transactional and operations systems, companies soon realized that they were accumulating very large volumes of XML documents that contained a goldmine of information. Hence, the need for XML warehousing and complex analytical XML queries was pressing. We wanted to show that DB2′s shared-nothing architecture scales horizontally for XML warehousing just as it does for traditional relational warehousing workloads.
Agustin: Admittedly, we have not yet formally included this workload of complex XML queries into the TPoX benchmark. Just like TPC-C and TPC-H are separate for transaction processing vs. decision support, we would also need to define two flavors of TPoX, even if the underlying XML data remains the same. A TPoX workload with complex queries is definitely very meaningful and desirable.
Q11. What are the main new results you obtained so far? What are the main values of the results obtained so far?
Agustin: We have produced many results using TPoX over the years, with ever larger numbers of transactions per second and continuous scalability of the benchmark on increasingly larger platforms. A key value is to provide strong data points that demonstrate and quantify how XML database processing can be done with very high performance. In particular, the first public 1TB XML benchmark that we did a few years ago has helped establish the notion that efficient XML transaction processing is a reality today. Such results give the hardware and the software a lot of credibility in the industry. And of course we learn a lot with every benchmark, which allows us to continuously improve our products.
Q12. You write in your Blog “For 5 years now Intel has a strong history of testing and showcasing many of their latest processors with the Transaction Processing over XML (TPoX) benchmark.” Why has Intel been using the TPoX benchmark? What results did they obtain?
Matthias: I let Agustin answer this one.
Agustin: Intel uses the TPoX benchmark because it helps us demonstrate the power of Intel platforms and generate insights on how to improve them. TPoX also enables us to work with IBM to improve the performance of DB2 on Intel platforms, which is good for both IBM and Intel. This collaboration of Intel and IBM around TPoX is an example of an extensive effort at Intel to make sure that enterprise software has excellent performance on Intel. You can see our most important results on the TPoX web page.
Q13. Can you use TPoX to evaluate other kinds of databases (e.g. Relational, NoSQL, Object Oriented, Cloud stores)? How does TPoX compare with the Yahoo! YCSB benchmark for Cloud Serving Systems?
Matthias: Yes, the TPoX workload driver can be used to run traditional SQL workloads against relational databases. Assuming you have a populated relational database, you can define a SQL workload and use the TPoX driver to parameterize, execute, and measure it. TPoX and YCSB have been designed for different systems under test. However, parts of the TPoX framework can be reused to quickly develop other types of benchmarks, especially since TPoX offers various extension points.
Agustin: Some open source relational databases have started to offer at least partial support for the SQL/XML functions and the XML data type. Given the level of parameterization and the extensible nature of the TPoX workload driver it would be very easy to develop custom workloads for the emerging support of the XML data type on open source databases. At the same time, the powerful XML document generator included in the kit can be used to generate the required data. Using TPoX to test the performance of XML in open source databases is an intriguing possibility.
Q14. Is it possible to extend TPoX? If yes, how?
Matthias: Yes, TPoX can be extended in several ways. First, you can change the TPoX workload in any way you want. You can modify, add, or remove transactions from the workload, you can change their relative weight, and you can change the random value distributions that are used for the transaction parameters. We have used the TPoX workload driver to run many different XML workloads, also on other XML data than just the TPoX documents. We have also used the workload driver for relational SQL performance tests, just because it’s so easy to setup concurrent workloads.
Second, the database specific interface of the TPoX workload driver is encapsulated in a single Java class, so it is relatively easy to port the driver to another database system. And third, the new version TPoX 2.1 allows transactions to be coded not only in SQL, SQL/XML, and XQuery, but also in Java. TPoX 2.1 supports “Java-Plugin transactions” that allow you to implement whatever activities you want to run and measure in a concurrent manner. For example, you can run transactions that call a web service, send or receive data from a message queue, access a content management system, or perform any other operations – only limited by what you can code in Java!
Agustin: At Intel we have been using TPoX internally for various other projects. Since the TPoX workload driver is open source, it is straightforward to modify it to support other type of workloads, not necessarily steady state, which makes it amenable to testing other aspects of computer systems such as power management, storage, and so on.
Q15 What are the current limitations of TPoX?
Matthias: Out of the box, the TPoX workload driver only works with databases that offer a JDBC interface. If a particular database system has specific requirements for its API or query syntax, then some modifications may be necessary. Some database system might require their own JDBC driver to be compiled into the workload driver.
Q16. Who else is using TPoX?
Matthias: You can see some examples of other TPoX usage on the TPoX web site. We know that other database vendors are using TPoX internally, even if haven’t decided to publish results yet. I also know a company in the Data Security space that uses TPoX to evaluate the performance of different data encryption algorithms. And TPoX also continues to be used at various universities in Europe, US, and Asia for a variety of research and student projects. For example, the University of Kaiserslautern in Germany has used TPoX to evaluate the benefit of solid-state disks for XML databases. Other universities have used TPoX to evaluate and compare the performance of several XML-only databases.
Q17. TPoX is an open source project. How can the community contribute?
Matthias: A good starting point is to use TPoX. From there, contributing to the TPoX project is easy. For example, you can report problems and bugs , or you can submit new feature requests. Or even better, you can implement bug fixes and enhancements yourself and submit them to the SVN code repository on sourceforge.net.
If you design other workloads for the TPoX data set, you can upload new workloads to the TPoX project site and have your results posted on he TPoX web site.
Agustin: As is customary for an open source project on sourceforge, anybody can download all TPoX files and source code freely.
If you want to upload any changed or new files or modify the TPoX web page, you only need to become a member of the TPoX sourceforge project, which is quick and easy.
Everybody is welcome, without exceptions.
XML Database Benchmark: “Transaction Processing over XML (TPoX)”:
TPoX is an application-level XML database benchmark based on a financial application scenario. It is used to evaluate the performance of XML database systems, focusing on XQuery, SQL/XML, XML Storage, XML Indexing, XML Schema support, XML updates, logging, concurrency and other database aspects.
Download TPoX (LINK), July 2009. | TPoX Results (LINK), April 2011.
“Taming a Terabyte of XML Data”.
Augustin, Gonzales, Matthias Nicola, IBM Silicon Valley Lab.
Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2009|
“An XML Transaction Processing Benchmark”.
Matthias Nicola, Irina Kogan, Berni Schiefer, IBM Silicon Valley Lab.
Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2007|
“A Performance Comparison of DB2 9 pureXML with CLOB and Shredded XML Storage”.
Matthias Nicola et a., IBM Silicon Valley Lab.
Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2006|