Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Nov 30 11

On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software

by Roberto V. Zicari

” In the last years, our main development focus in the data management area has been on technologies that help customers turn insight into action.” — Kristof Kloeckner, IBM Corporation.

I wanted to learn about IBM`s current strategy in data management. I have interviewed Kristof Kloeckner, General Manager of IBM Rational Software, IBM Corporation.

RVZ

Q1. In your opinion, what are the most important technologies being developed and deployed in the last years that are affecting data management?

Kloeckner: In the last years, our main development focus in the data management area has been on technologies that help customers turn insight into action. Three fundamental shifts are being experienced across industries where these technology advances have been critical i) information is exploding, ii) business change is outpacing the collective ability to keep up, and iii) the performance gap between leaders and laggards/followers is widening. Successful organizations are taking a structured approach to turning insight into action through business analytics and optimization. Specifically, innovative technologies that have helped businesses optimize to address these shifts include:

1 – Competing on speed: enabling organizations to speed up decision making and processes to act in a time frame that matters to retain/attract new clients and grow revenues:

2 – Exceeding customer and employee expectations: consumer IT innovation has created high employee expectations of IT: ease of use, collaboration, and mobile access

3. – Exploiting information from everywhere – access to data wherever it resides and in any variety, any velocity, and at any volume. Advancements in Big Data technology include new compression capabilities to transparently compress data on disk to reduce disk space and storage infrastructure requirements.

4- Realize new possibilities from Analytics. Developments within analytics processing that combine and analyze information through multiple analytics services including content analytics and predictive analytics are transforming Healthcare.

5 – Governance information: addressing the pressure seen with regulatory compliance. Technology advancements with Hadoop allow multiple users to access unstructured data for governance purposes.
• Content-development workflow advancements for enterprise business glossaries enable a governed approach to build, publish, and manage enterprise vocabularies

6 – OLTP Scale Out and Availability.

Q2. In a recent keynote you introduced the concept of “Software-driven innovation”. What is it?

Kloeckner:Innovation is increasingly achieved and delivered via software. The need for innovation is key for companies trying to gain competitive advantage in today’s business environment.
In a recent IBM survey of more than 1,500 CEOs – nearly 50% of respondents called out the increased need for product and service innovation as a top concern.

Software connects people, products and information. The more interconnected, instrumented, and intelligent a product or service is, the greater its potential value is to its users. Challenges to achieving software-driven innovation derive from the complexity of interconnected systems, but also from the complexity of software delivery due to the increasingly global distribution of development teams, new models of software supply chains including outsourcing and crowdsourcing, and the reality of constantly shifting market requirements. Effective software-driven innovation requires application and systems lifecycle management with a focus on integration of processes, data and tools, collaboration across roles and organization and optimization of business outcomes through measurements and simplified governance.

Q3. You defined three elements for realizing “software-driven innovation”: Integration, Collaboration, and Optimization. Could you please elaborate on this? In particular, how is the software design and delivery lifecycle managed?

Kloeckner:Collaborative Lifecycle Management integrates tools and processes end-to-end, from requirements management to design and construction to test and verification, and ultimately, deployment.
It enables traceability of artifacts across the lifecycle, real-time planning, in-context collaboration, development analytics or ‘intelligence’ and continuous improvement. It eliminates manual hand-offs, and can lead to increased project velocity, adaptability to change and reduced project risk.

Collaborative design management brings cross-team collaboration on software and systems design and creates deep integrations across the lifecycle. Effective modeling, design and architecture are critical to developing smarter products and services.

Collaborative DevOps bridges the gap between development and deployment by sharing information between the teams, improving deployment planning and automating many deployment steps. This leads to faster delivery of solutions into production.

Collaborative Lifecycle Management, Design Management and DevOps can considerably increase business agility by accelerating software and systems delivery. They can be augmented by capabilities that further increase alignment of development with business goals, for instance through application portfolio management.

Q4. IBM recently announced a new application development collaboration software built on its Jazz development platform. What is special about Jazz?

Kloeckner:Jazz is IBM’s initiative for improving collaboration across the software & systems lifecycle Inspired by the artists who transformed musical expression, Jazz is an initiative to transform software and systems delivery by making it more collaborative, productive and transparent, through integration of information and tasks across the phases of the lifecycle. The Jazz initiative consists of three elements: Platform, Products and Community.

Jazz is built on architectural principles that represent a key departure from approaches taken in the past.
Unlike monolithic, closed platforms of the past, Jazz has an innovative approach to integration based on open, flexible services and Internet architecture (linked data). It is complemented by Open Services for Lifecycle Collaboration (OSLC), an industry initiative of more than 30 companies to define the interfaces for tools integration based on the principles of linked data.

Jazz is supported by a community web site (Jazz.net), as well as a cloud environment supporting university projects (JazzHub).

Q5. What about other approaches such as social networks ala Facebook? Aren’t they also a sort of collaboration platforms?

Kloeckner:Without a doubt, software delivery is a team sport, and the process of delivery becomes an instance of ‘social business’. Collaboration is vital for organizational cohesion on a global scale, but also for the sharing of best practices and artifacts. Increasingly, developers expect to use similar mechanisms to interact and share to those they are using in their personal lives.
At IBM we combine development tools such as Rational Team Concert and Rational Asset Manager with our social business platform IBM Connections to provide an infrastructure for code reuse supporting more than 30000 members in more than 1700 projects. It has a governance structure derived from Open Source which we call Community Source. The platform uses wikis, forums, blogs and other feedback and communication mechanisms.

Q6. What is IBM strategy in data management? DB2, Apache Hadoop, CloudBurst, SmartCloud Enterprise to name a few. How do they relate to each other (if any)?

Kloeckner:Our strategy is to enable the placement of data and the workloads that use it in cost-effective patterns that encompass public and private clouds as well as traditional deployment topologies.

IBM has provided enablement on both public as well as private clouds. This enablement provided easy access to IBM’s rich enterprise data management portfolio for evolving businesses. Our focus is on the enablement of deeper cloud capabilities such as the Platform as a Service and Database as a Service style features in the IBM Workload Deployer (the evolved WebSphere Cloudburst Appliance). The adoption of these technologies leads to the need to integrate with existing IT infrastructures as well as the evolving world of NoSQL.

This deeper cloud capability additionally focuses on the need to expose deeper analytical capability and insight into raw business data which is difficult to model and or understand yet. This leads to the Big Data topic. The run-time infrastructure being used for this deeper insight into this data is built on Apache Hadoop and is the basis for IBM’s Big Data product. This offering is available in a cloud offering or as a product offering. However, Big Data is allowing businesses to expand the range of information that they can do analysis on. Moreover, for this data to be useful, it often needs to tie into the more traditional Information Management world – existing operational and warehouse systems which provide the context needed to both support the broader range of analytics that Big Data enables as well as the context needed to operationalize the results.

Q7. What are the main business and technical issues related to data management when using the Cloud?

Kloeckner:There are four main issues, each of which relate to the data itself: data movement, data security, data ownership/stewardship, as well as the analysis of enormous datasets. The challenge with data movement manifests itself primarily in two ways: availability and the obvious time that it takes to transfer data between sites. The relation to availability is not immediately obvious but high speed data movement is used for replication based HA models as well as recovery scenarios that need to pull data from a remote site. The higher the bandwidth and the lower the latency, the easier it is to create a highly available data management offering on the cloud.

Data security is one place where most of the concerns surface and a lot of the enterprise companies have very sensitive data. Questions such as “Who is the group of people working behind the cloud facade and can I trust them?” are the probably the most common. Fortunately with IBM, we’ve been handling sensitive enterprise data for most of the last century. Having a reputable company to partner with to either host or protect your data in the cloud is absolutely key and it is also why many companies are reaching out to IBM as a cloud provider. Dealing with sensitive data is business as usual for us.

Related to security, is the issue around ownership and stewardship of such data in the cloud. If data is believed to be out of the control of its stakeholders, unless you can demonstrate otherwise, they may begin to lose confidence in the trust-worthiness of that information. This leads to the need for governance of such data in the cloud and tied to security of it.

Lastly, what do you do with petabytes of structured and unstructured data? Cloud and hadoop some times get interchanged in some circles, but IBM’s direction on hadoop is illustrated above with BigInsights which includes BigData and Streams products. Cloud is just as critical put serves a different role. One area that information management is focused on around clouds as an example is making it simpler for business users to build sand boxes and smaller systems in the cloud and integrate there data capture and governance with the technical IT groups that make the business data. We need to make it simple for business users, but allow IT some control over what data goes into the cloud and how long it lives.

Q8. Private cloud vs. public cloud? What is the experience of IBM on this?

Kloeckner:Clients have choices for deploying enterprise applications, and the key factors they take into consideration are not just cost or convenience, but also security, reliability, scalability, control and tooling, and a trusted vendor relationship. We find that for enterprise critical applications customers in their majority favor private (including private managed) clouds. We also find increasing adoption of IBM SmartCloud Enterprise.

IBM has a proven approach for building and managing cloud solutions, providing an integrated platform that uses the same standards and processes across the entire portfolio of products and services. IBM’s expertise and experience in designing, building and implementing cloud solutions—beginning with its own—offers clients the confidence of knowing that they are engaging not just a provider, but a trusted partner in the transformation of their IT service delivery. The IBM Cloud Computing reference architecture builds on IBM’s industry-leading experience and success in implementing SOA solutions. We have also invested in technologies that integrate cloud environments and allow clients combine services deployed in their private cloud environments with services delivered through public clouds. We believe that such hybrid environments are becoming increasingly more common.

Q9. In your opinion, will we ever have a “Trusted” Cloud?

Kloeckner:Concerns about security are the most prominent reasons that organizations cite for not adopting public cloud services. Therefore, creating more comprehensive security capabilities is a prerequisite for getting organizations to adopt public cloud-based services for more complex, business-sensitive, and demanding purposes. A recent Forrester Research reports they fully expect to see the emergence of highly secure and trusted cloud services over the next five years, during which time cloud security will grow into a $1.5 billion market and will shift from being an inhibitor to an enabler of cloud services adoption. The advent of secure cloud services will be a disruptive force in the security solutions market, challenging traditional security solution providers to revamp their architectures, partner ecosystems, and service offerings, while creating an opportunity for emerging vendors, integrators, and consultants to establish themselves.

Q10. Where do you see the data management industry heading in the next years?

Kloeckner:Across industries, technology providers and technology consumers are moving to address the exponential growth in quantity, complexity, variety, and velocity of data to enable better business insights and drive informed actions. Add to this the immediacy of the web and data in motion, the new control and influence of the individual consumer, and the fact that most companies have yet to fully exploit the raw data they have in house within the context of structured models, and you can easily discern the driving forces behind strategic moves, acquisitions and recent product announcements. The potential for business risks continue to increase as do requirements for regulatory compliance. By 2015, the information and analytics software spending is estimated to reach $74B as a result.

The consumability of core technologies such as High Performance Computing and embarrassingly parallel workloads, real-time analysis of streaming data and process flows, ease of access to analytics as services via the cloud and embedded within business processes, as well as advances in natural language interfaces will take analytics to the mainstream and to main street.. IBM’s Watson is just a hint of what is to come.

Solution delivery trends in mobile, cloud, and software as a service are driving information management technology investments to meet these demands. Evolving data access trends from traditional business applications and existing OLTP environments are extending in reach in to social media and content as an example, driving technology investments in processing Big Data as well as content and text analytics.

As technology advances in systems processing power, in-memory computing capabilities, extreme low-energy servers, and storage capabilities such as solid state drives continue to progress, data management technologies continue to advance in parallel leveraging these technologies to enable optimized workloads and consolidated platforms for cloud environments.

Both companies and consumers look for better/faster/easier access to a single best view of the truth, ability to predict the next best action. Whether it be process optimizations within a business, risk and fraud management, dramatically enhanced personalization of each customer experience, or improving the speed and accuracy of diagnosis based the full compendium of the latest research, historical data and best practices, health care, financial services, customer service, precision marketing and even auto repair will be impacted.

_________________

Kristof Kloeckner, Ph.D., is General Manager of IBM Rational Software, IBM Corporation.
Dr. Kloeckner is responsible for all facets of the Rational Software business including strategy, marketing, sales, operations, technology development and overall financial performance. He and his global team are focused on providing the premier development and delivery solutions for transforming business through software-driven innovation. Dr. Kloeckner is based in Somers, New York.

Resources

ODBMS.org: Free Downloads and Links on various data management technologies and analytics.

Related Posts

On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

Analytics at eBay. An interview with Tom Fastner.

##

Nov 16 11

On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.

by Roberto V. Zicari

” It is expected there will be 6 billion mobile phones by the end of 2011, and there are currently over 300 Twitter accounts and 500K Facebook status updates created every minute. And, there is now a $2 billion a year market for virtual goods! “ — Shilpa Lawande

I wanted to know more about Vertica Analytics Platform for Big Data. I have interviewed Shilpa Lawande, VP of Engineering at Vertica. Vertica was acquired by HP early this year.

RVZ

Q1. What are the main technical challenges for big data analytics?

Shilpa Lawande: Big data problems have several characteristics that make them technically challenging. First is the volume of data, especially machine-generated data, and how fast that data is growing every year, with new sources of data that are emerging. It is expected there will be 6 billion mobile phones by the end of 2011, and there are currently over 300 Twitter accounts and 500K Facebook status updates created every minute. And, there is now a $2 billion a year market for virtual goods!

A lot of insights are contained in unstructured or semi-structured data from these types of applications, and the problem is analyzing this data at scale. Equally challenging is the problem of ‘how to analyze.’ It can take significant exploration to find the right model for analysis, and the ability to iterate very quickly and “fail fast” through many (possible throwaway) models – at scale – is critical.

Second, as businesses get more value out of analytics, it creates a success problem – they want the data available faster, or in other words, want real-time analytics. And they want more people to have access to it, or in other words, high user volumes.

One of Vertica’s early customers is a Telco that started using Vertica as a ‘data mart’ because they couldn’t get resources from their enterprise data warehouse. Today, they have over a petabyte of data in Vertica, several orders of magnitude bigger than their enterprise data warehouse.

Techniques like social graph analysis, for instance leveraging the influencers in a social network to create better user experience are hard problems to solve at scale. All of these problems combined create a perfect storm of challenges and opportunities to create faster, cheaper and better solutions for big data analytics than traditional approaches can solve.

Q2. How Vertica helps solving such challenges?

Shilpa Lawande: Vertica was designed from the ground up for analytics. We did not try to retrofit 30-year old RDBMS technology to build the Vertica Analytics Platform. Instead, Vertica built a true columnar database engine including sorted columnar storage, a query optimizer and an execution engine.

With sorted columnar storage, there are two methods that drastically reduce the I/O bandwidth requirements for such big data analytics workloads. The first is that Vertica only reads the columns that queries need.
Second, Vertica compresses the data significantly better than anyone else.

Vertica’s execution engine is optimized for modern multi-core processors and we ensure that data stays compressed as much as possible through the query execution, thereby reducing the CPU cycles to process the query. Additionally, we have a scale-out MPP architecture, which means you can add more nodes to Vertica.
All of these elements are extremely critical to handle the data volume challenge.

With Vertica, customers can load several terabytes of data quickly (per hour in fact) and query their data within minutes of it being loaded – that is real-time analytics on big data for you.
There is a myth that columnar databases are slow to load. This may have been true with older generation column stores, but in Vertica, we have a hybrid in-memory/disk load architecture that rapidly ingests incoming data into a write-optimized row store and then converts that to read-optimized sorted columnar storage in the background. This is entirely transparent to the user because queries can access data in both locations seamlessly. We have a very lightweight transaction implementation with snapshot isolation queries can always run without any locks.
And we have no auxiliary data structures, like indexes or materialized views, which need to be maintained post-load.

Last, but not least, we designed the system for “always on,” with built-in high availability features.
Operations that translate into downtime in traditional databases are online in Vertica, including adding or upgrading nodes, adding or modifying database objects, etc.

With Vertica, we’ve removed many of the barriers to monetizing big data and hope to continue to do so.

Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

Shilpa Lawande:The short answer to the performance and scale question is that we make the most efficient use of all the resources, and we parallelize at every opportunity. When we compress and encode the data, we sometimes see 80:1 compression (depending on the data) . Vertica’s fully peer-to-peer MPP architecture allows you can issue loads and queries from any node and run multiple load streams. Within each node, operations make full use of the multi-core processors to parallelize their work. A great measure of our raw performance is that we have several OEM customers who run Vertica on a single 1U node, embedded within applications like security and event log management, with very low data latency requirements.

Scalability has three aspects – data volume, hardware size, and concurrency. Vertica’s performance scales linearly (and often super linearly due to compression and other factors) when you double the data volume or run the same data volume on twice the number of nodes. We have customers who have grown their databases from scratch to over a petabyte, with clusters from tens to hundreds of nodes. As far as concurrency goes, running queries 50-200x faster ensures that we can get a lot more queries done in a unit of time. To efficiently handle a highly concurrent mix of short and long queries, we have built-in workload management that controls how resources are allocated to different classes of queries. Some of our customers run with thousands of concurrent users running sub-second queries.

Q4. What is new in Vertica 5.0?

Shilpa Lawande: In Vertica 5.0, we focused on extensibility and elasticity of the platform. Vertica 5.0 introduced a Software Development Kit (SDK) that allows users to write custom user-defined functions so they can do more with the Vertica platform, such as analyze Apache logs or build custom statistical extensions. We also added several more built-in analytic extensions, including event-series joins, event-series pattern matching, and statistical and geospatial functions.

With our new Elastic Cluster features, we allow very fast expansion and contraction of the cluster to handle changing workloads – in a recent POC we showed expansion from 8 nodes to 16 nodes with 11TB of data in an hour. Another feature we’ve added is called Import/Export, which automates fast export of data from one Vertica cluster to another, a very useful feature for creating sandboxes from a production system for exploratory analysis.

Besides these two major features, there are a number of improvements to the manageability of the database, most notably the Data Collector, that captures and maintains a history of system performance data, and the Workload Analyzer, that analyzes this data to point out suboptimal performance and how to fix it.

Q5. How is Vertica currently being used? Could you give examples of applications that use Vertica?

Shilpa Lawande: Vertica has customers in most major verticals, including Telco, Financial Services, Retail, Healthcare, Media and Advertising, and Online Gaming. We have eight out of the top 10 US Telcos using Vertica for Call-Detail-Record analysis, and in Financial Services, a common use-case is a tickstore, where Vertica is used to store and analyze many years of financial trades and quotes data to build models.
A growing segment of Vertica customers are Online Gaming and Web 2.0 companies that use Vertica to build models for in-game personalization.

Q6. Vertica vs. Apache Hadoop: what are the similarities and what are the differences?

Shilpa Lawande: Vertica and Hadoop are both systems that can store and analyze large amounts of data on commodity hardware. The main differences are how the data gets in and out, how fast the system can perform, and what transaction guarantees are provided. Also, from the standpoint of data access, Vertica’s interface is SQL and data must be designed and loaded into a SQL schema for analysis.

With Hadoop, data is loaded AS IS into a distributed file system and accessed programmatically by writing Map-Reduce programs. By not requiring a schema first, Hadoop provides a great tool for exploratory analysis of the data, as long as you have the software development expertise to write Map Reduce programs. Hadoop assumes that the workload it runs will be long running, so it makes heavy use of checkpointing at intermediate stages.
This means parts of a job can fail, be restarted and eventually complete successfully. There are no transactional guarantees.

Vertica, on the other hand, is optimized for performance by careful layout of data and pipelined operations that minimize saving intermediate state. Vertica gets queries to run sub-second and if a query fails, you just run it again. Vertica provides standard ACID transaction semantics on loads and queries.

We recently did a comparison between Hadoop, Pig, and Vertica for a graph problem (see post on our blog) and when it comes to performance, the choice is clearly in favor of Vertica. But we believe in using the right tool for the job and have over 30 customers using both the systems together. Hadoop is a great tool for the early exploration phase, where you need to determine what value there is in the data, what the best schema is, or to transform the source data before loading into Vertica. Once the data models have been identified, use Vertica to get fast responses to queries over the data.

Other customers keep all their data in Vertica and leverage Hadoop’s scheduling capabilities to retrieve the data for different kinds of analysis. To facilitate this, we provide a Hadoop connector that allows efficient bi-directional transfer of data between Vertica and Hadoop. We plan to continue to enhance Vertica’s analytic platform as well as be a partner in the Hadoop ecosystem.

Q7. Cloud computing: Does it play a role at Vertica? If yes, how?

Shilpa Lawande: In 2007, when ‘Cloud’ first started to gather steam, we realized that Vertica had the perfect architecture for the cloud. It is no surprise because the very trends in commodity multi-core servers, storage and interconnects that resulted in our design choices also enable the cloud. We were the first analytic database to run on Amazon EC2 and today have several customers using EC2 to run their analytics.

We consider cloud as an important deployment configuration for Vertica and as the IaaS offerings improve, and we gain real-world experience from our customers, you can expect Vertica’s product to be further optimized for cloud deployments. Cloud is also a big focus area at HP and we now have the unique opportunity to create the best cloud platform to run Vertica and to provide value-added solutions for big data analytics based on Vertica

Q8. How Vertica fits into HP data management strategy?

Shilpa Lawande: Vertica provides HP with a proven platform for big data analytics that can be deployed as software, an appliance, and on the cloud, all of which are key focus areas for HP’s business.
Big data is a growing market and the majority of the data is unstructured. The combination of Vertica and Autonomy gives HP a comprehensive “Information Platform” to manage, analyze, and derive value from the explosion in both structured and unstructured data.

________________________
Shilpa Lawande, VP of Engineering, Vertica.
Shilpa Lawande has been an integral part of the Vertica engineering team since its inception, bringing over 10 years of experience in databases, data warehousing and grid computing to Vertica. Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle 9i and 10g databases.

Lawande is a co-inventor on several patents on query optimization, materialized views and automatic index tuning for databases. She has also co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing.

Lawande has a Masters in Computer Science from the University of Wisconsin-Madison and a Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Mumbai.
—————

ODBMS.org Resources on Analytical Data Platforms:
1. Blog Posts
2. Free Software
3. Articles

Related Posts

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

Analytics at eBay. An interview with Tom Fastner.

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

##

Nov 2 11

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

by Roberto V. Zicari

“One of the core concepts of Big Data is being able to evolve analytics over time. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources. “ — Werner Vogels, Amazon.com

I wanted to know more on what is going on at Amazon.com in the area of Big Data and Analytics. For that, I have interviewed Dr. Werner Vogels, Chief Technology Officer and Vice President of Amazon.com

RVZ

Q1. In your keynote at the Strata Making Data Work Conference held this February in Santa Clara, California, you said that “Data and Storage should be unconstrained“. What did you mean with that?

Vogels: In the old world of data analysis you knew exactly which questions you wanted to asked, which drove a very predictable collection and storage model. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources.

Q2. You also claimed that “Big Data requires NO LIMIT“. However, Prof. Alex Szalay talking about astronomy warns us that “Data is everywhere, never be at a single location. Not scalable, not maintainable.” Will Big Data Analysis become a new Astronomy?

Vogels: Big Data is the hot topic for this year. With the rise of the internet, and the increasing number of consumers, researchers and businesses of all sizes getting online, the amount of data now available to collect, store, manage, analyze and share is growing. When companies come across such large amounts of data it can lead to data paralysis where they don’t have the resources to make effective use of the information.
To Alex’s point, it’s challenging to get the relevant data at the place where you want it to do your analysis. This is why we see many organizations putting their data in the cloud where it’s easily accessible for everyone.

Q3. You also quoted Jim Gray and mentioned the “Fourth Paradigm: Data Intensive Scientific Discovery”. What is it? Is Business Intelligence becoming more like Science for profit?

Vogels:The book I referenced was The Fourth Paradigm: Data-Intensive Scientific Discovery. This is a collection of essays that discusses the vision for data-intensive scientific discovery, which is the concept of shifting computational science to a data intensive model where we analyze observations.
For Business Intelligence this means that analysis goes beyond finance and accountancy and help companies’ continuously improve the service to their customers.

Q4. Michael Olson of Cloudera, in a recent interview said talking about Analytical Data Platforms that “Cloud is a deployment detail, not fundamental. Where you run your software and what software you run are two different decisions, and you need to make the right choice in both cases.”
What is your opinion on this? What is in your opinion the relationships between Big Data Analysis and Cloud Computing?

Vogels: Big Data holds the promise of helping companies create a competitive advantage as through data analysis they learn how to better serve their customers. This is an approach that we have already applied for 15 years to Amazon.com and we have a solid understanding of the all the challenges around managing and processing Big Data.

One of the core concepts of Big Data is being able to evolve analytics over time. For that, a company cannot be constrained by any resource. As such, Cloud Computing and Big Data are closely linked because for a company to be able to collect, store, organize, analyze and share data, they need access to infinite resources.
AWS customers are doing some really innovative things around dealing with Big Data. For example digital advertising and marketing firm, Razorfish. Razorfish targets online adverts based on data from browsing sessions. A common issue Razorfish found was the need to process gigantic data sets. These large data sets are often the result of holiday shopping traffic on a retail website, or sudden dramatic growth on a media or social networking site.

Normally crunching these numbers would take them two days or more. By leveraging on-demand services such as Amazon Elastic MapReduce, Razorfish is able to drop their processing time to eight hours. There was no upfront investment in computing hardware, no procurement delay, and no additional operations staff hired. All this means Razorfish can offer multi-million dollar client service programs on a small business budget, helping them to increase their return on ad spend by 500%.

Q5. How has Amazon’s technology evolved over the past three years?

Vogels: Every day, Amazon Web Services adds enough new server capacity to support all of Amazon’s global infrastructure in the company’s fifth full year of operation, when it was a $2.76B annual revenue enterprise. Today we have hundreds of thousands of customers in over 190 countries—both startups and large companies. To give you an idea of the scale we’re talking about, Amazon S3 holds over 260 billion objects and regularly peaks at 200k requests per second.

Our pace of innovation has been rapid because of our relentless customer focus. Our process is to release a service into beta that is useful to a lot of people, get customer feedback and rapidly begin adding the bells and whistles based in large part on what customers want and need from the services. There’s really no substitute for the accelerated learning we’ve had from working with hundreds of thousands of customers with every imaginable use case.

We are also relentless about driving efficiencies and passing along the cost savings to our customers.
We’ve lowered our prices 12 times in the past 5 years with no competitive pressure to do so. We’re very comfortable with running high volume, low margin businesses which is very different than traditional IT vendors.

Q6. Amazon’s Dynamo is proprietary, however the publication of your seminar 2007 Dynamo-paper was used as input for Open Source Projects (e.g. Cassandra—which began as a fusion of Google`s Bigtable and Amazon’s Dynamo concepts). Why Amazon allowed this?

Vogels: Dynamo is internal technology developed at Amazon to address the need for an incrementally scalable, highly-available key-value storage system. The technology is designed to give its users the ability to trade-off cost, consistency, durability and performance, while maintaining high-availability.
We found, though, that there had been some struggles with applying the concepts so we published the paper as feedback to the academic community about what one needed to do to build realistic production systems.

Q7. What is the positioning of Amazon with respect to Open Source projects? Why didn’t you develop Open Source data platforms from the start like for example Facebook and LinkedIn?

Vogels: I believe that you need to pour your resources into areas where you can make important contributions, where you can provide the customer with the best possible experience. Our mission is to provide the one-man app developer or the 20,000 person enterprise with a platform of web services they can use to build sophisticated, scalable applications. I believe anything we can do to make AWS lower-cost and widely available will help the community tremendously.

Q8. Amazon Elastic MapReduce utilizes a hosted Hadoop framework running on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Why choosing Hadoop? Why not using already existing BI products?

Vogels: We chose Hadoop for several reasons. First, it is the only available framework that could scale to process 100s or even 1000s of terabytes of data and scale to installations of up to 4000 nodes. Second, Hadoop is open source and we can innovate on top of the framework and inside it to help our customers develop more preformat applications quicker. Third, we recognized that Hadoop was gaining substantial popularity in the industry with multiple customers using Hadoop and many vendors innovating on top of Hadoop.

Three years later we believe we made the right choice. We also see that existing BI vendors such as Microstrategy are willing to work with us and integrate their solutions on top of Elastic MapReduce.

Q9. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead? And what are the main business challenges ahead?

Vogels: I think that sharing is another important aspect to the mix. Collaborating during the whole process of collecting data, storing it, organizing it and analyzing it is essential. Whether it’s scientists in a research field or doctors at different hospitals collaborating on drug trials, they can use the cloud to easily share results and work on common datasets.

——–
Dr. Vogels is Vice President & Chief Technology Officer at Amazon.com where he is responsible for driving the company’s technology vision, which is to continuously enhance the innovation on behalf of Amazon’s customers at a global scale.

Prior to joining Amazon, he worked as a researcher at Cornell University where he was a principal investigator in several research projects that target the scalability and robustness of mission-critical enterprise computing systems. He has held positions of VP of Technology and CTO in companies that handled the transition of academic technology into industry.

Vogels holds a Ph.D. from the Vrije Universiteit in Amsterdam and has authored many articles for journals and conferences, most of them on distributed systems technologies for enterprise computing.

He was named the 2008 CTO of the Year by Information Week for his contributions to making Cloud Computing a reality. For his unique style in engaging customers, media and the general public, Dr. Vogels received the 2009 Media Momentum Personality of Year Award.

Related Posts

Analytics at eBay. An interview with Tom Fastner.

Google Fusion Tables. Interview with Alon Y. Halevy.

Interview with Jonathan Ellis, project chair of Apache Cassandra.

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

The evolving market for NoSQL Databases: Interview with James Phillips.

Objects in Space vs. Friends in Facebook.

#

Oct 24 11

vFabric SQLFire: Better than RDBMS and NoSQL?

by Roberto V. Zicari

“ We believe the ability to do transactions will remain important to app developers.” — Blake Connell, VMware.

vFabric SQLFire beta was recently announced by VMware as part of their Cloud Application Platform, called vFabric 5. I wanted to know more about it. I have therefore interviewed Blake Connell, Sr. Product Marketing Manager, vFabric at VMware.

RVZ

Q1. What is vFabric SQLFire beta?

Blake Connell: SQLFire is a memory-optimized distributed database. The beta is available for anyone to download and try while we complete the initial release.

Q2. What are the main technical challenges that vFabric SQLFire is supposed to solve?

Blake Connell: SQLFire addresses speed, scalability and availability. SQLFire Is memory-optimized for maximum speed, this matters because users are becoming more and more demanding about responsiveness of web sites and applications.

SQLFire is designed for horizontal scalability, if you need more performance or need to store more data you simply add more members to the SQLFire distributed system. SQLFire extends SQL with new keywords that describe how tables should be partitioned and replicated, so as you add nodes the system intelligently rebalances itself for high performance, and it’s all completely transparent to the application. This is much more compelling than having to shard your database manually and build shard-awareness into your application.

Finally SQLFire uses a shared-nothing architecture to ensure availability. If a node in the distributed system dies, applications can simply talk to another node. In fact the database drivers shipped with SQLFire will transparently reconnect to a working node without the application needing to do anything, the app doesn’t even have to retry the request. A key compelling characteristics of SQLFire is that it enables developers to work with the well known SQL programming interface making adoption much more seamless and far less disruptive than alternatives.

Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

Blake Connell: A big part of the way we address scalability is through horizontal scalability and automatic partitioning / replication of data as I mentioned. But there are some other very important design choices
we’ve made to build a scalable database.
First, we believe the ability to do transactions will remain important to app developers. However we believe that the vast majority of transactions are small both in time required and amount of data affected. We’ve built a transaction system based on these assumptions that completely avoids the need for a centralized coordinator or lock manager. In fact each node can act as its own transaction coordinator, even when the data in the transaction lives in multiple nodes. In this way you get transactions with linear scalability. If you take this along with our partitioning scheme the end result is a relational database that is much more scalable than a traditional RDBMS.

Q4. What are the main results of your performance benchmark for SQLFire?

Blake Connell: In our preliminary performance testing of vFabric SQLFire, the results show the product is able to achieve near linear scalability as the cluster expands, while the CPU utilization remains steady.
In this test, this indicates SQLFire can accommodate even more load without exhausting CPU resources.
In this performance test, two thirds of all queries complete in under 1 ms, roughly 88% complete in under 2 ms, and 97% take under 5 ms which is extremely fast.

Q5. How do you handle structured and unstructured data?

Blake Connell: vFabric SQLFire handles structured data. A related offering, vFabric GemFire handles unstructured data.

Q6. How do you use vFabric SQLFire from Java applications and/ or from ADO.NET?

Blake Connell: We provide JDBC and ADO.NET drivers for you to embed in your application. After that you just use mostly standard SQL. We have DDL extensions, so for example when you create a table you can specify how it should be partitioned, replicated and so forth, but the DML is just standard SQL.
The DDL extensions are all optional and don’t interfere with standard DDL.

Q7. Could you give examples of applications that use SQLFire beta?

Blake Connell: We have a financial services trading application exploring vFabric SQLFire.
The application dramatically increases the speed at which it can take a customer order and send it to a stock exchange while maintaining reliability and security.

Q8. Why using vFabric SQLFire and not of one of the existing NoSQL database or Relatonal Databases?

Blake Connell: vFabric SQLFire offers capabilities not found in traditional RDBMS or NoSQL solutions.
Comparing vFabric SQLFire to the traditional RDBMS:
– Data can be partitioned and/or replicated in memory across a cluster of machines to deliver scale-out benefits and high availability (HA)
– Not constrained by strict adherence to ACID properties. Developers uphold various degrees of ACID properties based on desired levels of data availability, consistency and performance.
– Optimized key-based access

Comparing vFabric SQLFire to NoSQL:
– vFabric SQLFire is focused on data consistency with scale-out and HA, often NoSQL offerings provide eventually consistent data models.
– vFabric SQLFire supports queries and joins with standard SQL syntax.

Q9. How vFabric SQLFire fits into VMware products strategy?

Blake Connell: At VMware, our approach to data is really two-fold:
1. Leverage virtualization to automate database deployment and operations.
2. Provide new approaches to data beyond the traditional one-size fits all database model.

vFabirc SQLFire and vFabric GemFire are key data offerings from VMware that provide alternative data approaches for modern applications. Innovation is occurring at all layers of IT including the database layer. vFabric GemFire is an memory-oriented data fabric. Java developers using the Spring Framework or .NET developers can leverage an API to write applications to vFabric GemFire. vFabric SQLFire enables teams to leverage their skills and investment in SQL knowledge and gain the benefits of distributed, in-memory, shared-nothing architectures.

Our recently released vFabric Data Director leverages vSphere virtualization to provide Database-as-a-Service with centralized back-up, management and governance. This is a huge step forward for both developers looking for fast and simple access to the database resources and for IT teams who need a manageable approach to providing data services.

Q10. What is next in vFabric SQLFire?

Blake Connell: In the first half of 2012 we anticipate enhancing SQLFire to include asynchronous WAN replication. This will let you run a single SQLFire database in multiple datacenters with high performance and low latency.

Related Posts

MariaDB: the new MySQL? Interview with Michael Monty Widenius.

On Versant`s technology. Interview with Vishal Bagga.

Measuring the scalability of SQL and NoSQL systems.

The evolving market for NoSQL Databases: Interview with James Phillips.

#

Oct 6 11

Analytics at eBay. An interview with Tom Fastner.

by Roberto V. Zicari

“How much more complexity can human developers and organizations deal with?”— Tom Fastner, eBay.

Much has already been written about analytics at eBay. But what is the current status? Which data platforms and data management technologies do they currently use? I asked a few questions to Tom Fastner, a Senior Member of Technical Staff, and Architect with eBay.

RVZ

Q1. What are the main technical challenges for big data analytics at eBay?

Tom Fastner: The primary challenges are:

I/O bandwidth: limited due to configuration of the nodes.

Concurrency/workload management: Workload management tools usually manage the limited resource. For many years EDW systems bottle neck on the CPU; big systems are configured with ample CPU making I/O the bottleneck. Vendors are starting to put mechanisms in place to manage I/O, but it will take some time to get to the same level of sophistication.

Data movement (loads, initial loads, backup/restores): As new platforms are emerging you need to make data available on more systems challenging networks, movement tools and support to ensure scalable operations that maintain data consistency

Q2. What are the current Metrics of eBay`s main data warehouses?

Tom Fastner: We have 3 different platforms for Analytics:

A) EDW: Dual systems for transactional (structured) data; Teradata 3.5PB and 2.5 PB spinning disk; 10+ years experience; very high concurrency; good accessibility; hundreds of applications.

B) Singularity: deep Teradata system for semi-structured data; 36 PB spinning disk; lower concurrency that EDW, but can store more data; biggest use case is User Behavior Analysis; largest table is 1.2 PB with ~1.9 Trillion rows.

C) Hadoop: for unstructured/complex data; ~40 PB spinning disk; text analytics, machine learning; has the User Behavior data and selected EDW tables; lower concurrency and utilization.

Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

Tom Fastner: EDW: We model for the unknown (close to 3rd NF) to provide a solid physical data model suitable for many applications, that limits the number of physical copies needed to satisfy specific application requirements. A lot of scalability and performance is built into the database, but as any shared resource it does require an excellent operations team to fully leverage the capabilities of the platform

Singularity: The platform is identical to EDW, the only exception are limitations in the workload management due to configuration choices. But since we are leveraging the latest database release we are exploring ways to adopt new storage and processing patterns. Some new data sources are stored in a denormalized form significantly simplifying data modeling and ETL. On top we developed functions to support the analysis of the semi-structured data. It also enables more sophisticated algorithms that would be very hard, inefficient or impossible to implement with pure SQL. One example is the pathing of user sessions. However the size of the data requires us to focus more on best practices (develop on small subsets, use 1% sample; process by day),

Hadoop: The emphasis on Hadoop is on optimizing for access. The reusability of data structures (besides “raw” data) is very low.

Q4: How do you handle un-structured data?

Tom Fastner: Un-structured data is handled on Hadoop only. The data is copied from the source systems into HDFS for further processing. We do not store any of that on the Singularity (Teradata) system.

Q5. What kind of data management technologies do you use? What is your experience in using them?

Tom Fastner: ETL: AbInitio, home grown parallel Ingest system.
Scheduling: UC4.
Repositories: Teradata EDW; Teradata Deep system; Hadoop.
BI: Microstrategy, SAS, Tableau, Excel.
Data modeling: Power Designer.
Adhoc: Teradata SQL Assistant; Hadoop Pig and Hive.
Content Management: Joomla based.

In regards to tools capabilities, there is always something you might wish for. In some cases the tool of choice might even have an important capability, but we have not implemented it (due to resources, complexity, priorities, bugs). The way we try to address required features is through partnerships with some of those vendors.
The most mature partnership is with Teradata; other good examples are Tableau and Microstrategy where we have regular meetings discussing future enhancements. However I do not feel comfortable to rate all those tools and point out shortcomings.

Q6. Cloud computing and open source: Do you they play a role at eBay? If yes, how?

Tom Fastner: We do leverage internal cloud functions for Hadoop; no cloud for Teradata.
Open source: committers for Hadoop and Joomla; strong commitment to improve those technologies

Q7. Glenn Paulley of Sybase in a recent keynote asked the question of how much more complexity can database systems deal with? What is your take on this?

Tom Fastner: I am not familiar with Glenn Pulley’s presentation, so I can’t respond in that context. But in my role at eBay I like to take into account the full picture: How much more complexity can human developers and organizations deal with?

As we learn to deal with more complex data structures and algorithms using specialized solutions, we will look for ways to make them repeatable and available for a wider audience as they mature and demand grows in the market. Specialized solutions do come at a high price (TCO), requiring separate hardware and/or software stack, data replication, data management issues, computing specialists.

Q8. Anything else you wish to add?

Tom Fastner: Ebay is rapidly changing, and analytics is driving many key initiatives like buyer experience, search optimization, buyer protection or mobile commerce. We are investing heavily in new technologies and approaches to leverage new data sources to drive innovation.

———–
Tom Fastner is a Senior Member of Technical Staff, Architect with eBay. In this role Tom works on the architecture of the analytical platforms and related tools. Currently he spends most of his time driving
innovation to process Big Data, the Singularity© system. Before Tom joined eBay he was with NCR/Teradata for 11 years. He was involved in the Dual Active program from the beginning in 2003 as the technical lead of the first global implementation of Dual Active. Prior to NCR/Teradata Tom worked for debis Systemhaus with several
clients in the aerospace sector. He holds a Masters in Computer Science from the Technical University of Munich, Germany, and is a Teradata Certified Master.
————-

Related Posts

Measuring the scalability of SQL and NoSQL systems.

Interview with Jonathan Ellis, project chair of Apache Cassandra.

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

The evolving market for NoSQL Databases: Interview with James Phillips.

Sep 29 11

MariaDB: the new MySQL? Interview with Michael Monty Widenius.

by Roberto V. Zicari

“I want to ensure that the MySQL code base (under the name of MariaDB) will survive as open source, in spite of what Oracle may do.”– Michael “Monty” Widenius.

Michael “Monty” Widenius is the main author of the original version of the open-source MySQL database and a founding member of the MySQL AB company. Since 2009, Monty is working on a branch of the MySQL code base, called MariaDB.
I wanted to know what`s new with MariaDB. You can read the interview with Monty below.

RVZ.

Q1. MariaDB is an open-source database server that offers drop-in replacement functionality for MySQL. Why did you decide to develop a new MySQL database?

Monty: Two reasons:
1) I want to ensure that the MySQL code base (under the name of MariaDB) will survive as open source, in spite of what Oracle may do. As Oracle is now moving away from Open Source to Open Core (see my blog) it was in hindsight the right thing to do!

2) I want to ensure that the MySQL developers have a good home where they can continue to develop MySQL in an open source manner. This is important as if the MySQL ecosystem would loose the original core developers there would be no way for the product to survive. This is also in hindsight proven to be important as Oracle has lost almost all of the original core developers, fortunately most of them has joined the MariaDB project.

Q2. What is new in MariaDB 5.3.1 beta?

Monty: 5.1 and 5.2 was about fixing some outstanding issues and getting in patches for MySQL that has been available in the community for a long time but never was accepted into the MySQL code base for political reasons.

5.3 is where we have put most of our development efforts, especially in the optimizer area and replication. Replication is now a magnitude faster than before (when running with many concurrent updates) and many queries involving sub queries or big joins are now 2x – 100x faster. All the changes are listed in the MariaDB knowledge base.

Q3. How is MariaDB currently being used? Could you give examples of applications that use MariaDB?

Monty: We see a lot of old MySQL users switching MariaDB. These are especially big sites with a lot of queries that needs even more performance. We have some case studies in the knowledgebase.

What we do expect with 5.3 is that also people that need a more flexible SQL will start using MariaDB thanks to the new ‘no-sql’ features we have added like handler socket and dynamic columns.
Dynamic columns allows you to have a different set of columns for every row in a table.

Q4. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

Monty: A lot of the new features, like batched key access, hash joins, sub query caching etc are targeted to allow handling of larger data sets.

We have also started to continuously do benchmarks on data in the terabyte range to be able to improve MariaDB even more in the area. We hope to have some very interesting announcements in this area shortly.

Q5. How do you handle structured and unstructured data?

Monty: Yes. The dynamic columns feature is especially designed to do that. We have now one developer working on using the dynamic columns feature to implement a storage engine for HBASE.

Q6. Who else is using MariaDB and why?

Monty: MariaDB has several hound-red of thousands of downloads and is part of many Linux distributions, but as we don’t track users we don’t know who they are. (This was exactly the problem we had with MySQL in the early days). That is why we have started to collect success stories about MariaDB installations to spread the awareness of who is using it.
The problem is that big companies usually never want to tell what they are using which makes this a bit difficult 🙁

The main reasons people people are switching to MariaDB is:
1. Faster, more features and fewer bugs.
The fewer bugs comes from the fact that we have fixed a lot of bugs that MySQL has and is introducing in each release.
2. It’s a drop in replacement of MySQL, so it’s trivial to switch.
(All old connectors works unchanged and you don’t have to dump/reload your data).
3. Lot’s of new critical features that people have wanted for years:
– Microsecond support
– Virtual columns.
– Faster sub queries, big data queries & replication
– Segmented key cache (speeds up MyISAM tables a LOT)
– Progress reporting for alter table.
– SphinxSE search engine.
– FederatedX search engine (better and supported federated engine)
See for a full list here..
4. The upgrade process to a new release is easier than in MySQL; you don’t have to dump & restore your data to
upgrade to a new version.
5. It’s guaranteed to be open source now and forever (no commercial extensions from Monty Program Ab).
6. We actively work with the community and add patches and storage engines they create to the MariaDB code base.

Q7. What’s next in MariaDB?

Monty: We are just now working on a the last cleanups so that we can release MariaDB 5.5.
This will include everything in MariaDB 5.3, MySQL 5.5 and also most of the closed source features that is in MySQL 5.5 Enterprise.

In parallel we are working on 5.6. The plans for it can be found here.
The most important features in this are probably:
– True parallel replication (not only per database)
– Multi source slaves.
– A better MEMORY engine that can handle BLOB and VARCHAR gracefully (this is already in development).

We are also looking at introducing a new clustered storage engines that is optimized for the cloud.

What will be in 5.6 is still a bit up in the air; We are working with a lot of different companies and the community to define the new features. The final feature set will be decided upon in our next MariaDB developer meeting in Greece in November.

Q8. How can the open source community contribute to the project?

Monty: Anyone can contribute patches to MariaDB and if you are active and have proven that you are not going to break the code you can get commit access to the MariaDB tree.
The following link tells you how you can get involved.

Another way is of course to contract Monty Program Ab to implement features to MariaDB. This is a good options for those that have more money than time and wants to get something done quickly.

Q9. Cloud computing: Does it play a role for MariaDB? If yes, how?

Monty: Yes, cloud computing is very important for us. MySQL & MariaDB is already one of the most popular databases in the cloud; Rackspace, Amazon and Microsoft are all providing MySQL instances.
The popularity of MySQL in the cloud is mostly thanks to the fact that MySQL is quite easy to configure in different setups, from using very little memory to using all resources on the box. This combined with replication has made MySQL/MariaDB the database of choice in the cloud.

The one thing MariaDB is missing for being an even better choice for the cloud is a good engine that allows one to, on the switch of a key, add / drop nodes to dynamicly change how many cloud entities one is using. We hope to have a solution for this very soon.

Resources

Posts tagged ‘MySQL’.

Michael (Monty) Widenious, MariaDB
State of MariaDB, Dynamic Column in MariaDB
Describes MariaDB 5.1 a branch of MySQL 5.1. It also introduces the new features in versions 5.2(beta) and 5.3 (alpha). Copy of the presentation given at ICOODB Frankfurt 2010.
Presentation | Intermediate | English | DOWNLOAD (.PDF) | September 2010|

Databases Related Resources:
Blog Posts | Free Software | Articles and Presentations| Lecture Notes | Journals|

—————–

Sep 19 11

Benchmarking XML Databases: New TPoX Benchmark Results Available.

by Roberto V. Zicari

“A key value is to provide strong data points that demonstrate and quantify how XML database processing can be done with very high performance.”Agustin Gonzalez, Intel Corporation.

“We wanted to show that DB2’s shared-nothing architecture scales horizontally for XML warehousing just as it does for traditional relational warehousing workloads.”Dr. Matthias Nicola, IBM Corporation.

TPoX stands for “Transaction Processing over XML” and is a XML database benchmark that Intel and IBM have developed several years ago and then released as open source.
A couple of months ago, the project has published some new results.

To learn more about this I have interviewed the main leaders of the TPoX project, Dr. Matthias Nicola, Senior engineer for DB2 at IBM Corporation and Agustin Gonzalez, Senior Staff Software Engineer at Intel Corporation.

RVZ

Q1. What is exactly TPoX?

Matthias: TPoX is an XML database benchmark that focuses on XML transaction processing. TPoX simulates a simple financial application that issues XQuery or SQL/XML transactions to stress the XML storage, XML indexing, XML Schema support, XML updates, logging, concurrency and other components of an XML database system. The TPoX package comes with an XML data generator, an extensible Workload Driver, three XML Schemas that define the XML structures, and a set of predefined transactions. TPoX is free, open source, and available at http://tpox.sourceforge.net/ where detailed information can be found. Although TPoX comes with a predefined workload, it’s very easy to change this workload to adjust the benchmark to whatever your goals might be. The TPoX Workload driver is very flexible, it can even run plain old SQL against a relational database and simulate hundreds concurrent database users. So, when you ask “What is TPoX”, the complete answer is that it is an XML database benchmark but also a very flexible and extensible framework for database performance testing in general.

Q2. When did you start with this project? What was the original motivation for TPoX? What is the motivation now?

Matthias: We started with this project approximately in 2003/2004. At that time we were working on the native XML support in DB2 that was later released in DB2 version 9.1 in 2006. We needed an XML workload -a benchmark- that was representative of an important class of real-world XML applications and that would stress all critical parts of a database system.
We needed a tool to put a heavy load on the new XML database functionality that we were developing. Some XML benchmarks had been proposed by the research community, such as XMark, MBench, XMach-1, XBench, X007, and a few others. They were are all useful in their respective scope, such as evaluating XQuery processors, but we felt that none of them truly aimed at evaluating a database system in its entirety. We found that they did not represent all relevant characteristics of real-world XML applications.
For example, many of them only defined a read-only and single-user workload on a single XML document. However, real applications typically have many concurrent users, a mix of read and write operations, and millions or even billions of XML documents.
That’s what we wanted to capture in the TPoX benchmark.

Agustin: And the motivation today is the same as when TPoX became freely available as open source: database and hardware vendors, database researchers, and even database practitioners in the IT departments of large corporations need a tool evaluate system performance, compare products, or compare different design and configuration options.
At Intel, the main motivation behind TPoX it to benchmark and improve our platforms for the increasingly relevant intersection of XML and databases. So far, the joint results with IBM have exceeded our expectations.

Q3. TPoX is an application-level benchmark. What does it mean? Why did you choose to develop an application-level benchmark?

Matthias: We typically distinguish between micro-benchmarks and application-level benchmarks, both of which are very useful but have different goals. A micro-benchmark typically defines a range of tests such that each test exercises a narrow and well-defined piece of functionality. For example, if your focus is an XQuery processor you can define tests to evaluate XPath with parent steps, other tests to evaluate XPath with descendant-or-self axis, other tests to evaluate XQuery “let” clauses, and so on.
This is very useful for micro-optimization of important features and functions. In contrast, an application-level benchmark tries to evaluate the end-to-end performance of a realistic application scenario and to exercise the performance of a complete system as a whole, instead of just parts of it.

Agustin: As an application-level benchmark, TPoX has proven much more useful and believable than “synthetic” micro-benchmarks. As a result, TPoX can even be used to predict how similar real-world applications will perform, or where they will encounter a bottleneck. You cannot make such predications with a micro-benchmark. Another important feature is that TPoX is very scalable – you can run TPoX on a laptop but also scale it up and run on large enterprise-grade servers, such as multi-processor Intel Xeon platforms.

Q4. How do you exactly evaluate the performance of XML databases?

Agustin: Well, one way is to use TPoX on a given platform and then compare to existing results on different combinations of hardware and software. I know that this is a simplistic answer but we really learn a lot from this approach. Keeping a precise history of the test configurations and the results obtained is always critical.

Matthias: This is actually a very broad question! We use a wide range of approaches. We use micro-benchmarks, we use application-level benchmarks such as TPoX, we use real-world workloads that we get from some of our DB2 customers, and we continuously develop new performance tests. When we use TPoX, we often choose a certain database and hardware configuration that we want to test and then we gradually “turn up the heat”. For example, we perform repeated TPoX benchmark runs and increase the number of concurrent users until we hit a bottleneck, either in the hardware or the software. Then we analyze the bottleneck, try to fix it, and repeat the process. The goal is to always push the available hardware and software to the limit, in order to continuously improve both.

Q5. What is the difference of TPoX with respect to classical database benchmarks such as TPC-C and TPC-H?

Matthias: One of the obvious differences is that TPC-C and TPC-H focus on very traditional and mature relational database scenarios. In contrast, TPoX aims at the comparatively young field of XML in databases. Another difference is that the TPC benchmarks have been standardized and “approved” by the TPC committee, while TPoX was developed by Intel and IBM, and extended by various students and Universities as an open source project.

Agustin: But, TPoX also has some important commonalities with the TPC benchmarks. TPC-C, TPC-H, and TPoX are all application-level benchmarks. Also, TPC-C, TPC-H, and TPoX have each chosen to focus on a specific type of database workload. This is important because no benchmark can (or should try to) exercise all possible types of workloads. TPC-C is a relational transaction processing benchmark, TPC-H is a relational decision support benchmark, and TPoX is an XML transaction processing benchmark. Some people have called TPoX the “XML-equivalent of TPC-C”. Another similarity between TPC-C, TPC-E, and TPoX is that all three are throughput oriented “steady state benchmarks”, which makes it straightforward to communicate results and perform comparisons.

Q6. Do you evaluate both XML-enabled and Native XML databases? Which XML Databases did you evaluate?

Matthias: TPoX can be used to evaluate pretty much any database that offers XML support. The TPoX workload driver is architected such that only a thin layer (a single Java class) deals with the specific interaction to the database system under test. Personally have used TPoX only on DB2. I know that other companies as well as students at various Universities have also run TPoX against other well-known database systems.

Q7. How did you define the TPoX Application Scenario? How did you ensure that the TPoX Application Scenario you defined is representative of a broader class of applications?

Matthias: Over the years we have been working with a broad range of companies that have XML applications and require XML database support. Many of them are in the financial sector. We have worked closely with them to understand their XML processing needs. We have examined their XML documents, their XML Schemas, their XML operations, their data volumes, their transaction rates, and so on. All of that experience has flown into the design of TPoX. One very basic but very critical observation is that there are practically no real-world XML applications that use only a single large XML document. Instead, the majority of XML applications use very large numbers of small documents.

Agustin: TPoX is also very realistic because it uses a real-world XML Schema called FIXML, which standardizes trade-related messages in the financial industry. It is a very complex schema that defines thousands of optional elements and attributes and allows for immense document variability. It is extremely hard to map the FIXML schema to a traditional normalized relational schema. In the past, many XML processing systems were not able to handle the FIXML schema. But, since type of XML is used in real-world applications, it is a great fit for a benchmark.

Q8. How did you define the workload?

Matthias: Again, by experience with real XML transaction processing applications.

Q9. In your documentation you write that TPoX uses a “stateless” workload? What does it mean in practice? Why did you make this choice?

Matthias: It means that every transaction is submitted to the database independently from any previous transactions. As a result, the TPoX workload driver doesn’t need to remember anything about previous transactions. This makes it easier to design and implement a benchmark that scales to billions of XML documents and hundreds of millions transactions in a single benchmark run.

Q10. Why not define a workload also for complex analytical queries?

Matthias: We did! And we ran it on a 10TB XML data warehouse with more than 5.5 Billion XML documents.
That was a very exciting project and you can find more details on my blog.
Although the initial wave of XML database adoption was more focused on transactional and operations systems, companies soon realized that they were accumulating very large volumes of XML documents that contained a goldmine of information. Hence, the need for XML warehousing and complex analytical XML queries was pressing. We wanted to show that DB2’s shared-nothing architecture scales horizontally for XML warehousing just as it does for traditional relational warehousing workloads.

Agustin: Admittedly, we have not yet formally included this workload of complex XML queries into the TPoX benchmark. Just like TPC-C and TPC-H are separate for transaction processing vs. decision support, we would also need to define two flavors of TPoX, even if the underlying XML data remains the same. A TPoX workload with complex queries is definitely very meaningful and desirable.

Q11. What are the main new results you obtained so far? What are the main values of the results obtained so far?

Agustin: We have produced many results using TPoX over the years, with ever larger numbers of transactions per second and continuous scalability of the benchmark on increasingly larger platforms. A key value is to provide strong data points that demonstrate and quantify how XML database processing can be done with very high performance. In particular, the first public 1TB XML benchmark that we did a few years ago has helped establish the notion that efficient XML transaction processing is a reality today. Such results give the hardware and the software a lot of credibility in the industry. And of course we learn a lot with every benchmark, which allows us to continuously improve our products.

Q12. You write in your Blog “For 5 years now Intel has a strong history of testing and showcasing many of their latest processors with the Transaction Processing over XML (TPoX) benchmark.” Why has Intel been using the TPoX benchmark? What results did they obtain?

Matthias: I let Agustin answer this one.

Agustin: Intel uses the TPoX benchmark because it helps us demonstrate the power of Intel platforms and generate insights on how to improve them. TPoX also enables us to work with IBM to improve the performance of DB2 on Intel platforms, which is good for both IBM and Intel. This collaboration of Intel and IBM around TPoX is an example of an extensive effort at Intel to make sure that enterprise software has excellent performance on Intel. You can see our most important results on the TPoX web page.

Q13. Can you use TPoX to evaluate other kinds of databases (e.g. Relational, NoSQL, Object Oriented, Cloud stores)? How does TPoX compare with the Yahoo! YCSB benchmark for Cloud Serving Systems?

Matthias: Yes, the TPoX workload driver can be used to run traditional SQL workloads against relational databases. Assuming you have a populated relational database, you can define a SQL workload and use the TPoX driver to parameterize, execute, and measure it. TPoX and YCSB have been designed for different systems under test. However, parts of the TPoX framework can be reused to quickly develop other types of benchmarks, especially since TPoX offers various extension points.

Agustin: Some open source relational databases have started to offer at least partial support for the SQL/XML functions and the XML data type. Given the level of parameterization and the extensible nature of the TPoX workload driver it would be very easy to develop custom workloads for the emerging support of the XML data type on open source databases. At the same time, the powerful XML document generator included in the kit can be used to generate the required data. Using TPoX to test the performance of XML in open source databases is an intriguing possibility.

Q14. Is it possible to extend TPoX? If yes, how?

Matthias: Yes, TPoX can be extended in several ways. First, you can change the TPoX workload in any way you want. You can modify, add, or remove transactions from the workload, you can change their relative weight, and you can change the random value distributions that are used for the transaction parameters. We have used the TPoX workload driver to run many different XML workloads, also on other XML data than just the TPoX documents. We have also used the workload driver for relational SQL performance tests, just because it’s so easy to setup concurrent workloads.
Second, the database specific interface of the TPoX workload driver is encapsulated in a single Java class, so it is relatively easy to port the driver to another database system. And third, the new version TPoX 2.1 allows transactions to be coded not only in SQL, SQL/XML, and XQuery, but also in Java. TPoX 2.1 supports “Java-Plugin transactions” that allow you to implement whatever activities you want to run and measure in a concurrent manner. For example, you can run transactions that call a web service, send or receive data from a message queue, access a content management system, or perform any other operations – only limited by what you can code in Java!

Agustin: At Intel we have been using TPoX internally for various other projects. Since the TPoX workload driver is open source, it is straightforward to modify it to support other type of workloads, not necessarily steady state, which makes it amenable to testing other aspects of computer systems such as power management, storage, and so on.

Q15 What are the current limitations of TPoX?

Matthias: Out of the box, the TPoX workload driver only works with databases that offer a JDBC interface. If a particular database system has specific requirements for its API or query syntax, then some modifications may be necessary. Some database system might require their own JDBC driver to be compiled into the workload driver.

Q16. Who else is using TPoX?

Matthias: You can see some examples of other TPoX usage on the TPoX web site. We know that other database vendors are using TPoX internally, even if haven’t decided to publish results yet. I also know a company in the Data Security space that uses TPoX to evaluate the performance of different data encryption algorithms. And TPoX also continues to be used at various universities in Europe, US, and Asia for a variety of research and student projects. For example, the University of Kaiserslautern in Germany has used TPoX to evaluate the benefit of solid-state disks for XML databases. Other universities have used TPoX to evaluate and compare the performance of several XML-only databases.

Q17. TPoX is an open source project. How can the community contribute?

Matthias: A good starting point is to use TPoX. From there, contributing to the TPoX project is easy. For example, you can report problems and bugs , or you can submit new feature requests. Or even better, you can implement bug fixes and enhancements yourself and submit them to the SVN code repository on sourceforge.net.
If you design other workloads for the TPoX data set, you can upload new workloads to the TPoX project site and have your results posted on he TPoX web site.

Agustin: As is customary for an open source project on sourceforge, anybody can download all TPoX files and source code freely.
If you want to upload any changed or new files or modify the TPoX web page, you only need to become a member of the TPoX sourceforge project, which is quick and easy.
Everybody is welcome, without exceptions.

Resources

TPoX software:

XML Database Benchmark: “Transaction Processing over XML (TPoX)”:
TPoX is an application-level XML database benchmark based on a financial application scenario. It is used to evaluate the performance of XML database systems, focusing on XQuery, SQL/XML, XML Storage, XML Indexing, XML Schema support, XML updates, logging, concurrency and other database aspects.
Download TPoX (LINK), July 2009. | TPoX Results (LINK), April 2011.

Articles:

“Taming a Terabyte of XML Data”.
Augustin, Gonzales, Matthias Nicola, IBM Silicon Valley Lab.
Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2009|

“An XML Transaction Processing Benchmark”.
Matthias Nicola, Irina Kogan, Berni Schiefer, IBM Silicon Valley Lab.
Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2007|

“A Performance Comparison of DB2 9 pureXML with CLOB and Shredded XML Storage”.
Matthias Nicola et a., IBM Silicon Valley Lab.
Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2006|

Related Posts

Measuring the scalability of SQL and NoSQL systems.

Benchmarking ORM tools and Object Databases.

Sep 7 11

Big Data and the European Commission: How to get involved.

by Roberto V. Zicari

I reported in a previous post that the European Commission has a budget for funding projects in the area of “Intelligent Information Management”.
The European Commission is looking to support projects that develop and test new technologies to manage and exploit extremely large volumes of data, with real time capabilities.

Here are some new interesting information on how you can get involved.

RVZ

1. The Call 8 of the Seventh Framework Programme (FP7) is published since 20 July 2011 (Call FP7-ICT-2011-8).

2. A new document called “Technical background notes for Framework Programme 7, Strategic Objective ICT-2011.4.4 “Intelligent Information Management” is now available.
The document provides background information and technical commentary on the Strategic Objective ICT-2011.4.4 as far as checklists. It aims to facilitate the preparation of proposals on this objective and the preparation of the Information and Networking Day.

3. An Information and Networking Day, will take place on 26 September 2011 in Luxembourg, at the Jean Monnet Conference Centre.
The Registration for the Information and Networking Day is open until 19 September 2011, 18h00 on the EU Event’s page.
You can also visit the home page of the EU Unit Technologies for Information Management for more information.

The deadline for submitting Presentations and Posters, and for the Pre-proposal service registration is 19 September 2011, 18h00 CET.

Useful information on the Information and Networking Day, 26 September 2011 in Luxembourg,

Matchmaking Sessions: These sessions are dedicated to “matchmaking/partner finding”, where you can present your organisation, your skills and ideas (morning), or state what sort of partners and skills you are looking for (afternoon sessions).
If you wish to take the podium for three minutes (maximum three slides), please tick the option in your online registration form and upload the presentation in the foreseen box in your participant’s profile.
Please choose the relevant session and indicate the headline of your presentation not later than the 12 September 2011 and finalise your slides by 19 September 2011.

The slots will be allocated on a first-come-first-served basis, and the confirmation and exact timing will be communicated to you by e-mail. The presentations will be published after the event, if you have ticked the box “I do not object to the publication of my presentation on the meeting’s web site”.

Pre-proposal service: The Pre-proposal service will run in parallel to the Event’s programme. European Commission Project Officers will be available to discuss project ideas and offer preliminary feedback with respect to the Work programme and Call 8. If you wish to sign up for a bilateral meeting, please tick this option in the online registration form. The slots will be allocated on a first-come-first-served basis, and the exact schedule will be communicated to you at the registration desk.

Poster Session: The European Commission facilitates a poster session. It will give the participants the chance to showcase their work and to advertise their skills. No commercial advertising is allowed. Please send the posters electronically via your profile as registered participant and bring the paper version to the event (no print-out service is provided, maximum size A1 portrait). Please note that you will receive the confirmation by e-mail. The posters will be displayed in the Foyer of the Jean Monnet Conference Centre.

Registration and terms of participation

1. Participation is free of charge but subject to prior registration and confirmation (only registered participants can access the conference centre).

2. Travel and accommodation expenses must be borne by the delegates and will not be reimbursed by the European Commission.

Aug 29 11

The future of data management: “Disk-less” databases? Interview with Goetz Graefe

by Roberto V. Zicari

“With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.” — Goetz Graefe.

Are “disk-less” databases the future of data management? What about the issue of energy consumption and databases? Will we have have “Green Databases”?
I discussed these issues with Dr. Goetz Graefe, HP Fellow (*), one of the most accomplished and influential technologists in the areas of database query optimization and query processing.

Hope you`ll enjoy this interview.

RVZ.

Q1 The world of data management is changing. Service platforms, scalable cloud platforms, analytical data platforms, NoSQL databases and new approaches to concurrency control are all becoming hot topics both in academia and industry. What is your take on this?

Goetz Graefe: I am wondering whether new and imminent hardware, in particular very large RAM memory as well as inexpensive non-volatile memory (NV-RAM, PCM, etc.) will be significant shifts that will affect software architecture, functionality, scalability, total cost of ownership, etc.
Hasso Plattner in his keynote at BTW (” SanssouciDB: An In-Memory Database for Processing Enterprise Workloads”) definitely seemed to think so. Whether or not one agrees with everything he said, he traced many of the changes back to not having disk drives in their new system (except for backup and restore).

For what it’s worth, I suspect that, based on the performance advantage of semiconductor storage, the vendors will ask for a price premium for semiconductor storage for a long time to come. That enables disk vendors to sell less expensive storage space, i.e., there’ll continue to be role for traditional disk space for colder data such as long-ago history analyzed only once in a while.

I think there’ll also be a differentiation by RAID level. For example, warm data with updates might be on RAID-1 (mirroring) whereas cold historical data might be on RAID-5 or RAID-6 (dual redundancy, no data loss in dual failure). In the end, we might end up with more rather than fewer levels in the memory hierarchy.

Q2. What is expected impact of large RAM (volatile) memory on database technology?

Goetz Graefe: I believe that large RAM (volatile) memory has already made a significant difference. SAP’s HANA /Sanssouci/NewDB project is one example,
C-store/VoltDB
is another. Others database management system vendors are sure to follow.

It might be that NoSQL databases and key-value stores will “win” over traditional databases simply because they adapt faster to the new hardware, even if purely because they currently are simpler than database management systems and contain less code that needs adapting.

Non-volatile memory such as phase-change memory and memristors will change a lot of requirements for concurrency control and recovery code. With storage in RAM, including non-volatile RAM, compression will increase in economic value, sophistication, and compression factors. Vertica, for example, already uses multiple compression techniques, some of them pretty clever.

Q3. Will we end up having “disk less” databases then?

Goetz Graefe: With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.
I suspect we’ll see disk-less databases where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history, including cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.

Q4. Where will the data go if we have no disks? In the Cloud?

Goetz Graefe: Public clouds in some cases, private clouds in many cases. If “we” don’t have disks, someone else will, and many of us will use them whether we are aware of it or not.

Q5. As new developments in memory (also flash) occur, it will result in possibly less energy consumption when using a database. Are we going to see “Green Databases” in the near future?

Goetz Graefe: I think energy efficiency is a terrific question to pursue. I know of several efforts, e.g., by Jignesh Patel et al. and Stavros Harizopoulos et al. Your own students at DBIS Goethe Universität just did a very nice study, too.

It seems to me there are many avenues to pursue.

For example, some people just look at the most appropriate hardware, e.g., high-performance CPUs such as Xeon versus high-efficiency CPUs such as Centrino (back then). Similar thoughts apply to storage, e.g., (capacity-optimized) 7200 rpm SATA drives versus (performance-optimized) 15K rpm fiber channel drives.
Others look at storage placement, e.g., RAID-1 versus RAID-5/6, and at storage formats, e.g., columnar storage & compression.
Others look at workload management, e.g., deferred index maintenance (& view maintenance) during peak load (perhaps the database equivalent to load shedding in streams) or ‘pause and resume’ functionality in utilities such as index creation.
Yet others look at caching, e.g., memcached. Etc.

Q6. What about Workload management?

Goetz Graefe: Workload management really has two aspects: the policy engine including its monitoring component that provides input into policies, and the engine mechanisms that implement the policies. It seems that most people focus on the first aspect above. Typical mechanisms are then quite crude, e.g., admission control.

I have long been wondering about more sophisticated and graceful mechanisms. For example, should workload management control memory allocation among operations? Should memory-intensive operations such as sorting grow & shrink their memory allocation during their execution (i.e.,, not only when they start)? Should utilities such as index creation participate is resource management? Should index creation (etc.) support ‘pause and resume’ functionality?

It seems to me that I’d want to say ‘yes’ to all those questions. Some of us at Hewlett-Packard Laboratories have been looking into engine mechanisms in that direction.

Q7. What are the main research questions for data management and energy efficiency?

Goetz Graefe: I’ve recently attended a workshop by NSF on data management and energy efficiency.
The topic was split into data management for energy efficiency (e.g., sensors & history & analytics in smart buildings) and energy efficiency in data management (e.g., efficiency of flash storage versus traditional disk storage).
One big issue in the latter set of topics was the difference between traditional performance & scalability improvements versus improvements in energy efficiency, and we had a hard time coming up with good examples where the two goals (performance, efficiency) differed. I suspect that we’ll need cases with different resources, e.g., trading 1 second of CPU time (50 Joule) against 3 seconds of disk time (20 Joule).
NSF (the US National Science Foundation) seems to be very keen on supporting good research in these directions. I think that’s very timely and very laudable. I hope they’ll receive great proposals and can fund many of them.

————————–
Dr. Goetz Graefe is an HP Fellow, and a member of the Intelligent Information Management Lab within Hewlett-Packard Laboratories. His experience and expertise are focused on relational database management systems, gained in academic research, industrial consulting, and industrial product development.

His current research efforts focus on new hardware technologies in database management as well as robustness in database request processing in order to reduce total cost of ownership. Prior to joining Hewlett-Packard Laboratories in 2006, Goetz spent 12 years as software architect in product development at Microsoft, mostly in database management. Both query optimization and query execution of Microsoft’s re-implementation of SQL Server are based on his designs.

Goetz’s areas of expertise within database management systems include compile-time query optimization including extensible query optimization, run-time query execution including parallel query execution, indexing, and transactions. He has also worked on transactional memory, specifically techniques for software implementations of transactional memory.

Goetz studied Computer Science at TU Braunschweig from 1980 to 1983.

(*) HP Fellows are “pioneers in their fields, setting the standards for technical excellence and driving the direction of research in their respective disciplines”.

Resources:

NoSQL Data Stores (Free Downloads and Links).

Cloud Data Stores (Free Downloads and Links).

Graphs and Data Stores (Free Downloads and Links).

Databases in General (Free Downloads and Links).

##

Aug 17 11

On Versant`s technology. Interview with Vishal Bagga.

by Roberto V. Zicari

“We believe that data only becomes useful once it becomes structured.” — Vishal Bagga

There is a lot of discussion on NoSQL databases nowdays. But what about object databases?
I asked a few questions to Vishal Bagga, Senior Product Manager at Versant.

RVZ

Q1. How has Versant’s technology evolved over the past three years?

Vishal Bagga: Versant is a customer driven company. We work closely with our customers trying to understand how we can evolve our technology to meet their challenges – whether it’s regarding complexity, data size or demanding workloads.

In the last 3 years we have seen 2 very clear trends from our interaction with our new and existing customers – growing data sizes and increasingly parallel workloads. This is very much in-line with what the general database market is seeing. In addition there was request for simplified database management and monitoring.

Our state of the art Versant Object Database 8 released last year was designed for exactly these scenarios. We have added increased scalability and performance on multi-core architectures, faster and better defragmentation tools, Eclipse based management and monitoring tools to name a few. We are also re-architecting our database server technology to automatically scale when possible without manual DBA intervention and allow online tuning (reconfigure the database instance online without impacting applications).

Q2. On December 1, 2008 Versant acquired the assets of the database software business of Servo Software, Inc. (formerly db4objects, Inc.). What happened to db4objects since then? How does db4objects fit into Versant technology strategy?

Vishal Bagga: The db4o community is doing well and is an integral part of Versant. In fact, when we first acquired db4o at the end of 2008, there were just short of 50,000 registered members.
Today, the db4o community boasts nearly 110,000 members having more than doubled in size in the last 2+ years.
In addition, db4o has had 2 major releases with some significant advances in enterprise type features allowing things like online defragmentation support. In our latest major release, we announced a new data replication capability between db4o and the large scale enterprise class Versant database.
Versant sees a great need in the mobile markets for technology like db4o which can play well in the lightweight handheld, mobile computing and machine-to-machine space while leveraging big data aggregation servers like Versant which can handle the huge number of events coming off of these intelligent edge devices.
In the coming year, even greater synergies are being developed and our communities are merging into one single group dedicated to next generation NoSQL 2.0 technology development.

Q3. Versant database and NoSQL databases: what are the similarities and what are the differences?

Vishal Bagga: The Not Only SQL databases are essentially systems that have evolved out of a certain business need – The need was essentially to have horizontally scalable systems running on commodity hardware with a simple “soft-schema” model for example social networking, offline data crunching, distributed logging system, event processing systems etc.

Relational databases were considered to be too slow, expensive and difficult to manage and administrate, expensive and difficult to adapt to quick changing models.

If I look at similarities between Versant and NoSQL, I would say that:

Both systems have designed around the inefficiency of JOINs. This is the biggest problem with relational databases. If you think about it, in most operational systems relations don’t change e.g. Blog:Article, Order:OrderItem, so why recalculate those relations each time they are accessed using a methodology which gets slower and slower as the amount of data gets larger. JOINs have a use case, but for some 20%, not 100% of the use cases.

Both systems leverage an architectural shift to a “soft-schema” which allows scale-out capability – the ability to partition information across many physical nodes and treat those nodes as 1 ubiquitous database.

When it comes to differences:

The biggest in my opinion is the complexity of the data. Versant allows to you to model very complicated data models seamlessly with ease whereas doing so with a NoSQL solution would be much more effort and you would need to write a lot of code in the application to represent the data model.
In this respect, Versant prefers to use the term “soft-schema” –vs- the term “schemaless”, terms which are often interchanged in discussion.
We believe that data only becomes useful once it becomes structured, in fact that is the whole point of technologies like Hadoop, to churn unstructured data looking for a way to structure it into something useful.
NoSQL technologies that bill themselves as “schema-less” are in denial of the fact that they are leaving the application developer the burden of defining the structure and mapping the data into that structure in the language of the application space. In many ways, it is the mapping problem all over again. Plus, that kind of data management is very hard to change it over time, leading to a brittle solution difficult to optimize for more than 1 use case. The use of “soft-schema” lends itself to a more enterprise manageable and extensible system where the database still retains important elements of structure, while still being able to store and manipulate unstructured types.

Another is the difference in the consistency model. Versant is ACID centric and Versant’s customers depend on this for their mission critical systems – it would be nearly impossible for these systems to use NoSQL given the relaxed constraints. Versant can do a CAP mode, but that is not our only mode of operation. You use it where it is really needed; you are not forced into using it unilaterally.

NoSQL systems make you store your data in a way that you can lookup efficiently by a key. But what if want to lookup something differently; it is likely to be terribly inefficient. This may be okay for the design but a lot of people do not realize that this is a big change in mindset. Versant offers a more balanced approach where you can navigate between related objects using references; you can for example define a root object and then navigate your tree from that object. At the same time you can run ad-hoc queries whenever you want to.

Q4. Big Data: Can Versant database be useful when dealing with petabytes of user data? How?

Vishal Bagga: I don’t see why not. Versant was always designed to work on a network of databases from the very start. Dealing with a Petabyte is really about designing a system with the right architecture. Versant has that architecture just as intact as anyone in the database space saying they can handle a Petabyte. Make no mistake, no matter how you do it, it is a non-trivial task. Today, our largest customer databases are in the 100’s of terrabyte range, so getting to a Petabyte is really a matter of needing that much data.

Q5. Hadoop is designed to process large batches of data quickly. Do you plan to use Hadoop and leverage components of the Hadoop ecosystem like HBase, Pig, and Hive?

Vishal Bagga: Yes, and some of our customers already do that today. A question for you: “Why are those layers in existence?” I would say the answer is that most of these early NoSQL 1.0 technologies do not handle real world complexity in information models. So, these layers are built to try and compensate for that fact. That is the exact point where Versant’s NoSQL 2.0 technology fits into the picture, we help people deal with complexity of information models, something that 1st generation NoSQL has not managed to accomplish.

Q6. Do you think that projects such as JSON (JavaScript Object Notation) and MessagePack (binary-based efficient object serialization library ) play a role in the odbms market?

Vishal Bagga: Absolutely. We believe in open standards. Fortunately, you can store any type in an ODBMS. These specific libraries are particularly important for current most popular client frameworks like Ajax. Finding ways to deliver a soft-schema into a client friendly format is essential to help ease the development burden.

Q7. Looking at three elements: Data, Platform, Analysis, where is Versant heading up?

Vishal Bagga: It is a difficult question as database and data management is increasingly a cross cutting concern. It used to be perfectly fine to keep your Analysis as part of your off-line OLAP systems, but these days there is an increasing push to get Analytics to the real time business.
So, you play with Data, you play with Analytics whether you do it directly or in concert with other technologies through partnership. Certainly, as Versant embraces Platform as a Service, we will do so through eco system partners who are paving the way with new development and deployment methodologies.

Related Posts

Objects in Space: “Herschel” the largest telescope ever flown. (March 18, 2011)

Benchmarking ORM tools and Object Databases. (March 14, 2011)

Robert Greene on “New and Old Data stores” . (December 2, 2010)

Object Database Technologies and Data Management in the Cloud. (September 27, 2010)

##