Do we still have an impedance mismatch problem? – Interview with José A. Blakeley and Rowan Miller.
“The impedance mismatch problem has been significantly reduced, but not entirely eliminated”— José A. Blakeley.
” Performance and overhead of ORMs has been and will continue to be a concern. However, in the last few years there have been significant performance improvements” –José A. Blakeley, Rowan Miller.
Do we still have an impedance mismatch problem in 2012?
Not an easy question to answer. To get a sense of where are we now, I have interviewed José A. Blakeley and Rowan Miller. José is a Partner Architect in the SQL Server Division at Microsoft, and Rowan is the Program Manager for the ADO.NET Entity Framework team at Microsoft.
The focus of the interview is on ORM (object-relational mapping) technology) and the new release of Entity Framework (EF 5.0). Entity Framework is an object-relational mapper developed by Microsoft that enables .NET developers to work with relational data using domain-specific objects.
RVZ
Q1. Do we still have an impedance mismatch problem in 2012?
Blakeley: The impedance mismatch problem has been significantly reduced, but not entirely eliminated.
Q2. In the past there have been many attempts to remove the impedance mismatch. Is ORM (object-relational mapping) technology really the right solution for that in your opinion? Why? What other alternative solutions are feasible?
Blakeley: There have been several attempts to remove the impedance mismatch. In the late ’80s, early ’90s, object databases and persistent programming languages made significant progress in persisting data structures built in languages like C++, Smalltalk, and Lisp almost seamlessly. For instance, Persistent C++ languages could persist structures containing untyped pointers (e.g., void*). However, to succeed over relational databases, persistent languages needed to also support declarative, set-oriented queries and transactions.
Object database systems failed because they didn’t have strong support for queries, query optimization, and execution, and they didn’t have strong, well-engineered support for transactions. At the same time relational databases grew their capabilities by building extended relational capabilities which reduced the need for persistent languages and so the world continued to gravitate around relational database systems. Object relational mappings (ORM) systems, introduced in the last decade together with programming languages like C# which added built-in query capabilities (i.e., Language Integrated Query – LINQ)to the language, are the latest attempt to eliminate the impedance mismatch.
Object-relational mapping technology, like the Entity Framework, aims at providing a general solution to the problem of mapping database tables to programming language data structures. ORM technology is a right layering in bridging the complex mapping problem between tables and programming constructs. For instance the queries needed to map a set of tables to a class inheritance hierarchy can be quite complex. Similarly, propagating updates from the programming language structures to the tables in the database is a complex problem. Applications can build these mappings by hand, but the process is time-consuming and error prone. Automated ORMs can do this job correctly and faster.
Q3. What are the current main issues with O/R mappings?
Blakeley, Miller: In the area of functionality, enabling a complete ORM covering all programming constructs is a challenge. For example, up until its latest release EF lacked support for enum types. It’s also hard for an ORM to support the full range of concepts supported by the database. For example, support for spatial data types has been available in SQL Server since 2008 but native support has only just been added to EF. This challenge only gets harder when you consider most ORMs, including EF, support multiple database engines, each with different capabilities.
Another challenge is performance. Anytime you add a layer of abstraction there is a performance overhead, this is certainly true for ORMs. One critical area of performance is the time taken to translate a query into SQL that can be run against the database. In EF this involves taking the LINQ query that a user has written and translating it to SQL. In EF5 we made some significant improvements in this area by automatically caching and re-using these translations. The quality of the SQL that is generated is also key to performance, there are many different ways to write the same query and the performance difference can be huge. Things like unnecessary casts can cause the database not to use an index. With every release of EF we improve the SQL that is generated.
Adding a layer of abstraction also introduces another challenge; ORMs make it easy to map a relational database schema and have queries constructed for you, because this translation is handled internally by the ORM it can be difficult to debug when things don’t behave or perform as expected. There are a number of great tools, such as LINQPad and Entity Framework Profiler, which can help debug such scenarios.
Q4. What is special about Microsoft`s ORM (object-relational mapping) technology?
Miller: Arguably the biggest differentiator of EF isn’t a single technical feature but how deeply it integrates with the other tools and technologies that developers use, such as Visual Studio, LINQ, MVC and many others. EF also provides powerful mapping capabilities that allow you to solve some big impedance differences between your database schema and the shape of the objects you want to write code against. EF also gives you the flexibility of working in a designer (Model & Database First) or purely in code (Code First). There is also the benefit of Microsoft’s agreement to support and service the software that it ships.
Q5. Back in 2008 LINQ was a brand-new development in programming languages. What is the current status of LINQ now? For what is LINQ be used in practice?
Miller: LINQ is a really solid feature and while there probably won’t be a lot of new advancements in LINQ itself we should see new products continuing to take advantage of it. I think that is one of the great things about LINQ, it lends itself to so many different scenarios. For example there are LINQ providers today that allow you to query in-memory objects, relational databases and xml files, just to name a few.
Q6. The original design of the the Entity Framework dated back in 2006. Now, EF version 5.0 is currently available in Beta. What’s in EF 5.0?
Miller: Before we answer that question let’s take a minute to talk about EF versioning. The first two releases of EF were included as part of Visual Studio and the .NET Framework and were referred to using the version of the .NET Framework that they were included in. The first version (EF or EF3.5) was included in .NET 3.5 SP1 and the second version (EF4) was included in .NET 4. At that point we really wanted to release more often than Visual Studio and the .NET Framework released so we started to ship ‘out-of-band’ using NuGet. Once we started shipping out-of-band we adopted semantic versioning (as defined at http://semver.org ). Since then we’ve released EF 4.1, 4.2, 4.3 and EF 5.0 is currently available in Beta.
EF has come a long way since it was first released in Visual Studio 2008 and .NET 3.5. As with most v1 products there were a number of important scenarios that weren’t supported in the first release of EF.
EF4 was all about filling in these gaps and included features such as Model First development, support for POCO classes, customizable code generation, the ability to expose foreign key properties in your objects, improved support for unit testing applications built with EF and many other features.
In EF 4.1 we added the DbContext API and Code First development. The DbContext API was introduced as a cleaner and simpler API surface over EF that simplifies the code you write and allows you to be more productive. Code First gives you an alternative to the designer and allows you to define your model using just code. Code First can be used to map to an existing database or to generate a new database based on your code. EF 4.2 was mainly about bug fixes and adding some components to make it easier for tooling to interact with EF. The EF4.3 release introduced the new Code First Migrations feature that allows you to incrementally change your database schema as your Code First model evolves over time.
EF 5.0 is currently available in Beta and introduces some long awaited features including enum support, spatial data types, table valued function support and some significant performance improvements. In Visual Studio 11 we’ve also updated the EF designer to support these new features as well as multiple diagrams within a model and allowing you to apply coloring to your model.
Q7. What are the features that did not make it into EF 5.0., that you consider are important to be added in a next release?
Miller: There are a number of things that our customers are asking for that are on the top of our list for the upcoming versions of EF. These include asynchronous query support, improved support for SQL Azure (automatic connection retries and built in federation support), the ability to use Code First to map to stored procedure and functions, pluggable conventions for Code First and better performance for the designer and at runtime. If we get a significant number of those done in EF6 I think it will be a really great release. Keep in mind that because we now also ship in between Visual Studio releases you’re not looking at years between EF releases any more.
Q8. If your data is made of Java objects, would Entity Framework be useful? And if yes, how?
Blakeley: Unfortunately not. The EF ORM is written in C# and runs on the .NET Common Language Runtime (CLR). To support Java objects, we would need to have a .NET implementation of Java like Sun’s Java.Net.
Q9. EF offers different Entity Data Model design approaches: Database First, Model First, Code First. Why do you need three different design approaches? When would you recommend using each of these approaches?
Miller: This is a great question and something that confuses a lot of people. Whichever approach you choose the decision only impacts the way in which you design and maintain the model, once you start coding against the model there is no difference. Which one to use boils down to two fundamental questions. Firstly, do you want to model using boxes and lines in a designer or would you rather just write code? Secondly, are you working with an existing database or are you creating a new database?
If you want to work with boxes and lines in a designer then you will be using the EF Designer that is included in Visual Studio. If you’re targeting an existing database then the Database First workflow allows you to reverse engineer a model from the database, you can then tweak that model using the designer. If you’re going to be creating a new database then the Model First workflow allows you to start with an empty model and build it up using the designer. You can then generate a database based on the model you have created. Whether you choose Model First or Database First the classes that you will code against are generated for you. This generation is customizable though so if the generated code doesn’t suit your needs there is plenty of opportunity to customize it.
If you would rather forgo the designer and do all your modeling in code then Code First is the approach you want. If you are targeting an existing database you can either hand code the classes and mapping or use the EF Power Tools (available on Visual Studio Gallery) to reverse engineer some starting point code for you. If you are creating a new database then Code First can also generate the database for you and Code First Migrations allows you to control how that database is modified as your model changes over time. The idea of generating a database often scares people but Code First gives you a lot of control over the shape of your schema. Ultimately if there are things that you can’t control in the database using the Code First API then you have the opportunity to apply them using raw SQL in Code First Migrations.
Q10. There are concerns about the performance and the overhead generated by ORM technology. What is your opinion on that?
Blakeley: Performance and overhead of ORMs has been and will continue to be a concern. However, in the last few years there have been significant performance improvements in reducing the code path for the mapping implementations, relational query optimizers continue to get better at handling extremely complex queries, finally, processor technology continues to improve and there is abundant RAM allowing for larger object caches that speed up the mapping.
————–
José Blakeley is a Partner Architect in the SQL Server Division at Microsoft where he works on server programmability, database engine extensibility, query processing, object-relational functionality, scale-out database management, and scientific database applications. José joined Microsoft in 1994. Some of his contributions include the design of the OLE DB data access interfaces, the integration of the .NET runtime inside the SQL Server 2005 products, the development of many extensibility features in SQL Server, and the development of the ADO.NET Entity Framework in Visual Studio 2008. Since 2009 José has been building the SQL Server Parallel Data Warehouse, a scale-out MPP SQL Server appliance. José has authored many conference papers, book chapters and journal articles on design aspects of relational and object database management systems, and data access. Before joining Microsoft, José was a member of the technical staff with Texas Instruments where he was a principal investigator in the development of the DARPA funded Open-OODB object database management system. José became an ACM Fellow in 2009. He received a Ph.D. in computer science from University of Waterloo, Canada on materialized views, a feature implemented in all main commercial relational database products.
Rowan Miller works as a Program Manager for the ADO.NET Entity Framework team at Microsoft. He speaks at technical conferences and blogs. Rowan lives in Seattle, Washington with his wife Athalie. Prior to moving to the US he resided in the small state of Tasmania in Australia.
Outside of technology Rowan’s passions include snowboarding, mountain biking, horse riding, rock climbing and pretty much anything else that involves being active. The primary focus of his life, however, is to follow Jesus.
For further readings
– Entity Framework (EF) Resources: Software and Tools | Articles and Presentations
– ORM Technology: Blog Posts | Articles and Presentations
– Object-Relational Impedance Mismatch: Blog Posts | Articles and Presentations
Interview with Mike Stonebraker.
“I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.” — Mike Stonebraker.
I have interviewed Mike Stonebraker, serial entrepreneur and professor at MIT. In particular, I wanted to know more about his last endeavor, VoltDB.
RVZ
Q1. In your career you developed several data management systems, namely: the Ingres relational DBMS, the object-relational DBMS PostgreSQL, the Aurora Borealis stream processing engine(commercialized as StreamBase), the C-Store column-oriented DBMS (commercialized as Vertica), and the H-Store transaction processing engine (commercialized as VoltDB). In retrospective, what are, in a nutshell, the main differences and similarities between all these systems? What are they respective strengths and weaknesses?
Stonebraker: In addition, I am building SciDB, a DBMS oriented toward complex analytics.
I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.
The techniques used vary from market to market. Hence, StreamBase, Vertica, VoltDB and SciDB are all specialized to different markets. At this point Postgres and Ingres are legacy code bases.
Q2. In 2009 you co-founded VoltDB, a commercial start up based on ideas from the H-Store project. H-Store is a distributed In Memory OLTP system. What is special of VoltDB? How does it compare with other In-memory databases, for example SAP HANA, or Oracle TimesTen?
Stonebraker: A bunch of us wrote a paper “Through the OLTP Looking Glass and What We Found There” (SIGMOD 2008). In it, we identified 4 sources of significant OLTP overhead (concurrency control, write-ahead logging, latching and buffer pool management).
Unless you make a big dent in ALL FOUR of these sources, you will not run dramatically faster than current disk-based RDBMSs. To the best of my knowledge, VoltDB is the only system that eliminates or drastically reduces all four of these overhead components. For example, TimesTen uses conventional record level locking, an Aries-style write ahead log and conventional multi-threading, leading to substantial need for latching. Hence, they eliminate only one of the four sources.
Q3. VoltDB is designed for what you call “high velocity” applications. What do you mean with that? What are the main technical challenges for such systems?
Stonebraker: Consider an application that maintains the “state” of a multi-player internet game. This state is subject to a collection of perhaps thousands of
streams of player actions. Hence, there is a collective “firehose” that the DBMS must keep up with.
In a variety of OLTP applications, the input is a high velocity stream of some sort. These include electronic trading, wireless telephony, digital advertising, and network monitoring.
In addition to drinking from the firehose, such applications require ACID transactions and light real-time analytics, exactly the requirements of traditional OLTP.
In effect, the definition of transaction processing has been expanded to include non-traditional applications.
Q4. Goetz Grafe (HP fellow) said in an interview that “disk-less databases are appropriate where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history (e.g. cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.)” What is your take on this?
Stonebraker: In my opinion the best way to organize data management is to run a specialized OLTP engine on current data. Then, send transaction history data,
perhaps including an ETL component, to a companion data warehouse. VoltDB is a factor of 50 or so faster than legacy RDBMSs on the transaction piece, while column stores, such as Vertica, are a similar amount faster on historical analytics. In other words, specialization allows each component to run dramatically faster than a “one size fits all” solution.
A “two system” solution also avoids resource management issues and lock contention, and is very widely used as a DBMS architecture.
Q5. Where will the (historical) data go if we have no disks? In the Cloud?
Stonebraker: Into a companion data warehouse. The major DW players are all disk-based.
Q6. How VoltDB ensures durability?
Stonebraker: VoltDB automatically replicates all tables. On a failure, it performs “Tandem-style” failover and eventual failback. Hence, it totally masks most errors. To protect against cluster-wide failures (such as power issues), it supports snapshotting of data and an innovative “command logging” capability. Command logging
has been shown to be wildly faster than data logging, and supports the same durability as data logging.
Q7. How does VoltDB support atomicity, consistency and isolation?
Stonebraker: All transaction are executed (logically) in timestamp order. Hence, the net outcome of a stream of transactions on a VoltDB data base is equivalent
to their serial execution in timestamp order.
Q8. Would you call VoltDB a relational database system? Does it supports standard SQL? How do you handle scalability problems for complex joins of large amount of data?
Stonebraker: VoltDB supports standard SQL.
Complex joins should be run on a companion data warehouse. After all, the only way to interleave “big reads” with “small writes” in a legacy RDBMS is to use snapshot isolation or run with a reduced level of consistency.
You either get an out-of-date, but consistent answer or an up-to-date, but inconsistent answer. Directing big reads to a companion DW, gives you the same result as snapshot isolation. Hence, I don’t see any disadvantage to doing big reads on a companion system.
Concerning larger amounts of data, our experience is that OLTP problems with more than a few Tbyte of data are quite rare. Hence, these can easily fit in main memory, using a VoltDB architecture.
In addition, we are planning extensions of the VoltDB architecture to handle larger-than-main-memory data sets. Watch for product announcements in this area.
Q9. Does VoltDB handle disaster recovery? If yes, how?
Stonebraker: VoltDB just announced support for replication over a wide area network. This capability support failover to a remote site if a disaster occurs. Check
out voltdb web site for details.
Q10. VoltDB`s mission statement is “to deliver the fastest, most scalable in-memory database products on the planet”. What performance measurements do you have until now to sustain this claim?
Stonebraker: We have run TPC-C at about 50 X the performance of a popular legacy RDBMS. In addition, we have shown linear TPC-C scalability to 384 cores
(more than 3 million transactions per second). That was the biggest cluster we could get access to; there is no reason why VoltDB would not continue to scale.
Q11. Can In-Memory Data Management play a significant role also for Big Data Analytics (up to several PB of data)? If yes, how? What are the largest data sets that VoltDB can handle?
Stonebraker: VoltDB is not focused on analytics. We believe they should be run on a companion data warehouse.
Most of the warehouse customers I talk to want to keep increasing large amounts of increasingly diverse history to run their analytics over. The major data warehouse players are routinely being asked to manage petabyte-sized data warehouses. It is not clear how important main memory will be in this vertical market.
Q12. You were very critical about Apache Hadoop, but VoltDB offers an integration with Hadoop. Why? How does it work technically?
What are the main business benefits from such an integration?
Stonebraker: Consider the “two system” solution mentioned above. VoltDB is intended for the OLTP portion, and some customers wish to run Hadoop as a data
warehouse platform. To facilitate this architecture, VoltDB offers a Hadoop connector.
Q13. How “green” is VoltDB? What are the tradeoff between total power consumption and performance: Do you have any benchmarking results for that?
Stonebraker: We have no official benchmarking numbers. However, on a large variety of applications VoltDB is a factor of 50 or more faster than traditional RDBMSs. Put differently, if legacy folks need 100 nodes, then we need 2!
In effect, if you can offer vastly superior performance (say times 50) on the same hardware, compared to another system, then you can offer the same performance on 1/50th of the hardware. By definition, you are 50 times “greener” than they are.
Q14. You are currently working on science-oriented DBMSs and search engines for accessing the deep web. Could you please give us some details. What kind of results did you obtain so far?
Stonebraker: We are building SciDB, which is oriented toward complex analytics (regression, clustering, machine learning, …). It is my belief that such analytics
will become much more important off into the future. Such analytics are invariably defined on arrays, not tables. Hence, SciDB is an array DBMS, supporting a dialect of SQL for array data. We expect it to be wildly faster than legacy RDBMSs on this kind of application. See SciDB.org for more information.
Q15. You are a co-founder of several venture capital backed start-ups. In which area?
Stonebraker: The recent ones are: StreamBase (stream procession), Vertica (data warehouse market), VoltDB (OLTP), Goby.com (data aggregation of web sources), Paradigm4 (SciDB and complex analytics)
Check the company web sites for more details.
——————————–
Mike Stonebraker
Dr. Stonebraker has been a pioneer of data base research and technology for more than a quarter of a century. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for twenty five years. More recently at M.I.T. he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, and the H-Store transaction processing engine. Currently, he is working on science-oriented DBMSs, OLTP DBMSs, and search engines for accessing the deep web. He is the founder of five venture-capital backed startups, which commercialized his prototypes. Presently he serves as Chief Technology Officer of VoltDB, Paradigm4, Inc. and Goby.com.
Professor Stonebraker is the author of scores of research papers on data base technology, operating systems and the architecture of system software services. He was awarded the ACM System Software Award in 1992, for his work on INGRES. Additionally, he was awarded the first annual Innovation award by the ACM SIGMOD special interest group in 1994, and was elected to the National Academy of Engineering in 1997. He was awarded the IEEE John Von Neumann award in 2005, and is presently an Adjunct Professor of Computer Science at M.I.T.
Related Posts
– In-memory database systems. Interview with Steve Graves, McObject.
(March 16, 2012)
– On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. (February 1, 2012)
– A super-set of MySQL for Big Data. Interview with John Busch, Schooner. (February 20, 2012)
– On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica. (November 16, 2011)
– vFabric SQLFire: Better then RDBMS and NoSQL? (October 24, 2011)
##
“Enterprise Search implies being able to search multiple types of data generated by an enterprise. DSE2 takes this to the next level by integrating this with a real time database AND powerful analytics.” — Jonathan Ellis.
I wanted to learn more about the new version of the commercial version of Cassandra, DataStax Enterprise 2.0. I did interview Jonathan Ellis, CTO and co-founder of DataStax and project chair of Apache Cassandra.
RVZ
Q1. What are the new main features of DataStax Enterprise 2.0 (DSE 2.0)?
Jonathan Ellis: The one I’m most excited about is the integration of Solr support. This is not a side-by-side system like Solandra, where Solr has to maintain a separate copy of the data to be indexed, but full integration with Cassandra : you insert your data once, and access it via Cassandra, Solr, or Hadoop.
Search is an increasingly important ingredient for modern applications, and DSE2 is the first to offer fully integrated, scalable search to developers.
DSE2 also includes Apache Sqoop for easy migration of relational data into Cassandra, and plug-and-play log indexing for your application.
Q2. How does it work technically the integration of Cassandra with Apache Hadoop and Apache Soir?
Jonathan Ellis: Basically, Cassandra offers a pluggable index api, and we created a Solr-backed index implementation with this.
We wrote about the technical details here.
Q3. What is exactly an Enterprise Search, and why choosing Apache Soir?
Jonathan Ellis: Enterprise Search implies being able to search multiple types of data generated by an enterprise. DSE2 takes this to the next level by integrating this with a real time database AND powerful analytics.
Solr is the gold standard for search, much the same way that Hadoop is the gold standard for big data map/reduce and analytics. There’s an ecosystem of tools that build on Solr, so offering true Solr support is much more powerful than implementing a proprietary full-text search engine.
Q4. What are the main business benefits of such integration?
Jonathan Ellis: First, developers and administrators have one database and vendor to concern themselves with instead of multiple databases and many software suppliers. Second, the built-in technical benefits of running both Solr and Hadoop on top of Cassandra yields continuous uptime for critical applications as well as future proofing those same apps where growing data volumes and increased user traffic are concerned.
Finally, customers save anywhere from 80-90% over traditional RDBMS vendors by going with DSE. For example, Constant Contact estimated that a new project they had in the works would take $2.5 million and 9 months on traditional relational technology, but with with Cassandra, they delivered it in 3 months for $250,000.That’s one third the time and one tenth the cost; not bad!
Q5. It looks like you are attempting to compete with Google. Is this correct?
Jonathan Ellis: DSE2 is about providing search as a building block for applications, not necessarily delivering an off-the-shelf search appliance.
Compared to Google’s AppEngine product, it’s fair to say that DSE 2.0 provides a similar, scalable platform to build applications on. DSE 2.0 is actually ahead of the game there: Google has announced but not yet delivered full-text search for AppEngine.
Another useful comparison is to Amazon Web Services: DSE 2.0 gives you the equivalent of Amazon’s DynamoDB, S3, Elastic Map/Reduce, and CloudSearch in a single, integrated stack. So instead of having to insert documents once in S3 and again in CloudSearch, you just add it to DSE (with any of the Solr, Cassandra, or Hadoop APIs) without having to worry about having to write code to keep multiple copies in sync when updates happen.
Q6. How do you manage to run real-time, analytics and search operations in the same database cluster, without performance or resource contention problems?
Jonathan Ellis: DSE offers elastic workload partitioning: your analytics jobs run against their own copies of the data, kept in sync by Cassandra replication, so they don’t interfere with your real time queries. When your workload changes, you can re-provision existing nodes from the real time side to the analytical, or vice versa.
Q7. You do not require ETL software to move data between systems. How does it work instead?
Jonathan Ellis: All DSE nodes are part of a single logical Cassandra cluster. DSE tells Cassandra how many copies to keep for which workload partitions, and Cassandra keeps them in sync with its battle-tested replication.
So your real time nodes will have access to new analytical output almost instantly, and you never have to write ETL code to move real time data into your analytical cluster.
Q8. Could you give us some examples of Big Data applications that are currently powered by DSE 2.0?
Jonathan Ellis:
A recent example is Healthx, which develops and manages online portals and applications for the healthcare market. They handle things such as enrollment, reporting, claims management, and business intelligence.
They have to manage countless health groups, individual members, doctors, diagnoses, and a lot more. Data comes in very fast, from all over, changes constantly, and is accessed all the time.
Healthx especially likes the new search capabilities in DSE 2.0. In addition to being able to handle real-time and analytic work, their users can now easily perform lightening fast searches for things like, ‘find me a podiatrist who is female, speaks German, and has an office close to where I live.‘
Q9. What about Big Data applications which also need to use Relational Data? Is it possible to integrate DSE 2.0 with a Relational Database? If yes, how? How do you handle query of data from various sources?
Jonathan Ellis: Most customers start by migrating their highest-volume, most-frequently-accessed data to DSE (e.g. with the Sqoop tool I mentioned), and leave the rest in a relational system. So RDBMS interoperability is very common at that level.
It’s also possible to perform analytical queries that mix data from DSE and relational sources, or even a legacy HDFS cluster.
Q10. How can developers use DSE 2.0 for storing, indexing and searching web logs?
Jonathan Ellis: We ship a log4j appender with DSE2, so if your log data is coming from Java, it’s trivial to start streaming and indexing that into DSE. For non-Java systems, we’re looking at supporting ingestion through tools like Flume.
Q11. How do you adjust performance and capacity for various workloads depending on the application needs?
Jonathan Ellis: Currently reprovisioning nodes for different workloads is a manual, operator-driven procedure, made easy with our OpsCenter management tool. We’re looking at delivering automatic adaptation to changing workloads in a future release.
Q12. How DSE 2.0 is influenced by DataStax partnerships with Pentaho Corporation (February 28, 2012) with their Pentaho Kettle?
Jonathan Ellis: A question we get frequently is, “I’m sold on Cassandra and DSE, but I need to not only move data from my existing RDBMS’s to you, but transform the data so that it fits into my new Cassandra data model. How can I do that?” With Sqoop, we can extract and load, but nothing else. The free Pentaho solution provides very powerful transformation capabilities to massage the incoming data in nearly every way under the sun before it’s inserted into Cassandra. It does it very fast too,
and with a visual user interface.
Q13. Anything else to add?
Jonathan Ellis: DSE 2.0 is available for download now and is free to use, without any restrictions, for development purposes. Once you move to production, we do require a subscription, but I think you’ll find that the cost associated with DSE is much less than any RDBMS vendor.
_____________
Jonathan Ellis.
Jonathan Ellis is CTO and co-founder of DataStax (formerly Riptano), the commercial leader in products and support for Apache Cassandra. Prior to DataStax, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. Jonathan is project chair of Apache Cassandra.
Related Posts
– Interview with Jonathan Ellis, project chair of Apache Cassandra (May 16, 2011).
–Analytics at eBay. An interview with Tom Fastner (October 6, 2011).
–On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com (November 2, 2011)
Related Resources
– Big Data and Analytical Data Platforms – Articles.
– NoSQL Data Stores – Articles, Papers, Presentations.
##
In order to help disseminating the work of young students and researchers in the area of databases, I started publishing Master and PhD thesis in ODBMS.ORG
Published Master and PhD are available for free download (as. pdf) to all visitors of ODBMS.ORG (50,000+ visitors/month).
Copyright of the Master and PhD thesis remain by the authors.
The process of submission is quite simple. Please send (any time) by email to: editor AT odbms.org
1) a .pdf of your work
2) the filled in template below:
___________________________
Title of the work:
Language (English preferable):
Author:
Affiliation:
Short Abstract (max 2-3 sentences of text):
Type of work (PhD, Master):
Area (see classification below):
No of Pages:
Year of completion:
Name of supervisor/affiliation:
________________________________
To qualify for publication in ODBMS.ORG, the thesis should have been completed and accepted by the respective University/Research Center in 2011 or later, and it should be addressing one or more of the following areas:
> Big Data: Analytics, Storage Platforms
> Cloud Data Stores
> Entity Framework (EF)
> Graphs and Data Stores
> In-Memory Databases
> Object Databases
> NewSQL Data Stores
> NoSQL Data Stores
> Object-Relational Technology
> Relational Databases: Benchmarking, Data Modeling
For any questions, please do not hesitate to contact me.
Hope this help.
Best Regards
Roberto V. Zicari
Editor
ODBMS.ORG
ODBMS Industry Watch Blog
##
“Application types that benefit from an in-memory database system are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes” — Steve Graves.
On the topic of in-memory database systems, I did interview one of our expert, Steve Graves, co-founder and CEO of McObject.
RVZ
Q1. What is an in-memory database system (IMDS)?
Steve Graves: An in-memory database system (IMDS) is a database management system (DBMS) that uses main memory as its primary storage medium.
A “pure” in-memory database system is one that requires no disk or file I/O, whatsoever.
In contrast, a conventional DBMS is designed around the assumption that records will ultimately be written to persistent storage (usually hard disk or flash memory).
Obviously, disk or flash I/O is expensive, in performance terms, and therefore retrieving data from RAM is faster than fetching it from disk or flash, so IMDSs are very fast.
An IMDS also offers a more streamlined design. Because it is not built around the assumption of storage on hard disk or flash memory, the IMDS can eliminate the various DBMS sub-systems required for persistent storage, including cache management, file management and others. For this reason, an in-memory database is also faster than a conventional database that is either fully-cached or stored on a RAM-disk.
In other areas (not related to persistent storage) an IMDS can offer the same features as a traditional DBMS. These include SQL and/or native language (C/C++, Java, C#, etc.) programming interfaces; formal data definition language (DDL) and database schemas; support for relational, object-oriented, network or combination data designs; transaction logging; database indexes; client/server or in-process system architectures; security features, etc. The list could go on and on. In-memory database systems are a sub-category of DBMSs, and should be able to do everything that entails.
Q2. What are significant differences between an in-memory database versus a database that happens to be in memory (e.g. deployed on a RAM-disk).
Steve Graves: We use the comparison to illustrate IMDSs’ contribution to performance beyond the obvious elimination of disk I/O. If IMDSs’ sole benefit stemmed from getting rid of physical I/O, then we could get the same performance by deploying a traditional DBMS entirely in memory – for example, using a RAM-disk in place of a hard drive.
We tested an application performing the same tasks with three storage scenarios: using an on-disk DBMS with a hard drive; the same on-disk DBMS with a RAM-disk; and an IMDS (McObject’s eXtremeDB). Moving the on-disk database to a RAM drive resulted in nearly 4x improvement in database reads, and more than 3x improvement in writes. But the IMDS (using main memory for storage) outperformed the RAM-disk database by 4x for reads and 420x for writes.
Clearly, factors other than eliminating disk I/O contribute to the IMDS’s performance – otherwise, the DBMS-on-RAM-disk would have matched it. The explanation is that even when using a RAM-disk, the traditional DBMS is still performing many persistent storage-related tasks.
For example, it is still managing a database cache – even though the cache is now entirely redundant, because the data is already in RAM. And the DBMS on a RAM-disk is transferring data to and from various locations, such as a file system, the file system cache, the database cache and the client application, compared to an IMDS, which stores data in main memory and transfers it only to the application. These sources of processing overhead are hard-wired into on-disk DBMS design, and persist even when the DBMS uses a RAM-disk.
An in-memory database system also uses the storage space (memory) more efficiently.
A conventional DBMS can use extra storage space in a trade-off to minimize disk I/O (the assumption being that disk I/O is expensive, and storage space is abundant, so it’s a reasonable trade-off). Conversely, an IMDS needs to maximize storage efficiency because memory is not abundant in the way that disk space is. So a 10 gigabyte traditional database might only be 2 gigabytes when stored in an in-memory database.
Q3. What is in your opinion the current status of the in-memory database technology market?
Steve Graves: The best word for the IMDS market right now is “confusing.” “In-memory database” has become a hot buzzword, with seemingly every DBMS vendor now claiming to have one. Often these purported IMDSs are simply the providers’ existing disk-based DBMS products, which have been tweaked to keep all records in memory – and they more closely resemble a 100% cached database (or a DBMS that is using a RAM-disk for storage) than a true IMDS. The underlying design of these products has not changed, and they are still burdened with DBMS overhead such as caching, data transfer, etc. (McObject has published a white paper, Will the Real IMDS Please Stand Up?, about this proliferation of claims to IMDS status.)
Only a handful of vendors offer IMDSs that are built from scratch as in-memory databases. If you consider these to comprise the in-memory database technology market, then the status of the market is mature. The products are stable, have existed for a decade or more and are deployed in a variety of real-time software applications, ranging from embedded systems to real-time enterprise systems.
Q4. What are the application types that benefit the use of an in-memory database system?
Steve Graves: Application types that benefit from an IMDS are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes. Sometimes these types overlap, as in the case of a network router that needs to be fast, and has no persistent storage. Embedded systems often fall into the latter category, in fields such as telco and networking gear, avionics, industrial control, consumer electronics, and medical technology. What we call the real-time enterprise sector is represented in the first category, encompassing uses such as analytics, capital markets (algorithmic trading, order matching engines, etc.), real-time cache for e-commerce and other Web-based systems, and more.
Software that must run with minimal hardware resources (RAM and CPU) can also benefit.
As discussed above, IMDSs eliminate sub-systems that are part-and-parcel of on-disk DBMS processing. This streamlined design results in a smaller database system code size and reduced demand for CPU cycles. When it comes to hardware, IMDSs can “do more with less.” This means that the manufacturer of, say, a set-top box that requires a database system for its electronic programming guide, may be able to use a less powerful CPU and/or less memory in each box when it opts for an IMDS instead of an on-disk DBMS. These manufacturing cost savings are particularly desirable in embedded systems products targeting the mass market.
Q5. McObject offers an in-memory database system called eXtremeDB, and an open source embedded DBMS, called Perst. What is the difference between the two? Is there any synergy between the two products?
Steve Graves: Perst is an object-oriented embedded database system.
It is open source and available in Java (including Java ME) and C# (.NET) editions. The design goal for Perst is to provide as nearly transparent persistence for Java and C# objects as practically possibly within the normal Java and .NET frameworks. In other words, no special tools, byte codes, or virtual machine are needed. Perst should provide persistence to Java and C# objects while changing the way a programmer uses those objects as little as possible.
eXtremeDB is not an object-oriented database system, though it does have attributes that give it an object-oriented “flavor.” The design goals of eXtremeDB were to provide a full-featured, in-memory DBMS that could be used right across the computing spectrum: from resource-constrained embedded systems to high-end servers used in systems that strive to squeeze out every possible microsecond of latency. McObject’s eXtremeDB in-memory database system product family has features including support for multiple APIs (SQL ODBC/JDBC & native C/C++, Java and C#), varied database indexes (hash, B-tree, R-tree, KD-tree, and Patricia Trie), ACID transactions, multi-user concurrency (via both locking and “optimistic” transaction managers), and more. The core technology is embodied in the eXtremeDB IMDS edition. The product family includes specialized editions, built on this core IMDS, with capabilities including clustering, high availability, transaction logging, hybrid (in-memory and on-disk) storage, 64-bit support, and even kernel mode deployment. eXtremeDB is not open source, although McObject does license the source code.
The two products do not overlap. There is no shared code, and there is no mechanism for them to share or exchange data. Perst for Java is written in Java, Perst for .NET is written in C#, and eXtremeDB is written in C, with optional APIs for Java and .NET. Perst is a candidate for Java and .NET developers that want an object-oriented embedded database system, have no need for the more advanced features of eXtremeDB, do not need to access their database from C/C++ or from multiple programming languages (a Perst database is compatible with Java or C#), and/or prefer the open source model. Perst has been popular for smartphone apps, thanks to its small footprint and smart engineering that enables Perst to run on mobile platforms such as Windows Phone 7 and Java ME.
eXtremeDB will be a candidate when eliminating latency is a key concern (Perst is quite fast, but not positioned for real-time applications), when the target system doesn’t have a JVM (or sufficient resources for one), when the system needs to support multiple programming languages, and/or when any of eXtremeDB’s advanced features are required.
Q6. What are the current main technological developments for in-memory database systems?
Steve Graves: At McObject, we’re excited about the potential of IMDS technology to scale horizontally, across multiple hardware nodes, to deliver greater scalability and fault-tolerance while enabling more cost-effective system expansion through the use of low-cost (i.e. “commodity”) servers. This enthusiasm is embodied in our new eXtremeDB Cluster edition, which manages data stores across distributed nodes. Among eXtremeDB Cluster’s advantages is that it eliminates any performance ceiling from being CPU-bound on a single server.
Scaling across multiple hardware nodes is receiving a lot of attention these days with the emergence of NoSQL solutions. But database system clustering actually has much deeper roots. One of the application areas where it is used most widely is in telecommunications and networking infrastructure, where eXtremeDB has always been a strong player. And many emerging application categories – ranging from software-as-a-service (SaaS) platforms to e-commmerce and social networking applications – can benefit from a technology that marries IMDSs’ performance and “real” DBMS features, with a distributed system model.
Q7. What are the similarities and differences between current various database clustering solutions? In particular, let’s look at dimensions such as scalability, ACID vs. CAP, intended/applicable problem domains, structured vs. unstructured, and complexity of implementation.
Steve Graves: ACID support vs. “eventual consistency” is a good place to start looking at the differences between clustering database solutions (including some cluster-like NoSQL products). ACID-compliant transactions will be Atomic, Consistent, Isolated and Durable; consistency implies the transaction will bring the database from one valid state to another and that every process will have a consistent view of the database. ACID-compliance enables an on-line bookstore to ensure that a purchase transaction updates the Customers, Orders and Inventory tables of its DBMS. All other things being equal, this is desirable: updating Customers and Orders while failing to change Inventory could potentially result in other orders being taken for items that are no longer available.
However, enforcing the ACID properties becomes more of a challenge with distributed solutions, such as database clusters, because the node initiating a transaction has to wait for acknowledgement from the other nodes that the transaction can be successfully committed (i.e. there are no conflicts with concurrent transactions on other nodes). To speed up transactions, some solutions have relaxed their enforcement of these rules in favor of an “eventual consistency” that allows portions of the database (typically on different nodes) to become temporarily out-of-synch (inconsistent).
Systems embracing eventual consistency will be able to scale horizontally better than ACID solutions – it boils down to their asynchronous rather than synchronous nature.
Eventual consistency is, obviously, a weaker consistency model, and implies some process for resolving consistency problems that will arise when multiple asynchronous transactions give rise to conflicts. Resolving such conflicts increases complexity.
Another area where clustering solutions differ is along the lines of shared-nothing vs. shared-everything approaches. In a shared-nothing cluster, each node has its own set of data.
In a shared-everything cluster, each node works on a common copy of database tables and rows, usually stored in a fast storage area network (SAN). Shared-nothing architecture is naturally more complex: if the data in such a system is partitioned (each node has only a subset of the data) and a query requests data that “lives” on another node, there must be code to locate and fetch it. If the data is not partitioned (each node has its own copy) then there must be code to replicate changes to all nodes when any node commits a transaction that modifies data.
NoSQL solutions emerged in the past several years to address challenges that occur when scaling the traditional RDBMS. To achieve scale, these solutions generally embrace eventual consistency (thus validating the CAP Theorem, which holds that a system cannot simultaneously provide Consistency, Availability and Partition tolerance). And this choice defines the intended/applicable problem domains. Specifically, it eliminates systems that must have consistency. However, many systems don’t have this strict consistency requirement – an on-line retailer such as the bookstore mentioned above may accept the occasional order for a non-existent inventory item as a small price to pay for being able to meet its scalability goals. Conversely, transaction processing systems typically demand absolute consistency.
NoSQL is often described as a better choice for so-called unstructured data. Whereas RDBMSs have a data definition language that describes a database schema and becomes recorded in a database dictionary, NoSQL databases are often schema-less, storing opaque “documents” that are keyed by one or more attributes for subsequent retrieval. Proponents argue that schema-less solutions free us from the rigidity imposed by the relational model and make it easier to adapt to real-world changes. Opponents argue that schema-less systems are for lazy programmers, create a maintenance nightmare, and that there is no equivalent to relational calculus or the ANSI standard for SQL. But the entire structured or unstructured discussion is tangential to database cluster solutions.
Q7. Are in-memory database systems an alternative to classical disk-based relational database systems?
Steve Graves: In-memory database systems are an ideal alternative to disk-based DBMSs when performance and efficiency are priorities. However, this explanation is a bit fuzzy, because what programmer would not claim speed and efficiency as goals? To nail down the answer, it’s useful to ask, “When is an IMDS not an alternative to a disk-based database system?”
Volatility is pointed to as a weak point for IMDSs. If someone pulls the plug on a system, all the data in memory can be lost. In some cases, this is not a terrible outcome. For example, if a set-top box programming guide database goes down, it will be re-provisioned from the satellite transponder or cable head-end. In cases where volatility is more of a problem, IMDSs can mitigate the risk. For example, an IMDS can incorporate transaction logging to provide recoverability. In fact, transaction logging is unavoidable with some products, such as Oracle’s TimesTen (it is optional in eXtremeDB). Database clustering and other distributed approaches (such as master/slave replication) contribute to database durability, as does use of non-volatile RAM (NVRAM, or battery-backed RAM) as storage instead of standard DRAM. Hybrid IMDS technology enables the developer to specify persistent storage for selected record types (presumably those for which the “pain” of loss is highest) while all other records are managed in memory.
However, all of these strategies require some effort to plan and implement. The easiest way to reduce volatility is to use a database system that implements persistent storage for all records by default – and that’s a traditional DBMS. So, the IMDS use-case occurs when the need to eliminate latency outweighs the risk of data loss or the cost of the effort to mitigate volatility.
It is also the case that FLASH and, especially, spinning memory are much less expensive than DRAM, which puts an economic lid on very large in-memory databases for all but the richest users. And, riches notwithstanding, it is not yet possible to build a system with 100’s of terabytes, let alone petabytes or exabytes, of memory, whereas spinning memory has no such limitation.
By continuing to use traditional databases for most applications, developers and end-users are signaling that DBMSs’ built-in persistence is worth its cost in latency. But the growing role of IMDSs in real-time technology ranging from financial trading to e-commerce, avionics, telecom/Netcom, analytics, industrial control and more shows that the need for speed and efficiency often outweighs the convenience of a traditional DBMS.
———–
Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.
Related Posts
– A super-set of MySQL for Big Data. Interview with John Busch, Schooner.
– Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB.
– On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.
– vFabric SQLFire: Better then RDBMS and NoSQL?
Related Resources
ODBMS.ORG: Free Downloads and Links:
Object Databases
NoSQL Data Stores
Graphs and Data Stores
Cloud Data Stores
Object-Oriented Programming
Entity Framework (EF) Resources
ORM Technology
Object-Relational Impedance Mismatch
Databases in general
Big Data and Analytical Data Platforms
#
“Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits” — Michael Blaha.
This is the third interview with our expert Dr. Michael Blaha on the topic Database Modeling. This time we look at the issue of data design for Analytical Data Warehouses.
In previous interviews we looked at how good is UML for database design , and how good are Use Cases for database modeling.
Hope you`ll find this interview interesting. I encourage the community to post comments.
RVZ
Q1: What is the difference between data warehouses and day-to-day business applications?
Michael Blaha: Operational (day-to-day business) applications serve the routine needs of a business handling orders, scheduling manufacturing runs, servicing patients, and generating financial statements.
Operational applications have many short transactions that must process quickly. The transactions both read and write.
Well-written applications pay attention to data quality, striving to ensure correct data and avoid errors.
In contrast analytical (data warehouse) applications step back from the business routine and analyze data that accumulates over time. The idea is to gain insight into business patterns that are overlooked when responding to routine needs. Data warehouse queries can have a lengthy execution time as they process reams of data, searching for underlying patterns.
End users read from a data warehouse, but they don`t write to it. Rather writing occurs as the operational applications supply new data that is added to the data warehouse.
Q2: How do you approach data modeling for data warehouse problems?
Michael Blaha: For operational applications, I use the UML class model for conceptual data modeling. (I often use Enterprise Architect.) The notation is more succinct than conventional database notations and promotes abstract thinking.
In addition, the UML class model is understandable for business customers as it defers database design details. And, of course, the UML reaches out to the programming side of development.
In contrast, for analytical applications, I go straight to a database notation. (I often use ERwin.) Data warehouses revolve around facts and dimensions. The structure of a data warehouse model is so straightforward (unlike the model of operational application) that a database notation alone suffices.
For a business user, the UML model and the conventional data model look much the same for a data warehouse.
The programmers of a data warehouse (the ETL developers) are accustomed to database notations (unlike the developers in day-to-day applications).
As an aside, I note that in a past book (A Manager`s Guide to Database Technology) I used a UML class model for analytical modeling. In retrospect I now realize that was a forced fit. The class model does not deliver any benefits for data warehouses and it`s an unfamiliar technology for data warehouse developers, so there`s no point in using it there.
Q3: Is there any synergy between non-relational databases (NoSQL, Object Databases) and data warehouses?
Michael Blaha: Not for conventional data warehouses that are set-oriented. Mass quantities of data must be processed in bulk. Set-oriented data processing is a strength of relational databases and the SQL language. Furthermore, tables are a good metaphor for facts and dimensions and the data is intrinsically strongly typed.
NoSQL (Hadoop) is being used for mining Web data. Web data is by its nature unstructured and much different from conventional data warehouses.
Q4: How do data warehouses achieve fast performance?
Michael Blaha: The primary technique is pre-computation, by anticipating the need for aggregate data and computing it in advance. Indexing is also important for data warehouses, but less important than with operational applications.
Q5: What are some difficult issues with data warehouses?
Michael Blaha:
–Abstraction. Abstraction is needed to devise the proper facts and dimensions. It is always difficult to perform abstraction.
–Conformed dimensions. A large data warehouse schema must be flexible for mining. This can only be achieved if data is on the same basis. Therefore there is a need for conformed dimensions.
For example, there must be a single definition of Customer that is used throughout the warehouse.
–Size. The sheer size of schema and data is a challenge.
–Data cleansing. Many operational applications are old legacy code. Often their data is flawed and may need to be corrected for a data warehouse.
– Data integration. Many data warehouses combine data from multiple applications. The application data overlaps and must be reconciled.
– Security. Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits.
Q6: What kind of metadata is associated with a data warehouse and is there a role for Object Databases with this?
Michael Blaha: Maybe. Data warehouse metadata includes source-to-target mappings, definitions (of facts, dimensions, and attributes), as well as the organization of the data warehouse into subject areas. The metadata for a data warehouse is just like operational applications. The metadata has to be custom modeled and doesn`t have a standard metaphor for structure like the facts and dimensions of a data warehouse. Relational databases, OO databases, and possibly other kinds of databases are all reasonable candidates.
Q7. In a recent interview, Florian Waas, EMC/Greenplum, said “in the Big Data era the old paradigm of shipping data to the application isn’t working any more. Rather, the application logic must come to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack. Instead of stand-alone products for ETL, BI/reporting and analytics we have to think about seamless integration: in what ways can we open up a data processing platform to enable applications to get closer? What language interfaces, but also what resource management facilities can we offer? And so on.”
What is your view on this?
Michael Blaha: It’s well known that to get good performance for relational database applications that stored procedures must be used. Stored procedures are logic that is inside the database kernel. Stored procedures circumvent much of the overhead that is incurred by shuttling back and forth between an application process and the database process. So the stored procedure experience is certainly consistent with this comment.
What I try to do in practice is think in terms of objects. Relational database tables and stored procedures are analogous to objects with methods. I put core functionality that is likely to be reusable and computation intensive into stored procedures. I put lightweight functionality and functionality that is peculiar to an application outside the database kernel.
Q8. Hadoop is the system of choice for Big Data and Analytics. How do you approach data modeling in this case?
Michael Blaha: I have no experience with Hadoop. My projects have involved structured data. In contrast Hadoop is architected for unstructured data as is often found on the Web.
Q9. A lot of insights are contained in unstructured or semi-structured data from Big Data applications. Does it make any sense to do data modeling in this case?
Michael Blaha: I have no experience with unstructured data. I have some experience with semi-structured data (XML / XSD). I routinely practice data modeling for XSD files. I published a paper in 2010 lamenting the fact that so
many SOA projects concern storage and retrieval of data and completely lack a data model. I’ve been working on modeling approaches for XSD files, but have not yet devised a solution to my satisfaction.
————————————————–
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.
Related Posts
– Use Cases and Database Modeling — An interview with Michael Blaha.
– How good is UML for Database Design? Interview with Michael Blaha.
Resources
ODBMS.org: Free Downloads and Links on various data management technologies:
Object Databases
NoSQL Data Stores
Graphs and Data Stores
Cloud Data Stores
Object-Oriented Programming
Entity Framework (EF) Resources
ORM Technology
Object-Relational Impedance Mismatch
Databases in general
Big Data and Analytical Data Platforms
##
“Legacy MySQL does not scale well on a single node, which forces granular sharding and explicit application code changes to make them sharding-aware and results in low utilization of severs”— Dr. John Busch, Schooner Information Technology
A super-set of MySQL suitable for Big Data? On this subject, I have interviewed Dr. John Busch, Founder, Chairman, and CTO of Schooner Information Technology.
RVZ
Q1. What are the limitations of MySQL when handling Big Data?
John Busch: Legacy MySQL does not scale well and uses single threaded asynchronous replication. It’s poor scaling forces granular sharding across many servers and explicit application code changes to make them sharding-aware. It’s single threaded asynchronous replication results in slave lag and data inconsistency, and it requires complex manual fail-over with downtime and data loss. The net result is low utilization of severs, server sprawl, limited service availability, limited data integrity, and complex programming and administration. These are all serious problems when handling Big Data.
Q2. How is SchoonerSQL different with respect to Oracle/MySQL (5.1, 5.5, and the future 5.6)?
John Busch:Schooner licensed the source for MySQL and InnoDB directly from Oracle, with the right to enhance it in a compatible manner. Schooner made fundamental and extensive architectural and resource management advances to MySQL/InnoDB in order to make it enterprise class. SchoonerSQL fully exploits today’s commodity multi-core servers, flash memory, and high-performance networking while dramatically improving performance, availability, scalability, and cost of ownership relative to MySQL 5.X. SchoonerSQL advances include:
– very high thread level parallelism with granular concurrency control and highly parallel DRAM < -> FLASH memory hierarchy management, enabling linear vertical scaling as a function of processor cores;
– tightly integrated (DRAM to DRAM) synchronous replication, coupled with fully parallel asynchronous replication, with automated fail-over within and between data centers, enabling the highest levels of availability with no data loss. and
– transparent, workload-aware, relational sharding with DBShards, enabling unlimited high performance horizontal scaling;
SchoonerSQL is a super-set of Oracle MySQL/InnoDB 5.1.5.5/5.6, providing 100% compatibility for applications and data, while delivering order of magnitude improvements in availability, scalability, performance and cost of ownership.
Q3. How can SchoonerSQL achieve high availability (HA) with performance, scalability and at a reasonable cost?
John Busch: In the past, major trade-offs were required between performance, availability and TCO (total cost of ownership). Today’s commodity multi-core server, flash memory, and high speed networking, coupled with new database architectures and resource management algorithms, enable concurrently achieving radical improvements in performance, scalability, availability, data integrity, and cost of ownership. SchoonerSQL innovations incorporate fundamental database architecture and resource management advances, including:
· Linear vertical scaling, which fully utilizes modern commodity multi-core servers,providing 10:1 consolidation and capital and operating expense reduction;
· Unlimited horizontal scaling, which allows support of very large databases with high-performance and high availability and low cost of ownership using commodity hardware and standard SQL; and
· High performance synchronous and parallel asynchronous replication with automated fail-over, which provides 99.999% HA with full data integrity and no loss on performance.
Q4. What is your relationship with Oracle/MySQL?
John Busch: Schooner is an Oracle gold partner, and an OEM and go-to-market partner of Oracle. Schooner licensed the source for MySQL and InnoDB directly from Oracle and developed SchoonerSQL, which is completely compatible with Oracle’s MySQL. SchoonerSQL is an enterprise class database, and is targeted for customers requiring a mission critical database.. SchoonerSQL provides an order of magnitude improvement in performance, availability, scalability, and cost of ownership relative to Oracle’s MySQL 5.X.
Q5. What is special about SchoonerSQL’s transparent sharding?
John Busch: Beyond SchoonerSQL’s linear vertical scaling andclustering, which enables high-performance and high availability support ofmulti-terabyte databases, SchoonerSQL offers optional transparent relational sharding with DBShards to enable horizontal scaling across nodes for unlimited sized databases and unlimited scaling. SchoonerSQL’s DBShards relational transparent sharding is application-aware, based on analysis and optimization for the query and data access behavior of the specific customer workload.
Based on observed workload behavior, it optimally partitions the data across nodes and transparently replicates supporting data structures to eliminate cross nodecommunication to accomplish query execution.
Q6. How can you obtain scalability and high-performance with Big Data and at the same time offer SQL joins?
John Busch: SchoonerSQL’s DBShards relational transparent sharding optimally replicates supporting data structures used in dynamic queries. This is done on a workload specific-basis, based on dynamic query and data access patterns. As a result, there is no cross-node communication required to execute queries, and in particular for SQL-joins, the data is fully coalesced at the client with Schooner libraries transparently invoking the involved nodes.
Q7. What is your take on MariaDB?
John Busch: MariaDB is trying to offer an alternative to MySQL and to Oracle.
SchoonerSQL is focused on offering a superior MySQL for mission-critical applications and services, with 100% MySQL compatibility, vastly superior performance, availability, scalability and TCO, all in partnership with the Oracle corporation.
Q8. You also offer a Memcached-based product (Membrain). Why Membrain?
John Busch: Membrain is a very high-performance and very high availability scalable key-value store supporting the memcached protocol. Schooner’s experience in the market is that SchoonerSQL and Membrain are both required and very complementary.
Schooner Membrain provides high-performance, scalability, and high availability with low TCO for unstructured data based on fully exploiting flash memory and multi-core servers with synchronous replication and transparent fail -over. SchoonerSQL provides high-performance, scalability, and high availability with low TCO for structured data.
Q9. Talking about scalability and performance what are the main differences if the database is stored on hard drives, SAN, flash memory (Flashcache)? What happens when data does not fit in DRAM?
John Busch: SchoonerSQL and Schooner Membrain are designed to fully exploit the high IOPS of flash memory and SANs and the cores of today’s commodity servers. With SchoonerSQL, the performance when executing out of flash or SAN is almost the same as if everything were executing from DRAM memory, enabling the full utilization of today’s commodity multi-core servers with very high vertical scaling and consolidation at low cost.
SchoonerSQL also provides significant performance improvements with disc storage and flash cache. Measurements based on standard benchmarks show that Schooner offers much higher performance, consolidation, and scalability than any other MySQL or NoSQL product, and SchoonerSQL does this with 99.999% availability and much lower cost of ownership relative to legacy MySQL 5.X or other NoSQL offerings.
Q10. How do you differentiate yourselves from other NoSQL vendors (Key/Value stores, document-based databases and similar NoSQL databases)?
John Busch: SchoonerSQL and Membrain are unique in the industry. Schooner has over 20 filed patents on its advances in database/data store architecture and resource management.
SchoonerSQL and Membrain deliver order of magnitude improvements in performance, scalability, availability, and cost of ownership relative to any other MySQL or NoSQL, while maintaining 100% SQL and memcached compatibility.
Q11. What is your take on a database such as VoltDB?
John Busch: VoltDB is a large DRAM-only database. DRAM is expensive and volatile.
Schooner effectively utilizes parallel flash memory in a tightly integrated architecture , effectively exploiting flash, DRAM , multi-core, and multi-node scalability and availability. This results in superior cost of ownership and availability relative to VoltDB while providing high performance and unlimited scalability, all with 100% SQL compatibility.
Proliferation of Analytics
Q12. A/B testing, sessionization, bot detection, and pathing analysis all require powerful analytics on many petabytes of semi-structured Web data. How do you handle big semi-structured data?
John Busch: As we discussed above, SchoonerSQL provides optimized vertical scaling and clustering coupled with DBShards transparent relational horizontal scaling. This enables queries to be performed on unlimited semi-structured datasets, with 99.999% high availability, full data integrity, and the minimal number of commodity servers.
Q13. How do you see converging data from multiple data sources, both structured and unstructured?
John Busch: Today, SchoonerSQL and Schooner Membrain are often used in conjunction to provide support for both unstructured and structureddata. There are also emerging standards in interfacing heterogeneous structured and unstructured data stores. Schooner believes these are very important, and will contribute to and support these standards in our products.
Q14. Does it make sense to use Apache Hadoop, MapReduce and MySQL together?
John Busch: Hadoop/MapReduce provide a new distributed computational model which is very appropriate for certain application classes, and which is receiving industry acceptance and traction.
Schooner intends to enhance our products to provide exceptional interoperability with Hadoop so that customers can use these products in conjunction in delivering their services.
———–
Dr. John Busch, is the Founder, Chairman, and CTO of Schooner Information Technology.
Prior to Schooner, John was director of computer system architecture at Sun Microsystems Laboratories from 1999 through 2006. In this role, John led research in multi-core processors, multi-tier scale-out architectures, and advanced high-performance computer systems. John received the President’s Award for Innovation at Sun.
Prior to Sun, John was VP of engineering and business partnerships with Diba, Inc., co-founder, CTO and VP of engineering of Clarity Software, and director of computer systems R&D at Hewlett Packard.
John holds a Ph.D. in Computer Science from UCLA, an M.A. in Mathematics from UCLA, an M.S. in Computer Science from Stanford University, and attended the Sloan Program at Stanford.
Related Posts
– Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB.
– vFabric SQLFire: Better then RDBMS and NoSQL?
– MariaDB: the new MySQL? Interview with Michael Monty Widenius.
Related Resources
– ODBMS.org: Free Downloads and Links on:
*Analytical Data Platforms.
*Cloud Data Stores,
*Graphs and Data Stores
*NoSQL Data Stores,
##
“With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.” –Florian Waas.
On the subject of Big Data Analytics, I interviewed Florian Waas (flw). Florian is the Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team.
RVZ
Q1. What are the main technical challenges for big data analytics?
Florian Waas: Put simply, in the Big Data era the old paradigm of shipping data to the application isn’t working any more. Rather, the application logic must “come” to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack.
Instead of stand-alone products for ETL, BI/reporting and analytics we have to think about seamless integration: in what ways can we open up a data processing platform to enable applications to get closer?
What language interfaces, but also what resource management facilities can we offer? And so on.
At Greenplum, we’ve pioneered a couple of ways to make this integration reality: a few years ago with a Map-Reduce interface for the database and more recently with MADlib, an open source in-database analytics package. In fact, both rely on a powerful query processor under the covers that automates shipping application logic directly to the data.
Q2. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?
Florian Waas: With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.
Scale and performance requirements strain conventional databases. Almost always, the problems are a matter of the underlying architecture. If not built for scale from the ground-up a database will ultimately hit the wall — this is what makes it so difficult for the established vendors to play in this space because you cannot simply retrofit a 20+ year-old architecture to become a distributed MPP database over night.
Having said that, over the past few years, a whole crop of new MPP database companies has demonstrated that multiple PB’s don’t pose a terribly big challenge if you approach it with the right architecture in mind.
Q3. How do you handle structured and unstructured data?
Florian Waas: As a rule of thumb, we suggest to our customers to use Greenplum Database for structured data and to consider Greenplum HD—Greenplum’s enterprise Hadoop edition—for unstructured data. We’ve equipped both systems with high-performance connectors to import and export data to each other, which makes for a smooth transition when using one for pre-processing for the other, query HD using Greenplum Database, or whatever combination the application scenario might call for.
Having said this, we have seen a growing number of customers loading highly unstructured data directly into Greenplum Database and convert it into structured data on the fly through in-database logic for data cleansing, etc.
Q4. Cloud computing and open source: Do you they play a role at Greenplum? If yes, how?
Florian Waas: Cloud computing is an important direction for our business and hardly any vendor is better positioned than EMC in this space. Suffice it to say, we’re working on some exciting projects.
So, stay tuned!
As you know, Greenplum has been very close to the open source movement, historically. Besides our ties with the Postgres and Hadoop communities we released our own open source distribution of MADlib for in-database analytics (see also madlib.net)
Q5. In your blog you write that classical database benchmarks “aren’t any good at assessing the query optimizer”. Can you please elaborate on this?
Florian Waas: Unlike customer workloads, standard benchmarks pose few challenges for a query optimizer – the emphasis in these benchmarks is on query execution and storage structures. Recently, several systems that have no query optimizer to speak of have scored top results in the TPC-H benchmark.
And, while impressive at these benchmarks, these systems do usually not perform well in customer accounts when faced with ad-hoc queries — that’s where a good optimizer makes all the difference.
Q6. Why do we need specialized benchmarks for a subcomponent of a database?
Florian Waas: On the one hand, an optimizer benchmark will be a great tool for consumers.
A significant portion of the total cost of ownership of a database system comes from the cost of query tuning and manual query rewriting, in other words, the shortcomings of the query optimizer. Without an optimizer benchmark it’s impossible for consumers to compare the maintenance cost. That’s like buying a car without knowing its fuel consumption!
On the other hand, an optimizer benchmark will be extremely useful for engineering teams in optimizer development. It’s somewhat ironic that vendors haven’t invested in a methodology to show off that part of the system where most of their engine development cost goes.
Q7. Are you aware of any work in this area (Benchmarking query optimizers)?
Florian Waas: Funny, you’d asked. Over the past months I’ve been working with coworkers and colleagues in the industry on some techniques – we’re still far away from a complete benchmark but we’ve made some inroads.
Q8. You had done some work with “dealing with plan regressions caused by changes to the query optimizer”. Could you please explain what the problem is and what kind of solutions did you develop?
Florian Waas: A plan regression is a regression of a query due to changes to the optimizer from one release to the next. For the customer this could mean, after an upgrade or patch release one or more of their truly critical queries might run slower–maybe even so slow that it start impacting their daily business operations.
With the current test technology plan regressions are very hard to guard against simply because the size of the input space makes it impossible to achieve perfect test coverage. This dilemma made a number of vendors increasingly risk averse and turned into the biggest obstacle for innovation in this space. Some vendors came up with rather reactionary safety measures. To use another car analogy: many of these are akin to driving with defective breaks but wearing a helmet in the hopes that this will help prevent the worst in a crash.
I firmly believe in fixing the defective breaks, so to speak, and developing better test and analysis tools. We’ve made some good progress on this front and start seeing some payback already. This is an exciting and largely under-developed area of research!
Q9. Glenn Paulley of Sybase in a keynote at SIGMOD 2011 asked the question of ‘how much more complexity can database systems deal with? What is your take on this?
Florian Waas: Unnecessary complexity is bad. I think everybody will agree with that. Some complexity is inevitable though, and the question becomes: How are we dealing with it?
Database vendors have all too often fallen into the trap of implementing questionable features quickly without looking at the bigger picture. This has led to tons of internal complexity and special casing, not to mention the resulting spaghetti code.
When abstracted correctly and broken down into sound building blocks a lot of complexity can actually be handled quite well. Again, query optimization is a great example here: modern optimizers can be a joy to work with.
They are built and maintained by small surgical teams that innovate effectively! Whereas older models require literally dozens of engineers just to maintain the code base and fix bugs.
In short, I view dealing with complexity primarily as an exciting architecture and design challenge and I’m proud we assembled a team here at Greenplum that’s equally excited to take on this challenge!
Q10. I published an interview with Marko Rodriguez and Peter Neubauer, leaders of the Tinkerpop2 project. What is your opinion on Graph Analysis and Manipulation for databases?
Florian Waas: Great stuff these guys are building –- I’m interested to see how we can combine Big Data with graph analysis!
Q11. Anything else you wish to add?
Florian Waas: It’s been fun!
____________________
Florian Waas (flw) is Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team. His day job is to bring theory and practice together in the form of scalable and robust database technology.
_______________________________
Related Posts
– On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.
– On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.
– On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com
Related Resources
ODBMS.ORG: Resources on Analytical Data Platforms: Blog Posts | Free Software| Articles|
##
“ Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases. For a database project, the conceptual data model is a much more important software engineering contribution than use cases.“ — Dr. Michael Blaha.
First of all let me wish you a Happy, Healthy and Successful 2012!
I am coming back to discuss with our expert Dr. Michael Blaha, the topic of Database Modeling. In a previous interview we looked at the issue of “How good is UML for Database Design”?
Now, we look at Use Cases and discuss how good are they for Database Modeling. Hope you`ll find the interview interesting. I encourage the community to post comments.
RVZ
Q1. How are requirements taken into accounts when performing data base modeling in the daily praxis? What are the common problems and pitfalls?
Michael Blaha: Software development approaches vary widely. I’ve seen organizations use the following techniques for capturing requirements (listed in random order).
— Preparation of use cases.
— Preparation of requirements documents.
— Representation and explanation via a conceptual data model.
— Representation and explanation via prototyping.
— Haphazard approach. Just start writing code.
General issues include
— the amount of time required to capture requirements,
— missing requirements (requirements that are never mentioned)
— forgotten requirements (requirements that are mentioned but then forgotten)
— bogus requirements (requirements that are not germane to the business needs or that needlessly reach into design)
— incomplete understanding (requirements that are contradictory or misunderstood)
Q2. What is a use case?
Michael Blaha: A use case is a piece of functionality that a system provides to its users. A use case describes how a system interacts with outside actors.
Q3. What are the advantages of use cases?
Michael Blaha:
— Use cases lead to written documentation of requirements.
— They are intuitive to business specialists.
— Use cases are easy for developers to understand.
— They enable aspects of system functionality to be enumerated and managed.
— They include error cases.
— They let consulting shops bill many hours for low-skilled personnel.
(This is a cynical view, but I believe this is a major reason for some of the current practice.)
Q4. What are the disadvantages of use cases?
Michael Blaha:
— They are very time consuming. It takes much time to write them down. It takes much time to interview business experts (time that is often unavailable).
— Use cases are just one aspect of requirements. Other aspects should also be considered, such as existing documentation and
artifacts from related software. Many developers obsess on use cases and forget to look for other requirement sources.
— Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases.
— I have yet to see benefit from use case diagramming. I have yet to see significant benefit from use case structuring.
— In my opinion, use cases have been overhyped by marketeers.
Q5. How are use cases typically used in practice for database projects?
Michael Blaha: To capture requirements. It is OK to capture detailed requirements with use cases, but they should be
subservient to the class model. The class model defines the domain of discourse that use cases can then reference.
For database applications it is much inferior to start with use cases and afterwards construct a class model. Database applications, in particular, need a data approach and not a process approach.
It is ironic that use cases have arisen from the object-oriented community. Note that OO programming languages define a class structure to which logic is attached. So it is odd that use cases put process first and defer attention to data structure.
Q6. A possible alternative approach to data modeling is to write use cases first, then identifying the subsystems and components, and finally identifying the database schema. Do you agree with this?
Michael Blaha: This is a popular approach. No I do not agree with it. I strongly disagree.
For a database project, the conceptual data model is a much more important software engineering contribution than use cases.
Only when the conceptual model is well understood can use cases be fully understood and reconciled. Only then can developers integrate use cases and abstract their content into a form suitable for building a quality software product.
Q7. Many requirements and the design to satisfy those requirements are normally done with programming, not just schema. Do you agree with this? How do you handle this with use cases?
Michael Blaha: Databases provide a powerful language, but most do not provide a complete language.
The SQL language of relational databases is far from complete and some other language must be used to express full functionality.
OO databases are better in this regard. Since OO databases integrate a programming language with a persistence mechanism they inherently offer a full language for expressing functionality.
Use cases target functionality and functionality alone. Use cases, by their nature, do not pay attention to data structure.
Q8. Do you need to use UML for use cases?
Michael Blaha: No. The idea of use cases are valuable if used properly (in conjunction with data and normally subservient to data).
In my opinion, UML use case diagrams are a waste of time. They don’t add clarity. They add bulk and consume time.
Q9. Are there any suitable tools around to help the process of creating use cases for database design? If yes, how good are they?
Michael Blaha: Well, it’s clear by now that I don’t think much of use case diagrams. I think a textual approach is OK and there are probably requirement tools to manage such text, but I am unfamiliar with the product space.
Q10. Use case methods of design are usually applied to object-oriented models. Do you use use cases when working with an object database?
Michael Blaha: I would argue not. Most object-oriented languages put data first. First develop the data structure and then attach methods to the structure. Use cases are the opposite of this. They put functionality first.
Q11. Can you use use cases as a design method for relational databases, NoSQL databases, graph databases as well? And if yes how?
Michael Blaha: Not reasonably. I guess developers can force fit any technique and try to claim success.
To be realistic, traditional database developers (relational databases) are already resistant (for cultural reasons) to object-oriented jargon/style and the UML. When I show them that the UML class model is really just an ER model and fits in nicely with database conceptualization, they acknowledge my point, but it is still a foreign culture.
I don’t see how use cases have much to offer for NoSQL and graph databases.
Q12. So if you don’t have use cases, how do you address functionality when building database applications?
Michael Blaha: I strongly favor the technique of interactive conceptual data modeling. I get the various business and
technical constituencies in the same room and construct a UML class model live in front of them as we explore their business needs and scope. Of course, the business people articulate their needs in terms of use cases. But theuse cases are grounded by the evolving UML class model defining the domain of discourse. Normally I have my hands full with managing the meeting and constructing a class model in front of them. I don’t have time to explicitly capture the use cases (though I am appreciative if someone else volunteers for that task).
However, I fully consider the use cases by playing them against the evolving model. Of course as I consider use cases relative to the class model, I am reconciling the use cases. I am also considering abstraction as I construct the class model and consequently causing the business experts to do more abstraction in formulating their use case business requirements.
I have built class models this way many times before and it works great. Some developers are shocked at how well it can work.
————————————————–
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.
Related Posts
– How good is UML for Database Design? Interview with Michael Blaha.
– Agile data modeling and databases.>
– Why Patterns of Data Modeling?
Related Resources
-ODBMS.org: Databases in General: Blog Posts | Free Software | Articles and Presentations| Lecture Notes | Journals |
##

