ODBMS Industry Watch » ODBMS http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 21:04:31 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 On SQL and NoSQL. Interview with Dave Rosenthal http://www.odbms.org/blog/2014/03/dave-rosenthal/ http://www.odbms.org/blog/2014/03/dave-rosenthal/#comments Tue, 18 Mar 2014 16:04:45 +0000 http://www.odbms.org/blog/?p=2932

“Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications.”–Dave Rosenthal.

On SQL and NoSQL, I have interviewed Dave Rosenthal CEO of FoundationDB.

RVZ

Q1. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Dave Rosenthal: There is a tradeoff in available between commit latency and durability–especially in distributed databases. At one extreme a database client can just report success immediately (without even talking to the database server) and buffer the writes in the background. Obviously, that hides latency well, but you could lose a suffix of transactions. At the other extreme, you can replicate writes across multiple machines, fsync them on each of the machines, and only then report success to the client.

FoundationDB is optimized to provide good performance in its default setting, which is the safest end of that tradeoff.

Usually, if you want some reasonable amount of durability guarantee, you are talking about a commit latency of small constant factor times the network latency. So, the real latency issues come with databases spanning multiple data centers. In that case FoundationDB users are able to choose whether or not they want durability guarantees in all data centers before commit (increasing commitment latencies), which is our default setting, or whether they would like to relax durability guarantees by returning a commit when the data is fsync’d to disk in just one datacenter.

All that said, in general, we think that the application is usually a more appropriate place to try to hide latency than the database.

Q2. Justin Sheehy of Basho in an interview said [1] “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

Dave Rosenthal: Yes, we totally agree with Justin. Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications. In fact, true ACID transactions are way more broadly important than that. They give you the ability to build abstractions and systems that you can provide guarantees about.
As Michael Cahill says in his thesis which became the SIGMOD paper of the year: “Serializable isolation enables the development of a complex system by composing modules, after verifying that each module maintains consistency in isolation.” It’s this incredibly important ability to compose that makes a system with transactions special.

Q3. FoundationDB claim to provide full ACID transactions. How do you do that?

Dave Rosenthal: In the same basic way as many other transactional databases do. We use a few strategies that tend to work well in distributed system such as optimistic concurrency and MVCC. We also, of course, have had to solve some of the fundamental challenges associated with distributed systems and all of the crazy things that can happen in them. Honestly, it’s not very hard to build a distributed transactional database. The hard part is making it work gracefully through failure scenarios and to run fast.

Q4. Is this similar to Oracle NoSQL?

Dave Rosenthal: Not really. Both Oracle NoSQL and FoundationDB provide an automatically-partitioned key-value store with fault tolerance. Both also have a concept of ordering keys (for efficient range operations) though Oracle NoSQL only provides ordering “within a Major Key set”. So, those are the similarities, but there are a bunch of other NoSQL systems with all those properties. The huge difference is that FoundationDB provides for ACID transactions over arbitrary keys and ranges, while Oracle NoSQL does not.

Q5. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

Dave Rosenthal: The most obvious response for the NoSQL data stores would be “we have ACID transactions, they don’t”, but the more important difference is in philosophy and strategy.

Each of those products expose a single data model and interface. Maybe two. We are pursuing a fundamentally different strategy.
We are building a storage substrate that can be adapted, via layers, to provide a variety of data models, APIs, and true flexibility.
We can do that because of our transactional capabilities. CouchDB, MongoDB, Cassandra and Riak all have different APIs and we talk to companies that run all of those products side-by-side. The NewSQL database players are also offering a single data model, albeit a very popular one, SQL. FoundationDB is offering an ever increasing number of data models through its “layers”, currently including several popular NoSQL data models and with SQL being the next big one to hit. Our philosophy is that you shouldn’t have to increase the complexity of your architecture by adopting a new NoSQL database each time your engineers need access to a new data model.

Q6. Cloud computing and open source: How does it relate to FoundationDB?

Dave Rosenthal: Cloud computing: FoundationDB has been designed from the beginning to run well in cloud environments that make use of large numbers of commodity machines connected through a network. Probably the most important aspect of a distributed database designed for cloud deployment is exceptional fault tolerance under very harsh and strange failure conditions – the kind of exceptionally unlikely things that can only happen when you have many machines working together with components failing unpredictably. We have put a huge amount of effort into testing FoundationDB in these grueling scenarios, and feel very confident in our ability to perform well in these types of environments. In particular, we have users running FoundationDB successfully on many different cloud providers, and we’ve seen the system keep its guarantees under real-world hardware and network failure conditions experienced by our users.

Open source: Although FoundationDB’s core data storage engine is closed source, our layer ecosystem is open source. Although the core data storage engine has a very simple feature set, and is very difficult to properly modify while maintaining correctness, layers are very feature rich and because they are stateless, are much easier to create and modify which makes them well suited to third-party contributions.

Q7 Pls give some examples of use cases where FoundationDB is currently in use. Is FoundationDB in use for analyzing Big Data as well?

Dave Rosenthal: Some examples: User data, meta data, user social graphs, geo data, via ORMs using the SQL layer, metrics collection, etc.

We’ve mostly focused on operational systems, but a few of our customers have built what I would call “big data” applications, which I think of as analytics-focused. The most common use case has been for collecting and analyzing time-series data. FoundationDB is strongest in big data applications that call for lots of random reads and writes, not just big table scans—which many systems can do well.

Q8. Rick Cattel said in an recent interview [2] “there aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves”. What is your opinion on this?

Dave Rosenthal: People have great ideas for databases all the time. New data models, new query languages, etc.
If nothing else, this NoSQL experiment that we’ve all been a part of the past few years has shown us all the appetite for data models suited to specific problems. They would love to be able to build these tools, open source them, etc.
The problem is that the checklist of practical considerations for a database is huge: Fault tolerance, scalability, a backup solution, management and monitoring, ACID transactions, etc. Add those together and even the simplest concept sounds like a huge project.

Our vision at FoundationDB is that we have done the hard work to build a storage substrate that simultaneously solves all those tricky practical problems. Our engine can be used to quickly build a database layer for any particular application that inherits all of those solutions and their benefits, like scalability, fault tolerance and ACID compliance.

Q9. Nick Heudecker of Gartner, predicts that [3] “going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time” . What is your take on this?

Dave Rosenthal: I do think that the lines between SQL and NoSQL will start to blur and I believe that we are leading that charge.We acquired another database startup last year called Akiban that builds an amazing SQL database engine.
In 2014 we’ll be bringing that engine to market as a layer running on top of FoundationDB. That will be a true ANSI SQL database operating as a module directly on top of a transactional “NoSQL” engine, inheriting the operational benefits of our core storage engine – scalability, fault tolerance, ease of operation.

When you run multiple of the SQL layer modules, you can point many of them at the same key-space in FoundationDB and it’s as if they are all part of the same database, with ACID transactions enforced across the separate SQL layer processes.
It’s very cool. Of course, you can even run the SQL layer on a FoundationDB cluster that’s also supporting other data models, like graph or document. That’s about as blurry as it gets.

———–
Dave Rosenthal is CEO of FoundationDB. Dave started his career in games, building a 3D real-time strategy game with a team of high-school friends that won the 1st annual Independent Games Festival. Previously, Dave was CTO at Visual Sciences, a pioneering web-analytics company that is now part of Adobe. Dave has a degree in theoretical computer science from MIT.

Related Posts
Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch December 16, 2013

Follow ODBMS.org on Twitter: @odbmsorg

 

]]>
http://www.odbms.org/blog/2014/03/dave-rosenthal/feed/ 0
Big Data: Three questions to Aerospike. http://www.odbms.org/blog/2014/03/big-data-three-questions-to-aerospike/ http://www.odbms.org/blog/2014/03/big-data-three-questions-to-aerospike/#comments Sun, 02 Mar 2014 19:56:19 +0000 http://www.odbms.org/blog/?p=2861

“Many tools now exist to run database software without installing software. From vagrant boxes, to one click cloud install, to a cloud service that doesn’t require any installation, developer ease of use has always been a path to storage platform success.”–Brian Bulkowski.

The fifth interview in the “Big Data: three questions to “ series of interviews, is with Brian Bulkowski, Aerospike co-founder and CTO.

RVZ

Q1. What is your current product offering?

Brian Bulkowski: Aerospike is the first in-memory NoSQL database optimized for flash or solid state drives (SSDs).
In-memory for speed and NoSQL for scale. Our approach to memory is unique – we have built our own file system to access flash, we store indexes in DRAM and you can configure data sets to be in a combination of DRAM or flash. This gives you close to DRAM speeds, the persistence of rotational drives and the price performance of flash.
As next gen apps scale up beyond enterprise scale to “global scale”, managing billions of rows, terabytes of data and processing from 20k to 2 million read/write transactions per second, scaling costs are an important consideration. Servers, DRAM, power and operations – the costs add up, so even developers with small initial deployments must architect their systems with the bottom line in mind and take advantage of flash.
Aerospike is an operational database, a fast key-value store with ACID properties – immediate consistency for single row reads and writes, plus secondary indexes and user defined functions. Values can be simple strings, ints, blobs as well as lists and maps.
Queries are distributed and processed in parallel across the cluster and results on each node can be filtered, transformed, aggregated via user defined functions. This enables developers to enhance key value workloads with a few queries and some in-database processing.

Q2. Who are your current customers and how do they typically use your products?

Brian Bulkowski: We see two use cases – one as an edge database or real-time context store (user profile store, cookie store) and another as a very cost-effective and reliable cache in front of a relational database like mySQL or DB2.

Our customers are some of the biggest names in real-time bidding, cross channel (display, mobile, video, social, gaming) advertising and digital marketing, including AppNexus, BlueKai, TheTradeDesk and [X+1]. These companies use Aerospike to store real-time user profile information like cookies, device-ids, IP addresses, clickstreams, combined with behavioral segment data calculated using analytics platforms and models run in Hadoop or data warehouses. They choose Aerospike for predictable high performance, where reads and writes consistently, meaning 99% of the time, complete within 2-3 milliseconds.

The second set of customers use us in front of an existing database for more cost-effective and reliable caching. In addition to predictable high performance they don’t want to shard Redis, and they need persistence, high availability and reliability. Some need rack-awareness and cross data center support and they all want to take advantage of Aerospike deployments that are both simpler to manage and more cost-effective than alternative NoSQL databases, in-memory databases and caching technologies.

Q3. What are the main new technical features you are currently working on and why?

Brian Bulkowski: We are focused on ease of use, making development easier – quickly writing powerful, scalable applications – with developer tools and connectors. In our Aerospike 3 offering, we launched indexes and distributed queries, user defined functions for in-database processing, expressive API support, and aggregation queries. Performance continues to improve, with support for today’s highly parallel CPUs, higher density flash arrays, and improved allocators for RAM based in-memory use cases.

Developers love Aerospike because it’s easy to run a service operationally. That scale comes after the developer builds their original applications, so developers want samples and connectors that are tested and work easily. Whether that’s an ETL loader for CSV and JSON that’s parallel and scalable, a Hadoop connector to pour insights directly to Aerospike in order to drive hot interface changes, or improving our Mac OSX client that developers need, or HTTP/REST interfaces, developers need the ability to write their core application code to easily use Aerospike.

Many tools now exist to run database software without installing software. From vagrant boxes, to one click cloud install, to a cloud service that doesn’t require any installation, developer ease of use has always been a path to storage platform success.

Related Posts

Big Data: Three questions to McObject, ODBMS Industry Watch, February 14, 2014

Big Data: Three questions to VoltDB. ODBMS Industry Watch, February 6, 2014.

Big Data: Three questions to Pivotal. ODBMS Industry Watch, January 20, 2014.

Big Data: Three questions to InterSystems. ODBMS Industry Watch, January 13, 2014.

Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch, December 16, 2013.

Resources

Gartner – Magic Quadrant for Operational Database Management Systems (Access the report via registration). Authors: Donald Feinberg, Merv Adrian, Nick Heudecker, Date Published: 21 October 2013.

ODBMS.org free resources on NoSQL Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis.

  • Follow ODBMS.org on Twitter: @odbmsorg

    ##

  • ]]>
    http://www.odbms.org/blog/2014/03/big-data-three-questions-to-aerospike/feed/ 1
    Big Data: Three questions to McObject. http://www.odbms.org/blog/2014/02/big-data-three-questions-to-mcobject/ http://www.odbms.org/blog/2014/02/big-data-three-questions-to-mcobject/#comments Fri, 14 Feb 2014 08:21:08 +0000 http://www.odbms.org/blog/?p=2874

    “In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.”–Steven T. Graves.

    The fourth interview in the “Big Data: three questions to “ series of interviews, is with Steven T. Graves, President and CEO McObject

    RVZ

    Q1. What is your current product offering?

    Steven T. Graves: McObject has two product lines. One is the eXtremeDB product family. eXtremeDB is a real-time embedded database system built on a core in-memory database system (IMDS) architecture, with the eXtremeDB IMDS edition representing the “standard” product. Other eXtremeDB editions offer special features and capabilities such as an optional SQL API, high availability, clustering, 64-bit support, optional and selective persistent storage, transaction logging and more.

    In addition, our eXtremeDB Financial Edition database system targets real-time capital markets systems such as algorithmic trading and risk management (and has its own Web site). eXtremeDB Financial Edition comprises a super-set of the individual eXtremeDB editions (bundling together all specialized libraries such as clustering, 64-bit support, etc.) and offers features including columnar data handling and vector-based statistical processing for managing market data (or any other type of time series data).

    Features shared across the eXtremeDB product family include: ACID-compliant transactions; multiple application programming interfaces (a native and type-safe C/C++ API; SQL/ODBC/JDBC; native Java, C# and Python interfaces); multi-user concurrency with an optional multi-version concurrency control (MVCC) transaction manager; event notifications; cache prioritization; and support for multiple database indexes (b-tree, r-tree, kd-tree, hash, Patricia trie, etc.). eXtremeDB’s footprint is small, with an approximately 150K code size. eXtremeDB is available for a wide range of server, real-time operating system (RTOS) and desktop operating systems, and McObject provides eXtremeDB source code for porting.

    McObject’s second product offering is the Perst open source, object-oriented embedded database system, available in all-Java and all-C# (.NET) versions. Perst is small (code size typically less than 500K) and very fast, with features including ACID-compliant transactions; specialized collection classes (such as a classic b-tree implementation; r-tree indexes for spatial data; database containers optimized for memory-only access, etc.); garbage collection; full-text search; schema evolution; a “wrapper” that provides a SQL-like interface (SubSQL); XML import/export; database replication, and more.

    Perst also operates in specialized environments. Perst for .NET includes support for .NET Compact Framework, Windows Phone 8 (WP8) and Silverlight (check out our browser-based Silverlight CRM demo, which showcases Perst’s support for storage on users’ local file systems). The Java edition supports the Android smartphone platform, and includes the Perst Lite embedded database for Java ME.

    Q2. Who are your current customers and how do they typically use your products?

    Steven T. Graves: eXtremeDB initially targeted real-time embedded systems, often residing in non-PC devices such as set-top boxes, telecom switches or industrial controllers.
    There are literally millions of eXtremeDB -based devices deployed by our customers; a few examples are set-top boxes from DIRECTV (eXtremeDB is the basis of an electronic programming guide); F5 Networks’ BIG-IP network infrastructure (eXtremeDB is built into the devices’ proprietary embedded operating system); and BAE Systems (avionics in the Panavia Tornado GR4 combat jet). A recent new customer in telecom/networking is Compass-EOS, which has released the first photonics-based core IP router, using eXtremeDB High Availability to manage the device’s control plane database.

    Addition of “enterprise-friendly” features (support for SQL, Java, 64-bit, MVCC, etc.) drove eXtremeDB’s adoption for non-embedded systems that demand fast performance. Examples include software-as-a-service provider hetras Gmbh (eXtremeDB handles the most performance-intensive queries in its Cloud-based hotel management system); Transaction Network Services (eXtremeDB is used in a highly scalable system for real-time phone number lookups/ routing); and MeetMe.com (formerly MyYearbook.com – eXtremeDB manages data in social networking applications).

    In the financial industry, eXtremeDB is used by a variety of trading organizations and technology providers. Examples include the broker-dealer TradeStation (McObject’s database technology is part of its next-generation order execution system); Financial Technologies of India, Ltd. (FTIL), which has deployed eXtremeDB in the order-matching application used across its network of financial exchanges in Asia and the Middle East; and NSE.IT (eXtremeDB supports risk management in algorithmic trading).

    Users of Perst are many and varied, too. You can find Perst in many commercial software applications such as enterprise application management solutions from the Wily Division of CA. Perst has also been adopted for community-based open source projects, including the Frost client for the Freenet global peer-to-peer network. Some of the most interesting Perst-based applications are mobile. For example, 7City Learning, which provides training for financial professionals, gives students an Android tablet with study materials that are accessed using Perst. Several other McObject customers use Perst in mobile medical apps.

    Q3. What are the main new technical features you are currently working on and why?

    Steven T. Graves: One feature we’re very excited about is the ability to pipeline vector-based statistical functions in eXtremeDB Financial Edition – we’ve even released a short video and a 10-page white paper describing this capability. In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.

    This may not sound unusual, since almost any algorithm or program can be viewed as a chain of operations acting on data.
    But this pipelining has a unique purpose and a powerful result: it keeps market data inside CPU cache as the data is being worked.
    Without pipelining, the results of each function would typically be materialized outside cache, in temporary tables residing in main memory. Handing interim results back and forth “across the transom” between CPU cache and main memory imposes significant latency, which is eliminated by pipelining. We’ve been improving this capability by adding new statistical functions to the library. (For an explanation of pipelining that’s more in-depth than the video but shorter than the white paper, check out this article on the financial technology site Low-Latency.com.)

    We are also adding to the capabilities of eXtremeDB Cluster edition to make clustering faster and more flexible, and further simplify cluster administration. Improvements include a local tables option, in which database tables can be made exempt from replication, but shareable through a scatter/gather mechanism. Dynamic clustering, added in our recent v. 5.0 upgrade, enables nodes to join and leave clusters without interrupting processing. This further simplifies administration for a clustering database technology that counts minimal run-time maintenance as a key benefit. On selected platforms, clustering now supports the Infiniband switched fabric interconnect and Message Passing Interface (MPI) standard. In our tests, these high performance networking options accelerated performance more than 7.5x compared to “plain vanilla” gigabit networking (TCP/IP and Ethernet).

    Related Posts

    Big Data: Three questions to VoltDB.
    ODBMS INdustry Watch, February 6, 2014

    Big Data: Three questions to Pivotal.
    ODBMS Industry Watch, January 20, 2014.

    Big Data: Three questions to InterSystems.
    ODBMS Industry Watch, January 13, 2014.

    Cloud based hotel management– Interview with Keith Gruen.
    ODBMS Industry Watch, July 25, 2013

    In-memory database systems. Interview with Steve Graves, McObject.
    ODBMS Industry Watch, March 16, 2012

    Resources

    ODBMS.org: Free resources on Big Data, Analytics, Cloud Data Stores, Graphs Databases, NewSQL, NoSQL, Object Databases.

  • Follow ODBMS.org on Twitter: @odbmsorg
  • ##

    ]]>
    http://www.odbms.org/blog/2014/02/big-data-three-questions-to-mcobject/feed/ 0
    Operational Database Management Systems. Interview with Nick Heudecker http://www.odbms.org/blog/2013/12/operational-database-management-systems-interview-with-nick-heudecker/ http://www.odbms.org/blog/2013/12/operational-database-management-systems-interview-with-nick-heudecker/#comments Mon, 16 Dec 2013 08:36:39 +0000 http://www.odbms.org/blog/?p=2797

    “Going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time.”–Nick Heudecker.

    Gartner recently published a new report on “Operational Database Management Systems”. I have interviewed one of the co-authors of the report, Nick Heudecker, Research Director – Information Management at Gartner, Inc.

    Happy Holidays and Happy New Year to you and yours!


    RVZ

    Q1. You co-authored Gartner’s new report, “Magic Quadrant for Operational Database Management Systems”. How do you define “Operational Database Management Systems” (ODBMS)?

    Nick Heudecker: Prior to operational DBMS, the common label for databases was OLTP. However, OLTP no longer accurately describes the range of activities an operational DBMS is called on to support. Additionally, mobile and social, elements of Gartner’s Nexus of Forces, have created new activity types which we broadly classify as interactions and observations. Supporting these new activity types has resulted in new vendors entering the market to compete with established vendors. Also, OLTP is no longer valid as all transactions are on-line.

    Q2. What were the main evaluation criteria you used for the “Magic Quadrant for Operational Database Management Systems” report?

    Nick Heudecker: The primary evaluation criteria for any Magic Quadrant consists of customer reference surveys. Vendors are also evaluated on market understanding, strategy, offerings, business model, execution, and overall viability.

    Q3. To be included in the Magic Quadrant, what were the criteria that vendors and products had to meet?

    Nick Heudecker: To be included in the Magic Quadrant, vendors had to have at least ten customer references, meet a minimum revenue number and meet our definition
    of the market.

    Q4. What is new in the last year in the Operational Database Management Systems area, in your view? What is changing?

    Nick Heudecker: Innovations in the operational DBMS area have developed around flash memory, DRAM improvements, new processor technology, networking and appliance form factors. Flash memory devices have become faster, larger, more reliable and cheaper. DRAM has become far less costly and grown in size to greater than 1TB available on a server.
    This has not only enabled larger disk caching, but also led to the development and wider use of in-memory DBMSs. New processor technology not only enables better DBMS performance in physically smaller servers, but also allows virtualization to be used for multiple applications and the DBMS on the same server. With new methods of interconnect such as 10-gigabit Ethernet and Infiniband, the connection between the storage systems and the DBMS software on the server is far faster. This has also increased performance and allowed for larger storage in a smaller space and faster interconnect for distributed data in a scale-out architecture. Finally, DBMS appliances are beginning to gain acceptance.

    Q5. You also co-authored Gartner’s “Who’s Who in NoSQL Databases” report back in August. What is the current status of the NoSQL market in your opinion?

    Nick Heudecker: There is a substantial amount of interest in NoSQL offerings, but also a great deal of confusion related to use cases and how vendor offerings are differentiated.
    One question we get frequently is if NoSQL DBMSs are viable candidates to replace RDBMSs. To date, NoSQL deployments have been overwhelmingly supplemental to traditional relational DBMS deployments, not destructive.

    Q6. How does the NoSQL market relate to the Operational Database Management Systems market?

    Nick Heudecker: First, it’s difficult to define a NoSQL market. There are four distinct categories of NoSQL DBMS (document, key-value, table-style and graph), each with different capabilities and addressable use cases. That said, the various types of NoSQL DBMSs are included in the operational DBMS market based on capabilities around interactions and observations.

    Q8. What do you see happening with Operational Database Management Systems, going forward?

    Nick Heudecker: Going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time.

    ——————

    Nick Heudecker is a Research Director for Gartner Inc, covering information management topics and specializing in Big Data and NoSQL.
    Prior to Gartner, Mr. Heudecker worked with several Bay Area startups and developed an enterprise software development consulting practice. He resides in Silicon Valley.

    —————————-
    Resources

    Gartner, “Magic Quadrant for Operational Database Management Systems,” by Donald Feinberg, Merv Adrian, and Nick Heudecker, October 21, 2013

    Access the full Gartner Magic Quadrant report (via MarkLogic web site- registration required).

    Access the full Gartner Magic Quadrant report (via Aerospike web site- registration required)

    Related Posts

    On Big Data and NoSQL. Interview with Renat Khasanshyn. October 7, 2013

    On NoSQL. Interview with Rick Cattell. August 19, 2013

    Follow ODBMS.org on Twitter: @odbmsorg

    ##

    ]]>
    http://www.odbms.org/blog/2013/12/operational-database-management-systems-interview-with-nick-heudecker/feed/ 0
    Challenges and Opportunities for Big Data. Interview with Mike Hoskins http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/ http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/#comments Tue, 03 Dec 2013 07:52:01 +0000 http://www.odbms.org/blog/?p=2821

    “We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data” –Mike Hoskins.

    On the topic, Challenges and Opportunities for Big Data, I have interviewed Mike Hoskins, Actian Chief Technology Officer.

    RVZ

    Q1. What are in your opinion the most interesting opportunities in Big Data?

    Mike Hoskins: Until recently, most data projects were solely focused on preparation. Seminal developments in the big data landscape, including Hortonworks Data Platform (HDP) 2.0 and the arrival of YARN (Yet Another Resource Negotiator) – which takes Hadoop’s capabilities in data processing beyond the limitations of the highly regimented and restrictive MapReduce programming model – provides an opportunity to move beyond the initial hype of big data and instead towards the more high-value work of predictive analytics.
    As more big data applications are built on the Hadoop platform customized by industry and business needs, we’ll really begin to see organizations leveraging predictive analytics across the enterprise – not just in a sandbox or in the domain of the data scientists, but in the hands of the business users. At that point, more immediate action can be taken on insights.

    Q2. What are the most interesting challenges in Big Data?

    Mike Hoskins: We are facing an imminent torrent of machine generated data, creating volumes that will break the back of conventional hardware and software architectures. It is no longer be feasible to move the data to the compute process – the compute process has to be moved to the data. Companies need to rethink their static and rigid business intelligence and analytic software architectures in order to continue working at the speed of business. It’s clear that time has become the new gold standard – you can’t produce more of it; you can only increase the speed at which things happen.
    Software vendors with the capacity to survive and thrive in this environment will keep pace with the competition by offering a unified platform, underpinned by engineering innovation, completeness of solution and the service integrity and customer support that is essential to market staying power.

    Q3. Steve Shine, CEO and President, Actian Corporation, said in a recent interview (*) that “the synergies in data management come not from how the systems connect but how the data is used to derive business value”. Actian has completed a number of acquisitions this year. So, what is your strategy for Big Data at Actian?

    Mike Hoskins: Actian has placed its bets on a completely modern unified platform that is designed to deliver on the opportunities presented by the Age of Data. Our technology assets bring a level of maturity and innovation to the space that no other technology vendor can provide – with 30+ years of expertise in ‘all things data’ and over $1M investment in innovation. Our mission is to arm organizations with solutions that irreversibly shift the price/performance curve beyond the reach of traditional legacy stack players, allowing them to get a leg up on the competition, retain customers, detect fraud, predict business trends and effectively use data as their most important asset.

    Q4. What are the products synergies related to such a strategy?

    Mike Hoskins: Through the acquisition of Pervasive Software (a provider of big data analytics and cloud-based and on-premises data management and integration), Versant (an industry leader in specialized data management), and ParAccel (a leader in high-performance analytics), Actian has compiled a unified end-to-end platform with capabilities to connect, prep, optimize and analyze data natively on Hadoop, and then offer it to the necessary reporting and analytics environments to meet virtually any business need. All the while, operating on commodity hardware at a much lower cost than legacy software can ever evolve to.

    Q5. What else still need to be done at Actian to fully deploy this strategy?

    Mike Hoskins: There are definitely opportunities to continue integrating the platform experience and improve the user experience overall. Our world-class database technology can be brought closer to Hadoop, and we will continue innovating on analytic techniques to grow our stack upward.
    Our development team is working diligently to create a common user interface across all of our platforms, as we bring out technology together. We have the opportunity to create a true first-class SQL engine running natively Hadoop, and to more fully exploit market leading cooperative computing with our On-Demand Integration (ODI) capabilities. I would also like to raise the awareness of the power and speed of our offerings as a general paradigm for analytic applications.

    We don’t know what new challenges the Age of Data will bring, but we will continue to look to the future and build out a technology infrastructure to help organizations deal with the only constant – change.

    Q6. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

    Mike Hoskins: Elastic cloud computing is a convulsive game changer in the marketplace. It’s positive; if not where you do full production, at the very least it allows people to test, adopt and experiment with their data in a way that they couldn’t before. For cases where data is born in the cloud, using a 100% cloud model makes sense. However, much data is highly distributed in cloud and on-premises systems and applications, so it’s vital to have technology that can run and connect to either environments via a hybrid model.

    We will soon see more organizations utilizing cloud platforms to run analytic processes, if that is where their data is born and lives.

    Q7. How is your Cloud technology helping Amazon`s Redshift?

    Mike Hoskins: Amazon Redshift leverages our high-performance analytics database technology to help users get the most out of their cloud investment. Amazon selected our technology over all other database and data warehouse technologies available in the marketplace because of the incredible performance, extreme scalability, and flexibility.

    Q8. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey.
    When you speak with your customers what are the typical use cases and requirements they have?

    Mike Hoskins: A recent survey of data architects and CIOs by Sand Hill Group revealed that the top challenge of Hadoop adoption was knowledge and experience with the Hadoop platform, followed by the availability of Hadoop and big data skills, and finally the amount of technology development required to implement a Hadoop-based solution. This just goes to show how little we have actually begun to fully leverage the capabilities of Hadoop. Businesses are really only just starting to dip their toe in the analytic water. Although it’s still very early, the majority of use cases that we have seen are centered around data prep and ETL.

    Q9. What do you think is still needed for big data analytics to be really useful for the enterprise?

    Mike Hoskins: If we look at the complete end-to-end data pipeline, there are several things that are still needed for enterprises to take advantage of the opportunities. This includes high productivity, performant integration layers, and analytics that move beyond the sphere of data science and into mainstream business usage, with discovery analytics through a simple UI studio or an analytics-as-a-service offering. Analytics need to be made more available in the critical discovery phase, to bring out the outcomes, patterns, models, discoveries, etc. and begin applying them to business processes.

    Qx. Anything else you wish to add?

    Mike Hoskins: These kinds of highly disruptive periods are, frankly, unnerving for the marketplace and businesses. Organizations cannot rely on traditional big stack vendors, who are unprepared for the tectonic shift caused by big data, and therefore are not agile enough to rapidly adjust their platforms to deliver on the opportunities. Organizations are forced to embark on new paths and become their own System Integrators (SIs).

    On the other hand, organizations cannot tie their future to the vast number of startups, throwing darts to find the one vendor that will prevail. Instead, they need a technology partner somewhere in the middle that understands data in-and-out, and has invested completely and wholly as a dedicated stack to help solve the challenge.

    Although it’s uncomfortable, it is urgent that organizations look at modern architectures, next-generation vendors and innovative technology that will allow them to succeed and stay competitive in the Age of Data.

    —————————–
    Mike Hoskins, Actian Chief Technology Officer
    Actian CTO Michael Hoskins directs Actian’s technology innovation strategies and evangelizes accelerating trends in big data, and cloud-based and on-premises data management and integration. Mike, a Distinguished and Centennial Alumnus of Ohio’s Bowling Green State University, is a respected technology thought leader who has been featured in TechCrunch, Forbes.com, Datanami, The Register and Scobleizer. Mike has been a featured speaker at events worldwide, including Strata NY + Hadoop World 2013, the keynoter at DeployCon 2012, the “Open Standards and Cloud Computing” panel at the Annual Conference on Knowledge Discovery and Data Mining, the “Scaling the Database in the Cloud” panel at Structure 2010, and the “Many Faces of Map Reduce – Hadoop and Beyond” panel at Structure Big Data 2011. Mike received the AITP Austin chapter’s 2007 Information Technologist of the Year Award for his leadership in developing Actian DataRush, a highly parallelized framework to leverage multicore. Follow Mike on Twitter: @MikeHSays.

    Related Posts

    Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner. November 15, 2013

    On Big Data. Interview with Adam Kocoloski. November 5, 2013

    Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

    (*) Acquiring Versant –Interview with Steve Shine. March 6, 2013

    Resources

    “Do You Hadoop? A Survey of Big Data Practitioners”, Bradley Graham M. R. Rangaswami, SandHill Group, October 29, 2013 (.PDF)

    ActianVectorwise 3.0: Fast Analytics and Answers from Hadoop. Actian Corporation
    Paper | Technical | English | DOWNLOAD(PDF)| May 2013|

    ]]>
    http://www.odbms.org/blog/2013/12/challenges-and-opportunities-for-big-data-interview-with-mike-hoskins/feed/ 1
    Big Data from Space: the “Herschel” telescope. http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/ http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/#comments Fri, 02 Aug 2013 12:45:02 +0000 http://www.odbms.org/blog/?p=2169

    ” One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on”–Jon Brumfitt.

    On May 14, 2009, the European Space Agency launched an Arianne 5 rocket carrying the largest telescope ever flown: the “Herschel” telescope, 3.5 meters in diameter.

    I first did an interview with Dr. Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, at the European Space Agency in March 2011. You can read that interview here.

    Two years later, I wanted to know the status of the project. This is a follow up interview.

    RVZ

    Q1. What is the status of the mission?

    Jon Brumfitt: The operational phase of the Herschel mission came to an end on 29th April 2013, when the super-fluid helium used to cool the instruments was finally exhausted. By operating in the far infra-red, Herschel has been able to see cold objects that are invisible to normal telescopes.
    However, this requires that the detectors are cooled to an even lower temperature. The helium cools the instruments down to 1.7K (about -271 Celsius). Individual detectors are then cooled down further to about 0.3K. This is very close to absolute zero, which is the coldest possible temperature. The exhaustion of the helium marks the end of new observations, but it is by no means the end of the mission.
    We still have a lot of work to do in getting the best results from the data processing to give astronomers a final legacy archive of high-quality data to work with for years to come.

    The spacecraft has been in orbit around a point known as the second Lagrangian point “L2″, which is about 1.5 million kilometres from Earth (around four times as far away as the Moon). This location provided a good thermal environment and a relatively unrestricted view of the sky. The spacecraft cannot be left in this orbit because regular correction manoeuvres would be needed. Consequently, it is being transferred into a “parking” orbit around the Sun.

    Q2. What are the main results obtained so far by using the “Herschel” telescope?

    Jon Brumfitt: That is a difficult one to answer in a few sentences. Just to take a few examples, Herschel has given us new insights into the way that stars form and the history of star formation and galaxy evolution since the big-bang.
    It has discovered large quantities of cold water vapour in the dusty disk surrounding a young star, which suggests the possibility of other water covered planets. It has also given us new evidence for the origins of water on Earth.
    The following are some links giving more detailed highlights from the mission:

    – Press
    – Results
    – Press Releases
    – Latest news

    With its 3.5 metre diameter mirror, Herschel is the largest space telescope ever launched. The large mirror not only gives it a high sensitivity but also allows us to observe the sky with a high spatial resolution. So in a sense every observation we make is showing us something we have never seen before. We have performed around 35,000 science observations, which have already resulted in over 600 papers being published in scientific journals. There are many years of work ahead for astronomers in interpreting the results, which will undoubtedly lead to many new discoveries.

    Q3. How much data did you receive and process so far? Could you give us some up to date information?

    Jon Brumfitt: We have about 3 TB of data in the Versant database, most of which is raw data from the spacecraft. The data received each day is processed by our data processing pipeline and the resulting data products, such as images and spectra, are placed in an archive for access by astronomers.
    Each time we make a major new release of the software (roughly every six months at this stage), with improvements to the data processing, we reprocess everything.
    The data processing runs on a grid with around 35 nodes, each with typically 8 cores and between 16 and 256 GB of memory. This is able to process around 40 days worth of data per day, so it is possible to reprocess everything in a few weeks. The data in the archive is stored as FITS files (a standard format for astronomical data).
    The archive uses a relational (PostgreSQL) database to catalogue the data and allow queries to find relevant data. This relational database is only about 60 GB, whereas the product files account for about 60 TB.
    This may reduce somewhat for the final archive, once we have cleaned it up by removing the results of earlier processing runs.

    Q4. What are the main technical challenges in the data management part of this mission and how did you solve them?

    Jon Brumfitt: One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on.

    The lifetime of Herschel will have been 18 years from the start of software development to the end of the post-operations phase.
    We designed a single system to meet the needs of all mission phases, from early instrument development, through routine in-flight operations to the end of the post-operations phase. Although the spacecraft was not launched until 2009, the database was in regular use from 2002 for developing and testing the instruments in the laboratory. By using the same software to control the instruments in the laboratory as we used to control them in flight, we ended up with a very robust and well-tested system. We call this approach “smooth transition”.

    The development approach we adopted is probably best classified as an Agile iterative and incremental one. Object orientation helps a lot because changes in the problem domain, resulting from changing requirements, tend to result in localised changes in the data model.
    Other important factors in managing change are separation of concerns and minimization of dependencies, for example using component-based architectures.

    When we decided to use an object database, it was a new technology and it would have been unwise to rely on any database vendor or product surviving for such a long time. Although work was under way on the ODMG and JDO standards, these were quite immature and the only suitable object databases used proprietary interfaces.
    We therefore chose to implement our own abstraction layer around the database. This was similar in concept to JDO, with a factory providing a pluggable implementation of a persistence manager. This abstraction provided a route to change to a different object database, or even a relational database with an object-relational mapping layer, should it have proved necessary.

    One aspect that is difficult to abstract is the use of queries, because query languages differ. In principle, an object database could be used without any queries, by navigating to everything from a global root object. However, in practice navigation and queries both have their role. For example, to find all the observation requests that have not yet been scheduled, it is much faster to perform a query than to iterate by navigation to find them. However, once an observation request is in memory it is much easier and faster to navigate to all the associated objects needed to process it. We have used a variety of techniques for encapsulating queries. One is to implement them as methods of an extent class that acts as a query factory.

    Another challenge was designing a robust data model that would serve all phases of the mission from instrument development in the laboratory, through pre-flight tests and routine operations to the end of post-operations. We approached this by starting with a model of the problem domain and then analysing use-cases to see what data needed to be persistent and where we needed associations. It was important to avoid the temptation to store too much just because transitive persistence made it so easy.

    One criticism that is sometimes raised against object databases is that the associations tend to encode business logic in the object schema, whereas relational databases just store data in a neutral form that can outlive the software that created it; if you subsequently decide that you need a new use-case, such as report generation, the associations may not be there to support it. This is true to some extent, but consideration of use cases for the entire project lifetime helped a lot. It is of course possible to use queries to work-around missing associations.

    Examples are sometimes given of how easy an object database is to use by directly persisting your business objects. This may be fine for a simple application with an embedded database, but for a complex system you still need to cleanly decouple your business logic from the data storage. This is true whether you are using a relational or an object database. With an object database, the persistent classes should only be responsible for persistence and referential integrity and so typically just have getter and setter methods.
    We have encapsulated our persistent classes in a package called the Core Class Model (CCM) that has a factory to create instances. This complements the pluggable persistence manager. Hence, the application sees the persistence manager and CCM factories and interfaces, but the implementations are hidden.
    Applications define their own business classes which can work like decorators for the persistent classes.

    Q5. What is your experience in having two separate database systems for Herschel? A relational database for storing and managing processed data products and an object database for storing and managing proposal data, mission planning data, telecommands and raw (unprocessed) telemetry?

    Jon Brumfitt: There are essentially two parts to the ground segment for a space observatory.
    One is the “uplink” which is used for controlling the spacecraft and instruments. This includes submission of observing proposals, observation planning, scheduling, flight dynamics and commanding.
    The other is the “downlink”, which involves ingesting and processing the data received from the spacecraft.

    On some missions the data processing is carried out by a data centre, which is separate from spacecraft operations. In that case there is a very clear separation.
    On Herschel, the original concept was to build a completely integrated system around an object database that would hold all uplink and downlink data, including processed data products. However, after further analysis it became clear that it was better to integrate our product archive with those from other missions. This also means that the Herschel data will remain available long after the project has finished. The role of the object database is essentially for operating the spacecraft and storing the raw data.

    The Herschel archive is part of a common infrastructure shared by many of our ESA science projects. This provides a uniform way of accessing data from multiple missions.
    The following is a nice example of how data from Herschel and our XMM-Newton X-ray telescope have been combined to make a multi-spectral image of the Andromeda Galaxy.

    Our archive, in turn, forms part of a larger international archive known as the “Virtual Observatory” (VO), which includes both space and ground-based observatories from all over the world.

    I think that using separate databases for operations and product archiving has worked well. In fact, it is more the norm rather than the exception. The two databases serve very different roles.
    The uplink database manages the day-to-day operations of the spacecraft and is constantly being updated. The uplink data forms a complex object graph which is accessed by navigation, so an object database is well suited.
    The product archive is essentially a write-once-read-many repository. The data is not modified, but new versions of products may be added as a result of reprocessing. There are a large number of clients accessing it via the Internet. The archive database is a catalogue containing the product meta-data, which can be queried to find the relevant product files. This is better suited to a relational database.

    The motivation for the original idea of using a single object database for everything was that it allowed direct association between uplink and downlink data. For example, processed products could be associated with their observation requests. However, using separate databases does not prevent one database being queried with an observation identifier obtained from the other.
    One complication is that processing an observation requires both downlink data and the associated uplink data.
    We solved this by creating “uplink products” from the relevant uplink data and placing them in the archive. This has the advantage that external users, who do not have access to the Versant database, have everything they need to process the data themselves.

    Q6. What are the main lessons learned so far in using Versant object database for managing telemetry data and information on steering and calibrating scientific on-board instruments?

    Jon Brumfitt: Object databases can be very effective for certain kinds of application, but may have less benefit for others. A complex system typically has a mixture of application types, so the advantages are not always clear cut. Object databases can give a high performance for applications that need to navigate through a complex object graph, particularly if used with fairly long transactions where a significant part of the object graph remains in memory. Web (JavaEE) applications lose some of the benefit because they typically perform many short transactions with each one performing a query. They also use additional access layers that result in a system which loses the simplicity of the transparent persistence of an object database.

    In our case, the object database was best suited for the uplink. It simplified the uplink development by avoiding object-relational mapping and the complexity of a design based on JDBC or EJB 2. Nowadays with JPA, relational databases are much easier to use for object persistence, so the rationale for using an object database is largely determined by whether the application can benefit from fast navigational access and how much effort is saved in mapping. There are now at least two object database vendors that support both JDO and JPA, so the distinction is becoming somewhat blurred.

    For telemetry access we query the database instead of using navigation, as the packets don’t fit neatly into a single containment hierarchy. Queries allows packets to be accessed by many different criteria, such as time, instrument, type, source and so on.
    Processing calibration observations does not introduce any special considerations as far as the database is concerned.

    Q7. Did you have any scalability and or availability issues during the project? If yes, how did you solve them?

    Jon Brumfitt: Scalability would have been an important issue if we had kept to the original concept of storing everything including products in a single database. However, using the object database for just uplink and telemetry meant that this was not a big issue.

    The data processing grid retrieves the raw telemetry data from the object database server, which is a 16-core Linux machine with 64 GB of memory. The average load on the server is quite low, but occasionally there have been high peak loads from the grid that have saturated the server disk I/O and slowed down other users of the database. Interactive applications such as mission planning need a rapid response, whereas batch data processing is less critical. We solved this by implementing a mechanism to spread out the grid load by treating the database as a resource.

    Once a year, we have made an “Announcement of Opportunity” for astronomers to propose observations that they would like to perform with Herschel. It is only human nature that many people leave it until the last minute and we get a very high peak load on the server in the last hour or two before the deadline! We have used a separate server for this purpose, rather than ingesting proposals directly into our operational database. This has avoided any risk of interfering with routine operations. After the deadline, we have copied the objects into the operational database.

    Q8. What about the overall performance of the two databases? What are the lessons learned?

    Jon Brumfitt: The databases are good at different things.
    As mentioned before, an object database can give a high performance for applications involving a complex object graph which you navigate around. An example is our mission planning system. Object persistence makes application design very simple, although in a real system you still need to introduce layers to decouple the business logic from the persistence.

    For the archive, on the other hand, a relational database is more appropriate. We are querying the archive to find data that matches a set of criteria. The data is stored in files rather than as objects in the database.

    Q9. What are the next steps planned for the project and the main technical challenges ahead?

    Jon Brumfitt: As I mentioned earlier, the coming post-operations phase will concentrate on further improving the data processing software to generate a top-quality legacy archive, and on provision of high-quality support documentation and continued interactive support for the community of astronomers that forms our “customer base”. The system was designed from the outset to support all phases of the mission, from early instrument development tests in the laboratory, though routine operations to the end of the post-operations phase of the mission. The main difference moving into post-operations is that we will stop uplink activities and ingesting new telemetry. We will continue to reprocess all the data regularly as improvements are made to the data processing software.

    We are currently in the process of upgrading from Versant 7 to Versant 8.
    We have been using Versant 7 since launch and the system has been running well, so there has been little urgency to upgrade.
    However, with routine operations coming to an end, we are doing some “technology refresh”, including upgrading to Java 7 and Versant 8.

    Q10. Anything else you wish to add?

    Jon Brumfitt: These are just some personal thoughts on the way the database market has evolved over the lifetime of Herschel. Thirteen years ago, when we started development of our system, there were expectations that object databases would really take off in line with the growing use of object orientation, but this did not happen. Object databases still represent rather a niche market. It is a pity there is no open-source object-database equivalent of MySQL. This would have encouraged more people to try object databases.

    JDO has developed into a mature standard over the years. One of its key features is that it is “architecture neutral”, but in fact there are very few implementations for relational databases. However, it seems to be finding a new role for some NoSQL databases, such as the Google AppEngine datastore.
    NoSQL appears to be taking off far quicker than object databases did, although it is an umbrella term that covers quite a few kinds of datastore. Horizontal scaling is likely to be an important feature for many systems in the future. The relational model is still dominant, but there is a growing appreciation of alternatives. There is even talk of “Polyglot Persistence” using different kinds of databases within a system; in a sense we are doing this with our object database and relational archive.

    More recently, JPA has created considerable interest in object persistence for relational databases and appears to be rapidly overtaking JDO.
    This is partly because it is being adopted by developers of enterprise applications who previously used EJB 2.
    If you look at the APIs of JDO and JPA they are actually quite similar apart from the locking modes. However, there is an enormous difference in the way they are typically used in practice. This is more to do with the fact that JPA is often used for enterprise applications. The distinction is getting blurred by some object database vendors who now support JPA with an object database. This could expand the market for object databases by attracting some traditional relational type applications.

    So, I wonder what the next 13 years will bring! I am certainly watching developments with interest.
    ——

    Dr Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, European Space Agency.

    Jon Brumfitt has a background in Electronics with Physics and Mathematics and has worked on several of ESA’s astrophysics missions, including IUE, Hipparcos, ISO, XMM and currently Herschel. After completing his PhD and a post-doctoral fellowship in image processing, Jon worked on data reduction for the IUE satellite before joining Logica Space and Defence in 1980. In 1984 he moved to Logica’s research centre in Cambridge and then in 1993 to ESTEC in the Netherlands to work on the scientific ground segments for ISO and XMM. In January 2000, he joined the newly formed Herschel team as science ground segment System Architect. As Herschel approached launch, he moved down to the European Space Astronomy Centre in Madrid to become part of the Herschel Science Operations Team, where he is currently System Engineer and System Architect.

    Related Posts

    The Gaia mission, one year later. Interview with William O’Mullane. January 16, 2013

    Objects in Space: “Herschel” the largest telescope ever flown. March 18, 2011

    Resources

    Introduction to ODBMS By Rick Grehan

    ODBMS.org Resources on Object Database Vendors.

    —————————————
    You can follow ODBMS.org on Twitter : @odbmsorg

    ##

    ]]>
    http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/feed/ 0
    Isis2: A new Open Platform for Data Replication in the Cloud.–Interview with Ken Birman. http://www.odbms.org/blog/2013/07/isis2-a-new-open-platform-for-data-replication-in-the-cloud-interview-with-ken-birman/ http://www.odbms.org/blog/2013/07/isis2-a-new-open-platform-for-data-replication-in-the-cloud-interview-with-ken-birman/#comments Thu, 11 Jul 2013 06:31:18 +0000 http://www.odbms.org/blog/?p=2359

    “My target is to be the MapReduce solution for the world’s realtime problems.”— Ken Birman.

    I had the pleasure to interview Ken Birman, Professor of Computer Science at Cornell University. Ken is a pioneer and one of the leading researchers in the field of distributed systems. Ken has been working recently on a new system called Isis2 (isis2.codeplex.com). I asked him a few questions on Isis2.

    RVZ

    Q1. Recently, you have been working on the Isis2 project. What is it?

    Ken Birman: Isis2 started as a bit of a hobby for me, and the story actually dates back almost 25 years.
    Early in my career, I created a system called the Isis Toolkit, using a model we called virtual synchrony to support strongly consistent replication for services running on clusters of various kinds. We had quite a success with that version of Isis, and it was the basis of the applications I mentioned above – I started a company and we did very well with it. In fact, for more than a decade there was never a disruptive failure at the New York Stock Exchange: components crashed now and then, obviously, but Isis would help guide the solution to a clean recovery and the traders were never aware of any issues. The same is true for the French ATC system or the US Navy AEGIS.

    Now this model, virtual synchrony, has deep connections both to the ACID models used in database settings, and to the state machine replication model supported by protocols like Lamport’s Paxos. Indeed, recently we were able to show a bisimulation of the virtually synchronous protocol called gbcast (dating to around 1985) and a version of Paxos for dynamically reconfigurable systems. In some sense, gbcast was the first Paxos – Leslie Lamport says we should have named the protocol “virtually synchronous Paxos”.
    (Of course if we had, I suspect that he would have named his own protocol something else!). I could certainly do the same with respect to database serializability – basically, virtual synchrony is like the ACID model, but aimed at “groups of processes” that might use purely in-memory replication. In effect my protocols were optimized ones aimed at supporting ACI but without the D.

    Anyhow, over the years, the Isis Toolkit matured and people moved on. I moved on too, and worked on topics involving gossip communication, and security. But eventually this came to frustrate me: I felt that my own best work was being lost to disuse. And so starting four years ago, I decided to create a new and more modern Isis for the cloud. I’m calling it Isis2, and it uses the model Leslie and I talked about (virtually synchronous state machine replication). I’ve implemented the whole thing from scratch, using modern object-oriented languages and tools.

    So Isis2 is a new open platform for data replication in the cloud, aiming at cases where a developer might be building some sort of service that would end up running on a cloud platform like Eucalyptus, Amazon EC2, etc. My original idea was to create the ultimate library for distributed computing, but also to make it as easy to use as the GUI builders that our students use in their first course on object oriented programming: powerful technologies but entirely accessed via very simple ideas, such as attaching an event handler to a suitable object. A first version of the Isis2 system became available about two years ago, and I’ve been working on it as much as possible since then, with new releases every couple of months.

    But I’ve also slightly shifted my focus, in ways that are bringing Isis2 closer and closer to the big-data and ODBMS community.
    I realized that unlike the situation 20 years ago, today the people who most need tools for replicating data with strong consistency are mostly working to create in-memory big-data platforms for the cloud, and then running large machine-learning algorithms on them.
    For example, one important target for Isis2 is to support the smart grid here in the United States. So one imagines an application capturing all sorts of data in real-time, from what might be a vast number of sensing points, pulling this into the cloud, and then running a machine learning algorithm on the resulting in-memory data set. You replicate such a service for high availability and to gain faster read-only performance, and then run a distributed algorithm that learns the parameters to some model: perhaps a convex optimization, perhaps a support vector, etc.

    I’ve run into other applications with a similar structure. For example, self-driving cars and other AUVs need to query a rapidly evolving situational database, asking questions such as “what road hazards should I be watching for the next 250 meters?” The community doing large-scale simulations for problems in particle physics needs to solve the “nearby particle” problem: “now that I’m at location such-and-such, what particles are close enough to interact with me?” One can make quite a long list. All of these demand rapid updates, in place, to a database that is living in-memory and being queried frequently.

    With this in mind, I’ve been extending the Isis2 system to enlarge its support for key-value data, spread within a service and replicated, but with each item residing at just small subsets of the total set of members (“sharded”). My angle has been to offer strong consistency (not full transactions, because those involve locking and become expensive, but a still-powerful “one-shot atomic actions” model). Because Isis2 can offer these guarantees for a workload that includes a mix of updates and queries, the community working on graphical learning and other key-value learning problems where data might evolve even as the system is tracking various properties has shown strong interest in the platform.

    Q2. You claim that Isis2 is a new option for cloud computing that can enable reliable, secure replication of data even in the highly elastic first-tier of the cloud. How does it different from existing commercial products such as NoSQL, Amazon Web Services etc.?

    Ken Birman: Well, there are really a few differences. One is that Isis2 is a library. So if you compare with other technologies, the closest matches are probably Graphlab from CMU and Spark, Berkeley’s in-memory storage system for accelerating Hadoop computations. A second big difference is that Isis2 has a very strong consistency model, putting me at the other end of the spectrum relative to NoSQL. As for Web Services, well, the thinking is that these services you use Isis2 to create would often be web services, accessed over the web via some representative, which then turns around and interacts with other members using the Isis2 primatives.

    Q3. How do you handle Big Data in Isis2

    Ken Birman: I’m adding more and more features, but there are two that should be especially interesting to your readers. One is the Isis2 DHT: a key-value store spread over a group of nodes, such that any given key maps to some small subset of the members (a shard). So keys might, for example, be node-ids for a graph, and the values could represent data associated with that node: perhaps a weight, in and out edges to other nodes, etc. The model is incredibly flexible: a key can be any object you like, and the values can also be arbitrary objects. You even can control the mapping of keys to group members, by implementing a specialized version of GetHashCode().

    What I do is to allow atomic actions on sets of key-value pairs. So you can insert a collection of key-value pairs, or query a set of them, or even do an ordered multicast to just the nodes that host some set of keys. These actions occur on a consistent cut, giving a form of action by action atomicity. And of course, the solution is also fault-tolerant and even offers a strong form of security via AES-256 encryption, if desired.

    What this enables is a powerful new form of distributed aggregation, in which one can guarantee that a query result reflects exactly-once contributions from each matching key-value pair. You can also insert new key-value tuples as part of one of these actions (although those occur as new atomic actions, ordered after the one that initiated them), or even reshuffle the mapping of keys to nodes, by “reconfiguring” the group to use a new GetHashCode() method.

    You can also use LINQ in these groups, for example to efficiently query the “slice” of a key-value set that mapped to a particular set of nodes.

    As I mentioned, I also have a second big-data feature that should become available later this summer: a tool for moving huge objects around within a cluster of cloud nodes. So suppose that an application is working with gigabyte memory-mapped files.
    Sending these around inside messages could be very costly, and will cause the system to stutter because anything ordered after that send or multicast might have to wait for the gigabyte to transfer. With this new feature, I do out-of-band movement of the big objects in a very efficient way, using IP multicast if available (and a form of fat-tree TCP mesh if not), and then ship just a reference to the big object in my messages. This way the in-band communication won’t stutter, and the gigabyte object gets replicated using very efficient out-of-band protocols. By the end of the summer, I’m hoping to have the world’s fastest tool for replicating big memory-mapped objects in the cloud. (Of course, there can be quite a gap between “hoping” and “will have”, but I’m optimistic).

    Q4. Isis2 uses a distributed sharding scheme. What is it?

    Ken Birman: Again, there are a few answers – if Isis2 has a flaw, I suspect that it comes down to trying to offer every possible cool mechanism to every imaginable developer. Just like Microsoft .NET this can mean that you end up with too many stories. (You won’t be surprised to learn that Isis2 is actually built in .NET, using C#, and hence usable from any .NET language. We use Mono to compile it for use on Linux. And it does have a bit of that .NET flavor).

    Anyhow, one scheme is the sharding approach mentioned above. The user takes some group – perhaps it has 10,000 members in it and runs on 10,000 virtual machines on Amazon EC2. And they ask Isis2 to shard the group into shards of some size, maybe 5 or 10 – you pick a factor aimed at soaking up the query load but not imposing too high an update cost.

    So in that case I might end up with 1000 or 2000 subgroups: those would be my shards. The mapping is very explicit: given a key-value pair with key K, I compute K.GetHashCode()%NS where NS is the number of shards obtained from the target group size (call it N) and the target shard size (call this S): hashcode%(N/S). And that tells me which shard will hold your key-value pair.
    Given the group membership, which is consistent in Isis2, my protocols just count off from left to right to find the members that host this shard. The atomic update or query reaches out to those members: I involve all of them if the action is an update, and I load-balance over the shard for queries.

    The other mechanism is coarser-grained: one Isis2 application can actually support multiple process groups. And each of those groups can be a DHT, sharded separately. This is convenient if an application has one long-lived in-memory group for the database, but wants to create temporary data structures too. What you can do is to use a separate temporary group for those temporary key-value pairs or structures. Then when your computation is finished, you have an easy way to clean up: you just tell Isis2 to discard the temporary group and all its contents will evaporate, like magic.

    Q5. Isis2 sounds like MapReduce. Is this right? What are the differences and what are the similarities?

    Ken Birman: Roberto, you are exactly right about this. While you can treat the Isis2 DHT as a key-value store, a natural style of computing would be to use the kinds of code one sees with iterative MapReduce applications. Isis2 is in-memory, of course (although nothing stops you from pairing it with a persistent database or log – in fact I provide tools for doing that). But you could do this, and if you did, Isis2 would be acting a lot like Spark.

    I think the main point is that whereas a system like Spark just speeds MapReduce up by caching partial results, and really works only for immutable or append-only data sets, Isis2 can support dynamic updates while the queries are running. My target is to be the MapReduce solution for the world’s realtime problems. Moreover, I’m doing this for people who need strong consistency.
    Get the answer wrong in the power grid and you cause a power outage or damage a transformer (does anyone remember how Christ Church took itself off the New Zealand power grid for a month?) Get the answer wrong in a self-driving car, and it might have an accident. So for situations in which data is rapidly evolving, Isis2 offers a way to do highly scalable queries even as you support updates in a highly scalable way too.

    I actually see the similarity to MapReduce as a good thing. To me this model has been incredibly popular, because people find it easy to use. I can leverage that popularity.

    The bigger contrast is with true transactional databases. Isis2 does support locking, but not full-scale transactions. It would be dishonest to say that I don’t think transactions can run at massive scale – in fact I’m working with a post-doc, Ittay Eyal, on an extension of Isis2 that we call ACID-RAIN that has a full transactional model. And I know of others who are doing competing systems – Marcos Aguilera at Microsoft, for example, who is hard at work on his follow-on to the famous Sinfornia system.

    But I don’t think we need the full costs of ACID transactions for in-memory key-value stores, and this has pushed me towards a lock-free computing model for this part of Isis2, and towards the kind of weaker atomicity I offer: guarantees for individual actions on sets of key-value pairs, but with each step in your computation treated as a separate one-shot transaction.

    Q6. You claim to be able to run consistent queries even on rapidly changing data, and yet scales as well as any sharding scheme. Please explain how?

    Ken Birman: The question centers on semantics: what do I mean by a consistent query? For me, a query that ran on a consistent cut (like with snapshot isolation) and gives a result that reflects exactly one contribution from each matching key-value pair, is a “consistent” query. This is definitely what you want for some purposes. But if you want to define consistency to mean for full transactions, I’m not taking that last step. I offer the building blocks – locking, atomic multicast, etc. You could easily run a transaction and do a 2-phase commit at the end. But I just doubt that this would perform well and so I haven’t picked it as my main consistency model.

    Q7. Why using LINQ as a query language and not SQL?

    Ken Birman: I’m using LINQ, but maybe not in the way you are thinking. There are some projects that do use LINQ as their main user API. In the Isis2 space, Naiad and the Dryad system that preceded it would be famous examples. But I find it hard to work with systems that “map” my LINQ query to a distributed key-value structure. I’ve always favored simpler building blocks.

    So the way I’m using LINQ is more limited. For me, a user might issue some sort of query, and at the first step this looks like a multicast: you specify the target group or the target shards (depending on whether you want every member or just the ones associated with certain keys), and the information you offered as arguments to the query or multicast are automatically translated to an efficient out-form and transmitted to the relevant members. On the members an upcall event occurs to a handler the developer coded, perhaps in C# or perhaps in some other .NET language like C++/CLI, Visual Basic, Java, etc. I think .NET has something like 40 languages you can use – I myself work mostly in C#.

    So this C# handler is invoked with the specified arguments, in parallel, on the set of group members that matched the target you specified. Each one of them has a slice of the full DHT: the subset of key-value pairs that mapped to that particular member, in the way we talked about earlier – a mapping you as the user can control, and even modify dynamically (if you do, Isis2 reshuffles the key-value pairs appropriately).

    What I do with LINQ is to let your code access this slice of the DHT as a collection on which you could run a LINQ query. And in fact, as you may know, LINQ has an SQL front end, so you could even use SQL on these collections. So the C# handler has this potentially big set of key-value tuples, the full set for that slice (remember that I guarantee consistency!), and with LINQ each of those members now can compute its share of the answer.

    What happens next is up to you. You could use this answer to insert new key-value tuples: that would be like the shuffle and reduce step in MapReduce. You could send the answers to the query initiator, as a reply – Isis2 supports 1-K queries that get lists of K results back, and the initiator could then aggregate the results (perhaps using LINQ at that step too, or SQL). And finally I have a tree-structured aggregation option, where I build a binary tree spanning the participants and combine the subresults using user-specified code, so that just one aggregated result comes out, after log(N) delay. That last option would be sensible if you might end up with a very large number of answers, or really big ones.

    Q8. How fault-tolerant is Isis2?

    Ken Birman: Virtual synchrony is very powerful as a tool for building pretty much any fault-tolerance approach one likes (people who prefer Paxos can reread that sentence as “the dynamic state machine model…” and people who think in terms of ACID transactions can see this as “ACID-based SQL systems…”). The model I favor is one in which updates are always applied in an all-or-nothing way, but queries might be partially completed and then abandoned (“aborted”) if a failure occurs.

    So with Isis2, the basic guarantee is that an atomic multicast or query reaches all the operational group members that it is supposed to reach, and in a total order with respect to conflicting updates. Queries either reflect exactly once contributions from each matching key-value pair, or they abort (if a failure disrupts them), and you need to reissue the request in the new group membership – the new process group “view”.

    Q9. What about Hadoop? Do you plan to have an interface to Hadoop?

    Ken Birman: I’ve suggested this topic to a few of my PhD students but up to now, none has “bit”. I think my group tends to attract students who have more of a focus on lower levels of the infrastructure. But with our work on the smart power grid, this could change; I’m starting to interact much more with students who come from the machine learning community and who might find that step very appealing. It wouldn’t really be all that hard to do.

    Q10. Isis2 is open source. How can developers contribute?

    Ken Birman: This is a great question. The basic system is open source under a free 3-clause BSD license. You can access it here: isis2.codeplex.com. – I have a big user manual there, and one of those compiled HTML documentation pages for each of my APIs, and I’m going to be doing some form of instructional MOOC soon, so there should end up being ten or so short videos showing how to program with the system.

    Initially, I think that people who would want to play with Isis2 might be best off limiting themselves to working with the system but not changing my code directly. My worry is that the code really is quite subtle and while I would love to have help, my experience here at Cornell has been that even well-intentioned students have made a lot of mistakes by not really understanding my code and then trying to change it. I’m sorry to say that as of now, not a line of third party code has survived – not because I don’t want help (I would love help), but because so far, all the third-party code has died horribly when I really tested it carefully!

    But over time, I’m hoping that Isis2 could become more and more of a community tool. In fact complex or not, I do think others could definitely master it and help me debug and extend it. They just need to move no faster than the speed of light, which is kind of slow where large, complex tools are concerned. Building things that work “over” Isis2 but don’t change it is an especially good path: I’ll be happy to fix bugs other people identify, and then your add-ons can become third party extensions without being somehow key parts of the system. Then as things mature (keep in mind that Isis2 itself is nearly four years old right now), things could gradually migrate into the core release.

    I think this is how other big projects tend to evolve towards an open development model: nobody trusts a contributor who shows up one day and announces that he wants to rewrite half the system. But if that person hangs around for a while and proves his or her talents over time, they end up in the inner circle. So I’m not trying to be overprotective, but I do want the system to be incredibly robust.

    This is how we’re building the ACID-RAIN system I mentioned. Ittay Eyal owns the architecture of that system, but he has no interest at all in replicating things already available in Isis2, which after all is quite a big and powerful system. So he’s using it but rather than building ACID-RAIN by modifying Isis2, his system will be more of an application that uses Isis2. To the extent that he needs things Isis2 is lacking, I can build them for him. But later, if ACID-RAIN becomes the world’s ultimate SQL solution for the cloud, maybe Isis2 and ACID-RAIN would merge in some way. Over time I have no doubt at all that a talented developer like Ittay could become as expert as he needs to be even with my most obscure stuff.

    And the fact is that I do need help. Tuning Isis2 to work really well in these massive settings is a hard challenge for me; more hands would definitely help. Right now I’m in the middle of porting it to work well with Infiniband connections, for example. You might thing such a step would be trivial, and in a way it is: I just need to adapt the way that I talk to my network sockets. But in fact this has all sorts of implications for flow control, and timing in many protocols. A small step becomes a big task. (I’m kind of shuddering at the thought of needing to move to IPv6!) I can support the system now, with 3000 downloads to date but relatively few really active users. If someday I have lots of users and a StackOverflow.com tag of my very own, I’ll need a hand!

    Qx Anything else you wish to add?

    Ken Birman: Roberto, I just want to thank you for the great job you do with the ODBMS.org blog. I think it has become a great resource, and I hope this little interview gets a few of your readers interested in fooling around with Isis2! Tell them to feel free to contact me at Cornell, or ken@cs.cornell.edu/

    —————–
    Ken Birman is the N. Rama Rao Professor of Computer Science at Cornell. An ACM Fellow and the winner of the IEEE Tsutomu Kanai award, Ken has written 3 textbooks and published more than 150 papers in prestigious journals and conferences. Software he developed operated the New York Stock Exchange for more than a decade without trading disruptions, and played central roles in the French Air Traffic Control System (now expanding into much of Europe) and the US Navy AEGIS warship. Other technologies from his group found their way into IBM’s Websphere product, Amazon’s EC2 and S3 systems, Microsoft’s cluster management solutions. His latest system, Isis2 (isis2.codeplex.com) helps developers create secure, strongly consistent and scalable cloud computing solutions.

    Resources

    Ken Birmann`s Publication Listings.

    ODBMS.org Cloud Data Stores – Lecture Notes: “Data Management in the Cloud” by Michael Grossniklaus, David Maier, Portland State University.
    Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

    Related Posts

    On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com. November 2, 2011

    Follow ODBMS.org on Twitter: @odbmsorg
    ##

    ]]>
    http://www.odbms.org/blog/2013/07/isis2-a-new-open-platform-for-data-replication-in-the-cloud-interview-with-ken-birman/feed/ 0
    Acquiring Versant –Interview with Steve Shine. http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/ http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/#comments Wed, 06 Mar 2013 17:26:21 +0000 http://www.odbms.org/blog/?p=2096 “So the synergies in data management come not from how the systems connect but how the data is used to derive business value” –Steve Shine,

    On Dec. 21, 2012, Actian Corp. announced the completion of the transaction to buy Versant Corporation. I have interviewed Steve Shine, CEO and President, Actian Corporation.

    RVZ

    Q1. Why acquiring an object-oriented database company such as Versant?

    Steve Shine: Versant Corporation, like us, has a long pedigree in solving complex data management in some of the world’s largest organisations. We see many synergies in bringing the two companies together. The most important of these is together we are able to invest more resources in helping our customers extract even more value from their data. Our direct clients will have a larger product portfolio to choose from, our partners will be able to expand in adjacent solution segments, and strategically we arm ourselves with the skills and technology to fulfil our plans to deliver innovative solutions in the emerging Big Data Market.

    Q2. For the enterprise market, Actian offers its legacy Ingres relational database. Versant on the other hand offers an object oriented database, especially suited for complex science/engineering applications. How does this fit? Do you have a strategy on how to offer a set of support processes and related tools for the enterprise? if yes, how?

    Steve Shine: While the two databases may not have a direct logical connection at client installations, we recognise that most clients use these two products as part of a larger more holistic solutions to support their operations. The data they manage is the same and interacts to solve business issues – for example object stores to manage the relationships between entities; transactional systems to manage clients and the supply chain and analytic systems to monitor and tune operational performance. – Different systems using the same underlying data to drive a complex business.

    We plan to announce a vision of an integrated platform designed to help our clients manage all their data and their complex interactions, both internal and external so they can not only focus on their running their business, but better exploit the incremental opportunity promised by Big Data.

    Q3. Bernhard Woebker, president and chief executive officer of Versant stated, “the combination of Actian and Versant provides numerous synergies for data management”. Could you give us some specific examples of such synergies for data management?

    Steve Shine: Here is a specific example of what I mean by helping clients extracting more value from data in the Telco space. These type of incremental opportunities exist in every vertical we have looked at.

    An OSS system in a Telco today may use an Object store to manage the complex relationships between the data, the same data is used in a relational store to monitor, control and manage the telephone network.

    Another relational store using variants of the same data manages the provisioning, billing and support for the users of the network. The whole data set in Analytical stores is used to monitor and optimise performance and usage of the network.

    Fast forwarding to today, the same data used in more sophisticated ways has allowed voice and data networks to converge to provide a seamless interface to mobile users. As a result, Telcos have tremendous incremental revenue opportunities BUT only if they can exploit the data they already have in their networks. For example: The data on their networks has allowed for a huge increase in location based services, knowledge and analysis of the data content has allowed providers to push targeted advertising and other revenue earning services at their users; then turning the phone into a common billing device to get even a greater share of the service providers revenue… You get the picture.

    Now imagine other corporations being able to exploit their information in similar ways: Would a retailer benefit from knowing the preferences of who’s in their stores? Would a hedge fund benefit from detecting a sentiment shift for a stock as it happens? Even knowledge of simple events can help organisations become more efficient.
    A salesman knowing immediately a key client raises a support ticket; A product manager knowing what’s being asked on discussion forums; A marketing manager knowing a perfect prospect is on the website.

    So the synergies in data management come not from how the systems connect but how the data is used to derive business value. We want to help manage all the data in our customers organisations and help them drive incremental value from it. That is the what we mean by numerous synergies from data management and we have a vision to deliver it to our customers.

    Q4. Actian claims to have more than 10,000 customers worldwide. What is the value proposition of Versant’s acquisition for the existing Actian`s customers?

    Steve Shine: I have covered this in the answers above. They get access to a larger portfolio of products and services and we together drive a vision to help them extract greater value from their data.

    Q5. Versant claims to have more than 150,000 installations worldwide. How do you intend to support them?

    Steve Shine: Actian already runs a 24/7 global support organisation that prides itself in delivering one of the industry’s best client satisfaction scores. As far as numbers are concerned, Versant’s large user count is in essence driven by only 250 or so very sophisticated large installations whereas Actian already deals with over 10,000 discreet mission critical installations worldwide. So we are confident of maintaining our very high support levels and the Versant support infrastructure is being integrated into Actian’s as we speak.

    Q6. Actian is active in the market for big data analytics. How does Versant’s database technology fit into Actian’s big data analytics offerings and capabilities?

    Steve Shine: Using the example above imagine using OSS data to analyse network utilisation, CDR’s and billing information to identify pay plans for your most profitable clients.

    Now give these clients the ability to take business action on real time changes in their data.Now imagine being able to do that from an integrated product set from one vendor. We will be announcing the vision behind this strategy this quarter. In addition, the Versant technology gives us additional options for solutions for big data for example visualisation and managing meta data.

    Q7. Do you intend to combine or integrate your analytics database Vectorwise with Versant’s database technology (such as Versant JPA)? If yes, how?

    Steve Shine: Specific plans for integrating products within the overall architecture have not been formulated. We have a strong philosophy that you should use the best tool for the job eg OODB for some things, OLTP RDBMS for other etc. But the real value comes from being able to perform sophisticated analysis and management across the different data stores. That is part of the work out platform integration efforts are focused on.

    Q8. What are the plans for future software developments. Will you have a joint development team or else?

    Steve Shine: We will be merging the engineering teams to focus on providing innovative solutions for big Data under single leadership.

    Q9. You have recently announced two partnerships for Vectorwise, with Inferenda and BiBoard. Will you also pursue this indirect channel path also for Versant’s database technology?

    Steve Shine: The beauty of the vision we speak of is that our joint partner have a real opportunity to expand their solutions using Actian’s broader product set and for those that are innovative the opportunity for new emerging markets

    Q10. Versant recently developed Versant JPA. Is the Java market important for Actian?

    Steve Shine: Yes !

    Q11. It is currently a crowded database market: several new database vendors (NoSQL and NewSQL) offering innovative database technology (NuoDB, VoltDB, MongoDB, Cassandra, Couchbase, Riak to name a few), and large companies such as IBM and Oracle, are all chasing the big data market. What is your plan to stand out of the crowd?

    Steve Shine: We are very excited about the upcoming announcement on our plans for the Big Data market. We will be happy to brief you on the details closer to the time but I will say that early feedback from analysts houses like Gartner have confirmed that our solution is very effective and differentiated in helping corporations extract business value from Big Data. On a higher scale, many of the start ups are going to get a very rude awakening when they find that delivering a database for mission critical use is much more than speed and scale of technology. Enterprises want world class 24×7 support service with failsafe resilience and security. Real industry grade databases take years and many $m’s to reach scalable maturity. Most of the start ups will not make it. Actian is uniquely positioned in being profitable and having delivered industry grade database innovation but also being singularly focused around data management unlike the broad, cumbersome and expensive bigger players. We believe value conscious enterprises will see our maturity and agility as a great strength.

    Qx Anything else you wish to add?

    Steve Shine: DATA! – What a great thing to be involved in! Endless value, endless opportunities for innovation and no end in sight as far as growth is concerned. I look forward to the next 5 years.

    ———————–

    Steve Shine, CEO and President, Actian Corporation.
    Steve comes to Actian from Sybase where he was senior vice president and general manager for EMEA, overseeing all operational, sales, financial and human resources in the region for the past three years. While at Sybase, he achieved more than 200 million in revenue and managed 500 employees, charting over 50 percent growth in the Business Intelligence market for Sybase. Prior to Sybase, Steve was at Canadian-based Geac Computer Corporation for ten successful years, helping to successfully turn around two major global divisions for the ERP firm.

    Related Posts

    Managing Internet Protocol Television Data. — An interview with Stefan Arbanowski. June 25, 2012

    On Versant`s technology. Interview with Vishal Bagga. August 17, 2011

    Resources

    Big Data: Principles and best practices of scalable realtime data systems. Nathan Marz (Twitter) and James Warren, MEAP Began: January 2012,Manning Publications.

    Analyzing Big Data With Twitter. A special UC Berkeley iSchool course.

    -A write-performance improvement of ZABBIX with NoSQL databases and HistoryGluon. MIRACLE LINUX CORPORATION, February 13, 2013

    Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs. Ben Engber, CEO, Thumbtack Technology, JANUARY 2013.

    Follow ODBMS.org on Twitter: @odbmsorg
    ##

    ]]>
    http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/feed/ 0
    In-memory database systems. Interview with Steve Graves, McObject. http://www.odbms.org/blog/2012/03/in-memory-database-systems-interview-with-steve-graves-mcobject/ http://www.odbms.org/blog/2012/03/in-memory-database-systems-interview-with-steve-graves-mcobject/#comments Fri, 16 Mar 2012 07:43:44 +0000 http://www.odbms.org/blog/?p=1371 “Application types that benefit from an in-memory database system are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes” — Steve Graves.

    On the topic of in-memory database systems, I did interview one of our expert, Steve Graves, co-founder and CEO of McObject.

    RVZ

    Q1. What is an in-memory database system (IMDS)?

    Steve Graves: An in-memory database system (IMDS) is a database management system (DBMS) that uses main memory as its primary storage medium.
    A “pure” in-memory database system is one that requires no disk or file I/O, whatsoever.
    In contrast, a conventional DBMS is designed around the assumption that records will ultimately be written to persistent storage (usually hard disk or flash memory).
    Obviously, disk or flash I/O is expensive, in performance terms, and therefore retrieving data from RAM is faster than fetching it from disk or flash, so IMDSs are very fast.
    An IMDS also offers a more streamlined design. Because it is not built around the assumption of storage on hard disk or flash memory, the IMDS can eliminate the various DBMS sub-systems required for persistent storage, including cache management, file management and others. For this reason, an in-memory database is also faster than a conventional database that is either fully-cached or stored on a RAM-disk.

    In other areas (not related to persistent storage) an IMDS can offer the same features as a traditional DBMS. These include SQL and/or native language (C/C++, Java, C#, etc.) programming interfaces; formal data definition language (DDL) and database schemas; support for relational, object-oriented, network or combination data designs; transaction logging; database indexes; client/server or in-process system architectures; security features, etc. The list could go on and on. In-memory database systems are a sub-category of DBMSs, and should be able to do everything that entails.

    Q2. What are significant differences between an in-memory database versus a database that happens to be in memory (e.g. deployed on a RAM-disk).

    Steve Graves: We use the comparison to illustrate IMDSs’ contribution to performance beyond the obvious elimination of disk I/O. If IMDSs’ sole benefit stemmed from getting rid of physical I/O, then we could get the same performance by deploying a traditional DBMS entirely in memory – for example, using a RAM-disk in place of a hard drive.

    We tested an application performing the same tasks with three storage scenarios: using an on-disk DBMS with a hard drive; the same on-disk DBMS with a RAM-disk; and an IMDS (McObject’s eXtremeDB). Moving the on-disk database to a RAM drive resulted in nearly 4x improvement in database reads, and more than 3x improvement in writes. But the IMDS (using main memory for storage) outperformed the RAM-disk database by 4x for reads and 420x for writes.

    Clearly, factors other than eliminating disk I/O contribute to the IMDS’s performance – otherwise, the DBMS-on-RAM-disk would have matched it. The explanation is that even when using a RAM-disk, the traditional DBMS is still performing many persistent storage-related tasks.
    For example, it is still managing a database cache – even though the cache is now entirely redundant, because the data is already in RAM. And the DBMS on a RAM-disk is transferring data to and from various locations, such as a file system, the file system cache, the database cache and the client application, compared to an IMDS, which stores data in main memory and transfers it only to the application. These sources of processing overhead are hard-wired into on-disk DBMS design, and persist even when the DBMS uses a RAM-disk.

    An in-memory database system also uses the storage space (memory) more efficiently.
    A conventional DBMS can use extra storage space in a trade-off to minimize disk I/O (the assumption being that disk I/O is expensive, and storage space is abundant, so it’s a reasonable trade-off). Conversely, an IMDS needs to maximize storage efficiency because memory is not abundant in the way that disk space is. So a 10 gigabyte traditional database might only be 2 gigabytes when stored in an in-memory database.

    Q3. What is in your opinion the current status of the in-memory database technology market?

    Steve Graves: The best word for the IMDS market right now is “confusing.” “In-memory database” has become a hot buzzword, with seemingly every DBMS vendor now claiming to have one. Often these purported IMDSs are simply the providers’ existing disk-based DBMS products, which have been tweaked to keep all records in memory – and they more closely resemble a 100% cached database (or a DBMS that is using a RAM-disk for storage) than a true IMDS. The underlying design of these products has not changed, and they are still burdened with DBMS overhead such as caching, data transfer, etc. (McObject has published a white paper, Will the Real IMDS Please Stand Up?, about this proliferation of claims to IMDS status.)

    Only a handful of vendors offer IMDSs that are built from scratch as in-memory databases. If you consider these to comprise the in-memory database technology market, then the status of the market is mature. The products are stable, have existed for a decade or more and are deployed in a variety of real-time software applications, ranging from embedded systems to real-time enterprise systems.

    Q4. What are the application types that benefit the use of an in-memory database system?

    Steve Graves: Application types that benefit from an IMDS are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes. Sometimes these types overlap, as in the case of a network router that needs to be fast, and has no persistent storage. Embedded systems often fall into the latter category, in fields such as telco and networking gear, avionics, industrial control, consumer electronics, and medical technology. What we call the real-time enterprise sector is represented in the first category, encompassing uses such as analytics, capital markets (algorithmic trading, order matching engines, etc.), real-time cache for e-commerce and other Web-based systems, and more.

    Software that must run with minimal hardware resources (RAM and CPU) can also benefit.
    As discussed above, IMDSs eliminate sub-systems that are part-and-parcel of on-disk DBMS processing. This streamlined design results in a smaller database system code size and reduced demand for CPU cycles. When it comes to hardware, IMDSs can “do more with less.” This means that the manufacturer of, say, a set-top box that requires a database system for its electronic programming guide, may be able to use a less powerful CPU and/or less memory in each box when it opts for an IMDS instead of an on-disk DBMS. These manufacturing cost savings are particularly desirable in embedded systems products targeting the mass market.

    Q5. McObject offers an in-memory database system called eXtremeDB, and an open source embedded DBMS, called Perst. What is the difference between the two? Is there any synergy between the two products?

    Steve Graves: Perst is an object-oriented embedded database system.
    It is open source and available in Java (including Java ME) and C# (.NET) editions. The design goal for Perst is to provide as nearly transparent persistence for Java and C# objects as practically possibly within the normal Java and .NET frameworks. In other words, no special tools, byte codes, or virtual machine are needed. Perst should provide persistence to Java and C# objects while changing the way a programmer uses those objects as little as possible.

    eXtremeDB is not an object-oriented database system, though it does have attributes that give it an object-oriented “flavor.” The design goals of eXtremeDB were to provide a full-featured, in-memory DBMS that could be used right across the computing spectrum: from resource-constrained embedded systems to high-end servers used in systems that strive to squeeze out every possible microsecond of latency. McObject’s eXtremeDB in-memory database system product family has features including support for multiple APIs (SQL ODBC/JDBC & native C/C++, Java and C#), varied database indexes (hash, B-tree, R-tree, KD-tree, and Patricia Trie), ACID transactions, multi-user concurrency (via both locking and “optimistic” transaction managers), and more. The core technology is embodied in the eXtremeDB IMDS edition. The product family includes specialized editions, built on this core IMDS, with capabilities including clustering, high availability, transaction logging, hybrid (in-memory and on-disk) storage, 64-bit support, and even kernel mode deployment. eXtremeDB is not open source, although McObject does license the source code.

    The two products do not overlap. There is no shared code, and there is no mechanism for them to share or exchange data. Perst for Java is written in Java, Perst for .NET is written in C#, and eXtremeDB is written in C, with optional APIs for Java and .NET. Perst is a candidate for Java and .NET developers that want an object-oriented embedded database system, have no need for the more advanced features of eXtremeDB, do not need to access their database from C/C++ or from multiple programming languages (a Perst database is compatible with Java or C#), and/or prefer the open source model. Perst has been popular for smartphone apps, thanks to its small footprint and smart engineering that enables Perst to run on mobile platforms such as Windows Phone 7 and Java ME.
    eXtremeDB will be a candidate when eliminating latency is a key concern (Perst is quite fast, but not positioned for real-time applications), when the target system doesn’t have a JVM (or sufficient resources for one), when the system needs to support multiple programming languages, and/or when any of eXtremeDB’s advanced features are required.

    Q6. What are the current main technological developments for in-memory database systems?

    Steve Graves: At McObject, we’re excited about the potential of IMDS technology to scale horizontally, across multiple hardware nodes, to deliver greater scalability and fault-tolerance while enabling more cost-effective system expansion through the use of low-cost (i.e. “commodity”) servers. This enthusiasm is embodied in our new eXtremeDB Cluster edition, which manages data stores across distributed nodes. Among eXtremeDB Cluster’s advantages is that it eliminates any performance ceiling from being CPU-bound on a single server.

    Scaling across multiple hardware nodes is receiving a lot of attention these days with the emergence of NoSQL solutions. But database system clustering actually has much deeper roots. One of the application areas where it is used most widely is in telecommunications and networking infrastructure, where eXtremeDB has always been a strong player. And many emerging application categories – ranging from software-as-a-service (SaaS) platforms to e-commmerce and social networking applications – can benefit from a technology that marries IMDSs’ performance and “real” DBMS features, with a distributed system model.

    Q7. What are the similarities and differences between current various database clustering solutions? In particular, let’s look at dimensions such as scalability, ACID vs. CAP, intended/applicable problem domains, structured vs. unstructured, and complexity of implementation.

    Steve Graves: ACID support vs. “eventual consistency” is a good place to start looking at the differences between clustering database solutions (including some cluster-like NoSQL products). ACID-compliant transactions will be Atomic, Consistent, Isolated and Durable; consistency implies the transaction will bring the database from one valid state to another and that every process will have a consistent view of the database. ACID-compliance enables an on-line bookstore to ensure that a purchase transaction updates the Customers, Orders and Inventory tables of its DBMS. All other things being equal, this is desirable: updating Customers and Orders while failing to change Inventory could potentially result in other orders being taken for items that are no longer available.

    However, enforcing the ACID properties becomes more of a challenge with distributed solutions, such as database clusters, because the node initiating a transaction has to wait for acknowledgement from the other nodes that the transaction can be successfully committed (i.e. there are no conflicts with concurrent transactions on other nodes). To speed up transactions, some solutions have relaxed their enforcement of these rules in favor of an “eventual consistency” that allows portions of the database (typically on different nodes) to become temporarily out-of-synch (inconsistent).

    Systems embracing eventual consistency will be able to scale horizontally better than ACID solutions – it boils down to their asynchronous rather than synchronous nature.

    Eventual consistency is, obviously, a weaker consistency model, and implies some process for resolving consistency problems that will arise when multiple asynchronous transactions give rise to conflicts. Resolving such conflicts increases complexity.

    Another area where clustering solutions differ is along the lines of shared-nothing vs. shared-everything approaches. In a shared-nothing cluster, each node has its own set of data.
    In a shared-everything cluster, each node works on a common copy of database tables and rows, usually stored in a fast storage area network (SAN). Shared-nothing architecture is naturally more complex: if the data in such a system is partitioned (each node has only a subset of the data) and a query requests data that “lives” on another node, there must be code to locate and fetch it. If the data is not partitioned (each node has its own copy) then there must be code to replicate changes to all nodes when any node commits a transaction that modifies data.

    NoSQL solutions emerged in the past several years to address challenges that occur when scaling the traditional RDBMS. To achieve scale, these solutions generally embrace eventual consistency (thus validating the CAP Theorem, which holds that a system cannot simultaneously provide Consistency, Availability and Partition tolerance). And this choice defines the intended/applicable problem domains. Specifically, it eliminates systems that must have consistency. However, many systems don’t have this strict consistency requirement – an on-line retailer such as the bookstore mentioned above may accept the occasional order for a non-existent inventory item as a small price to pay for being able to meet its scalability goals. Conversely, transaction processing systems typically demand absolute consistency.

    NoSQL is often described as a better choice for so-called unstructured data. Whereas RDBMSs have a data definition language that describes a database schema and becomes recorded in a database dictionary, NoSQL databases are often schema-less, storing opaque “documents” that are keyed by one or more attributes for subsequent retrieval. Proponents argue that schema-less solutions free us from the rigidity imposed by the relational model and make it easier to adapt to real-world changes. Opponents argue that schema-less systems are for lazy programmers, create a maintenance nightmare, and that there is no equivalent to relational calculus or the ANSI standard for SQL. But the entire structured or unstructured discussion is tangential to database cluster solutions.

    Q7. Are in-memory database systems an alternative to classical disk-based relational database systems?

    Steve Graves: In-memory database systems are an ideal alternative to disk-based DBMSs when performance and efficiency are priorities. However, this explanation is a bit fuzzy, because what programmer would not claim speed and efficiency as goals? To nail down the answer, it’s useful to ask, “When is an IMDS not an alternative to a disk-based database system?”

    Volatility is pointed to as a weak point for IMDSs. If someone pulls the plug on a system, all the data in memory can be lost. In some cases, this is not a terrible outcome. For example, if a set-top box programming guide database goes down, it will be re-provisioned from the satellite transponder or cable head-end. In cases where volatility is more of a problem, IMDSs can mitigate the risk. For example, an IMDS can incorporate transaction logging to provide recoverability. In fact, transaction logging is unavoidable with some products, such as Oracle’s TimesTen (it is optional in eXtremeDB). Database clustering and other distributed approaches (such as master/slave replication) contribute to database durability, as does use of non-volatile RAM (NVRAM, or battery-backed RAM) as storage instead of standard DRAM. Hybrid IMDS technology enables the developer to specify persistent storage for selected record types (presumably those for which the “pain” of loss is highest) while all other records are managed in memory.

    However, all of these strategies require some effort to plan and implement. The easiest way to reduce volatility is to use a database system that implements persistent storage for all records by default – and that’s a traditional DBMS. So, the IMDS use-case occurs when the need to eliminate latency outweighs the risk of data loss or the cost of the effort to mitigate volatility.

    It is also the case that FLASH and, especially, spinning memory are much less expensive than DRAM, which puts an economic lid on very large in-memory databases for all but the richest users. And, riches notwithstanding, it is not yet possible to build a system with 100’s of terabytes, let alone petabytes or exabytes, of memory, whereas spinning memory has no such limitation.

    By continuing to use traditional databases for most applications, developers and end-users are signaling that DBMSs’ built-in persistence is worth its cost in latency. But the growing role of IMDSs in real-time technology ranging from financial trading to e-commerce, avionics, telecom/Netcom, analytics, industrial control and more shows that the need for speed and efficiency often outweighs the convenience of a traditional DBMS.

    ———–
    Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

    Related Posts

    A super-set of MySQL for Big Data. Interview with John Busch, Schooner.

    Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB.

    On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.

    vFabric SQLFire: Better then RDBMS and NoSQL?

    Related Resources

    ODBMS.ORG: Free Downloads and Links:
    Object Databases
    NoSQL Data Stores
    Graphs and Data Stores
    Cloud Data Stores
    Object-Oriented Programming
    Entity Framework (EF) Resources
    ORM Technology
    Object-Relational Impedance Mismatch
    Databases in general
    Big Data and Analytical Data Platforms

    #

    ]]>
    http://www.odbms.org/blog/2012/03/in-memory-database-systems-interview-with-steve-graves-mcobject/feed/ 0
    Use Cases and Database Modeling — An interview with Michael Blaha. http://www.odbms.org/blog/2012/01/use-cases-and-database-modeling-an-interview-with-michael-blaha/ http://www.odbms.org/blog/2012/01/use-cases-and-database-modeling-an-interview-with-michael-blaha/#comments Mon, 09 Jan 2012 08:58:13 +0000 http://www.odbms.org/blog/?p=1307 “ Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases. For a database project, the conceptual data model is a much more important software engineering contribution than use cases.“ — Dr. Michael Blaha.

    First of all let me wish you a Happy, Healthy and Successful 2012!

    I am coming back to discuss with our expert Dr. Michael Blaha, the topic of Database Modeling. In a previous interview we looked at the issue of “How good is UML for Database Design”?

    Now, we look at Use Cases and discuss how good are they for Database Modeling. Hope you`ll find the interview interesting. I encourage the community to post comments.

    RVZ

    Q1. How are requirements taken into accounts when performing data base modeling in the daily praxis? What are the common problems and pitfalls?

    Michael Blaha: Software development approaches vary widely. I’ve seen organizations use the following techniques for capturing requirements (listed in random order).

    — Preparation of use cases.
    — Preparation of requirements documents.
    — Representation and explanation via a conceptual data model.
    — Representation and explanation via prototyping.
    — Haphazard approach. Just start writing code.

    General issues include
    — the amount of time required to capture requirements,
    — missing requirements (requirements that are never mentioned)
    — forgotten requirements (requirements that are mentioned but then forgotten)
    — bogus requirements (requirements that are not germane to the business needs or that needlessly reach into design)
    — incomplete understanding (requirements that are contradictory or misunderstood)

    Q2. What is a use case?

    Michael Blaha: A use case is a piece of functionality that a system provides to its users. A use case describes how a system interacts with outside actors.

    Q3. What are the advantages of use cases?

    Michael Blaha:
    — Use cases lead to written documentation of requirements.
    — They are intuitive to business specialists.
    — Use cases are easy for developers to understand.
    — They enable aspects of system functionality to be enumerated and managed.
    — They include error cases.
    — They let consulting shops bill many hours for low-skilled personnel.
    (This is a cynical view, but I believe this is a major reason for some of the current practice.)

    Q4. What are the disadvantages of use cases?

    Michael Blaha:
    — They are very time consuming. It takes much time to write them down. It takes much time to interview business experts (time that is often unavailable).

    — Use cases are just one aspect of requirements. Other aspects should also be considered, such as existing documentation and
    artifacts from related software. Many developers obsess on use cases and forget to look for other requirement sources.

    — Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases.

    — I have yet to see benefit from use case diagramming. I have yet to see significant benefit from use case structuring.

    — In my opinion, use cases have been overhyped by marketeers.

    Q5. How are use cases typically used in practice for database projects?

    Michael Blaha: To capture requirements. It is OK to capture detailed requirements with use cases, but they should be
    subservient to the class model. The class model defines the domain of discourse that use cases can then reference.

    For database applications it is much inferior to start with use cases and afterwards construct a class model. Database applications, in particular, need a data approach and not a process approach.

    It is ironic that use cases have arisen from the object-oriented community. Note that OO programming languages define a class structure to which logic is attached. So it is odd that use cases put process first and defer attention to data structure.

    Q6. A possible alternative approach to data modeling is to write use cases first, then identifying the subsystems and components, and finally identifying the database schema. Do you agree with this?

    Michael Blaha: This is a popular approach. No I do not agree with it. I strongly disagree.
    For a database project, the conceptual data model is a much more important software engineering contribution than use cases.

    Only when the conceptual model is well understood can use cases be fully understood and reconciled. Only then can developers integrate use cases and abstract their content into a form suitable for building a quality software product.

    Q7. Many requirements and the design to satisfy those requirements are normally done with programming, not just schema. Do you agree with this? How do you handle this with use cases?

    Michael Blaha: Databases provide a powerful language, but most do not provide a complete language.

    The SQL language of relational databases is far from complete and some other language must be used to express full functionality.

    OO databases are better in this regard. Since OO databases integrate a programming language with a persistence mechanism they inherently offer a full language for expressing functionality.

    Use cases target functionality and functionality alone. Use cases, by their nature, do not pay attention to data structure.

    Q8. Do you need to use UML for use cases?

    Michael Blaha: No. The idea of use cases are valuable if used properly (in conjunction with data and normally subservient to data).

    In my opinion, UML use case diagrams are a waste of time. They don’t add clarity. They add bulk and consume time.

    Q9. Are there any suitable tools around to help the process of creating use cases for database design? If yes, how good are they?

    Michael Blaha: Well, it’s clear by now that I don’t think much of use case diagrams. I think a textual approach is OK and there are probably requirement tools to manage such text, but I am unfamiliar with the product space.

    Q10. Use case methods of design are usually applied to object-oriented models. Do you use use cases when working with an object database?

    Michael Blaha: I would argue not. Most object-oriented languages put data first. First develop the data structure and then attach methods to the structure. Use cases are the opposite of this. They put functionality first.

    Q11. Can you use use cases as a design method for relational databases, NoSQL databases, graph databases as well? And if yes how?

    Michael Blaha: Not reasonably. I guess developers can force fit any technique and try to claim success.

    To be realistic, traditional database developers (relational databases) are already resistant (for cultural reasons) to object-oriented jargon/style and the UML. When I show them that the UML class model is really just an ER model and fits in nicely with database conceptualization, they acknowledge my point, but it is still a foreign culture.

    I don’t see how use cases have much to offer for NoSQL and graph databases.

    Q12. So if you don’t have use cases, how do you address functionality when building database applications?

    Michael Blaha: I strongly favor the technique of interactive conceptual data modeling. I get the various business and
    technical constituencies in the same room and construct a UML class model live in front of them as we explore their business needs and scope. Of course, the business people articulate their needs in terms of use cases. But theuse cases are grounded by the evolving UML class model defining the domain of discourse. Normally I have my hands full with managing the meeting and constructing a class model in front of them. I don’t have time to explicitly capture the use cases (though I am appreciative if someone else volunteers for that task).

    However, I fully consider the use cases by playing them against the evolving model. Of course as I consider use cases relative to the class model, I am reconciling the use cases. I am also considering abstraction as I construct the class model and consequently causing the business experts to do more abstraction in formulating their use case business requirements.

    I have built class models this way many times before and it works great. Some developers are shocked at how well it can work.

    ————————————————–
    Michael Blaha is a partner at Modelsoft Consulting Corporation.
    Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.

    Related Posts

    – How good is UML for Database Design? Interview with Michael Blaha.

    – Agile data modeling and databases.>

    – Why Patterns of Data Modeling?

    Related Resources

    -ODBMS.org: Databases in General: Blog Posts | Free Software | Articles and Presentations| Lecture Notes | Journals |

    ##

    ]]>
    http://www.odbms.org/blog/2012/01/use-cases-and-database-modeling-an-interview-with-michael-blaha/feed/ 10