ODBMS Industry Watch » object databases http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 Big Data: Three questions to McObject. http://www.odbms.org/blog/2014/02/big-data-three-questions-to-mcobject/ http://www.odbms.org/blog/2014/02/big-data-three-questions-to-mcobject/#comments Fri, 14 Feb 2014 08:21:08 +0000 http://www.odbms.org/blog/?p=2874

“In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.”–Steven T. Graves.

The fourth interview in the “Big Data: three questions to “ series of interviews, is with Steven T. Graves, President and CEO McObject

RVZ

Q1. What is your current product offering?

Steven T. Graves: McObject has two product lines. One is the eXtremeDB product family. eXtremeDB is a real-time embedded database system built on a core in-memory database system (IMDS) architecture, with the eXtremeDB IMDS edition representing the “standard” product. Other eXtremeDB editions offer special features and capabilities such as an optional SQL API, high availability, clustering, 64-bit support, optional and selective persistent storage, transaction logging and more.

In addition, our eXtremeDB Financial Edition database system targets real-time capital markets systems such as algorithmic trading and risk management (and has its own Web site). eXtremeDB Financial Edition comprises a super-set of the individual eXtremeDB editions (bundling together all specialized libraries such as clustering, 64-bit support, etc.) and offers features including columnar data handling and vector-based statistical processing for managing market data (or any other type of time series data).

Features shared across the eXtremeDB product family include: ACID-compliant transactions; multiple application programming interfaces (a native and type-safe C/C++ API; SQL/ODBC/JDBC; native Java, C# and Python interfaces); multi-user concurrency with an optional multi-version concurrency control (MVCC) transaction manager; event notifications; cache prioritization; and support for multiple database indexes (b-tree, r-tree, kd-tree, hash, Patricia trie, etc.). eXtremeDB’s footprint is small, with an approximately 150K code size. eXtremeDB is available for a wide range of server, real-time operating system (RTOS) and desktop operating systems, and McObject provides eXtremeDB source code for porting.

McObject’s second product offering is the Perst open source, object-oriented embedded database system, available in all-Java and all-C# (.NET) versions. Perst is small (code size typically less than 500K) and very fast, with features including ACID-compliant transactions; specialized collection classes (such as a classic b-tree implementation; r-tree indexes for spatial data; database containers optimized for memory-only access, etc.); garbage collection; full-text search; schema evolution; a “wrapper” that provides a SQL-like interface (SubSQL); XML import/export; database replication, and more.

Perst also operates in specialized environments. Perst for .NET includes support for .NET Compact Framework, Windows Phone 8 (WP8) and Silverlight (check out our browser-based Silverlight CRM demo, which showcases Perst’s support for storage on users’ local file systems). The Java edition supports the Android smartphone platform, and includes the Perst Lite embedded database for Java ME.

Q2. Who are your current customers and how do they typically use your products?

Steven T. Graves: eXtremeDB initially targeted real-time embedded systems, often residing in non-PC devices such as set-top boxes, telecom switches or industrial controllers.
There are literally millions of eXtremeDB -based devices deployed by our customers; a few examples are set-top boxes from DIRECTV (eXtremeDB is the basis of an electronic programming guide); F5 Networks’ BIG-IP network infrastructure (eXtremeDB is built into the devices’ proprietary embedded operating system); and BAE Systems (avionics in the Panavia Tornado GR4 combat jet). A recent new customer in telecom/networking is Compass-EOS, which has released the first photonics-based core IP router, using eXtremeDB High Availability to manage the device’s control plane database.

Addition of “enterprise-friendly” features (support for SQL, Java, 64-bit, MVCC, etc.) drove eXtremeDB’s adoption for non-embedded systems that demand fast performance. Examples include software-as-a-service provider hetras Gmbh (eXtremeDB handles the most performance-intensive queries in its Cloud-based hotel management system); Transaction Network Services (eXtremeDB is used in a highly scalable system for real-time phone number lookups/ routing); and MeetMe.com (formerly MyYearbook.com – eXtremeDB manages data in social networking applications).

In the financial industry, eXtremeDB is used by a variety of trading organizations and technology providers. Examples include the broker-dealer TradeStation (McObject’s database technology is part of its next-generation order execution system); Financial Technologies of India, Ltd. (FTIL), which has deployed eXtremeDB in the order-matching application used across its network of financial exchanges in Asia and the Middle East; and NSE.IT (eXtremeDB supports risk management in algorithmic trading).

Users of Perst are many and varied, too. You can find Perst in many commercial software applications such as enterprise application management solutions from the Wily Division of CA. Perst has also been adopted for community-based open source projects, including the Frost client for the Freenet global peer-to-peer network. Some of the most interesting Perst-based applications are mobile. For example, 7City Learning, which provides training for financial professionals, gives students an Android tablet with study materials that are accessed using Perst. Several other McObject customers use Perst in mobile medical apps.

Q3. What are the main new technical features you are currently working on and why?

Steven T. Graves: One feature we’re very excited about is the ability to pipeline vector-based statistical functions in eXtremeDB Financial Edition – we’ve even released a short video and a 10-page white paper describing this capability. In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.

This may not sound unusual, since almost any algorithm or program can be viewed as a chain of operations acting on data.
But this pipelining has a unique purpose and a powerful result: it keeps market data inside CPU cache as the data is being worked.
Without pipelining, the results of each function would typically be materialized outside cache, in temporary tables residing in main memory. Handing interim results back and forth “across the transom” between CPU cache and main memory imposes significant latency, which is eliminated by pipelining. We’ve been improving this capability by adding new statistical functions to the library. (For an explanation of pipelining that’s more in-depth than the video but shorter than the white paper, check out this article on the financial technology site Low-Latency.com.)

We are also adding to the capabilities of eXtremeDB Cluster edition to make clustering faster and more flexible, and further simplify cluster administration. Improvements include a local tables option, in which database tables can be made exempt from replication, but shareable through a scatter/gather mechanism. Dynamic clustering, added in our recent v. 5.0 upgrade, enables nodes to join and leave clusters without interrupting processing. This further simplifies administration for a clustering database technology that counts minimal run-time maintenance as a key benefit. On selected platforms, clustering now supports the Infiniband switched fabric interconnect and Message Passing Interface (MPI) standard. In our tests, these high performance networking options accelerated performance more than 7.5x compared to “plain vanilla” gigabit networking (TCP/IP and Ethernet).

Related Posts

Big Data: Three questions to VoltDB.
ODBMS INdustry Watch, February 6, 2014

Big Data: Three questions to Pivotal.
ODBMS Industry Watch, January 20, 2014.

Big Data: Three questions to InterSystems.
ODBMS Industry Watch, January 13, 2014.

Cloud based hotel management– Interview with Keith Gruen.
ODBMS Industry Watch, July 25, 2013

In-memory database systems. Interview with Steve Graves, McObject.
ODBMS Industry Watch, March 16, 2012

Resources

ODBMS.org: Free resources on Big Data, Analytics, Cloud Data Stores, Graphs Databases, NewSQL, NoSQL, Object Databases.

  • Follow ODBMS.org on Twitter: @odbmsorg
  • ##

    ]]>
    http://www.odbms.org/blog/2014/02/big-data-three-questions-to-mcobject/feed/ 0
    Big Data: Three questions to InterSystems. http://www.odbms.org/blog/2014/01/big-data-three-questions-to-intersystems/ http://www.odbms.org/blog/2014/01/big-data-three-questions-to-intersystems/#comments Mon, 13 Jan 2014 10:41:24 +0000 http://www.odbms.org/blog/?p=2880

    “The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. “–Iran Hutchinson.

    I start this new year with a new series of short interviews to leading vendors of Big Data technologies. I call them “Big Data: three questions to“. The first of such interviews is with Iran Hutchinson, Big Data Specialist at InterSystems.

    RVZ

    Q1. What is your current “Big Data” products offering?

    Iran Hutchinson: InterSystems has actually been in the Big Data business for some time, since 1978, long before anyone called it that. We currently offer an integrated database, integration and analytics platform based on InterSystems Caché®, our flagship product, to enable Big Data breakthroughs in a variety of industries.

    Launched in 1997, Caché is an advanced object database that provides in-memory speed with persistence, and the ability to ingest huge volumes of transactional data at insanely high velocity. It is massively scalable, because of its very lean design. Its efficient multidimensional data structures require less disk space and provide faster SQL performance than relational databases. Caché also provides sophisticated analytics, enabling real-time queries against transactional data with minimal maintenance and hardware requirements.

    InterSystems Ensemble® is our seamless platform for integrating and developing connected applications. Ensemble can be used as a central processing hub or even as backbone for nationwide networks. By integrating this connectivity with our high-performance Caché database, as well as with new technologies for analytics, high-availability, security, and mobile solutions, we can deliver a rock-solid and unified Big Data platform, not a patchwork of disparate solutions.

    We also offer additional technologies built on our integrated platform, such as InterSystems HealthShare®, a health informatics platform that enables strategic interoperability and analytics for action. Our TrakCare unified health information system is likewise built upon this same integrated framework.

    Q2. Who are your current customers and how do they typically use your products?

    Iran Hutchinson: We continually update our technology to enable customers to better manage, ingest and analyze Big Data. Our clients are in healthcare, financial services, aerospace, utilities – industries that have extremely demanding requirements for performance and speed. For example, Caché is the world’s most widely used database in healthcare. Entire countries, such as Sweden and Scotland, run their national health systems on Caché, as well as top hospitals and health systems around the world. One client alone runs 15 percent of the world’s equity trades through InterSystems software, and all of the top 10 banks use our products.

    It is also being used by the European Space Agency to map a billion stars – the largest data processing task in astronomy to date. (See The Gaia Mission One Year Later.)

    Our configurable ACID (Atomicity Consistency Isolation Durability) capabilities and ECP-based approach enable us to handle these kinds of very large-scale, very high-performance, transactional Big Data applications.

    Q3. What are the main new technical features you are currently working on and why?

    Iran Hutchinson: There are several new paradigms we are focusing on, but let’s focus on analytics. Once you absorb all that Big Data, you want to run analytics. And that’s where the three V’s of Big Data – volume, velocity and variety – are critically important.

    Let’s talk about the variety of data. Most popular Big Data analytics solutions start with the assumption of structured data – rows and columns – when the most interesting data is unstructured, or text-based data. A lot of our competitors still struggle with unstructured data, but we solved this problem with Caché in 1997, and we keep getting better at it. InterSystems Caché offers both vertical and horizontal scaling, enabling schema-less and schema-based (SQL) querying options for both structured and unstructured data.
    As a result, our clients today are running analytics on all their data – and we mean real-time, operational data, not the data that is aggregated a week later or a month later for boardroom presentations.

    A lot of development has been done in the area of schema-less data stores or so-called document stores, which are mainly key-value stores. The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. Some companies now offer SQL querying on schema-less data stores as an add-on or plugin. InterSystems Caché provides a high-performance key-value store with native SQL support.

    The commonly available SQL-based solutions also require a predefinition of what the user is interested in. But if you don’t know the data, how do you know what’s interesting? Embedded within Caché is a unique and powerful text analysis technology, called iKnow, that analyzes unstructured data out of the box, without requiring any predefinition through ontologies or dictionaries. Whether it’s English, German, or French, iKnow can automatically identify concepts and understand their significance – and do that in real-time, at transaction speeds.

    iKnow enables not only lightning-fast analysis of unstructured data, but also equally efficient Google-like keyword searching via SQL with a technology called iFind.
    And because we married that iKnow technology with another real-time OLAP-type technology we call DeepSee, we make it possible to embed this analytic capability into your applications. You can extract complex concepts and build cubes on both structured AND unstructured data. We blend keyword search and concept discovery, so you can express a SQL query and pull out both concepts and keywords on unstructured data.

    Much of our current development activity is focused on enhancing our iKnow technology for a more distributed environment.
    This will allow people to upload a data set, structured and/or unstructured, and organize it in a flexible and dynamic way by just stepping through a brief series of graphical representation of the most relevant content in the data set. By selecting, in the graphs, the elements you want to use, you can immediately jump into the micro-context of these elements and their related structured and unstructured information objects. Alternately, you can further segment your data into subsets that fit the use you had in mind. In this second case, the set can be optimized by a number of classic NLP strategies such as similarity extension, typicality pattern parallelism, etc. The data can also be wrapped into existing cubes or into new ones, or fed into advanced predictive models.

    So our goal is to offer our customers a stable solution that really uses both structured and unstructured data in a distributed and scalable way. We will demonstrate the results of our efforts in a live system at our next annual customer conference, Global Summit 2014.

    We also have a software partner that has built a very exciting social media application, using our analytics technology. It’s called Social Knowledge, and it lets you monitor what people are saying on Twitter and Facebook – in real-time. Mind you, this is not keyword search, but concept analysis – a very big difference. So you can see if there’s a groundswell of consumer feedback on your new product, or your latest advertising campaign. Social Knowledge can give you that live feedback – so you can act on it right away.

    In summary, today InterSystems provides SQL and DeepSee over our shared data architecture to do structured data analysis.
    And for unstructured data, we offer iKnow semantic analysis technology and iFind, our iKnow-powered search mechanism, to enable information discovery in text. These features will be enabled for text analytics in future versions of our shared-nothing data architectures.

    Related Posts

    The Gaia mission, one year later. Interview with William O’Mullane.
    ODBMS Industry Watch, January 16, 2013

    Operational Database Management Systems. Interview with Nick Heudecker. ODBMS Industry Watch, December 16, 2013.

    Challenges and Opportunities for Big Data. Interview with Mike Hoskins. ODBMS Industry Watch, December 3, 2013.

    On Analyzing Unstructured Data. — Interview with Michael Brands.
    ODBMS Industry Watch, July 11, 2012.

    Resources

    ODBMS.org: Big Data Analytics, NewSQL, NoSQL, Object Database Vendors –Free Resources.

    ODBMS.org: Big Data and Analytical Data Platforms, NewSQL, NoSQL, Object Databases– Free Downloads and Links.

    ODBMS.org: Expert Articles.

    Follow ODBMS.org on Twitter: @odbmsorg

    ##

    ]]>
    http://www.odbms.org/blog/2014/01/big-data-three-questions-to-intersystems/feed/ 0
    On NoSQL. Interview with Rick Cattell. http://www.odbms.org/blog/2013/08/on-nosql-interview-with-rick-cattell/ http://www.odbms.org/blog/2013/08/on-nosql-interview-with-rick-cattell/#comments Mon, 19 Aug 2013 07:48:00 +0000 http://www.odbms.org/blog/?p=2576

    ” There aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves. Likewise, companies with closed source will have trouble finding customers willing to risk a closed source (or limited open source) solution. It will be interesting to see what happens. But I don’t see NoSQL going away, there is a well-established following.” –Rick Cattell.

    I have asked Rick Cattell, one of the leading independent consultants in database systems, a few questions on NoSQL.

    RVZ

    Q1. For years, you have been studying the NoSQL area and writing articles about scalable databases. What is new in the last year, in your view? What is changing?

    Rick Cattell: It seems like there’s a new NoSQL player every month or two, now!
    It’s hard to keep track of them all. However, a few players have become much more popular than the others.

    Q2. Which players are those?

    Rick Cattell: Among the open source players, I hear the most about MongoDB, Cassandra, and Riak now, and often HBase and Redis. However, don’t forget that the proprietary players like Amazon, Oracle, and Google have NoSQL systems as well.

    Q3. How do you define “NoSQL”?

    Rick Cattell: I use the term to mean systems that provide a simple operations like key/value storage or simple records and indexes, and that focus on horizontal scalability for those simple operations. Some people categorize horizontally scaled graph databases and object databases to be “NoSQL” as well. However, those systems have very different characteristics. Graph databases and object databases have to efficiently break connections up over distributed servers, and have to provide operations that somehow span servers as you traverse the graph. Distributed graph/object databases have been around for a while, but efficient distribution is a hard problem. The NoSQL databases simply distribute (or shard) each data type based on a primary key; that’s easier to do efficiently.

    Q4. What other categories of systems do you see?

    Rick Cattell: Well, there are systems that focus on horizontal scaling for full SQL with joins, which are generally called “NewSQL“, and systems optimized for “Big Data” analytics, typically based on Hadoop map/reduce. And of course, you can also sub-categorize the NoSQL systems based on their data model and distribution model.

    Q5. What subcategories would those be?

    Rick Cattell: On data model, I separate them into document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and grouped-column stores like HBase and Cassandra. However, a categorization by data model is deceptive, because they also differ quite a bit in their performance and concurrency guarantees.

    Q6: Which systems perform best?

    Rick Cattell: That’s hard to answer. Performance is not a scale from “good” to “bad”… the different systems have better performance for different kinds of applications. MongoDB performs incredibly well if all your data fits in distributed memory, for example, and Cassandra does a pretty good job of using disk, because of its scheme of writing new data to the end of disk files and consolidating later.

    Q7: What about their concurrency guarantees?

    Rick Cattell: They are all over the place on concurrency. The simplest provide no guarantees, only “eventual consistency“. You don’t know which version of data you’ll get with Cassandra. MongoDB can keep a “primary” replica consistent if you can live with their rudimentary locking mechanism.
    Some of the new systems try to provide full ACID transactions. FoundationDB and Oracle NoSQL claim to do that, but I haven’t yet verified that. I have studied Google’s Spanner paper, and they do provide true ACID consistency in a distributed world, for most practical purposes. Many people think the CAP theorem makes that impossible, but I believe their interpretation of the theorem is too narrow: most real applications can have their cake and eat it too, given the right distribution model. By the way, graph/object databases also provide ACID consistency, as does VoltDB, but as I mentioned I consider them a different category.

    Q8: I notice you have an unpublished paper on your website, called 2x2x2 Requirements for Scalability. Can you explain what the 2x2x2 means?

    Rick Cattell: Well, the first 2x means that there are two different kinds of scalability: horizontal scaling over multiple servers, and vertical scaling for performance on a single server. The remaining 2×2 means that there are two key features needed to achieve the horizontal and vertical scaling, and for each of those, there are two additional things you have to do to make the features practical.

    Q9: What are those key features?

    Rick Cattell: For horizontal scaling, you need to partition and replicate your data. But you also need automatic failure recovery and database evolution with no downtime, because when your database runs on 200 nodes, you can’t afford to take the database offline and you can’t afford operator intervention on every failure.
    To achieve vertical scaling, you need to take advantage of RAM and you need to avoid random disk I/O. You also need to minimize the overhead for locking and latching, and you need to minimize network calls between servers. There are various ways to do that. The best systems have all eight of these key features. These eight features represent my summary of scalable databases in a nutshell.

    Q10: What do you see happening with NoSQL, going forward?

    Rick Cattell: Good question. I see a lot of consolidation happening… there are too many players! There aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves. Likewise, companies with closed source will have trouble finding customers willing to risk a closed source (or limited open source) solution.
    It will be interesting to see what happens. But I don’t see NoSQL going away, there is a well-established following.

    ————–
    R. G. G. “Rick” Cattell is an independent consultant in database systems.
    He previously worked as a Distinguished Engineer at Sun Microsystems, most recently on open source database systems and distributed database scaling. Dr. Cattell served for 20+ years at Sun Microsystems in management and senior technical roles, and for 10 years in research at Xerox PARC and at Carnegie-Mellon University. Dr. Cattell is best known for his contributions in database systems and middleware, including database scalability, enterprise Java, object/relational mapping, object-oriented databases, and database interfaces. He is the author of several dozen papers and five books, and a co-inventor of six U.S. patents.
    At Sun he instigated the Enterprise Java, Java DB, and Java Blend projects, and was a contributor to a number of Java APIs and products. He previously developed the Cedar DBMS at Xerox PARC, the Sun Simplify database GUI, and SunSoft’s CORBA-database integration.
    He is a co-founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), the co-creator of JDBC, the author of the world’s first monograph on object/relational and object databases, a recipient of the ACM Outstanding PhD Dissertation Award, and an ACM Fellow.

    Related Posts

    On Oracle NoSQL Database –Interview with Dave Segleau. July 2, 2013

    On Real Time NoSQL. Interview with Brian Bulkowski. May 21, 2013

    Resources

    Rick Cattell home page.

    ODBMS.org Free Downloads and Links
    In this section you can download free resources covering the following topics:
    Big Data and Analytical Data Platforms
    Cloud Data Stores
    Object Databases
    NoSQL Data Stores
    Graphs and Data Stores
    Object-Oriented Programming
    Entity Framework (EF) Resources
    ORM Technology
    Object-Relational Impedance Mismatch
    NewSQL, XML, RDF Data Stores, RDBMS

    Follow ODBMS.org on Twitter: @odbmsorg

    ##

    ]]>
    http://www.odbms.org/blog/2013/08/on-nosql-interview-with-rick-cattell/feed/ 0
    Big Data from Space: the “Herschel” telescope. http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/ http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/#comments Fri, 02 Aug 2013 12:45:02 +0000 http://www.odbms.org/blog/?p=2169

    ” One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on”–Jon Brumfitt.

    On May 14, 2009, the European Space Agency launched an Arianne 5 rocket carrying the largest telescope ever flown: the “Herschel” telescope, 3.5 meters in diameter.

    I first did an interview with Dr. Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, at the European Space Agency in March 2011. You can read that interview here.

    Two years later, I wanted to know the status of the project. This is a follow up interview.

    RVZ

    Q1. What is the status of the mission?

    Jon Brumfitt: The operational phase of the Herschel mission came to an end on 29th April 2013, when the super-fluid helium used to cool the instruments was finally exhausted. By operating in the far infra-red, Herschel has been able to see cold objects that are invisible to normal telescopes.
    However, this requires that the detectors are cooled to an even lower temperature. The helium cools the instruments down to 1.7K (about -271 Celsius). Individual detectors are then cooled down further to about 0.3K. This is very close to absolute zero, which is the coldest possible temperature. The exhaustion of the helium marks the end of new observations, but it is by no means the end of the mission.
    We still have a lot of work to do in getting the best results from the data processing to give astronomers a final legacy archive of high-quality data to work with for years to come.

    The spacecraft has been in orbit around a point known as the second Lagrangian point “L2″, which is about 1.5 million kilometres from Earth (around four times as far away as the Moon). This location provided a good thermal environment and a relatively unrestricted view of the sky. The spacecraft cannot be left in this orbit because regular correction manoeuvres would be needed. Consequently, it is being transferred into a “parking” orbit around the Sun.

    Q2. What are the main results obtained so far by using the “Herschel” telescope?

    Jon Brumfitt: That is a difficult one to answer in a few sentences. Just to take a few examples, Herschel has given us new insights into the way that stars form and the history of star formation and galaxy evolution since the big-bang.
    It has discovered large quantities of cold water vapour in the dusty disk surrounding a young star, which suggests the possibility of other water covered planets. It has also given us new evidence for the origins of water on Earth.
    The following are some links giving more detailed highlights from the mission:

    – Press
    – Results
    – Press Releases
    – Latest news

    With its 3.5 metre diameter mirror, Herschel is the largest space telescope ever launched. The large mirror not only gives it a high sensitivity but also allows us to observe the sky with a high spatial resolution. So in a sense every observation we make is showing us something we have never seen before. We have performed around 35,000 science observations, which have already resulted in over 600 papers being published in scientific journals. There are many years of work ahead for astronomers in interpreting the results, which will undoubtedly lead to many new discoveries.

    Q3. How much data did you receive and process so far? Could you give us some up to date information?

    Jon Brumfitt: We have about 3 TB of data in the Versant database, most of which is raw data from the spacecraft. The data received each day is processed by our data processing pipeline and the resulting data products, such as images and spectra, are placed in an archive for access by astronomers.
    Each time we make a major new release of the software (roughly every six months at this stage), with improvements to the data processing, we reprocess everything.
    The data processing runs on a grid with around 35 nodes, each with typically 8 cores and between 16 and 256 GB of memory. This is able to process around 40 days worth of data per day, so it is possible to reprocess everything in a few weeks. The data in the archive is stored as FITS files (a standard format for astronomical data).
    The archive uses a relational (PostgreSQL) database to catalogue the data and allow queries to find relevant data. This relational database is only about 60 GB, whereas the product files account for about 60 TB.
    This may reduce somewhat for the final archive, once we have cleaned it up by removing the results of earlier processing runs.

    Q4. What are the main technical challenges in the data management part of this mission and how did you solve them?

    Jon Brumfitt: One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on.

    The lifetime of Herschel will have been 18 years from the start of software development to the end of the post-operations phase.
    We designed a single system to meet the needs of all mission phases, from early instrument development, through routine in-flight operations to the end of the post-operations phase. Although the spacecraft was not launched until 2009, the database was in regular use from 2002 for developing and testing the instruments in the laboratory. By using the same software to control the instruments in the laboratory as we used to control them in flight, we ended up with a very robust and well-tested system. We call this approach “smooth transition”.

    The development approach we adopted is probably best classified as an Agile iterative and incremental one. Object orientation helps a lot because changes in the problem domain, resulting from changing requirements, tend to result in localised changes in the data model.
    Other important factors in managing change are separation of concerns and minimization of dependencies, for example using component-based architectures.

    When we decided to use an object database, it was a new technology and it would have been unwise to rely on any database vendor or product surviving for such a long time. Although work was under way on the ODMG and JDO standards, these were quite immature and the only suitable object databases used proprietary interfaces.
    We therefore chose to implement our own abstraction layer around the database. This was similar in concept to JDO, with a factory providing a pluggable implementation of a persistence manager. This abstraction provided a route to change to a different object database, or even a relational database with an object-relational mapping layer, should it have proved necessary.

    One aspect that is difficult to abstract is the use of queries, because query languages differ. In principle, an object database could be used without any queries, by navigating to everything from a global root object. However, in practice navigation and queries both have their role. For example, to find all the observation requests that have not yet been scheduled, it is much faster to perform a query than to iterate by navigation to find them. However, once an observation request is in memory it is much easier and faster to navigate to all the associated objects needed to process it. We have used a variety of techniques for encapsulating queries. One is to implement them as methods of an extent class that acts as a query factory.

    Another challenge was designing a robust data model that would serve all phases of the mission from instrument development in the laboratory, through pre-flight tests and routine operations to the end of post-operations. We approached this by starting with a model of the problem domain and then analysing use-cases to see what data needed to be persistent and where we needed associations. It was important to avoid the temptation to store too much just because transitive persistence made it so easy.

    One criticism that is sometimes raised against object databases is that the associations tend to encode business logic in the object schema, whereas relational databases just store data in a neutral form that can outlive the software that created it; if you subsequently decide that you need a new use-case, such as report generation, the associations may not be there to support it. This is true to some extent, but consideration of use cases for the entire project lifetime helped a lot. It is of course possible to use queries to work-around missing associations.

    Examples are sometimes given of how easy an object database is to use by directly persisting your business objects. This may be fine for a simple application with an embedded database, but for a complex system you still need to cleanly decouple your business logic from the data storage. This is true whether you are using a relational or an object database. With an object database, the persistent classes should only be responsible for persistence and referential integrity and so typically just have getter and setter methods.
    We have encapsulated our persistent classes in a package called the Core Class Model (CCM) that has a factory to create instances. This complements the pluggable persistence manager. Hence, the application sees the persistence manager and CCM factories and interfaces, but the implementations are hidden.
    Applications define their own business classes which can work like decorators for the persistent classes.

    Q5. What is your experience in having two separate database systems for Herschel? A relational database for storing and managing processed data products and an object database for storing and managing proposal data, mission planning data, telecommands and raw (unprocessed) telemetry?

    Jon Brumfitt: There are essentially two parts to the ground segment for a space observatory.
    One is the “uplink” which is used for controlling the spacecraft and instruments. This includes submission of observing proposals, observation planning, scheduling, flight dynamics and commanding.
    The other is the “downlink”, which involves ingesting and processing the data received from the spacecraft.

    On some missions the data processing is carried out by a data centre, which is separate from spacecraft operations. In that case there is a very clear separation.
    On Herschel, the original concept was to build a completely integrated system around an object database that would hold all uplink and downlink data, including processed data products. However, after further analysis it became clear that it was better to integrate our product archive with those from other missions. This also means that the Herschel data will remain available long after the project has finished. The role of the object database is essentially for operating the spacecraft and storing the raw data.

    The Herschel archive is part of a common infrastructure shared by many of our ESA science projects. This provides a uniform way of accessing data from multiple missions.
    The following is a nice example of how data from Herschel and our XMM-Newton X-ray telescope have been combined to make a multi-spectral image of the Andromeda Galaxy.

    Our archive, in turn, forms part of a larger international archive known as the “Virtual Observatory” (VO), which includes both space and ground-based observatories from all over the world.

    I think that using separate databases for operations and product archiving has worked well. In fact, it is more the norm rather than the exception. The two databases serve very different roles.
    The uplink database manages the day-to-day operations of the spacecraft and is constantly being updated. The uplink data forms a complex object graph which is accessed by navigation, so an object database is well suited.
    The product archive is essentially a write-once-read-many repository. The data is not modified, but new versions of products may be added as a result of reprocessing. There are a large number of clients accessing it via the Internet. The archive database is a catalogue containing the product meta-data, which can be queried to find the relevant product files. This is better suited to a relational database.

    The motivation for the original idea of using a single object database for everything was that it allowed direct association between uplink and downlink data. For example, processed products could be associated with their observation requests. However, using separate databases does not prevent one database being queried with an observation identifier obtained from the other.
    One complication is that processing an observation requires both downlink data and the associated uplink data.
    We solved this by creating “uplink products” from the relevant uplink data and placing them in the archive. This has the advantage that external users, who do not have access to the Versant database, have everything they need to process the data themselves.

    Q6. What are the main lessons learned so far in using Versant object database for managing telemetry data and information on steering and calibrating scientific on-board instruments?

    Jon Brumfitt: Object databases can be very effective for certain kinds of application, but may have less benefit for others. A complex system typically has a mixture of application types, so the advantages are not always clear cut. Object databases can give a high performance for applications that need to navigate through a complex object graph, particularly if used with fairly long transactions where a significant part of the object graph remains in memory. Web (JavaEE) applications lose some of the benefit because they typically perform many short transactions with each one performing a query. They also use additional access layers that result in a system which loses the simplicity of the transparent persistence of an object database.

    In our case, the object database was best suited for the uplink. It simplified the uplink development by avoiding object-relational mapping and the complexity of a design based on JDBC or EJB 2. Nowadays with JPA, relational databases are much easier to use for object persistence, so the rationale for using an object database is largely determined by whether the application can benefit from fast navigational access and how much effort is saved in mapping. There are now at least two object database vendors that support both JDO and JPA, so the distinction is becoming somewhat blurred.

    For telemetry access we query the database instead of using navigation, as the packets don’t fit neatly into a single containment hierarchy. Queries allows packets to be accessed by many different criteria, such as time, instrument, type, source and so on.
    Processing calibration observations does not introduce any special considerations as far as the database is concerned.

    Q7. Did you have any scalability and or availability issues during the project? If yes, how did you solve them?

    Jon Brumfitt: Scalability would have been an important issue if we had kept to the original concept of storing everything including products in a single database. However, using the object database for just uplink and telemetry meant that this was not a big issue.

    The data processing grid retrieves the raw telemetry data from the object database server, which is a 16-core Linux machine with 64 GB of memory. The average load on the server is quite low, but occasionally there have been high peak loads from the grid that have saturated the server disk I/O and slowed down other users of the database. Interactive applications such as mission planning need a rapid response, whereas batch data processing is less critical. We solved this by implementing a mechanism to spread out the grid load by treating the database as a resource.

    Once a year, we have made an “Announcement of Opportunity” for astronomers to propose observations that they would like to perform with Herschel. It is only human nature that many people leave it until the last minute and we get a very high peak load on the server in the last hour or two before the deadline! We have used a separate server for this purpose, rather than ingesting proposals directly into our operational database. This has avoided any risk of interfering with routine operations. After the deadline, we have copied the objects into the operational database.

    Q8. What about the overall performance of the two databases? What are the lessons learned?

    Jon Brumfitt: The databases are good at different things.
    As mentioned before, an object database can give a high performance for applications involving a complex object graph which you navigate around. An example is our mission planning system. Object persistence makes application design very simple, although in a real system you still need to introduce layers to decouple the business logic from the persistence.

    For the archive, on the other hand, a relational database is more appropriate. We are querying the archive to find data that matches a set of criteria. The data is stored in files rather than as objects in the database.

    Q9. What are the next steps planned for the project and the main technical challenges ahead?

    Jon Brumfitt: As I mentioned earlier, the coming post-operations phase will concentrate on further improving the data processing software to generate a top-quality legacy archive, and on provision of high-quality support documentation and continued interactive support for the community of astronomers that forms our “customer base”. The system was designed from the outset to support all phases of the mission, from early instrument development tests in the laboratory, though routine operations to the end of the post-operations phase of the mission. The main difference moving into post-operations is that we will stop uplink activities and ingesting new telemetry. We will continue to reprocess all the data regularly as improvements are made to the data processing software.

    We are currently in the process of upgrading from Versant 7 to Versant 8.
    We have been using Versant 7 since launch and the system has been running well, so there has been little urgency to upgrade.
    However, with routine operations coming to an end, we are doing some “technology refresh”, including upgrading to Java 7 and Versant 8.

    Q10. Anything else you wish to add?

    Jon Brumfitt: These are just some personal thoughts on the way the database market has evolved over the lifetime of Herschel. Thirteen years ago, when we started development of our system, there were expectations that object databases would really take off in line with the growing use of object orientation, but this did not happen. Object databases still represent rather a niche market. It is a pity there is no open-source object-database equivalent of MySQL. This would have encouraged more people to try object databases.

    JDO has developed into a mature standard over the years. One of its key features is that it is “architecture neutral”, but in fact there are very few implementations for relational databases. However, it seems to be finding a new role for some NoSQL databases, such as the Google AppEngine datastore.
    NoSQL appears to be taking off far quicker than object databases did, although it is an umbrella term that covers quite a few kinds of datastore. Horizontal scaling is likely to be an important feature for many systems in the future. The relational model is still dominant, but there is a growing appreciation of alternatives. There is even talk of “Polyglot Persistence” using different kinds of databases within a system; in a sense we are doing this with our object database and relational archive.

    More recently, JPA has created considerable interest in object persistence for relational databases and appears to be rapidly overtaking JDO.
    This is partly because it is being adopted by developers of enterprise applications who previously used EJB 2.
    If you look at the APIs of JDO and JPA they are actually quite similar apart from the locking modes. However, there is an enormous difference in the way they are typically used in practice. This is more to do with the fact that JPA is often used for enterprise applications. The distinction is getting blurred by some object database vendors who now support JPA with an object database. This could expand the market for object databases by attracting some traditional relational type applications.

    So, I wonder what the next 13 years will bring! I am certainly watching developments with interest.
    ——

    Dr Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, European Space Agency.

    Jon Brumfitt has a background in Electronics with Physics and Mathematics and has worked on several of ESA’s astrophysics missions, including IUE, Hipparcos, ISO, XMM and currently Herschel. After completing his PhD and a post-doctoral fellowship in image processing, Jon worked on data reduction for the IUE satellite before joining Logica Space and Defence in 1980. In 1984 he moved to Logica’s research centre in Cambridge and then in 1993 to ESTEC in the Netherlands to work on the scientific ground segments for ISO and XMM. In January 2000, he joined the newly formed Herschel team as science ground segment System Architect. As Herschel approached launch, he moved down to the European Space Astronomy Centre in Madrid to become part of the Herschel Science Operations Team, where he is currently System Engineer and System Architect.

    Related Posts

    The Gaia mission, one year later. Interview with William O’Mullane. January 16, 2013

    Objects in Space: “Herschel” the largest telescope ever flown. March 18, 2011

    Resources

    Introduction to ODBMS By Rick Grehan

    ODBMS.org Resources on Object Database Vendors.

    —————————————
    You can follow ODBMS.org on Twitter : @odbmsorg

    ##

    ]]>
    http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/feed/ 0
    Acquiring Versant –Interview with Steve Shine. http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/ http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/#comments Wed, 06 Mar 2013 17:26:21 +0000 http://www.odbms.org/blog/?p=2096 “So the synergies in data management come not from how the systems connect but how the data is used to derive business value” –Steve Shine,

    On Dec. 21, 2012, Actian Corp. announced the completion of the transaction to buy Versant Corporation. I have interviewed Steve Shine, CEO and President, Actian Corporation.

    RVZ

    Q1. Why acquiring an object-oriented database company such as Versant?

    Steve Shine: Versant Corporation, like us, has a long pedigree in solving complex data management in some of the world’s largest organisations. We see many synergies in bringing the two companies together. The most important of these is together we are able to invest more resources in helping our customers extract even more value from their data. Our direct clients will have a larger product portfolio to choose from, our partners will be able to expand in adjacent solution segments, and strategically we arm ourselves with the skills and technology to fulfil our plans to deliver innovative solutions in the emerging Big Data Market.

    Q2. For the enterprise market, Actian offers its legacy Ingres relational database. Versant on the other hand offers an object oriented database, especially suited for complex science/engineering applications. How does this fit? Do you have a strategy on how to offer a set of support processes and related tools for the enterprise? if yes, how?

    Steve Shine: While the two databases may not have a direct logical connection at client installations, we recognise that most clients use these two products as part of a larger more holistic solutions to support their operations. The data they manage is the same and interacts to solve business issues – for example object stores to manage the relationships between entities; transactional systems to manage clients and the supply chain and analytic systems to monitor and tune operational performance. – Different systems using the same underlying data to drive a complex business.

    We plan to announce a vision of an integrated platform designed to help our clients manage all their data and their complex interactions, both internal and external so they can not only focus on their running their business, but better exploit the incremental opportunity promised by Big Data.

    Q3. Bernhard Woebker, president and chief executive officer of Versant stated, “the combination of Actian and Versant provides numerous synergies for data management”. Could you give us some specific examples of such synergies for data management?

    Steve Shine: Here is a specific example of what I mean by helping clients extracting more value from data in the Telco space. These type of incremental opportunities exist in every vertical we have looked at.

    An OSS system in a Telco today may use an Object store to manage the complex relationships between the data, the same data is used in a relational store to monitor, control and manage the telephone network.

    Another relational store using variants of the same data manages the provisioning, billing and support for the users of the network. The whole data set in Analytical stores is used to monitor and optimise performance and usage of the network.

    Fast forwarding to today, the same data used in more sophisticated ways has allowed voice and data networks to converge to provide a seamless interface to mobile users. As a result, Telcos have tremendous incremental revenue opportunities BUT only if they can exploit the data they already have in their networks. For example: The data on their networks has allowed for a huge increase in location based services, knowledge and analysis of the data content has allowed providers to push targeted advertising and other revenue earning services at their users; then turning the phone into a common billing device to get even a greater share of the service providers revenue… You get the picture.

    Now imagine other corporations being able to exploit their information in similar ways: Would a retailer benefit from knowing the preferences of who’s in their stores? Would a hedge fund benefit from detecting a sentiment shift for a stock as it happens? Even knowledge of simple events can help organisations become more efficient.
    A salesman knowing immediately a key client raises a support ticket; A product manager knowing what’s being asked on discussion forums; A marketing manager knowing a perfect prospect is on the website.

    So the synergies in data management come not from how the systems connect but how the data is used to derive business value. We want to help manage all the data in our customers organisations and help them drive incremental value from it. That is the what we mean by numerous synergies from data management and we have a vision to deliver it to our customers.

    Q4. Actian claims to have more than 10,000 customers worldwide. What is the value proposition of Versant’s acquisition for the existing Actian`s customers?

    Steve Shine: I have covered this in the answers above. They get access to a larger portfolio of products and services and we together drive a vision to help them extract greater value from their data.

    Q5. Versant claims to have more than 150,000 installations worldwide. How do you intend to support them?

    Steve Shine: Actian already runs a 24/7 global support organisation that prides itself in delivering one of the industry’s best client satisfaction scores. As far as numbers are concerned, Versant’s large user count is in essence driven by only 250 or so very sophisticated large installations whereas Actian already deals with over 10,000 discreet mission critical installations worldwide. So we are confident of maintaining our very high support levels and the Versant support infrastructure is being integrated into Actian’s as we speak.

    Q6. Actian is active in the market for big data analytics. How does Versant’s database technology fit into Actian’s big data analytics offerings and capabilities?

    Steve Shine: Using the example above imagine using OSS data to analyse network utilisation, CDR’s and billing information to identify pay plans for your most profitable clients.

    Now give these clients the ability to take business action on real time changes in their data.Now imagine being able to do that from an integrated product set from one vendor. We will be announcing the vision behind this strategy this quarter. In addition, the Versant technology gives us additional options for solutions for big data for example visualisation and managing meta data.

    Q7. Do you intend to combine or integrate your analytics database Vectorwise with Versant’s database technology (such as Versant JPA)? If yes, how?

    Steve Shine: Specific plans for integrating products within the overall architecture have not been formulated. We have a strong philosophy that you should use the best tool for the job eg OODB for some things, OLTP RDBMS for other etc. But the real value comes from being able to perform sophisticated analysis and management across the different data stores. That is part of the work out platform integration efforts are focused on.

    Q8. What are the plans for future software developments. Will you have a joint development team or else?

    Steve Shine: We will be merging the engineering teams to focus on providing innovative solutions for big Data under single leadership.

    Q9. You have recently announced two partnerships for Vectorwise, with Inferenda and BiBoard. Will you also pursue this indirect channel path also for Versant’s database technology?

    Steve Shine: The beauty of the vision we speak of is that our joint partner have a real opportunity to expand their solutions using Actian’s broader product set and for those that are innovative the opportunity for new emerging markets

    Q10. Versant recently developed Versant JPA. Is the Java market important for Actian?

    Steve Shine: Yes !

    Q11. It is currently a crowded database market: several new database vendors (NoSQL and NewSQL) offering innovative database technology (NuoDB, VoltDB, MongoDB, Cassandra, Couchbase, Riak to name a few), and large companies such as IBM and Oracle, are all chasing the big data market. What is your plan to stand out of the crowd?

    Steve Shine: We are very excited about the upcoming announcement on our plans for the Big Data market. We will be happy to brief you on the details closer to the time but I will say that early feedback from analysts houses like Gartner have confirmed that our solution is very effective and differentiated in helping corporations extract business value from Big Data. On a higher scale, many of the start ups are going to get a very rude awakening when they find that delivering a database for mission critical use is much more than speed and scale of technology. Enterprises want world class 24×7 support service with failsafe resilience and security. Real industry grade databases take years and many $m’s to reach scalable maturity. Most of the start ups will not make it. Actian is uniquely positioned in being profitable and having delivered industry grade database innovation but also being singularly focused around data management unlike the broad, cumbersome and expensive bigger players. We believe value conscious enterprises will see our maturity and agility as a great strength.

    Qx Anything else you wish to add?

    Steve Shine: DATA! – What a great thing to be involved in! Endless value, endless opportunities for innovation and no end in sight as far as growth is concerned. I look forward to the next 5 years.

    ———————–

    Steve Shine, CEO and President, Actian Corporation.
    Steve comes to Actian from Sybase where he was senior vice president and general manager for EMEA, overseeing all operational, sales, financial and human resources in the region for the past three years. While at Sybase, he achieved more than 200 million in revenue and managed 500 employees, charting over 50 percent growth in the Business Intelligence market for Sybase. Prior to Sybase, Steve was at Canadian-based Geac Computer Corporation for ten successful years, helping to successfully turn around two major global divisions for the ERP firm.

    Related Posts

    Managing Internet Protocol Television Data. — An interview with Stefan Arbanowski. June 25, 2012

    On Versant`s technology. Interview with Vishal Bagga. August 17, 2011

    Resources

    Big Data: Principles and best practices of scalable realtime data systems. Nathan Marz (Twitter) and James Warren, MEAP Began: January 2012,Manning Publications.

    Analyzing Big Data With Twitter. A special UC Berkeley iSchool course.

    -A write-performance improvement of ZABBIX with NoSQL databases and HistoryGluon. MIRACLE LINUX CORPORATION, February 13, 2013

    Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs. Ben Engber, CEO, Thumbtack Technology, JANUARY 2013.

    Follow ODBMS.org on Twitter: @odbmsorg
    ##

    ]]>
    http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/feed/ 0
    The Spring Data project. Interview with David Turanski. http://www.odbms.org/blog/2013/01/the-spring-data-project-interview-with-david-turanski/ http://www.odbms.org/blog/2013/01/the-spring-data-project-interview-with-david-turanski/#comments Thu, 03 Jan 2013 09:56:23 +0000 http://www.odbms.org/blog/?p=1867 “Given the recent explosion of NoSQL data stores, we saw the need for a common data access abstraction to simplify development with NoSQL stores. Hence the Spring Data team was created.” –David Turanski.

    I wanted to know more about the Spring Data project. I have interviewed David Turanski, Senior Software Engineer with SpringSource, a division of VMWare.

    RVZ

    Q1. What is the Spring Framework?

    David Turanski: Spring is a widely adopted open source application development framework for enterprise Java‚ used by millions of developers. Version 1.0 was released in 2004 as a lightweight alternative to Enterprise Java Beans (EJB). Since, then Spring has expanded into many other areas of enterprise development, such as enterprise integration (Spring Integration), batch processing (Spring Batch), web development (Spring MVC, Spring Webflow), security (Spring Security). Spring continues to push the envelope for mobile applications (Spring Mobile), social media (Spring Social), rich web applications (Spring MVC, s2js Javascript libraries), and NoSQL data access(Spring Data).

    Q2. In how many open source Spring projects is VMware actively contributing?

    David Turanski: It’s difficult to give an exact number. Spring is very modular by design, so if you look at the SpringSource page on github, there are literally dozens of projects. I would estimate there are about 20 Spring projects actively supported by VMware.

    Q3. What is the Spring Data project?

    David Turanski: The Spring Data project started in 2010, when Rod Johnson (Spring Framework’s inventor), and Emil Eifrem (founder of Neo Technologies) were trying to integrate Spring with the Neo4j graph database. Spring has always provided excellent support for working with RDBMS and ORM frameworks such as Hibernate. However, given the recent explosion of NoSQL data stores, we saw the need for a common data access abstraction to simplify development with NoSQL stores. Hence the Spring Data team was created with the mission to:

    “…provide a familiar and consistent Spring-based programming model for NoSQL and relational stores while retaining store-specific features and capabilities.”

    The last bit is significant. It means we don’t take a least common denominator approach. We want to expose a full set of capabilities whether it’s JPA/Hibernate, MongoDB, Neo4j, Redis, Hadoop, GemFire, etc.

    Q4. Could you give us an example on how you build Spring-powered applications that use NOSQL data stores (e.g. Redis, MongoDB, Neo4j, HBase)

    David Turanski: Spring Data provides an abstraction for the Repository pattern for data access. A Repository is akin to a Data Access Object and provides an interface for managing persistent objects. This includes the standard CRUD operations, but also includes domain specific query operations. For example, if you have a Person object:

    Person {
        int id;
    	int age;
    	String firstName;
    	String lastName;
    } 
    

    You may want to perform queries such as findByFirstNameAndLastName, findByLastNameStartsWith, findByFirstNameContains, findByAgeLessThan, etc. Traditionally, you would have to write code to implement each of these methods. With Spring Data, you simply declare a Java interface to define the operations you need. Using method naming conventions, as illustrated above, Spring Data generates a dynamic proxy to implement the interface on top of whatever data store is configured for the application. The Repository interface in this case looks like:

    	
    public interface PersonRepository 
    extends CrudRepository {
      Person findByFirstNameAndLastName(String firstName, String lastName);
      Person findByLastNameStartsWith(String lastName);
      Persion findByAgeLessThan(int age);
    	...
    }
    

    In addition, Spring Data Repositories provide declarative support for pagination and sorting.

    Then, using Spring’s dependency injection capabilities, you simply wire the repository into your application. For example:

    public class PersonApp {
                 @Autowired
                 PersonRepository personRepository;
    
                 public Person findPerson(String lastName, String firstName) {
                 return personRepository.findByFirstNameAndLastName(firstName, lastName);
       }
    }
    

    Essentially, you don’t have to write any data access code! However, you must provide Java annotations on your domain class to configure entity mapping to the data store. For example, if using MongoDB you would associate the domain class to a document:

    @Document
    Person {
        int id;
    	int age;
    	String firstName;
    	String lastName;
    } 
    

    Note that the entity mapping annotations are store-specific. Also, you need to provide some Spring configuration to tell your application how to connect to the data store, in which package(s) to search for Repository interfaces and the like.

    The Spring Data team has written an excellent book, including lots of code examples. Spring Data Modern Data Acces for Enterprise Java recently published by O’Reilly. Also, the project web site includes many resources to help you get started using Spring Data.

    Q5 And for map-reduce frameworks?

    David Turanski: Spring Data provides excellent support for developing applications with Apache Hadoop along with Pig and/or Hive. However, Hadoop applications typically involve a complex data pipeline which may include loading data from multiple sources, pre-procesing and real-time analysis while loading data into HDFS, data cleansing, implementing a workflow to coordinate several data analysis steps, and finally publishing data from HDFS to on or more application data relational or NoSQL data stores.

    The complete pipeline can be implemented using Spring for Apache Hadoop along with Spring Integration and Spring Batch. However, Hadoop has its own set of challenges which the Spring for Apache Hadoop project is designed to address. Like all Spring projects, it leverages the Spring Framework to provide a consistent structure and simplify writing Hadoop applications. For example, Hadoop applications rely heavily on command shell tools. So applications end up being a hodge-podge of Perl, Python, Ruby, and bash scripts. Spring for Apache Hadoop, provides a dedicated XML namespace for configuring Hadoop jobs with embedded scripting features and support for Hive and Pig. In addition, Spring for Apache Hadoop allows you to take advantage of core Spring Framework features such as task scheduling, Quartz integration, and property placeholders to reduce lines of code, improve testability and maintainability, and simplify the development proces.

    Q6. What about cloud based data services? and support for relational database technologies or object-relational mappers?

    David Turanski: While there are currently no plans to support cloud based services such as Amazon 3S, Spring Data provides a flexible architecture upon which these may be implemented. Relational technologies and ORM are supported via Spring Data JPA. Spring has always provided first class support for Relation database via the JdbcTemplate using a vendor provided JDBC driver. For ORM, Spring supports Hibernate, any JPA provider, and Ibatis. Additionally, Spring provides excellent support for declarative transactions.

    With Spring Data, things get even easier. In a traditional Spring application backed by JDBC, you are required to hand code the Repositories or Data Access Objects. With Spring Data JPA, the data access layer is generated by the framework while persistent objects use standard JPA annotations.

    Q7. How can use Spring to perform:
    – Data ingestion from various data sources into Hadoop,
    – Orchestrating Hadoop based analysis workflow,
    – Exporting data out of Hadoop into relational and non-relational databases

    David Turanski: As previously mentioned, a complete big data processing pipeline involving all of these steps will require Spring for Apache Hadoop in conjunction with Spring Integration and Spring Batch.

    Spring Integration greatly simplifies enterprise integration tasks by providing a light weight messaging framework, based on the well known Patterns of Enterprise Integration by Hohpe and Woolf. Sometimes referred to as the “anti ESB”, Spring Integration requires no runtime component other than a Spring container and is embedded in your application process to handle data ingestion from various distributed sources, mediation, transformation, and data distribution.

    Spring Batch provides a robust framework for any type of batch processing and is be used to configure and execute scheduled jobs composed of the coarse-grained processing steps. Individual steps may be implemented as Spring Integration message flows or Hadoop jobs.

    Q8. What is the Spring Data GemFire project?

    David Turanski: Spring Data GemFire began life as a separate project from Spring Data following VMWare’s acquisition of GemStone and it’s commercial GemFire distributed data grid.
    Initially, it’s aim was to simplify the development of GemFire applications and the configuration of GemFire caches, data regions, and related components. While this was, and still is, developed independently as an open source Spring project, the GemFire product team recognized the value to its customers of developing with Spring and has increased its commitment to Spring Data GemFire. As of the recent GemFire 7.0 release, Spring Data GemFire is being promoted as the recommended way to develop GemFire applications for Java. At the same time, the project was moved under the Spring Data umbrella. We implemented a GemFire Repository and will continue to provide first class support for GemFire.

    Q9. Could you give a technical example on how do you simplify the development of building highly scalable applications?

    David Turanski: GemFire is a fairly mature distributed, memory oriented data grid used to build highly scalable applications. As a consequence, there is inherent complexity involved in configuration of cache members and data stores known as regions (a region is roughly analogous to a table in a relational database). GemFire supports peer-to-peer and client-server topologies, and regions may be local, replicated, or partitioned. In addition, GemFire provides a number of advanced features for event processing, remote function execution, and so on.

    Prior to Spring Data GemFire, GemFire configuration was done predominantly via its native XML support. This works well but is relatively limited in terms of flexibility. Today, configuration of core components can be done entirely in Spring, making simple things simple and complex things possible.

    In a client-server scenario, an application developer may only be concerned with data access. In GemFire, a client application accesses data via a client cache and a client region which act as a proxies to provide access to the grid. Such components are easily configured with Spring and the application code is the same whether data is distributed across one hundred servers or cached locally. This fortunately allows developers to take advantage of Spring’s environment profiles to easily switch to a local cache and region suitable for unit integration tests which are self-contained and may run anwhere, including automated build environments. The cache resources are configured in Spring XML:

    <beans>
            </beans><beans profile="test">
    		  <gfe:cache />
    		  <gfe:local -region name="Person"/>
    	</beans>
    
           <beans profile="default">
                   <context:property-placeholder location="cache.properties"/>
    		<gfe:client-cache/>
    		<gfe:client-region name="Person"/>
    		<gfe:pool>
    	                 <gfe:locator host="${locator.host}" port="${locator.port}"/>
    		</gfe:pool>
    	</beans>
    </beans>
    

    Here we see the deployed application (default profile) depends on a remote GemFire locator process. The client region does not store data locally by default but is connected to an available cache server via the locator. The region is distributed among the cache server and its peers and may be partitioned or replicated. The test profile sets up a self contained region in local memory, suitable for unit integration testing.

    Additionally, applications may by further simplified by using a GemFire backed Spring Data Repository. The key difference from the example above is that the entity mapping annotations are replaced with GemFire specific annotations:

    @Region
    Person {
        int id;
    	int age;
    	String firstName;
    	String lastName;
    } 
    

    The @Region annotation maps the Person type to an existing region of the same name. The Region annotation provides an attribute to specify the name of the region if necessary.

    Q10. The project uses GemFire as a distributed data management platform. Why using an In-Memory Data Management platform, and not a NoSQL or NewSQL data store?

    David Turanski: Customers choose GemFire primarily for performance. As an in memory grid, data access can be an order of magnitude faster than disk based stores. Many disk based systems also cache data in memory to gain performance. However your mileage may vary depending on the specific operation and when disk I/O is needed. In Contrast, GemFire’s performance is very consistent. This is a major advantage for a certain class of high volume, low latency, distributed systems. Additionally, GemFire is extremely reliable, providing disk-based backup and recovery.

    GemFire also builds in advanced features not commonly found in the NoSQL space. This includes a number of advanced tuning parameters to balance performance and reliability, synchronous or asynchronous replication, advanced object serialization features, flexible data partitioning with configurable data colocation, WAN gateway support, continuous queries, .Net interoperability, and remote function execution.

    Q11. Is GemFire a full fledged distributed database management system? or else?

    David Turanski: Given all its capabilities and proven track record supporting many mission critical systems, I would certainly characterize GemFire as such.
    ———————————-

    David Turanski is a Senior Software Engineer with SpringSource, a division of VMWare. David is a member of the Spring Data team and lead of the Spring Data GemFire project. He is also a committer on the Spring Integration project. David has extensive experience as a developer, architect and consultant serving a variety of industries. In addition he has trained hundreds of developers how to use the Spring Framework effectively.

    Related Posts

    On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. December 5, 2012

    Two cons against NoSQL. Part II. November 21, 2012

    Two Cons against NoSQL. Part I. October 30, 2012

    Interview with Mike Stonebraker. May 2, 2012

    Resources

    ODBMS.org Lecture Notes: Data Management in the Cloud.
    Michael Grossniklaus, David Maier, Portland State University.
    Course Description: “Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed. The topics covered in the courserange from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J). The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems. Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.”
    Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

    ##

    ]]>
    http://www.odbms.org/blog/2013/01/the-spring-data-project-interview-with-david-turanski/feed/ 0
    Publishing Master and PhD Thesis on Big Data. http://www.odbms.org/blog/2012/03/publishing-master-and-phd-thesis-on-big-data/ http://www.odbms.org/blog/2012/03/publishing-master-and-phd-thesis-on-big-data/#comments Wed, 28 Mar 2012 14:16:35 +0000 http://www.odbms.org/blog/?p=1385 In order to help disseminating the work of young students and researchers in the area of databases, I started publishing Master and PhD thesis in ODBMS.ORG

    Published Master and PhD are available for free download (as. pdf) to all visitors of ODBMS.ORG (50,000+ visitors/month).

    Copyright of the Master and PhD thesis remain by the authors.

    The process of submission is quite simple. Please send (any time) by email to: editor AT odbms.org

    1) a .pdf of your work

    2) the filled in template below:
    ___________________________
    Title of the work:
    Language (English preferable):
    Author:
    Affiliation:
    Short Abstract (max 2-3 sentences of text):
    Type of work (PhD, Master):
    Area (see classification below):
    No of Pages:
    Year of completion:
    Name of supervisor/affiliation:

    ________________________________

    To qualify for publication in ODBMS.ORG, the thesis should have been completed and accepted by the respective University/Research Center in 2011 or later, and it should be addressing one or more of the following areas:

    > Big Data: Analytics, Storage Platforms
    > Cloud Data Stores
    > Entity Framework (EF)
    > Graphs and Data Stores
    > In-Memory Databases
    > Object Databases
    > NewSQL Data Stores
    > NoSQL Data Stores
    > Object-Relational Technology
    > Relational Databases: Benchmarking, Data Modeling

    For any questions, please do not hesitate to contact me.

    Hope this help.

    Best Regards

    Roberto V. Zicari
    Editor
    ODBMS.ORG
    ODBMS Industry Watch Blog

    ##

    ]]>
    http://www.odbms.org/blog/2012/03/publishing-master-and-phd-thesis-on-big-data/feed/ 2
    In-memory database systems. Interview with Steve Graves, McObject. http://www.odbms.org/blog/2012/03/in-memory-database-systems-interview-with-steve-graves-mcobject/ http://www.odbms.org/blog/2012/03/in-memory-database-systems-interview-with-steve-graves-mcobject/#comments Fri, 16 Mar 2012 07:43:44 +0000 http://www.odbms.org/blog/?p=1371 “Application types that benefit from an in-memory database system are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes” — Steve Graves.

    On the topic of in-memory database systems, I did interview one of our expert, Steve Graves, co-founder and CEO of McObject.

    RVZ

    Q1. What is an in-memory database system (IMDS)?

    Steve Graves: An in-memory database system (IMDS) is a database management system (DBMS) that uses main memory as its primary storage medium.
    A “pure” in-memory database system is one that requires no disk or file I/O, whatsoever.
    In contrast, a conventional DBMS is designed around the assumption that records will ultimately be written to persistent storage (usually hard disk or flash memory).
    Obviously, disk or flash I/O is expensive, in performance terms, and therefore retrieving data from RAM is faster than fetching it from disk or flash, so IMDSs are very fast.
    An IMDS also offers a more streamlined design. Because it is not built around the assumption of storage on hard disk or flash memory, the IMDS can eliminate the various DBMS sub-systems required for persistent storage, including cache management, file management and others. For this reason, an in-memory database is also faster than a conventional database that is either fully-cached or stored on a RAM-disk.

    In other areas (not related to persistent storage) an IMDS can offer the same features as a traditional DBMS. These include SQL and/or native language (C/C++, Java, C#, etc.) programming interfaces; formal data definition language (DDL) and database schemas; support for relational, object-oriented, network or combination data designs; transaction logging; database indexes; client/server or in-process system architectures; security features, etc. The list could go on and on. In-memory database systems are a sub-category of DBMSs, and should be able to do everything that entails.

    Q2. What are significant differences between an in-memory database versus a database that happens to be in memory (e.g. deployed on a RAM-disk).

    Steve Graves: We use the comparison to illustrate IMDSs’ contribution to performance beyond the obvious elimination of disk I/O. If IMDSs’ sole benefit stemmed from getting rid of physical I/O, then we could get the same performance by deploying a traditional DBMS entirely in memory – for example, using a RAM-disk in place of a hard drive.

    We tested an application performing the same tasks with three storage scenarios: using an on-disk DBMS with a hard drive; the same on-disk DBMS with a RAM-disk; and an IMDS (McObject’s eXtremeDB). Moving the on-disk database to a RAM drive resulted in nearly 4x improvement in database reads, and more than 3x improvement in writes. But the IMDS (using main memory for storage) outperformed the RAM-disk database by 4x for reads and 420x for writes.

    Clearly, factors other than eliminating disk I/O contribute to the IMDS’s performance – otherwise, the DBMS-on-RAM-disk would have matched it. The explanation is that even when using a RAM-disk, the traditional DBMS is still performing many persistent storage-related tasks.
    For example, it is still managing a database cache – even though the cache is now entirely redundant, because the data is already in RAM. And the DBMS on a RAM-disk is transferring data to and from various locations, such as a file system, the file system cache, the database cache and the client application, compared to an IMDS, which stores data in main memory and transfers it only to the application. These sources of processing overhead are hard-wired into on-disk DBMS design, and persist even when the DBMS uses a RAM-disk.

    An in-memory database system also uses the storage space (memory) more efficiently.
    A conventional DBMS can use extra storage space in a trade-off to minimize disk I/O (the assumption being that disk I/O is expensive, and storage space is abundant, so it’s a reasonable trade-off). Conversely, an IMDS needs to maximize storage efficiency because memory is not abundant in the way that disk space is. So a 10 gigabyte traditional database might only be 2 gigabytes when stored in an in-memory database.

    Q3. What is in your opinion the current status of the in-memory database technology market?

    Steve Graves: The best word for the IMDS market right now is “confusing.” “In-memory database” has become a hot buzzword, with seemingly every DBMS vendor now claiming to have one. Often these purported IMDSs are simply the providers’ existing disk-based DBMS products, which have been tweaked to keep all records in memory – and they more closely resemble a 100% cached database (or a DBMS that is using a RAM-disk for storage) than a true IMDS. The underlying design of these products has not changed, and they are still burdened with DBMS overhead such as caching, data transfer, etc. (McObject has published a white paper, Will the Real IMDS Please Stand Up?, about this proliferation of claims to IMDS status.)

    Only a handful of vendors offer IMDSs that are built from scratch as in-memory databases. If you consider these to comprise the in-memory database technology market, then the status of the market is mature. The products are stable, have existed for a decade or more and are deployed in a variety of real-time software applications, ranging from embedded systems to real-time enterprise systems.

    Q4. What are the application types that benefit the use of an in-memory database system?

    Steve Graves: Application types that benefit from an IMDS are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes. Sometimes these types overlap, as in the case of a network router that needs to be fast, and has no persistent storage. Embedded systems often fall into the latter category, in fields such as telco and networking gear, avionics, industrial control, consumer electronics, and medical technology. What we call the real-time enterprise sector is represented in the first category, encompassing uses such as analytics, capital markets (algorithmic trading, order matching engines, etc.), real-time cache for e-commerce and other Web-based systems, and more.

    Software that must run with minimal hardware resources (RAM and CPU) can also benefit.
    As discussed above, IMDSs eliminate sub-systems that are part-and-parcel of on-disk DBMS processing. This streamlined design results in a smaller database system code size and reduced demand for CPU cycles. When it comes to hardware, IMDSs can “do more with less.” This means that the manufacturer of, say, a set-top box that requires a database system for its electronic programming guide, may be able to use a less powerful CPU and/or less memory in each box when it opts for an IMDS instead of an on-disk DBMS. These manufacturing cost savings are particularly desirable in embedded systems products targeting the mass market.

    Q5. McObject offers an in-memory database system called eXtremeDB, and an open source embedded DBMS, called Perst. What is the difference between the two? Is there any synergy between the two products?

    Steve Graves: Perst is an object-oriented embedded database system.
    It is open source and available in Java (including Java ME) and C# (.NET) editions. The design goal for Perst is to provide as nearly transparent persistence for Java and C# objects as practically possibly within the normal Java and .NET frameworks. In other words, no special tools, byte codes, or virtual machine are needed. Perst should provide persistence to Java and C# objects while changing the way a programmer uses those objects as little as possible.

    eXtremeDB is not an object-oriented database system, though it does have attributes that give it an object-oriented “flavor.” The design goals of eXtremeDB were to provide a full-featured, in-memory DBMS that could be used right across the computing spectrum: from resource-constrained embedded systems to high-end servers used in systems that strive to squeeze out every possible microsecond of latency. McObject’s eXtremeDB in-memory database system product family has features including support for multiple APIs (SQL ODBC/JDBC & native C/C++, Java and C#), varied database indexes (hash, B-tree, R-tree, KD-tree, and Patricia Trie), ACID transactions, multi-user concurrency (via both locking and “optimistic” transaction managers), and more. The core technology is embodied in the eXtremeDB IMDS edition. The product family includes specialized editions, built on this core IMDS, with capabilities including clustering, high availability, transaction logging, hybrid (in-memory and on-disk) storage, 64-bit support, and even kernel mode deployment. eXtremeDB is not open source, although McObject does license the source code.

    The two products do not overlap. There is no shared code, and there is no mechanism for them to share or exchange data. Perst for Java is written in Java, Perst for .NET is written in C#, and eXtremeDB is written in C, with optional APIs for Java and .NET. Perst is a candidate for Java and .NET developers that want an object-oriented embedded database system, have no need for the more advanced features of eXtremeDB, do not need to access their database from C/C++ or from multiple programming languages (a Perst database is compatible with Java or C#), and/or prefer the open source model. Perst has been popular for smartphone apps, thanks to its small footprint and smart engineering that enables Perst to run on mobile platforms such as Windows Phone 7 and Java ME.
    eXtremeDB will be a candidate when eliminating latency is a key concern (Perst is quite fast, but not positioned for real-time applications), when the target system doesn’t have a JVM (or sufficient resources for one), when the system needs to support multiple programming languages, and/or when any of eXtremeDB’s advanced features are required.

    Q6. What are the current main technological developments for in-memory database systems?

    Steve Graves: At McObject, we’re excited about the potential of IMDS technology to scale horizontally, across multiple hardware nodes, to deliver greater scalability and fault-tolerance while enabling more cost-effective system expansion through the use of low-cost (i.e. “commodity”) servers. This enthusiasm is embodied in our new eXtremeDB Cluster edition, which manages data stores across distributed nodes. Among eXtremeDB Cluster’s advantages is that it eliminates any performance ceiling from being CPU-bound on a single server.

    Scaling across multiple hardware nodes is receiving a lot of attention these days with the emergence of NoSQL solutions. But database system clustering actually has much deeper roots. One of the application areas where it is used most widely is in telecommunications and networking infrastructure, where eXtremeDB has always been a strong player. And many emerging application categories – ranging from software-as-a-service (SaaS) platforms to e-commmerce and social networking applications – can benefit from a technology that marries IMDSs’ performance and “real” DBMS features, with a distributed system model.

    Q7. What are the similarities and differences between current various database clustering solutions? In particular, let’s look at dimensions such as scalability, ACID vs. CAP, intended/applicable problem domains, structured vs. unstructured, and complexity of implementation.

    Steve Graves: ACID support vs. “eventual consistency” is a good place to start looking at the differences between clustering database solutions (including some cluster-like NoSQL products). ACID-compliant transactions will be Atomic, Consistent, Isolated and Durable; consistency implies the transaction will bring the database from one valid state to another and that every process will have a consistent view of the database. ACID-compliance enables an on-line bookstore to ensure that a purchase transaction updates the Customers, Orders and Inventory tables of its DBMS. All other things being equal, this is desirable: updating Customers and Orders while failing to change Inventory could potentially result in other orders being taken for items that are no longer available.

    However, enforcing the ACID properties becomes more of a challenge with distributed solutions, such as database clusters, because the node initiating a transaction has to wait for acknowledgement from the other nodes that the transaction can be successfully committed (i.e. there are no conflicts with concurrent transactions on other nodes). To speed up transactions, some solutions have relaxed their enforcement of these rules in favor of an “eventual consistency” that allows portions of the database (typically on different nodes) to become temporarily out-of-synch (inconsistent).

    Systems embracing eventual consistency will be able to scale horizontally better than ACID solutions – it boils down to their asynchronous rather than synchronous nature.

    Eventual consistency is, obviously, a weaker consistency model, and implies some process for resolving consistency problems that will arise when multiple asynchronous transactions give rise to conflicts. Resolving such conflicts increases complexity.

    Another area where clustering solutions differ is along the lines of shared-nothing vs. shared-everything approaches. In a shared-nothing cluster, each node has its own set of data.
    In a shared-everything cluster, each node works on a common copy of database tables and rows, usually stored in a fast storage area network (SAN). Shared-nothing architecture is naturally more complex: if the data in such a system is partitioned (each node has only a subset of the data) and a query requests data that “lives” on another node, there must be code to locate and fetch it. If the data is not partitioned (each node has its own copy) then there must be code to replicate changes to all nodes when any node commits a transaction that modifies data.

    NoSQL solutions emerged in the past several years to address challenges that occur when scaling the traditional RDBMS. To achieve scale, these solutions generally embrace eventual consistency (thus validating the CAP Theorem, which holds that a system cannot simultaneously provide Consistency, Availability and Partition tolerance). And this choice defines the intended/applicable problem domains. Specifically, it eliminates systems that must have consistency. However, many systems don’t have this strict consistency requirement – an on-line retailer such as the bookstore mentioned above may accept the occasional order for a non-existent inventory item as a small price to pay for being able to meet its scalability goals. Conversely, transaction processing systems typically demand absolute consistency.

    NoSQL is often described as a better choice for so-called unstructured data. Whereas RDBMSs have a data definition language that describes a database schema and becomes recorded in a database dictionary, NoSQL databases are often schema-less, storing opaque “documents” that are keyed by one or more attributes for subsequent retrieval. Proponents argue that schema-less solutions free us from the rigidity imposed by the relational model and make it easier to adapt to real-world changes. Opponents argue that schema-less systems are for lazy programmers, create a maintenance nightmare, and that there is no equivalent to relational calculus or the ANSI standard for SQL. But the entire structured or unstructured discussion is tangential to database cluster solutions.

    Q7. Are in-memory database systems an alternative to classical disk-based relational database systems?

    Steve Graves: In-memory database systems are an ideal alternative to disk-based DBMSs when performance and efficiency are priorities. However, this explanation is a bit fuzzy, because what programmer would not claim speed and efficiency as goals? To nail down the answer, it’s useful to ask, “When is an IMDS not an alternative to a disk-based database system?”

    Volatility is pointed to as a weak point for IMDSs. If someone pulls the plug on a system, all the data in memory can be lost. In some cases, this is not a terrible outcome. For example, if a set-top box programming guide database goes down, it will be re-provisioned from the satellite transponder or cable head-end. In cases where volatility is more of a problem, IMDSs can mitigate the risk. For example, an IMDS can incorporate transaction logging to provide recoverability. In fact, transaction logging is unavoidable with some products, such as Oracle’s TimesTen (it is optional in eXtremeDB). Database clustering and other distributed approaches (such as master/slave replication) contribute to database durability, as does use of non-volatile RAM (NVRAM, or battery-backed RAM) as storage instead of standard DRAM. Hybrid IMDS technology enables the developer to specify persistent storage for selected record types (presumably those for which the “pain” of loss is highest) while all other records are managed in memory.

    However, all of these strategies require some effort to plan and implement. The easiest way to reduce volatility is to use a database system that implements persistent storage for all records by default – and that’s a traditional DBMS. So, the IMDS use-case occurs when the need to eliminate latency outweighs the risk of data loss or the cost of the effort to mitigate volatility.

    It is also the case that FLASH and, especially, spinning memory are much less expensive than DRAM, which puts an economic lid on very large in-memory databases for all but the richest users. And, riches notwithstanding, it is not yet possible to build a system with 100’s of terabytes, let alone petabytes or exabytes, of memory, whereas spinning memory has no such limitation.

    By continuing to use traditional databases for most applications, developers and end-users are signaling that DBMSs’ built-in persistence is worth its cost in latency. But the growing role of IMDSs in real-time technology ranging from financial trading to e-commerce, avionics, telecom/Netcom, analytics, industrial control and more shows that the need for speed and efficiency often outweighs the convenience of a traditional DBMS.

    ———–
    Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

    Related Posts

    A super-set of MySQL for Big Data. Interview with John Busch, Schooner.

    Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB.

    On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.

    vFabric SQLFire: Better then RDBMS and NoSQL?

    Related Resources

    ODBMS.ORG: Free Downloads and Links:
    Object Databases
    NoSQL Data Stores
    Graphs and Data Stores
    Cloud Data Stores
    Object-Oriented Programming
    Entity Framework (EF) Resources
    ORM Technology
    Object-Relational Impedance Mismatch
    Databases in general
    Big Data and Analytical Data Platforms

    #

    ]]>
    http://www.odbms.org/blog/2012/03/in-memory-database-systems-interview-with-steve-graves-mcobject/feed/ 0
    Data Modeling for Analytical Data Warehouses. Interview with Michael Blaha. http://www.odbms.org/blog/2012/03/data-modeling-for-analytical-data-warehouses-interview-with-michael-blaha/ http://www.odbms.org/blog/2012/03/data-modeling-for-analytical-data-warehouses-interview-with-michael-blaha/#comments Sat, 03 Mar 2012 11:43:20 +0000 http://www.odbms.org/blog/?p=1359 “Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits” — Michael Blaha.

    This is the third interview with our expert Dr. Michael Blaha on the topic Database Modeling. This time we look at the issue of data design for Analytical Data Warehouses.

    In previous interviews we looked at how good is UML for database design , and how good are Use Cases for database modeling.

    Hope you`ll find this interview interesting. I encourage the community to post comments.

    RVZ

    Q1: What is the difference between data warehouses and day-to-day business applications?

    Michael Blaha: Operational (day-to-day business) applications serve the routine needs of a business handling orders, scheduling manufacturing runs, servicing patients, and generating financial statements.

    Operational applications have many short transactions that must process quickly. The transactions both read and write.
    Well-written applications pay attention to data quality, striving to ensure correct data and avoid errors.

    In contrast analytical (data warehouse) applications step back from the business routine and analyze data that accumulates over time. The idea is to gain insight into business patterns that are overlooked when responding to routine needs. Data warehouse queries can have a lengthy execution time as they process reams of data, searching for underlying patterns.

    End users read from a data warehouse, but they don`t write to it. Rather writing occurs as the operational applications supply new data that is added to the data warehouse.

    Q2: How do you approach data modeling for data warehouse problems?

    Michael Blaha: For operational applications, I use the UML class model for conceptual data modeling. (I often use Enterprise Architect.) The notation is more succinct than conventional database notations and promotes abstract thinking.
    In addition, the UML class model is understandable for business customers as it defers database design details. And, of course, the UML reaches out to the programming side of development.

    In contrast, for analytical applications, I go straight to a database notation. (I often use ERwin.) Data warehouses revolve around facts and dimensions. The structure of a data warehouse model is so straightforward (unlike the model of operational application) that a database notation alone suffices.

    For a business user, the UML model and the conventional data model look much the same for a data warehouse.
    The programmers of a data warehouse (the ETL developers) are accustomed to database notations (unlike the developers in day-to-day applications).

    As an aside, I note that in a past book (A Manager`s Guide to Database Technology) I used a UML class model for analytical modeling. In retrospect I now realize that was a forced fit. The class model does not deliver any benefits for data warehouses and it`s an unfamiliar technology for data warehouse developers, so there`s no point in using it there.

    Q3: Is there any synergy between non-relational databases (NoSQL, Object Databases) and data warehouses?

    Michael Blaha: Not for conventional data warehouses that are set-oriented. Mass quantities of data must be processed in bulk. Set-oriented data processing is a strength of relational databases and the SQL language. Furthermore, tables are a good metaphor for facts and dimensions and the data is intrinsically strongly typed.

    NoSQL (Hadoop) is being used for mining Web data. Web data is by its nature unstructured and much different from conventional data warehouses.

    Q4: How do data warehouses achieve fast performance?

    Michael Blaha: The primary technique is pre-computation, by anticipating the need for aggregate data and computing it in advance. Indexing is also important for data warehouses, but less important than with operational applications.

    Q5: What are some difficult issues with data warehouses?

    Michael Blaha:
    Abstraction. Abstraction is needed to devise the proper facts and dimensions. It is always difficult to perform abstraction.

    Conformed dimensions. A large data warehouse schema must be flexible for mining. This can only be achieved if data is on the same basis. Therefore there is a need for conformed dimensions.
    For example, there must be a single definition of Customer that is used throughout the warehouse.

    Size. The sheer size of schema and data is a challenge.

    Data cleansing. Many operational applications are old legacy code. Often their data is flawed and may need to be corrected for a data warehouse.

    Data integration. Many data warehouses combine data from multiple applications. The application data overlaps and must be reconciled.

    Security. Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits.

    Q6: What kind of metadata is associated with a data warehouse and is there a role for Object Databases with this?

    Michael Blaha: Maybe. Data warehouse metadata includes source-to-target mappings, definitions (of facts, dimensions, and attributes), as well as the organization of the data warehouse into subject areas. The metadata for a data warehouse is just like operational applications. The metadata has to be custom modeled and doesn`t have a standard metaphor for structure like the facts and dimensions of a data warehouse. Relational databases, OO databases, and possibly other kinds of databases are all reasonable candidates.

    Q7. In a recent interview, Florian Waas, EMC/Greenplum, said “in the Big Data era the old paradigm of shipping data to the application isn’t working any more. Rather, the application logic must come to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack. Instead of stand-alone products for ETL, BI/reporting and analytics we have to think about seamless integration: in what ways can we open up a data processing platform to enable applications to get closer? What language interfaces, but also what resource management facilities can we offer? And so on.”
    What is your view on this?

    Michael Blaha: It’s well known that to get good performance for relational database applications that stored procedures must be used. Stored procedures are logic that is inside the database kernel. Stored procedures circumvent much of the overhead that is incurred by shuttling back and forth between an application process and the database process. So the stored procedure experience is certainly consistent with this comment.

    What I try to do in practice is think in terms of objects. Relational database tables and stored procedures are analogous to objects with methods. I put core functionality that is likely to be reusable and computation intensive into stored procedures. I put lightweight functionality and functionality that is peculiar to an application outside the database kernel.

    Q8. Hadoop is the system of choice for Big Data and Analytics. How do you approach data modeling in this case?

    Michael Blaha: I have no experience with Hadoop. My projects have involved structured data. In contrast Hadoop is architected for unstructured data as is often found on the Web.

    Q9. A lot of insights are contained in unstructured or semi-structured data from Big Data applications. Does it make any sense to do data modeling in this case?

    Michael Blaha: I have no experience with unstructured data. I have some experience with semi-structured data (XML / XSD). I routinely practice data modeling for XSD files. I published a paper in 2010 lamenting the fact that so
    many SOA projects concern storage and retrieval of data and completely lack a data model. I’ve been working on modeling approaches for XSD files, but have not yet devised a solution to my satisfaction.

    ————————————————–
    Michael Blaha is a partner at Modelsoft Consulting Corporation.
    Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.

    Related Posts

    Use Cases and Database Modeling — An interview with Michael Blaha.

    – How good is UML for Database Design? Interview with Michael Blaha.

    Resources

    ODBMS.org: Free Downloads and Links on various data management technologies:
    Object Databases
    NoSQL Data Stores
    Graphs and Data Stores
    Cloud Data Stores
    Object-Oriented Programming
    Entity Framework (EF) Resources
    ORM Technology
    Object-Relational Impedance Mismatch
    Databases in general
    Big Data and Analytical Data Platforms

    ##

    ]]>
    http://www.odbms.org/blog/2012/03/data-modeling-for-analytical-data-warehouses-interview-with-michael-blaha/feed/ 2
    Use Cases and Database Modeling — An interview with Michael Blaha. http://www.odbms.org/blog/2012/01/use-cases-and-database-modeling-an-interview-with-michael-blaha/ http://www.odbms.org/blog/2012/01/use-cases-and-database-modeling-an-interview-with-michael-blaha/#comments Mon, 09 Jan 2012 08:58:13 +0000 http://www.odbms.org/blog/?p=1307 “ Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases. For a database project, the conceptual data model is a much more important software engineering contribution than use cases.“ — Dr. Michael Blaha.

    First of all let me wish you a Happy, Healthy and Successful 2012!

    I am coming back to discuss with our expert Dr. Michael Blaha, the topic of Database Modeling. In a previous interview we looked at the issue of “How good is UML for Database Design”?

    Now, we look at Use Cases and discuss how good are they for Database Modeling. Hope you`ll find the interview interesting. I encourage the community to post comments.

    RVZ

    Q1. How are requirements taken into accounts when performing data base modeling in the daily praxis? What are the common problems and pitfalls?

    Michael Blaha: Software development approaches vary widely. I’ve seen organizations use the following techniques for capturing requirements (listed in random order).

    — Preparation of use cases.
    — Preparation of requirements documents.
    — Representation and explanation via a conceptual data model.
    — Representation and explanation via prototyping.
    — Haphazard approach. Just start writing code.

    General issues include
    — the amount of time required to capture requirements,
    — missing requirements (requirements that are never mentioned)
    — forgotten requirements (requirements that are mentioned but then forgotten)
    — bogus requirements (requirements that are not germane to the business needs or that needlessly reach into design)
    — incomplete understanding (requirements that are contradictory or misunderstood)

    Q2. What is a use case?

    Michael Blaha: A use case is a piece of functionality that a system provides to its users. A use case describes how a system interacts with outside actors.

    Q3. What are the advantages of use cases?

    Michael Blaha:
    — Use cases lead to written documentation of requirements.
    — They are intuitive to business specialists.
    — Use cases are easy for developers to understand.
    — They enable aspects of system functionality to be enumerated and managed.
    — They include error cases.
    — They let consulting shops bill many hours for low-skilled personnel.
    (This is a cynical view, but I believe this is a major reason for some of the current practice.)

    Q4. What are the disadvantages of use cases?

    Michael Blaha:
    — They are very time consuming. It takes much time to write them down. It takes much time to interview business experts (time that is often unavailable).

    — Use cases are just one aspect of requirements. Other aspects should also be considered, such as existing documentation and
    artifacts from related software. Many developers obsess on use cases and forget to look for other requirement sources.

    — Use cases are rote work. The developer listens to business experts and slavishly write what they hear. There is little interpretation and no abstraction. There is little reconciliation of conflicting use cases.

    — I have yet to see benefit from use case diagramming. I have yet to see significant benefit from use case structuring.

    — In my opinion, use cases have been overhyped by marketeers.

    Q5. How are use cases typically used in practice for database projects?

    Michael Blaha: To capture requirements. It is OK to capture detailed requirements with use cases, but they should be
    subservient to the class model. The class model defines the domain of discourse that use cases can then reference.

    For database applications it is much inferior to start with use cases and afterwards construct a class model. Database applications, in particular, need a data approach and not a process approach.

    It is ironic that use cases have arisen from the object-oriented community. Note that OO programming languages define a class structure to which logic is attached. So it is odd that use cases put process first and defer attention to data structure.

    Q6. A possible alternative approach to data modeling is to write use cases first, then identifying the subsystems and components, and finally identifying the database schema. Do you agree with this?

    Michael Blaha: This is a popular approach. No I do not agree with it. I strongly disagree.
    For a database project, the conceptual data model is a much more important software engineering contribution than use cases.

    Only when the conceptual model is well understood can use cases be fully understood and reconciled. Only then can developers integrate use cases and abstract their content into a form suitable for building a quality software product.

    Q7. Many requirements and the design to satisfy those requirements are normally done with programming, not just schema. Do you agree with this? How do you handle this with use cases?

    Michael Blaha: Databases provide a powerful language, but most do not provide a complete language.

    The SQL language of relational databases is far from complete and some other language must be used to express full functionality.

    OO databases are better in this regard. Since OO databases integrate a programming language with a persistence mechanism they inherently offer a full language for expressing functionality.

    Use cases target functionality and functionality alone. Use cases, by their nature, do not pay attention to data structure.

    Q8. Do you need to use UML for use cases?

    Michael Blaha: No. The idea of use cases are valuable if used properly (in conjunction with data and normally subservient to data).

    In my opinion, UML use case diagrams are a waste of time. They don’t add clarity. They add bulk and consume time.

    Q9. Are there any suitable tools around to help the process of creating use cases for database design? If yes, how good are they?

    Michael Blaha: Well, it’s clear by now that I don’t think much of use case diagrams. I think a textual approach is OK and there are probably requirement tools to manage such text, but I am unfamiliar with the product space.

    Q10. Use case methods of design are usually applied to object-oriented models. Do you use use cases when working with an object database?

    Michael Blaha: I would argue not. Most object-oriented languages put data first. First develop the data structure and then attach methods to the structure. Use cases are the opposite of this. They put functionality first.

    Q11. Can you use use cases as a design method for relational databases, NoSQL databases, graph databases as well? And if yes how?

    Michael Blaha: Not reasonably. I guess developers can force fit any technique and try to claim success.

    To be realistic, traditional database developers (relational databases) are already resistant (for cultural reasons) to object-oriented jargon/style and the UML. When I show them that the UML class model is really just an ER model and fits in nicely with database conceptualization, they acknowledge my point, but it is still a foreign culture.

    I don’t see how use cases have much to offer for NoSQL and graph databases.

    Q12. So if you don’t have use cases, how do you address functionality when building database applications?

    Michael Blaha: I strongly favor the technique of interactive conceptual data modeling. I get the various business and
    technical constituencies in the same room and construct a UML class model live in front of them as we explore their business needs and scope. Of course, the business people articulate their needs in terms of use cases. But theuse cases are grounded by the evolving UML class model defining the domain of discourse. Normally I have my hands full with managing the meeting and constructing a class model in front of them. I don’t have time to explicitly capture the use cases (though I am appreciative if someone else volunteers for that task).

    However, I fully consider the use cases by playing them against the evolving model. Of course as I consider use cases relative to the class model, I am reconciling the use cases. I am also considering abstraction as I construct the class model and consequently causing the business experts to do more abstraction in formulating their use case business requirements.

    I have built class models this way many times before and it works great. Some developers are shocked at how well it can work.

    ————————————————–
    Michael Blaha is a partner at Modelsoft Consulting Corporation.
    Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.

    Related Posts

    – How good is UML for Database Design? Interview with Michael Blaha.

    – Agile data modeling and databases.>

    – Why Patterns of Data Modeling?

    Related Resources

    -ODBMS.org: Databases in General: Blog Posts | Free Software | Articles and Presentations| Lecture Notes | Journals |

    ##

    ]]>
    http://www.odbms.org/blog/2012/01/use-cases-and-database-modeling-an-interview-with-michael-blaha/feed/ 10