ODBMS Industry Watch

Sep 19 12

Hadoop and NoSQL: Interview with J. Chris Anderson

by Roberto V. Zicari

“The missing piece of the Hadoop puzzle is accounting for real time changes. Hadoop can give powerful analysis, but it is fundamentally a batch-oriented paradigm.” — J. Chris Anderson.

How is Hadoop related to NoSQL databases? What are the main performance bottlenecks of NoSQL data stores? On these topics I did interview, J. Chris Anderson co-founder of Couchbase.

RVZ

Q1. In order to analyze Big Data, the current state of the art is a parallel database or NoSQL data store, with a Hadoop connector.
What about performance issues arising with the transfer of large amounts of data between the two systems? Can the use of connectors introduce delays, data silos, increase TCO?

Chris Anderson : The missing piece of the Hadoop puzzle is accounting for real time changes. Hadoop can give powerful analysis, but it is fundamentally a batch-oriented paradigm. Couchbase is designed for real time applications (with all the different trade-offs that implies) yet also provides query-ability, so you can see inside the data as it changes.

We are seeing interesting applications where Couchbase is used to enhance the batch-based Hadoop analysis with real time information, giving the effect of a continuous process.
So hot data lives in Couchbase, in RAM (even replicas in RAM for HA fast-failover). You wouldn’t want to keep 3 copies of your Hadoop data in RAM, that’d be crazy.
But it makes sense for your working set.

And this solves the data transfer costs issue you mention, because you essentially move the data out of Couchbase into Hadoop when it cools off.
That’s much easier than maintaining parallel stores, because you only have to copy data from Couchbase to Hadoop as it passes out of the working set.

For folks working on problems like this, we have a Sqoop connector and we’ll be talking about it with Cloudera at our CouchConf in San Francisco on September 21.

Q2. Wouldn’t a united/integrated platform (data store + Hadoop) be a better solution instead?

Chris Anderson : It would be nice to have a unified query language and developer experience (not to mention goodies like automatically pulling data back
out of Hadoop into Couchbase when it comes back into the working set). I think everyone recognizes this.

We’ll get there, but in my opinion the primary interface will be via the real time store, and the Hadoop layer will become a commodity. That is why there is so much competition for the NoSQL brass ring right now.

Q3. Could you please explain in your opinion the tradeoff between scaling out and scaling up? What does it mean in practice for an application
domain?

Chris Anderson : Scaling up is easier from a software perspective. It’s essentially the Moore’s Law approach to scaling — buy a bigger box. Well, eventually you run out of bigger boxes to buy, and then you’ve run off the edge of a cliff. You’ve got to pray Moore keeps up.

Scaling out means being able to add independent nodes to a system. This is the real business case for NoSQL. Instead of being hostage to Moore’s Law, you can grow as fast as your data. Another advantage to adding independent nodes is you have more options when it comes to matching your workload. You have more flexibility when you are running on commodity hardware — you can run on SSDs or high-compute instances, in the cloud, or inside your firewall.

Q4. James Phillips a year ago said that “it is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology.” What is your take now?

Chris Anderson : That hasn’t changed but the industry is still young and everyone is heads-down on things like reliability and operability. Once these products become more mature there will be time to think about standardization.

Q5. There is a scarcity of benchmarks to substantiate the many claims made of scalabilty of NoSQL vendors. NoSQL data stores do not qualify for the TPC-C benchmark, since they relax ACID transaction properties. How can you then measure and compare the performance of the various NoSQL data stores instead?

Chris Anderson : I agree. Vendors are making a lot of claims about latency, throughput and scalability without much proof. There are a few benchmarks starting
to trickle out from various third parties. Cisco and SolarFlare published one on Couchbase (See here ) and are putting other vendors through the same tests. I know there will be other third party benchmarks comparing Couchbase, MongoDB, and Cassandra that will be coming out soon. I think the Yahoo YCSB benchmarks will be another source of good comparisons.
There are bigger differences between vendors than people are aware of and we think many people will be surprised by the results.

Q6. What are in your opinion the main performance bottlenecks for NoSQL data stores?

Chris Anderson : The three classes of bottleneck correspond to the major areas of hardware: network, disk, and memory. Couchbase has historically been very
fast at the network layer – it’s based on Memcached which has had a ton of optimizations for interacting with network hardware and protocols.
We’re essentially as fast as one can get in the network layer, and I believe most NoSQL databases that use persistent socket connections are also free of significant network bottlenecks. So the network is only a bottleneck for REST or other stateless connection models.

Disk is always going to be the slowest component as far as the inherent latencies of non-volatile storage, but any high-performance database will paper over this by using some form of memory caching or memory-based storage. Couchbase has been designed specifically to decouple the disk from the rest of the system. In the extreme, we’ve seen customers survive prolonged disk outages while maintaining availability, as our memory layer keeps on trucking, even when disks become unresponsive. (This was during the big Amazon EBS outage that left a lot of high-profile sites down due to database issues.)

Memory may be the most interesting bottleneck, because it is the source of non-determinism in performance. So if you are choosing a database for performance reasons you’ll want to take a look at how it’s memory layer is architected.
Is it decoupled from the disk? Is it free of large locks that can pause unrelated queries as the engine modifies in-memory data structures? Over time does it continue to perform, or does the memory layout become fragmented? These are all problems we’ve been working on for a long time at Couchbase, and we are pretty happy with where we stand.

Q7. Couchbase is the result of the merger (more then one year ago) of CouchOne(document store) and Membase (key-value store).How has your product offering changed since the merge happened?

Chris Anderson : Our product offering hasn’t changed a bit since the merger. The current GA product, Couchbase Server 1.8.1, is essentially a continuation of the Membase Server 1.7 product line. It is a key value database using the
binary-memcached interface. It’s in use all around the web, like Orbitz, LinkedIn, AOL, Zynga, and lots of other companies.

With our upcoming 2.0 release, we are expanding from a key value database to a document database. This means adding query and index support. We’re even including an Elastic Search adapter and experimental geographic indices. In addition, 2.0 adds cross-datacenter replication support so you can provide high-performance access to the data at multiple points-of-presence.

Q8. How do you implement concurrency in Couchbase?

Chris Anderson : Each node is inherently independent, and there are no special nodes, proxies, or gatekeepers. The client drivers running inside your application server connect directly to the data-node for a particular item, which gives low-latency but also greater concurrency, a given application server will be talking to multiple backend database nodes at any given time.

For the memory layer, we are based on memcached, which has a long history of concurrency optimizations. We support the full memcached feature set, so operations like CAS write, INCR and DECR are available, which is great for concurrent workloads. We’ve also added some extensions for locking, which facilitates reading and updating an object-graph that is spread across multiple keys.

At the disk layer, for Couchbase Server 2.0, we are moving away from SQLite, toward our own highly concurrent storage format. We’re append-only, so once data is written to disk, there’s no chance of corruption. The other advantage of tail-append writes is that you can do all kinds of concurrent reads of the file, even while writes are happening. For instance a backup can be as easy as `cp` or `rsync` (although we provide tools to manage backing up an entire cluster).

Q9. Couchbase does not support SQL queries, how do you query your database instead? What are the lessons learned so far from your users?

Chris Anderson : Our incremental indexing system is designed to be native to our JSON storage format. So the user writes JavaScript code to inspect the document, and pick out which data to use as the index key. It’s essentially putting the developer directly in touch with the index data structure, so while it sounds primitive, there is a ton of flexibility to be had there. We’ve got a killer optimization for aggregation operations, so if you’ve ever been burned by a slow GROUP BY query, you might want to take a look.

Despite all the power, we know users are also looking for more traditional query approaches. We’re working on a few things in this area.
The first one you will see is our Elastic Search integration, which will simplify querying your Couchbase cluster. Elastic Search provides a JSON-style query API, and we’ve already seen many of our users integrate it with Couchbase, so we are building an official adapter to better support this use case.

Q10 How do you handle both structured and unstructured data at scale?

Chris Anderson : At scale, all data is messy. I’ve seen databases in the 4th normal form accumulate messy errors, so a schema isn’t always a protection. At scale, it’s all about discovering the schema, not imposing it.

If you fill a Couchbase cluster with enough tweets, wall posts, and Instagram photos, you’ll start to see that even though these APIs all have different JSON formats, it’s not hard to pick out some things in common. With our flexible indexing system, we see users normalize data after the fact, so they can query heterogeneous data, and have it show up as a result set that is easy for the application to work with.

This fits with the overall model of a document database: rather than try to “model the world” with a relational schema, your aim becomes to “capture the user’s intent” and make sense of it later. When your goal is to scale up to tens of millions of users (in maybe only a few days), the priority becomes to capture the data and provide a compelling high-performance user experience.

Q11 Couchbase is sponsor of the Couchbase Server NoSQL database open source project. How do you ensure participation of the open source developers community and avoid incompatible versions of the system? Are you able to with an open source project to produce a highly performant system?

Chris Anderson : All our code is available under the Apache 2.0 license, and I’d wager that we’ve never heard of the majority of our users, much less asked them
for money. When someone’s business depends on Couchbase, that’s when they come to us, so I’m comfortable with the open source business model. It has some distinct advantages over the Oracle style license model.

The engineering side of me admits that we haven’t always been the best at engendering participation in the Couchbase open source project. A few months ago, if you tried to compile your own copy of Couchbase, e.g. to provide a patch or bug-fix, you’d be on a wild goose chase for days before you got to the fun part. We’ve fixed that now, but it’s worth noting that open-source friction hurts in more ways than one, as smoothing the path for new external contributors also means new engineering hires can get productive on our tool-chain faster.

So I’ve taken a personal interest in our approach to external contributions. The first step is cleaning up the on-ramps, as we already have decent docs for contributing, we just need to make them more prominent. The goal is to have a world-class open-source contributor experience, and we don’t take it lightly.

We do have a history of *very open development* which I am proud of. Not only can you see all the patches that make it into the code base, you can see all the patches that don’t make it through code review, and the comments on them as they are happening. Check out here for an example of how to do open development right.

Q12 How do you compare NewSQL databases, claming to offer both ACID-transaction properties and scalability, with Couchbase?

Chris Anderson : The CAP theorem is well-known these days, so I don’t need to go into the
reasons why NewSQL is an uphill battle. I see NewSQL technologies as primarily aimed at developers who don’t want to learn the lessons about data that the first decade of the web taught us. So there will likely be a significant revenue stream based on applications that need to be scaled up, without being re-written.

But we’ve asked our users what they like about NoSQL, and one of the biggest answers, as important to most of them as scalability and performance, was that they felt more productive with a JSON document model, than with rigid schemas. Requirements change quickly and being able to change your application without working through a DBA is a real
advantage. So I think NewSQL is really for legacy applications, and the future is with NoSQL document databases.

Q13 No single data store is best for all uses. What are the applications that are best suited for Couchbase, and which ones are not?

Chris Anderson : I got into Couch in the first place, because I see document databases as the 80% solution. It might take a few years, but I fully expect the
schema-free document model to be ascendant over the relational model, for most applications, most of the time. Of course there will still be uses for relational databases, but the majority of applications being written these days could work just fine on a document database, and as developer preferences change, we’ll see more and more serious applications running on NoSQL.

So from that angle, you can see that I think Couchbase is a broad solution.
We’ve seen great success in everything from content management systems to ad-targeting platforms, as well as simpler things like providing a high-performance data store for CRUD operations, or more enterprise-focused use cases like offloading query volume from mainframe applications.

A geekier way to answer your question, is to talk about the use cases where you pretty much don’t have a choice, it’s Couchbase or bust. So those use cases are anything where you need consistent sub-millisecond access latency, for instance maybe you have a time-based service level agreement.

If you need consistent high-performance while you are scaling your database cluster, that’s when you need Couchbase.
So for instance social-gaming has been a great vertical for us, since the hallmark of a successful game is that it may go from obscurity to a household name in just a few weeks. Too many of those games crash and burn when they are running on previous-generation technology.
It’s always been possible to build a backend that can handle internet-scale applications, what’s new with Couchbase is the ability to scale from a relatively small backend, to a backend handling hundreds of thousands of operations per second, at the drop of a hat.

Again, the use-cases where it’s Couchbase or bust are a subset of what our users are doing, but they are a great way to illustrate our priorities. We have plenty of users who don’t have high-performance requirements, but they still enjoy the flexibility of a document database.

If you need transactions across multiple data items, with atomic rollback, then you should be using a relational database. If you need foreign-key constraints, you should be using a relational database.

However, before you make that decision, you may want to ask if the performance tradeoffs are worth it for your application. Often there is a way to redesign something to make it less dependent on schemas, and the business benefits from the increased scale and performance you can get from NoSQL may make it a worthwhile tradeoff.

_________________________

Chris Anderson is a co-founder of Couchbase, focussed on the overall developer experience. In the early days of the company he served as CFO and President, but never strayed too far from writing code. His background includes open-source contributions to the Apache Software Foundation as well as many other projects. Before he wrote database engine code, he cut his teeth building applications and spidering the web. These days he gets a kick out of Node.js, playing bass guitar, and enjoying family time.

======

– Next generation Hadoop — interview with John Schroeder. by Roberto V. Zicari on September 7, 2012

–Integrating Enterprise Search with Analytics. Interview with Jonathan Ellis. by Roberto V. Zicari on April 16, 2012

Resources

ODBMS.org resources on:
– Big Data and Analytical Data Platforms: Blog Posts | Free Software | Articles | PhD and Master Thesis|

– NoSQL Data Stores: Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

Sep 7 12

Next generation Hadoop — interview with John Schroeder.

by Roberto V. Zicari

“There are only a few Facebook-sized IT organizations that can have 60 Stanford PhDs on staff to run their Hadoop infrastructure. The others need it to be easier to develop Hadoop applications, deploy them and run them in a production environment.”— John Schroeder.

How easy is to use Hadoop? What are the next generation Hadoop distributions? On these topics, I did Interview John Schroeder, Cofounder and CEO of MapR.

RVZ

Q1. What is the value that Apache Hadoop provides as a Big Data analytics platform?

John Schroeder: Apache Hadoop is a software framework that supports data-intensive distributed applications. Apache Hadoop provides a new platform to analyze and process Big Data. With data growth exploding and new unstructured sources of data expanding a new approach is required to handle the volume, variety and velocity of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

Q2. Is scalability the only benefits of Apache Hadoop then?

John Schroeder: No, you can build applications that aren’t feasible using traditional data warehouse platforms.
The combination of scale, ability to process unstructured data along with the availability of machine learning algorithms and recommendation engines creates the opportunity to build new game changing applications.

Q3. What are the typical requirements of advanced Hadoop users as well as those new to Hadoop?

John Schroeder Advanced users of Hadoop are looking to go beyond batch uses of Hadoop to support real-time streaming of content. Advanced users also need multi-tenancy to balance production and decision support workloads on their shared Hadoop cloud.
New users need Hadoop to become easier. There are only a few Facebook-sized IT organizations that can have 60 Stanford PhDs on staff to run their Hadoop infrastructure. The others need it to be easier to develop Hadoop applications, deploy them and run them in a production environment.

Q4. Why this? Please give us some practical examples of applications.

John Schroeder: Product recommendations, ad placements, customer churn, patient outcome predictions, fraud detection and sentiment analysis are just a few examples that improve with real time information.
Organizations are also looking to expand Hadoop use cases to include business critical, secure applications that easily integrate with file-based applications and products. Requirements for data protection include snapshots to provide point-in-time recovery and mirroring for business continuity.
With mainstream adoption comes the need for tools that don’t require specialized skills and programmers. New Hadoop developments must be simple for users to operate and to get data in and out. This includes direct access with standard protocols using existing tools and applications.

Q5. What are in your opinion the core limitations that limit the adoption of Hadoop in the enterprise? How do you contribute in taking Big Data into mainstream?

John Schroeder: MapR has and continues to knock down the barriers to Hadoop adoption. Hadoop needed five 9’s availability and the ability to run in a ‘lights-out” datacenter so we transformed Hadoop into a reliable compute and dependable storage platform. Hadoop use cases were too narrow so we expanded access to Hadoop data for industry standard file-based processing.
The MapR Control Center makes it easy to administer, monitor and provision Hadoop applications.
We improved Hadoop economics by dramatically improving performance. We just released our multi-tenancy features that were key to our recently announced Amazon and Google partnership announcements.
Next on our roadmap is to continue expanding the use cases and additional progress moving Hadoop from batch to real-time.

Q6. What are the benefits of having an automated stateful failover?

John Schroeder: Automated stateful failover provides high availability and continuous operations for organizations.
Even with multiple hardware or software outages and errors applications will continue running without any administrator actions required.

Q7. What is special about MapR architecture? Please give us some technical detail.

John Schroeder: MapR rethought and rebuilt the entire internal infrastructure while maintaining complete compatibility with Hadoop APIs. MapR opened up Hadoop to programming, products and data access by providing a POSIX storage layer, NFS access, ODBC/JDBC and REST. The MapR strategy is to expand use cases appropriate for Hadoop while avoiding proprietary features that would result in vendor lock-in. We rebuilt the underlying storage services layer and eliminated the “append only” limitation of the Hadoop Distributed File System (HDFS). MapR writes directly to block devices eliminating the inefficiencies caused by layering HDFS on top of a Linux local file system. In other Hadoop implementations, these continually rob the entire system of performance efficiencies.
The storage layer rearchitecture also enabled us to implement snapshot and mirroring capabilities. The MapR Distributed NameNode eliminates the well-known file scalability limitation allowing us to optimize the Hadoop shuffle algorithm.
MapR provides transparent compression on the client making it easy to reduce data transmission over the network or on disk.
Finally MapR eliminated the impact of periodic Java garbage collection.
This kind of increase is simply impossible with current implementations because it is limited by the architecture itself.

Q8. There are several commercial Hadoop distribution companies on the market (e.g. Cloudera, Datameer, Greenplum, Hortonworks, Platform Computing to name a few). What is special about MapR?

John Schroeder: Datameer is not a Hadoop distribution provider they provide an analytic application and are a partner of MapR. Platform Computing is also not a Hadoop distribution provider. MapR is the only company to provide an enterprise-grade Hadoop distribution.

Q9. What do you mean with an enterprise-grade Hadoop distribution? Isnt` for example Cloudera (to name one) also an enterprise Hadoop distribution?

John Schroeder: There is no other alternative in the market for full HA, business continuity, real-time streaming, standard file-based access through NFS, full database access through ODBC, and support for mission-critical SLAs.

Q10. How do you support this claim? Did you do a market analysis? What other systems did you look at?

John Schroeder: We performed a complete review of available Hadoop distributions. The recent selections of MapR by Amazon as an integrated offering into their Amazon Elastic MapReduce service and by Google for their Google Compute Engine are further validations of MapR’s differentiated Hadoop offering.
With MapR users can mount a cluster using the standard NFS protocol. Applications can write directly into the cluster with no special clients or applications. Users can use every file-based application that has been developed over the last 20 years, ranging from BI applications to file browsers and command-line utilities such as grep or sort directly on data in the cluster. This dramatically simplifies development. Existing applications and workflows can be used and just the specific steps requiring parallel processing need to be converted to take advantage of the MapReduce framework.

MapR also delivers ease of data management. With clusters expanding well into the petabyte range simplifying how data is managed is critical. MapR uniquely supports volumes to make it easy to apply policies across directories and file contents without managing individual files.
These policies include data protection, retention, snapshots, and access privileges.

Additionally, MapR delivers business critical reliability. This includes full HA, business continuity and data protection features. In addition to replication, MapR includes snapshots and mirroring. Snapshots provide for point-in-time recovery. Mirroring provides backup to an alternate cluster, data center or between on-premise and private Hadoop clouds. These features provide a level of protection that is necessary for business critical uses.

Q11. What are the next generations of Hadoop distributions?

John Schroeder: The first generation of Hadoop surrounded the open source Apache project with services and management utilities. MapR is the next generation of Hadoop that combines standards-based innovations developed by MapR with open source components resulting in the industry’s only differentiated distribution that meets the needs of both the largest and emerging Hadoop installations.
MapR’s extensive engineering efforts have resulted in the first software distribution for Hadoop that provides extreme high performance, unprecedented scale, business continuity and is easy to deploy and manage.

Q12. Do you have any results to show us to support these claims “high performance, unprecedented scale”?

John Schroeder: We have many examples of high performance and scale. Google recently unveiled the Google Compute Engine with MapR on stage at the Google IO conference. We demonstrated a 1256 node cluster perform a Terasort in 1 minute and 20 seconds. One of our customers, comScore presented a session at the Hadoop Summit and showed how they process 30B internet events a day using MapR. As for scale differences we have a customer with 18 billion files in a single MapR cluster. By comparison, the largest clusters of other distributions max out around 200 million files.

Q13. What functionalities still need to be added to Hadoop to serve new business critical and real-time applications?

John Schroeder: Other Hadoop distributions present customers with several challenges including:
• Getting data in and out of Hadoop. Other Hadoop distributions are limited by the append-only nature of the Hadoop Distributed File System (HDFS) that requires programs to batch load and unload data into a cluster.
• Deploying Hadoop into mission critical business projects. The lack of reliability of current Hadoop software platforms is a major impediment for expansion.
• Protecting data against application and user errors. Hadoop has no backup and restore capabilities. Users have to contend with data loss or resort to very expensive solutions that reside outside the actual Hadoop cluster.
According to industry research firm, ESG, half of the companies they surveyed plan to leverage commercial distributions of Hadoop as opposed to the open source version. This trend indicates organizations are moving from experimental and pilot projects to mainstream applications with mission-critical requirements that include high availability, better performance, data protection, security, and ease of use.

Q14. There is work to be done training developers in learning advanced statistics and software (such as Hadoop) to ensure adoption in the Enterprise. Do you agree with this? What is your role here?

John Schroeder: Simply put the limitations of the Hadoop Distributed File System require whole scale changes to existing applications and extensive development of new ones. MapR’s next generation storage services layer provides full random/read support and provides direct access with NFS. This dramatically simplifies development. Existing applications and workflows can be used and just the specific steps requiring parallel processing need to be converted to take advantage of the MapReduce framework.

Q15. Are customers willing to share their private data?

John Schroeder: In general customers are concerned with the protection and security of their data. That said, we see growing adoption of Hadoop in the cloud. Amazon has a significant web-services business around Hadoop and recently added MapR as part of their Elastic MapReduce offering. Google has also announced the Google Compute Engine and integration with MapR.

Q16. Data quality from different sources is a problem. How do you handle this?

John Schroeder: Data quality issues can be similar to those in a traditional data warehouse. Scrubbing can be built into the Hadoop applications using algorithms similar to those used during ELT.
ETL and ELT can both accomplish data scrubbing. The storage/compute resources and ability to combine unlike datasets provide significant advantages to Hadoop-based ELT.

There are different views with respect to this issue. IT personnel that are used to traditional data warehouses are typically concerned with data quality and ETL processes. The advantage of Hadoop is that you can have disparate data from many different sources and different data types in the same cluster. Some advanced users have pointed out that “quality” issues are actually valuable information that can provide insight into issues, anomalies and opportunities. With Hadoop users have the flexibility to process and analyze. Analytics are not dependent on having a pre-defined schema.

Q17. Moving older data online. Is this a business opportunity for you?

John Schroeder: The advantage of Hadoop is performing compute on data. It makes much more sense to perform analytics directly on large data stores so you send only results over the network instead of dragging the entire data set over the network for processing. For this use case to be viable requires a highly reliable cluster with full data protection and business continuity features.

Q18. Yes, but what about big data that is not digitalized yet? This is what I meant with moving older data online.

John Schroeder: Most organizations are looking for a solution to help them cope with fast growing digital sources of machine generated content such as log files, sensor data, etc. Images, video and audio are also a fast growing data source that can provide valuation analytics.

_______________________
John Schroeder, Cofounder and CEO, MapR.

John has led companies creating innovative and disruptive business intelligence, database management, storage and virtualization technologies at early stage ventures through success as large public companies. John founded MapR to produce the next generation Hadoop distribution to expand the use cases beyond batch Hadoop to include real-time, business critical, secure applications that easily integrate with file-based applications and products.

– On Big Graph Data. on August 6, 2012

– Managing Big Data. An interview with David Gorbet on July 2, 2012

– Interview with Mike Stonebraker. on May 2, 2012

– Integrating Enterprise Search with Analytics. Interview with Jonathan Ellis. on April 16, 2012

– A super-set of MySQL for Big Data. Interview with John Busch, Schooner. on February 20, 2012

– On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. on February 1, 2012

Resources

ODBMS.org Free Downloads and Links
In this section you can download free resources on Big Data and Analytical Data Platforms (Blog Posts | Free Software| Articles| PhD and Master Thesis)

Aug 27 12

On Impedance Mismatch. Interview with Reinhold Thurner

by Roberto V. Zicari

“Many enterprises sidestep applications with “Shadow IT” to solve planning, reporting and analysis problems” — Reinhold Thurner.

I am coming back to the topic of “Impedance Mismatch”.
I have interviewed one of our experts, Dr. Reinhold Thurner founder of Metasafe GmbH in Switzerland.

RVZ

Q1. In a recent interview José A. Blakeley and Rowan Miller of Microsoft, said that “the impedance mismatch problem has been significantly reduced, but not entirely eliminated”? Do you agree?

Thurner: Yes I agree, with some reservations and only for the special case for the impedance mismatch between a conceptual model, a relational database and an oo-program. However even an advanced ORM is not really a solution for the more general case of complex data which affects any (also non oo) programmer and especially also an end user.

Q2. Could you please explain better what you mean here?

Thurner: My reservations concern the tools and the development process: Several standalone tools (model-designer, mapper, code generator, schema-loader) are connected by intermediate files. Is is difficult if not impossible to develop a transparent model transformation which relieves the developer from the necessity to “think” on both levels – the original model and the transformed model – at the same time. The conceptual models can be “painted” easily, but they cannot be “executed” and tested with test data.
They are practically disjoint from the instance data. It takes a lot of discipline to avoid that changes in the data structures are directly applied to the final database with the consequence that the conceptual model is lost.
I rephrase from a document about ADO.net: “Most significant applications involve a conceptual design phase early in the application development lifecycle. Unfortunately, however, the conceptual data model is captured inside a database design tool that has little or no connection with the code and the relational schema used to implement the application. The database design diagrams created in the early phases of the application life cycle usually stay “pinned to a wall” growing increasingly disjoint from the reality of the application implementation with time.”

Q3. You are criticizing the process and the tools – what is the alternative?

Thurner: I compare this tool-architecture with the idea of an “integrated view of conceptual modeling, databases and CASE” (actually the title of one of your books). The basic ideas did exist already in the early 90es but were not realized because the means to implement a “CASE database” were missing: modeling concepts (OMG), languages (java, c#), frameworks (Eclipse), big cheap memory, powerful cpus, big screens etc. Today we are in a much better position and it is now feasible to create a data platform (i.e. a database for CASE) for tool integration. As José A. Blakeley argues, ‘(…) modern applications and data services need to target a higher-level conceptual model based on entities and relationships (…)’. A modern data platform is a prerequisite to supports such a concept?

Q4. Could you give us some examples of typical (impedance mismatch) problems still existing in the enterprise? How are they practically handled in the enterprise?

Thurner: As a consequence of the problems with the impedance mismatch some applications don’t use database technology at all or develop a thick layer of proprietary services which are in fact a sort of private DBMS.
Many enterprises sidestep applications with “Shadow
IT” to solve planning, reporting and analysis problems– i.e. Spreadsheets instead of databases, mail for multi-user access and data exchange, security by obscurity and a lot of manual copy and paste.
Another important area is development tools: Development tools deal with a large number of highly interconnected artifacts which must be managed in configurations and versions. These artifacts are still stored in files, libraries and some in relational databases with a thick layer on top. A proper repository would provide better services for a tool developer and helps to create products which are more flexible and easier to use.
Data management and information dictionary: IT-Governance (COBIT) stipulates that a company should maintain an “information dictionary” which contains all “information assets”, their logical structure, the physical implementations and the responsibilities (RACI-Charts, data steward). The common warehouse model (OMG) describes the model of the most common types of data stores – which is a good start: but companies with several DMBSs, hundreds of databases, servers and programs accessing thousands of tables and IMS-segments need a proper database to store the instances to make the “information model real”. Users of such a dictionary (designers, programmers, testers, integration services, operations, problem management etc.) need an easy to use query-language to access these data in an explorative manner.

Q5. If ORM technology cannot solve this kind of problem? What are the alternatives?

Thurner: The essence of ORM-technology is to create a bridge between the “existing eco-system of databases based on the relational model and the conceptual model”. The “relational model” is not the one and only possible approach to persist data. Data storage technology has moved up the ladder from sequential files to index-sequential, to multi-index, to codasyl, to hierarchical (IMS) and today’s market leader RDBMS. This is certainly not the end and the future seems to become very colorful. As Michael Stonebraker explains “In summary, there may be a substantial number of domain-specific database engines with differing capabilities off into the future. See his paper “One Size fits all – an Idea whose time has come and gone“.
ADO.net has been described as “a part of the broader Microsoft Data Access vision” and covers a specific subset of applications. Is the “other part” – the “executable conceptual model” which was mentioned by Peter Chen in a discussion with José Blakely about “The future of Database Systems”?
I am convinced that an executable conceptual model will play an important role for the aforementioned problems: A DMBS with an entity-relationship model implements the conceptual model without an impedance mismatch. To succeed it needs however all the properties José mentioned like queries, transactions, access-rights and integrated tools.

Q6. You started a company which developed a system called Metasafe-Repository. What is it?

Thurner: It started long ago with a first version developed in C, which was e.g. used in a reengineering project to manage a few hundred types, one million instances and about five million bidirectional relationships. In 2006 we decided to redevelop the system from scratch in java and the necessary tools with the Eclipse framework. We started with the basic elements – multi-level architecture based on the entity-relationship-model, integration of models and instance-data, ACID transactions, versioning and user access rights. During development the system morphed from the initial idea of a repository-service to a complete DBMS. We developed a set of entirely model driven tools – modeler, browser, import-/export utilities, Excel-interface, ADO-Driver for BIRT etc.
Metasafe has a multilevel structure: an internal metamodel, the global data model, external models as subsets (views) of the global data model and the instance data – in OMG-Terms it stretches from M3 to M0. All types (M2, M1) are described by meta-attributes as defined in the metamodel. User access rights to models and instance data are tied to the external models. Entity instances (M0) may exist in several manifestations (Catalog, Version, Variant). An extension of the data model e.g. by additional data types, entity types, relationship types or submodels can be defined using the metaModeler tool (or via the API by a program). From the moment the model changes are committed, the database is ready to accept instance data for the new types without unload/reload of the database.

Q7. Is the Metasafe repository the solution to the impedance mismatch problem?

Thurner: It is certainly a substantial step forward because we made the conceptual model and the database definition one and the same. We take the conceptual model literally by its word: If an ‘Order has Orderdetails’, we tell the database to create two entity types ‘Order’ and ‘Orderdetails’ and the relation ‘has’ between them. This way Metasafe implements an executable conceptual model with all the required properties of a real database management system: an open API, versioning, “declarative, model related queries and transactions” etc. Our own set of tools and especially the integration of BIRT (the eclipse Business Intelligence and Reporting Tool) demonstrate how it can be done. Our graphical erSQL query builder is even integrated into the BIRT designer. The erSQL queries are translated on the fly and BIRT accesses our database without any intermediate files.

Q8: What is the API of the Metasafe repository?

Thurner: Metasafe provides an object-oriented Java-API for all methods required to search, create, read, update, delete the elements of the database – i.e. schemas, user groups /users, entities, relationships and their attributes – both on type- and on instance-level. All the tools of Metasafe (modeler, browser, import/export, query builder etc.) are built with this public API. This approach has led to an elaborate set of methods to support an application programmer. The erSQL query-builder and also the query-translator and processor were implemented with this API. An erSQL query can be embedded in a java-program to retrieve a result-set (including its metadata) or to export the result-set.
In early versions we had a C#-version in parallel but we discontinued this branch when we started with the development of the tools based on Eclipse RCP. The reimplementation of the core in C# would be relatively easy. I think that also the tools could be reimplemented because they are entirely model-driven.

Q9 How does Metasafeˈs query language differ from the Microsoft Entity Framework built-in query capabilities (i.e. Language Integrated Query (LINQ)?

Thurner: It is difficult to compare because Metasafe’s ersql query-language was designed with respect to the special nature of an Entity-Relationship model with heavily cross linked information. So the erSQL query language maps directly to the conceptual model. Also “end users” can create queries with the graphical query builder with point and click on the graphical representation of the conceptual model to identify the path through the model and to collect the information of interest.

The queries are translated on the fly and processed by the query processor. The validation and translation of a query into a command structure of the query processor is a matter or milliseconds. The query processor returns result sets of metadata and typed instance data. The query result can also be exported as Excel-Table or as XML-file. In “read-mode” the result of each retrieval step (instance objects and their attributes) is returned to the invoking program instead of building the complete result set. A query represents a sort of “user” model and is also documented graphically. “End users” can easily create queries and retrieve data from the database. erSQL and the graphical query builder is fully integrated in BIRT to create reports on the fly.
The present version supports only information retrieval. We plan to extend it by a ” … for update” feature which locks all selected entity instances for further operation.
E.g. an update query for {an order and all its order items and products} would lock “the order” until backout or commit.

Q10. There are concerns about the performance and the overhead generated by ORM technology. Is performance an issue for Metasafe?

Thurner: Performance is always an issue when the number of concurrent users and the size and complexity of the data grow. The system works quite well for medium size systems with a few hundred types, a few million instances and a few GBs. The performance depends on the translation of the logical requests into physical access commands and on the execution of the physical access to the persistence. Metasafe uses a very limited functionality of an RDBMS (currently SQLServer, Derby, Oracle) for persistence. Locking, transactions, multi-user management is handled by Metasafe; the locking tables are kept in memory. After a commit it writes all changes in one burst to the database. We could of course use an in-memory DBMS to gain performance. E.g. VoltDB with the direct transaction access could be integrated easily and would certainly lead to superior performance.
We have also another kind of performance in mind – the user performance. For many applications the number of milliseconds to execute a transaction are less important than the ability to quickly create or change a database and to create and launch queries in a matter of minutes. Metasafe is especially helpful for this kind of application.

Q11. What problems is Metasafe designed to solve?

Thurner: Metasafe is designed as a generic data platform for medium sized (XX GB) model-driven applications. The primary purpose is the support for applications with large, complex and volatile data structures as tools, models, catalogs or process managers etc. Metasafe could be used to replace some legacy repositories.
Metasafe is certainly the best data platform (repository) for the construction of an integrated development environment. Metasafe can also serve as DBMS for business applications.
We evaluate also the possibilities to use that Metasafe DBMS as data platform for portable devices as phones and tablet computers: This could be a real killer application for application developers.

Q12. How do you position Metasafe in the market?

Thurner: I had the vision of an entity relationship base database system as future data platform and decided to develop Metasafe to a really useful level without the pressure of the market (namely the first time users). Now we have our product on the necessary level of quality and we are planning the next steps. It could be the “open source approach” for a limited version or the integration into a larger organization.
We have a number of applications and POCs but we have no substantial customer base yet, which would require an adequate support and sales organization. But we have not the intension to convert a successful development setup into a mediocre service and sales organization. We are not under time pressure and are looking at a number of possibilities.

Q13. How can the developers community test your system?

Thurner: We provide an evaluation version upon request.

—————————-
Related Posts

– Do we still have an impedance mismatch problem? – Interview with José A. Blakeley and Rowan Miller. by Roberto V. Zicari on May 21, 2012

Resources

– “Implementing the Executable Conceptual Model (ECM)” (download as .pdf),
by Dr. Reinhold Thurner, Metasafe.

ODBMS.org Free Resources on:
– Entity Framework (EF) Resources
– ORM Technology
– Object-Relational Impedance Mismatch

Aug 15 12

On Eventual Consistency– An interview with Justin Sheehy.

by Roberto V. Zicari

“I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense” –Justin Sheehy.

On the subject of new data models and eventual consistency I did interview Justin Sheehy Chief Technology Officer, Basho Technologies.

RVZ

Q1. What are in your opinion the main differences and similarities of a key-value store (ala Dynamo), a document store (ala MongoDB), and an “extensible record” store (ala Big Table) when using them in practice?

Justin Sheehy: Describing the kv-store, doc store, and column family data models in general is not the same as describing specific systems like Dynamo, MongoDB, and BigTable. I’ll do the former here as I am guessing that is the intention of the question. Since the following couple of questions ask for differences, I’ll emphasize the similarity here.

All three of these data models have two major things in common: values stored in them are not rigidly structured, and are organized mainly by primary key. The details beyond those similarities, and how given systems expose those details, certainly vary. But the flexibility of semi-structured data and the efficiency of primary-key access generally apply to most such systems.

Q2. When is a key-value store particularly well suited and when is a a document store instead preferable? For which kind of applications and for what kind of data management requirements?

Justin Sheehy: The interesting issue with this question is that “document store” is not well-established as having a specific meaning. Certainly it seems to apply to both MongoDB and CouchDB, but those two systems have very different data access semantics. The closest definition I can come up with quickly that covers the prominent systems known as doc stores might be something like “a key-value store which also has queryability of some kind based on the content of the values.”

If we accept that definition then you can happily use a document store anywhere that a key-value store would work, but would find it most worthwhile when your querying needs are richer than simply primary key direct access.

Q3. What is Riak? A key-value store or a document store? What are the main features of the current version of Riak?

Justin Sheehy: Riak began as being called a key-value store before the current popularity of the term “document store” term, but it is certainly a document store by any reasonable definition that I know — such as the one I gave above. In addition to access by primary key, values in Riak can be queried by secondary key, range query, link walking, full text search, or map/reduce.

Riak has many features, but the core reasons that people come to Riak over other systems are Availability, Scalability, and Predictability. For people whose business demands extremely high availability, easy and linear scalability, or predictable performance over time, Riak is worth a look.

Q4. How do you achieve horizontal scalability? Do you use a “shared nothing” horizontal scaling – replicating and partitioning data over many servers?
What performance metrics do you have for that?

Justin Sheehy: We use a number of techniques to achieve horizontal scalability. Among them is consistent hashing, an approach invented at Akamai and successfully used by many distributed systems since then. This allows for constant time routing to replicas of data based on the hash of the data’s primary key.
Data is partitioned to servers in the cluster based on consistent hashing, and replicated to a configurable number of of those servers. By partitioning the data to many “virtual nodes” per host, growth is relatively easy as new hosts simply (and automatically) take over some of the virtual nodes that has previously owned by existing cluster hosts.
Yes, in terms of data location Riak is a “shared nothing” system.
One (of many) demonstrations of this scalability was performed by Joyent here.
That benchmark is approximately 2 years old, so various specific numbers are quite outdated, but the important lesson in it remains and is summed up by this graph late in this post.
It shows that as servers were added, the throughput (as well as the capacity) of the overall system increased linearly.

Q5. How do you handle updates if you do not support ACID transactions? For which applications this is sufficient, and when this is not?

Justin Sheehy: Riak takes more of the “BASE” approach, which has become accepted over the past several years as a sensible tradeoff for high-availability data systems. By allowing consistency guarantees to be a bit flexible during failure conditions, a Riak cluster is able to provide much more extreme availability guarantees than a strictly ACID system.

Q6. You said that Riak takes more of the “BASE” approach. Did you use the definition of eventual consistency by Werner Vogels?
Reproduced here: “Eventual consistency: The storage system guarantees that if no new updates are made to the object, eventually (after the inconsistency window closes) all accesses will return the last updated value”. You would not wish to have an “eventual consistency” update to your bank account. For which class of applications is eventual consistency a good system design choice?

Justin Sheehy: That definition of Eventual Consistency certainly does apply to Riak, yes.

I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense. Traditional accounting is done in an eventually-consistent way and if you send me a payment from your bank to mine then that transaction will be resolved in an eventually consistent way. That is, your bank account and mine will not have a jointly-atomic change in value, but instead yours will have a debit and mine will have a credit, each of which will be applied to our respective accounts.

This question contains a very commonly held misconception. The use of eventual consistency in well-designed systems does not lead to inconsistency. Instead, such systems may allow brief (but shortly resolved) discrepancies at precisely the moments when the other alternative would be to simply fail.

To rephrase your statement, you would not wish your bank to fail to accept a deposit due to an insistence on strict global consistency.

It is precisely the cases where you care about very high availability of a distributed system that eventual consistency might be a worthwhile tradeoff.

Q7. Why is Riak written in Erlang? What are the implications for the application developers of this choice?

Justin Sheehy: Erlang’s original design goals included making it easy to build systems with soft real-time guarantees and very robust fault-tolerance properties. That is perfectly aligned with our central goals with Riak, and so Erlang was a natural fit for us. Over the past few years, that choice has proven many times to have been a great choice with a huge payoff for Riak’s developers. Application developers using Riak are not required to care about this choice any more than they need to care what language PostgreSQL is written in. The implications for those developers are simply that the database they are using has very predictable performance and excellent resilience.

Q8. Riak is open source. How do you engage the open source community and how do you make sure that no inconsistent versions are generated?

Justin Sheehy: We engage the open source community everywhere that it exists. We do our development in the open on github, and have lively conversations with a wider community via email lists, IRC, Twitter, many in-person venues, and more.
Mark Phillips and others at Basho are dedicated full-time to ensuring that we continue to engage honestly and openly with developer communities, but all of us consider it an essential part of what we do. We do not try to prevent forks. Instead, we are part of the community in such a way that people generally want to contribute their changes back to the central repository. The only barrier we have to merging such code is about maintaining a standard of quality.

Q9. How do you optimize access to non-key attributes?

Justin Sheehy: Riak stores index content in addition to values, encoded by type and in sorted order on disk. A query by index certainly is more expensive than simply accessing a single value directly by key, as the indices are distributed around the cluster — but this also means that the size of the index is not constrained by a single host.

Q10. How do you optimize access to non-key attributes if you do not support indexes in Riak?

Justin Sheehy: We do support indexes in Riak.

Q11 How does Riak compare with a new generation of scalable relational systems (NewSQL)?

Justin Sheehy: The “NewSQL” term is, much like “NoSQL”, a marketing term that doesn’t usefully define a technical category. The primary argument made by NewSQL proponents is that some NoSQL systems have made unnecessary tradeoffs. I personally consider these NewSQL systems to be a part of the greater movement generally dubbed NoSQL despite the seemingly contradictory names, as the core of that movement has nothing to do with SQL — it is about escaping the architectural monoculture that has gripped the commercial database market for the past few decades. In terms of technical comparison, some systems placing themselves under the NewSQL banner are excellent at scalability and performance, but I know of none whose availability and predictability can rival Riak.

Q12 Pls give some examples of use cases where Riak is currently in use. Is Riak in use for analyzing Big Data as well?

Justin Sheehy: A few examples of companies relying on Riak in their business can be found here.
While Riak is primarily about highly-available systems with predictable low-latency performance, it does have analytical capabilities as well and many users make use of map/reduce and other such programming models in Riak. By most definitions of “Big Data”, many of Riak’s users certainly fall into that category.

Q Anything you wish to add?

Justin Sheehy: Thank you for your interest. We’re not done making Riak great!

_____________

Justin Sheehy
Chief Technology Officer, Basho Technologies

As Chief Technology Officer, Justin Sheehy directs Basho’s technical strategy, roadmap, and new research into storage and distributed systems.
Justin came to Basho from the MITRE Corporation, where as a principal scientist he managed large research projects for the U.S. Intelligence Community including such efforts as high assurance platforms, automated defensive cyber response, and cryptographic protocol analysis.
He was central to MITRE’s development of research for mission assurance against sophisticated threats, the flagship program of which successfully proposed and created methods for building resilient networks of web services.
Before working for MITRE, Justin worked at a series of technology companies including five years at Akamai Technologies, where he was a senior architect for systems infrastructure giving Justin a broad as well as deep background in distributed systems.
Justin was a key contributor to the technology that enabled fast growth of Akamai’s networks and services while allowing support costs to stay low. Justin performed both undergraduate and postgraduate studies in Computer Science at Northeastern University.

______________________
Related Posts

– On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.

– On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

Resources
ODBMS.org — Free Downloads and Links
In this section you can download resources covering the following topics:
Big Data and Analytical Data Platforms
Cloud Data Stores
Object Databases
NewSQL,
NoSQL Data Stores
Graphs and Data Stores
Object-Oriented Programming
Entity Framework (EF) Resources
ORM Technology
Object-Relational Impedance Mismatch
XML, RDF Data Stores,
RDBMS

Aug 6 12

On Big Graph Data.

by Roberto V. Zicari

” The ultimate goal is to ensure that the graph community is not hindered by vendor lock-in” –Marko A. Rodriguez.
“There are three components to scaling OLTP graph databases: effective edge compression, efficient vertex centric query support, and intelligent graph partitioning” — Matthias Broecheler.

Titan is a new distributed graph database available in alpha release. It is an open source Apache project maintained and funded by Aurelius. To learn more about it, I have interviewed Dr. Marko A. Rodriguez and Dr. Matthias Broecheler cofounders of Aurelius.

RVZ

Q1. What is Titan?

MATTHIAS: Titan is a highly scalable OLTP graph database system optimized for thousands of users concurrently accessing and updating one huge graph.

Q2. Who needs to handle graph-data and why?

MARKO: Much of today’s data is composed of a heterogeneous set of “things” (vertices) connected by a heterogeneous set of relationships (edges) — people, events, items, etc. related by knowing, attending, purchasing, etc. The property graph model leveraged by Titan espouses this world view. This world view is not new as the object-oriented community has a similar perspective on data.
However, graph-centric data aligns well with the numerous algorithms and statistical techniques developed in both the network science and graph theory communities.

Q3. What are the main technical challenges when storing and processing graphs?

MATTHIAS: At the interface level, Titan strives to strike a balance between simplicity, so that developers can think in terms of graphs and traversals without having to worry about the persistence and efficiency details. This is achieved by both using the Blueprint’s API and by extending it with methods that allow developers to give Titan “hints” about the graph data. Titan can then exploit these “hints” to ensure performance at scale.

Q4. Graphs are hard to scale. What are the key ideas that make it so that Titan scales? Do you have any performance metrics available?

MATTHIAS: There are three components to scaling OLTP graph databases: effective edge compression, efficient vertex centric query support, and intelligent graph partitioning.
Edge compression in Titan comprises various techniques for keeping the memory footprint of each edge as small as possible and storing all edge information in one consecutive block of memory for fast retrieval.
Vertex centric queries allow users to query for a specific set of edges by leveraging vertex centric indices and a query optimizer.
Graph data partitioning refers to distributing the graph across multiple machines such that frequently co accessed data is co-located. Graph partitioning is a (NP-) hard problem and this is an aspect of Titan where we will see most improvement in future releases.
The current alpha release focuses on balanced partitioning and multi-threaded parallel traversals for scale.

MARKO: To your question about performance metrics, Matthias and his colleague Dan LaRocque are currently working on a benchmark that will demonstrate Titan’s performance when tens of thousands of transactions are concurrently interacting with Titan. We plan to release this benchmark via the Aurelius blog.
[Edit: The benchmark is now available here. ]

Q5. What is the relationships of Titan with other open source projects you were previously involved with, such as TinkerPop? Is Titan open source?

MARKO: Titan is a free, open source Apache2 project maintained and funded by Aurelius . Aurelius (our graph consulting firm) developed Titan in order to meet the scalability requirements of a number of our clients.
In fact, Pearson is a primary supporter and early adopter of Titan. TinkerPop, on the other hand, is not directly funded by any company and as such, is an open source group developing graph-based tools that any graph database vendor can leverage.
With that said, Titan natively implements the Blueprint 2 API and is able to leverage the TinkerPop suite of technologies: Pipes, Gremlin, Frames, and Rexster.
We believe this demonstrates the power of the TinkerPop stack — if you are developing a graph persistence store, implement Blueprints and your store automatically gets a traversal language, an OGM (object-to-graph mapper) framework, and a RESTful server.

Q6. How is Titan addressing the problem of analyzing Big Data at scale?

MATTHIAS: Titan is an OLTP database that is optimized for many concurrent users running short transactions, e.g. graph updates or short traversals against one huge graph. Titan significantly simplifies the development of scalable graph applications such as Facebook, Twitter, and the like.
Interestingly enough, most of these large companies have built their own internal graph databases.
We hope Titan will allow organizations to not reinvent the wheel. In this way, companies can focus on the value their data adds, not on the “plumbing” needed to process that data.

MARKO: In order to support the type of global OLAP operations typified by the Big Data community, Aurelius will be providing a suite of technologies that will allow developers to make use of global graph algorithms. Faunus is a Hadoop-connector that implements a multi-relational path algebra developed by myself and Joshua Shinavier. This algebra allows users to derive smaller, “semantically rich” graphs that can then be effectively computed on within the memory confines of a single machine. Fulgora will be the in-memory processing engine. Currently, as Matthias has shown in prototype, Fulgora can store ~90 billion edges on a 64-Gig RAM machine for graphs with a natural, real-world topology. Titan, Faunus, and Fulgora form Aurelius’ OLAP story

Q7. How do you handle updates?

MATTHIAS: Updates are bundled in transactions which are executed against the underlying storage backend. Titan can be operated on multiple storage backends and currently supports Apache Cassandra, Apache HBase and Oracle BerkeleyDB.
The degree of transactional support and isolation depends on the chosen storage backend. For non-transactional storage backends Titan provides its own locking system and fine grained locking support to achieve consistency while maintaining scalability.

Q8. Do you offer support for declarative queries?

MARKO: Titan implements the Blueprints 2 API and as such, supports Gremlin as its query/traversal language. Gremlin is a data flow language for graphs whereby traversals are prescriptively described using path expressions.

MATTHIAS: With respects to a declarative query language, the TinkerPop teams is currently in the design process of a graph-centric language called “Troll.” We invite anybody interested in graph algorithms and graph processing to help in this effort.
We want to identify the key graph use cases and then build a language that addresses those most effectively. Note that this is happening in TinkerPop and any Blueprints-enabled graph database will ultimately be able to add “Troll” to their supported languages.

Q9. How does Titan compare with other commercial graph databases and RDF triple stores?

MARKO: As Matthias has articulated previously, Titan is optimized for thousands of concurrent users reading and writing to a single massive graph. Most popular graph databases on the market today are single machine databases and simply can’t handle the scale of data and number of concurrent users that Titan can support. However, because Titan is a Blueprints-enabled graph database, it provides that same perspective on graph data as other graph databases.
In terms of RDF quad/triple stores, the biggest obvious difference is the data model. RDF stores make use of a collection of triples composed of a subject, predicate, and object. There is no notion of key/value pairs associated with vertices and edges like Blueprints-based databases. When one wants to model edge weights, timestamps, etc., RDF becomes cumbersome. However, the RDF community has a rich collection of tools and standards that make working with RDF data easy and compatible across all RDF vendors.
For example, I have a deep appreciation for OpenRDF.
Similar to OpenRDF, TinkerPop hopes to make it easy for developers to migrate between various graph solutions whether they be graph databases, in-memory graph frameworks, Hadoop-based graph processing solutions, etc.
The ultimate goal is to ensure that the graph community is not hindered by vendor lock-in.

Q10. How does Titan compare with respect to NoSQL data stores and NewSQL databases?

MATTHIAS: Titan builds on top of the innovation at the persistence layer that we have seen in recent years in the NoSQL movement. At the lowest level, a graph database needs to store bits and bytes and therefore has to address the same issues around persistence, fault tolerance, replication, synchronization, etc. that NoSQL solutions are tackling.
Rather than reinventing the wheel, Titan is standing on the shoulders of giants by being able to utilize different NoSQL solutions for storage through an abstract storage interface. This allows Titan to cover all three sides of the CAP theorem triangle — please see here.

Q11. Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

MATTHIAS: We absolutely agree with Mike on this. The relational model is a way of looking at your data through tables and SQL is the language you use when you adopt this tabular view. There is nothing intrinsically inefficient about tables or relational algebra. But its important to note that the relational model is simply one way of looking at your data. We promote the graph data model which is the natural data representation for many applications where entities are highly connected with one another. Using a graph database for such applications will make developers significantly more productive and change the way one can derive value from their data.
——–

Dr. Marko A. Rodriguez is the founder of the graph consulting firm Aurelius. He has focused his academic and commercial career on the theoretical and applied aspects of graphs. Marko is a cofounder of TinkerPop and the primary developer of the Gremlin graph traversal language.

Dr. Matthias Broecheler has been researching and developing large-scale graph database systems for many years in both academia and in his role as a cofounder of the Aurelius graph consulting firm. He is the primary developer of the distributed graph database Titan.
Matthias focuses most of his time and effort on novel OLTP and OLAP graph processing solutions.

————
Related Posts

– “Applying Graph Analysis and Manipulation to Data Stores.” (June 22, 2011)

– “Marrying objects with graphs”: Interview with Darren Wood. (March 5, 2011)

Resources on Graphs and Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes

Jul 11 12

On Analyzing Unstructured Data. — Interview with Michael Brands.

by Roberto V. Zicari

“The real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data”– Michael Brands.

It is reported that 80% of all data in an enterprise is unstructured information. How do we manage unstructured data? I have interviewed Michael Brands, an expert on analyzing unstructured data and currently a senior product manager for the i.Know technology at InterSystems.

RVZ

Q1. It is commonly said that more than 80% of all data in an enterprise is unstructured information. Examples are telephone conversations, voicemails, emails, electronic documents, paper documents, images, web pages, video and hundreds of other formats. Why is unstructured data important for an enterprise?

Michael Brands: Well unstructured data is important for organizations in general in at least 3 ways.  First of all 90% of what people do in a business day is unstructured and the results of most of these activities can only be captured in unstructured data. Second it is generally acknowledged in modern economy that knowledge is the biggest of asset of companies and most of this knowledge, since itʼs developed by people, is recorded in unstructured formats. 
The last and maybe most unexpected argument to underpin the importance of unstructured data is the fact large research organizations such as Gardner and IDC state that: “80% of business is conducted on unstructured data”
If we take these tree elements together is even surprising to see most organizations invest heavily in business intelligence applications to improve their business but these applications only cover a very small portion of the data (20% in the most optimistic estimation) that are actually important for their business.
If we look at this from a different prospective we think enterprises that really want to be leading and make a difference will heavily invest in technologies that help them to understand and exploit their unstructured data because if we only look at the numbers (and thatʼs the small portion of data most enterprises already understand very well) the area of unstructured data will be the one where the difference will be made over the next couple of years.
However the real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data.
As InterSystems we have a unique technology offering that was especially designed to help our customers and partners in doing exactly that and weʼre proud our partners that actually deploy the technology to fully exploit a 100% of their data make a real difference in their market and grow way faster than their competitors.

Q2. What is the main difference between semi-structured and unstructured information?

Michael Brands: The very short and bold answer to this question would be to say semi-structured is just a euphemism for unstructured.  However a more in-depth answer is that unstructured data is a combination of structured and unstructured data in the same data channel.
Typically semi-structured data comes out of forms that foresee specific free text areaʼs to describe specific parts of the required information. This way a “structured” (meta)-data field describes with a fair degree of abstraction the contents of the associated text field.
A typical example will help to clarify this: In an electronic medical record system the notes section in which a doctor can record his observations about a specific patient in free text is typically semi-structured which means the doctor doesnʼt have to write all observations in one text but he can typically “categorize” his observations under different headers such as: “Patient History”, “Family History”, “Clinical Findings”, “Diagnose” and more.
Subdividing such text entry environments into a series of different fields with a fixed header is a very common example of semi-structured data.  Another very popular example of semi-structured data is e-mail, mp3 or video-data. These data-types contain mainly unstructured data but these unstructured data is always attached to some more structured data such as: Author, Subject or Title, Summary etc.

Q3. The most common example of unstructured data is text. Several applications store portions of their data as unstructured text that is typically implemented as plain text, in rich text format (RTF), as XML, or as a BLOB (Binary Large Object). It is very hard to extract meaning from this content. How iKnow can help here?

Michael Brands: iKnow can help here in a very specific and unique way because it is able to structure these texts into chains of concepts and relations.
What this means is that iKnow will be able to tell you without prior knowledge what the most important concepts in these texts are and how they are related to each other.
This is why, when we talk about iKnow, we say the technology is proactive.
Any other technology that analyses text will need a domain specific model (statistical, ontological or syntactical) containing a lot of domain specific knowledge in order to make some sense out of the texts it is supposed to analyze. iKnow, thanks to its unique way of splitting sentences into concepts and relations doesnʼt need this.
It will fully automatically perform the analysis and highlighting tasks students usually perform as a first step in understanding and memorizing a course text book.

Q4. How do you exactly make conceptual meaning out of unstructured data? Which text analytical methods do you use for that?

Michael Brands: The process we use to extract meaning out of texts is unique because of the following: we do not split sentences into individual words and then try to recombine these words by means of a syntactic parser, an ontology (which essentially is a dictionary combined with a hierarchical model that describes a specific domain), or a statistical model. What iKnow does instead is we split sentences by identifying relational word(group)s in a sentence.
This approach is based on a couple of long known facts about language and communication.
 First of all analytical semantics already discovered years ago every sentence is nothing else than a chain of conceptual word groups (often called Noun Phrases or Prepositional Phrases in formal linguistics) tied together by relations (often called Verb Phrases in formal linguistics). So a sentence will semantically always be built as a chain of a concept followed by a relation followed by another concept again followed by another relation and another concept etc.
This basic conception of a binary sentence structure consisting of Noun-headed phrases (concepts) and Verb-headed phrases (relations) is at the heart of almost all major approaches to automated syntactic sentence analysis. However this knowledge is only used by state-of-the-art analysis algorithms to construct second order syntactic dependency structure representations of a sentence rather than to effectively analyze the meaning of a sentence. 
A second important discovery underpinning the iKnow approach is the fact, discovered by behavioral psychology and neuro-psychiatry, humans only understand and operate a very small set of different relations to express links between facts, events, or thoughts. Not only the set of different relations people use and understand is very limited but it is also a universal set. In other words people only use a limited number of different relations and these relations are the same for everybody no matter his language, education, cultural background or whatsoever.
This discovery can learn us a lot of how basic mechanisms for learning like derivation and inference work. But more important for our purposes is that we can derive from this that, in sharp contrast with the set of concepts that is infinite and has different subsets for each specific domain, the set of relations is limited and universal.
The combination of these two elements namely the basic binary concept-relation structure of language and the universality and limitedness of the set of relations led to the development of the iKnow approach after a thorough analysis of a lot of state-of-the-art techniques.
Our conclusion of this analysis is the main problem of all classical approaches to text analysis is they all focus essentially on the endless and domain specific set of concepts because they mostly were created to serve the specific needs of a specific domain.
Thanks to this domain specific focus the number of elements a system needs to know upfront can be controlled. Nevertheless a “serious” application quickly integrates several millions of different concepts. This need for large collections of predefined concepts to describe the application domain, commonly called dictionaries, taxonomies or ontologies, leads to a couple of serious problems.
First off all, the time needed to set up and tune such applications is substantial and expensive because domain experts are needed to come up with the most appropriate concepts. Second the foot print of these systems is rather big and their maintenance costly and time-consuming because specialists need to follow whatʼs going on in the domain and adapt the knowledge of the application.
Third, itʼs very difficult to open up a domain specific application for other domains because in these other domains concepts might have different meanings or even contradict each other which can create serious problems at the level of the parsing logic.
Therefore iKnow was built to perform a completely different kind of analysis because by focussing on the relations we can build systems with a very small footprint (an average language model only contains several 10.000s relations and a very small number of context based disambiguation rules).
Moreover our system is not domain specific but it can work with data from very different domains at the same time and doesnʼt need expert input. Splitting up sentences by means of relations and solving the ambiguous cases (this means the cases in which a word or word group can express both a concept and a relation e.g. walk: is a concept in this sentence: Brussels Paris would be quite a walk. and a relation in this sentence: Pete and Mary walk to school) by means of rules that use the function (concept or relation) of the surrounding words (or word groups) to decide whether the ambiguous word is a concept or a relation is a computationally very efficient and fast process and ensures a system that learns as it analyses more data because it kind of “learns” the concepts from the texts because it identifies them as “the groups of words between the relations, before the first relation and between the last relation and the end of the sentence.

Q5. How “precise” is the meaning you extract from unstructured data? Do you have a way to validate it?

Michael Brands: This is a very interesting question because it raises two very difficult topics in the area of semantic data analysis namely : How do you define precision and How to evaluate results generated by semantic technologies ?  If we use the classical definition of precision in this area, it describes what percentage of the documents given back by a system in response to a query asking for documents containing information about certain concepts actually contains useful information about these concepts.
Based on this definition of precision we can say iKnow scores very close to a 100% because it outperforms competing technologies in itʼs efficiency to detect what words in a sentence belong together and form meaningful groups and how the relate to each other.
Even if weʼd use other more challenging definitions of precision like: the syntactic or formal correctness of the word groups identified by iKnow we score very high percentages, but itʼs evident weʼre dependent of the quality of input. If the input doesnʼt accurately uses punctuation marks or contains a lot of non-letter characters that will affect our precision. Moreover how precision is perceived and defined varies a lot from one use case to another.
Evaluation is a very complex and subjective operation in this area because whatʼs considered to be good or bad heavily depends on what people want to do with the technology and what their background is. So far we let our customers and partners decide after an evaluation period whether the technology does what they expect from it and we didnʼt have “no goes” yet.

Q6. How do you process very large scale archives of data?

Michael Brands: The architecture of the system has been set up to be as flexible as possible and to make sure processes can be executed in parallel where possible and desirable. Moreover the system provides different modes to load data: A batch-load of data which has been especially designed to pump large amounts of existing data such as document archives into an system as fast as possible, a single source load thatʼs especially designed to add individual documents to a system at transactional speed, and a small-batch mode to add limited sets of documents to a system in one process.
On top of that the loading architecture foresees different steps in the loading process: data to be loaded needs to be listed or staged, the data can be converted (this means the data that has to be indexed can be adapted to get better indexing and analysis results), and, off course the data will be loaded into the system.
These different steps can partially be done in parallel and in multiple processes to ensure the best possible performance and flexibility.

Q.7 On one hand we have mining text data, and on the other hand we have database transactions on structured data: how do you relate them to each other?

Michael Brands: Well there are two different perspectives in this question:  On the one hand itʼs important to underline that all textual data indexed with iKnow can be used as if it was structured data, because the API foresees appropriate methods that allow you to query the textual data the same you would query traditional row-column data. These methods come in 3 different flavors: they can be called as native caché object script methods, they can be called from within a SQL-environment as stored procedures and they are also available as web services. 
On the other hand thereʼs the fact all structured data that has a link with the indexed texts can be used as metadata within iKnow. Based on these structured metadata filters can be created and used within the iKnow API to make sure the API returns exactly the results you need.

_______________________
Michael Brands previously founded i.Know NV a company specialized in analyzing unstructured data. In 2010 InterSystems acquired i.Know and since then he is serving as a senior product manager for the i.Know technology at InterSystems.
i.Know’s technology is embedded in the InterSystems technology platform.

– Big Data: Smart Meters — Interview with Markus Gerdes (June 18, 2012)

– Big Data for Good (June 4, 2012)

– On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum (February 1, 2012)

– On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica (November 16, 2011)

– On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com (November 2, 2011)

– Analytics at eBay. An interview with Tom Fastner (October 6, 2011)

Jul 2 12

Managing Big Data. An interview with David Gorbet

by Roberto V. Zicari

“Executives and industry leaders are looking at the Big Data issue from a volume perspective, which is certainly an issue – but the increase in data complexity is the biggest challenge that every IT department and CIO must address, and address now. “— David Gorbet.

Managing unstructured Big Data is a challenge and an opportunity at the same time. I have interviewed David Gorbet, vice president of product strategy at MarkLogic.

RVZ

Q1. You have been quoted saying that “more than 80 percent of today’s information is unstructured and it’s typically too big to manage effectively.” What do you mean by that?

David Gorbet: It used to be the case that all the data an organization needed to run its operations effectively was structured data that was generated within the organization. Things like customer transaction data, ERP data, etc.
Today, companies are looking to leverage a lot more data from a wider variety of sources both inside and outside the organization. Things like documents, contracts, machine data, sensor data, social media, health records, emails, etc. The list is endless really. A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows and columns.
And organizations want to be able to combine all this data and analyze it together in new ways. For example, we have more than one customer in different industries whose applications combine geospatial vessel location data with weather and news data to make real-time mission-critical decisions.

MarkLogic was early in recognizing the need for data management solutions that can handle a huge volume of complex data in real time. We started the company a decade ago to solve this problem, and we now have over 300 customers who have been able to build mission-critical real-time Big Data Applications to run their operations on this complex unstructured data.
This trend is accelerating as businesses all over the world are realizing that their old relational technology simply can’t handle this data effectively.

Q2. In your opinion, how is the Big Data movement affecting the market demand for data management software?

David Gorbet: Executives and industry leaders are looking at the Big Data issue from a volume perspective, which is certainly an issue – but the increase in data complexity is the biggest challenge that every IT department and CIO must address, and address now.
Businesses across industries have to not only store the data but also be able to leverage it quickly and effectively to derive business value.
We allow companies to do this better than traditional solutions, and that’s why our customer base doubled last year and continues to grow rapidly. Big Data is a major driver for the acquisition of new technology, and companies are taking action and choosing us.

Q3. Why do you see MarkLogic as a replacement of traditional database systems, and not simply as a complementary solution?

David Gorbet: First of all, we don’t advocate ripping out all your infrastructure and replacing it with something new. We recognize that there are many applications where traditional relational database technology works just fine. That said, when it came time to build applications to process large volumes of complex data, or a wide variety of data with different schemas, most of our customers had struggled with relational technology before coming to us for help.
Traditional relational database systems just don’t have the ability to handle complex unstructured data like we do. Relational databases are very good solutions for managing information that fits in rows and columns, however businesses are finding that getting value from unstructured information requires a totally new approach.
That approach is to use a database built from the ground up to store and manage unstructured information, and allow users to easily access the data, iterate on the data, and to build applications on top of it that utilize the data in new and exciting ways. As data evolves, the database must evolve with it, and MarkLogic plays a unique role as the only technology currently in the market that can fulfill that need.

Q4. How do you store and manage unstructured information in MarkLogic?

David Gorbet: MarkLogic uses documents as its native data type, which is a new way of storing information that better fits how information is already “shaped.”
To query, MarkLogic has developed an indexing system using techniques from search engines to perform database-style queries. These indexes are maintained as part of the insert or update transaction, so they’re available in real-time with no crawl delay.
For Big Data, search is an important component of the solution, and MarkLogic is the only technology that combines real-time search with database-style queries.

Q5. Would you define MarkLogic as an XML Database? A NoSQL database? Or other?

David Gorbet: MarkLogic is a Big Data database, optimized for large volumes of complex structured or unstructured data.
We’re non-relational, so in that sense we’re part of the NoSQL movement, however we built our database with all the traditional robust database functionality you’d expect and require for mission-critical applications, including failover for high availability, database replication for disaster recovery, journal archiving, and of course ACID transactions, which are critical to maintain data integrity.
If you think of what a next-generation database for today’s data should be, that’s MarkLogic.

Q6. MarkLogic has been working on techniques for storing and searching semantic information inside MarkLogic, and you have been running the Billion Triple Challenge, and the Lehigh University Benchmark. What were the main results of these tests?

David Gorbet: The testing showed that we could load 1 billion triples in less than 24 hours using approximately 750 gigabytes of disk and 150 gigabytes of RAM. Our LUBM query performance was extremely good, and in many cases superior, when compared to the performance from existing relational systems and dedicated triple stores.

Q7. Do you plan in the future to offer an open source API for your products?

David Gorbet: We have a thriving community of developers at community.marklogic.com where we make many of the tools, libraries, connectors, etc. that sit on top of our core server available for free, and in some cases as open source projects living on the social coding site github.
For example, we publish the source for XCC, our connector for Java or .NET applications, and we have an open-source REST API there as well.

Q8. James Phillips from Couchbase said in an interview last year : “It is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology.” What is your opinion on that?

David Gorbet: MarkLogic certainly sees the value of standards, and for years we’ve worked with the World Wide Web Consortia (W3C) standards groups in developing the XQuery and XSLT languages, which are used by MarkLogic for query and transformation. Interoperability helps drive collaboration and new ideas, and supporting standards will allow us continue to be at the forefront of innovation.

Q9. MarkLogic and Hortonworks last March announced a partnerships to enhance Real-Time Big Data Applications with Apache Hadoop. Can you explain how technically the combination of MarkLogic and Hadoop will work?

David Gorbet: Hadoop is a key technology for Big Data, but doesn`t provide the real-time capabilities that are vital for the mission-critical nature of so many
organizations. MarkLogic brings that power to Hadoop, and is executing its Hadoop vision in stages.
Last November, MarkLogic introduced its Connector for Hadoop, and in March 2012, announced a partnership with leading Hadoop vendor Hortonworks. The partnership enables organizations in both the commercial and public sectors to seamlessly combine the power of MapReduce with MarkLogic’s real-time, interactive analysis and indexing on a single, unified platform.

With MarkLogic and Hortonworks, organizations have a fully supported big data application platform that enables real-time data access and full-text search together with batch processing and massive archival storage.
MarkLogic will certify its connector for Hadoop against the Hortonworks Data Platform, and the two companies will also develop reference-architectures for MarkLogic-Hadoop solutions.

Q10. How do you identify new insights and opportunities in Big Data without having to write more code and wait for the batch process to complete?

David Gorbet: The most impactful Big Data Applications will be industry- or even organization-specific, leveraging the data that the organization
consumes and generates in the course of doing business. There is no single set formula for extracting value from this data; it will depend on the application.
That said, there are many applications where simply being able to comb through large volumes of complex data from multiple sources via interactive queries can give organizations new insights about their products,customers, services, etc.
Being able to combine these interactive data explorations with some analytics and visualization can produce new insights that would otherwise be hidden.
We call this Big Data Search.
For example, we recently demonstrated an application at MarkLogic World that shows through real-time co-occurrence analysis new insights about how products are being used. In our example, it was analysis of social media that revealed that Gatorade is closely associated with flu and fever, and our ability to drill seamlessly from high-level aggregate data into the actual source social media posts shows that many people actually take Gatorade to treat flu symptoms. Geographic visualization shows that this phenomenon may be regional. Our ability to sift through all this data in real-time, using fresh data gathered from multiple sources, both internal and external to the organization helps our customers identify new actionable insights.

———–
David Gorbet is the vice president of product strategy for MarkLogic.
Gorbet brings almost two decades of experience delivering some of the highest-volume applications and enterprise software in the world. Prior to MarkLogic, Gorbet helped pioneer Microsoft`s business online services strategy by founding and leading the SharePoint Online team.
__________

Related Resources

– Lecture Notes on “Data Management in the Cloud”.
by Michael Grossniklaus, and David Maier, Portland State University.
The topics covered in the course range from novel data processing
paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data
management platforms (Google BigTable, Microsoft Azure, Amazon S3
and Dynamo, Yahoo PNUTS) and open-source NoSQL databases
(Cassandra, MongoDB, Neo4J).
Lecture Notes|Intermediate|English| DOWNLOAD ~280 slides (PDF)| 2011-12|

– NoSQL Data Stores Resources (free downloads) :
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

Related Posts

– Big Data for Good (June 4, 2012)

–Interview with Mike Stonebraker (May 2, 2012)

– Integrating Enterprise Search with Analytics. Interview with Jonathan Ellis ( April 16, 2012)

– A super-set of MySQL for Big Data. Interview with John Busch, Schooner (February 20, 2012)

– Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB ( December 14, 2011)

– vFabric SQLFire: Better than RDBMS and NoSQL? October 24, 2011) (

– MariaDB: the new MySQL? Interview with Michael Monty Widenius (September 29, 2011)

– On Versant`s technology. Interview with Vishal Bagga ( August 17, 2011)

Jun 25 12

Managing Internet Protocol Television Data. — An interview with Stefan Arbanowski.

by Roberto V. Zicari

” There is a variety of services possible via IPTV. Starting with live/linear TV and Video on Demand (VoD) over interactive broadcast related apps, like shopping or advertisement, up to social TV apps where communities of users have shared TV experience”— Stefan Arbanowski.

The research center Fraunhofer FOKUS (Fraunhofer Institute for Open Communication Systems) in Berlin, has established a “SmartTV Lab” to build an independent development and testing environment for HybridTV technologies and solutions. They did some work on benchmarking databases for Internet Protocol Television Data. I have interviewed Stefan Arbanowski, who leads the Lab.

RVZ

Q1.What are the main research areas at the Fraunhofer Fokus research center?

Stefan Arbanowski: Be it on your mobile device, TV set or car – the FOKUS Competence Center for Future Media and Applications (FAME) develops future web technologies to offer intelligent services and applications. Our team of visionaries combines creativity and innovation with their technical expertise for the creation of interactive media. These technologies enable smart personalization and support future web functionalities on various platform from diverse domains.

The experts rigorously focus on web-based technologies and strategically use open standards. In the FOKUS Hybrid TV Lab our experts develop future IPTV technologies compliant to current standards with emphasis on advanced functionality, convergence and interoperability. The FOKUS Open IPTV Ecosystem offers one of the first solutions for standardized media services and core components of the various standards.

Q2. At Fraunhofer Fokus, you have experience in using a database for managing and controlling IPTV (Internet Protocol Television) content. What is IPTV? What kind of internet television services can be delivered using IPTV?

Stefan Arbanowski: There is a variety of services possible via IPTV.
Starting with live/linear TV and Video on Demand (VoD) over interactive broadcast related apps, like shopping or advertisement, up to social TV apps where communities of users have shared TV experience.

Q3. What is IPTV data? Could you give a short description of the structure of a typical IPTV data?

Stefan Arbanowski: This is complex: start with page 14 of this doc. 😉

Q4. What are the main requirements for a database to manage such data?

Stefan Arbanowski: There are different challenges. One is the management of different sessions of streams that is used by viewers following a particular service including for instance electronic program guide (EPG) data. Another one is the pure usage data for billing purpose. Requirements are concurrent read/write ops on large (>=1GB) DBs ensuring fast response times.

Q5. How did you evaluate the feasibility of a database technology for managing IPTV data?

Stefan Arbanowski: We did compare Versant ODB (JDO interface) with MySQL Server 5.0 and handling data in RAM. For this we did 3 implementations trying to get most out of the individual technologies.

Q6. Your IPTV benchmark is based on use cases. Why? Could you briefly explain them?

Stefan Arbanowski: It has to be a real world scenario to judge whether a particular technology really helps. We did identify the bottlenecks in current IPTV systems and used them as basis for our use cases.

The objective of the first test case was to handle a demanding number of simultaneous read/write operations and queries with small data objects, typically found in an IPTV Session Control environment.
V/OD performed more than 50% better compared to MySQL in a server side, 3-tier application server architecture. Our results for a traditional client/server architecture showed smaller advantages for the Versant Object Database, performing only approximately 25% better than MySQL, probably because of the network latency of the test environment.

The second test case managed larger sized Broadband Content Guide = Electronic Program Guide (BCG) data in one large transaction. V/OD was more than 8 times
faster compared to MySQL. In our analysis, the test case demonstrated V/OD’s significant advantages when managing complex data structures.

We wrote a white paper for more details.

Q7. What are the main lessons learned in running your benchmark?

Stefan Arbanowski: Comparing databases is never an easy task. Many specific requirements influence the decision making process, for example, the application specific data model and application specific data management tasks. Instead of using a standard database benchmark, such as TPC-C, we chose to develop a couple of benchmarks that are based on our existing IPTV Ecosystem data model and data management requirements, which allowed us to analyze results that are more relevant to the real world requirements found in such systems.

Q8. Anything else you wish to add?

Stefan Arbanowski: Considering these results, we would recommend a V/OD database implementation where performance is mandatory and in particular when the application must manage complex data structures.

_____________________
Dr. Stefan Arbanowski is head of the Competence Centre Future Applications and Media (FAME) at Fraunhofer Institute for Open Communication Systems FOKUS in Berlin, Germany.
Currently, he is coordinating Fraunhofer FOKUS’ IPTV activities, bundling expertise in the areas of interactive applications, media handling, mobile telecommunications, and next generation networks. FOKUS channels those activities towards networked media environments featuring live, on demand, context-aware, and personalized interactive media.
Beside telecommunications and distributed service platforms, he has published more than 70 papers in respective journals and conferences in the area of personalized service provisioning. He is member of various program committees of international conferences.

Jun 18 12

Big Data: Smart Meters — Interview with Markus Gerdes.

by Roberto V. Zicari

“For a large to medium sized German utility, which has about 240,000 conventional meters, quarter-hour meter readings would produce 960,000 sets of meter data to be processed and stored each hour once replaced by smart meters. And every hour another 960,000 sets of meter data have to be processed.” — Markus Gerdes.

80 percent of all households in Germany will have to be equipped with smart meters by 2020, according to a EU single market directive.
Why smart meters? A smart meter, as described by e.On, is “a digital device which can be read remotely and allows customers to check their own energy consumption at any time. This helps them to control their usage better and to identify concrete ways to save energy. Every customer can access their own consumption data online in graphic form displayed in quarter-hour intervals. There is also a great deal of additional information, such as energy saving tips. Similarly, measurements can be made using a digital display in the home in real time and the current usage viewed.” This means Big Data. How do we store, and use all these machine-generated data?
To better understand this, I have interviewed Dr. Markus Gerdes, Product Manager at BTC , a company specialized in the energy sector.

RVZ

Q1. What are the main business activities of BTC ?

Markus Gerdes: BTC provides various IT-services: besides the basics of system management, e.g. hosting services, security services or the new field of mobile security services, BTC primarily delivers IT- and process consulting and system integration services for different industries, especially for utilities.
This means, BTC plans and rolls IT-architectures out, integrates and customizes IT-applications and migrates data for ERP, CRM and more applications. BTC also delivers its IT-applications if desired: In particular, BTC’s Smart Metering solution BTC Advanced Meter Management (BTC AMM) is increasingly known in the smart meter market and has drawn customers` interest at this stage of the market, not only in Germany, but e.g. in Turkey and other European countries as well.

Q2. According to a EU single market directive and German Federal Government, 80 percent of all households in Germany will have to be equipped with smart meters by 2020, How many smart meters will have to be installed? What will the government do with all these data generated?

Markus Gerdes: Currently, 42 million electricity meters are installed in Germany. Thus, about 34 million meters need to be exchanged according to the EU directive in Germany until 2020. In order to achieve this aim, in 2011 the Germany EnWG (law on the energy industry) adds some new aspects: smart meters have to be installed where customers` electricity consumption is more than 6.000 kWh per year, at decentralized feed-in with more than 7 kW and in considerably refurbished or newly constructed buildings, if this is technically feasible.
In this context technical feasible means, that the installed smart meters are certified (as a precondition they have to use the protection profiles) and must be commercially available in the meter market. An amendment to the EnWG is due in September 2012 and it is generally expected that this threshold of 6000 kWh will be lowered. The government will actually not be in charge of the data collected by the Smart Meters. It is metering companies who have to provide the data to the distribution net operators and utility companies. The data is then used for billing and as an input to customer feedback systems for example and potentially grid analyses under the use of pseudonyms.

Q3. Smart Metering: Could you please give us some detail on what Smart Metering means in the Energy sector?

Markus Gerdes: Smart Metering means opportunities. The technology itself does no more or less than deliver data, foremost a timestamp plus a measured value, from a metering system via a communication network to an IT-system, where it is prepared and provided to other systems. If necessary this may even be done in real time. This data can be relevant to different market players in different resolutions and aggregations as a basis for other services.
Furthermore, smart meter offer new features like complex tariffs, load limitations etc. The data and the new features will lead to optimized processes with respect to quality, speed and costs. The type of processing will finally lead to new services, products and solutions – some of which we do not even know today. In combination with other technologies and information types the smart metering infrastructure will be the backbone of smart home applications and the so-called smart grid.
For instance, BTC develops scenarios to combine the BTC AMM with the control of virtual power plants or even with the BTC grid management and control application BTC PRINS. This means: smart markets become reality.

Q4. BTC AG has developed an advanced meter management system for the energy industry. What is it?

Markus Gerdes:The BTC AMM is an innovative software system, which allows meter service providers to manage, control and readout smart meters and provide these meter readings and other possibly relevant information, e.g. status information, information on meter corruption and manipulation to authorized market partners.
Also data and control signals for the smart meter can be provided by the system.
The BTC AMM is developed as a new solution BTC has been able to particularly focus on mass data management and smart meter mass process optimized workflows. In combination with a clear and easy to use frontend we bring our customers a high performance solution for their most important requirements.
In addition, our modular concept and the use of open standards makes our vendor-independent solution not only fit into utilities IT-architecture easily but makes it future-proof.

Q5. What kind of data management requirements do you have for this application? What kind of data is a smart meter producing and at what speed? How do you plan to store and process all the data generated by these smart meters?

Markus Gerdes: Let me address the issue of the data volume and frequency of data sent first. The BTC AMM is designed to collect the data of several millions of
smart meters. In a standard scenario each of the smart meters sends a load profile with a resolution of 15 minutes to the BTC AMM. This means that at least 96 data points have to be stored by the BTC AMM per day and meter. This implies both, a huge amount of data to be stored and a high frequency data traffic.
Hence, the data management system needs to be highly performant in both dimensions. In order to process time series BTC has developed a specific, highly efficient time series management which runs with different data base providers. This enables the BTC AMM to cope even with data sent in a higher frequency. For certain smart grid use cases the BTC AMM processes metering data sent from the meters on the scale of seconds.

Q6. The system you are developing is based on InterSystems Caché® database system. How do you use Cache`?

Markus Gerdes: BTC uses InterSystems Caché as Meter Data Management solution. This means the incoming data from the smart meters is saved into the database and the information provided e.g. to backend-systems via webservices or to other interfaces so that the data can be used for market partner communication or customer feedback systems. And all this means the BTC AMM has to handle thousands of read- and write-operations per second.

Q7. You said that one of the critical challenge you are facing is to “master up the mass data efficiency in communicating with smart meters and the storage and processing of measured time series” Can you please elaborate on this? What is the volume of the data sets involved?

Markus Gerdes: For a large to medium sized German utility, which has about 240,000 conventional meters, quarter-hour meter readings would produce 960,000
sets of meter data to be processed and stored each hour once replaced by smart meters. And every hour another 960,000 sets of meter data have to be processed.
In addition calculations, aggregations and plausibility checks are necessary. Moreover incoming tasks have to be processed and the relevant data has to be delivered to backend applications. This means that the underlying database as well as the AMM-processes may have to process the incoming data every 15 minutes while reading thousands of time series per minute simultaneously.

Q8. How did you test the performance of the underlying database system when handling data streams? What results did you obtain so far?

Markus Gerdes: We designed a load profile generator and used it to simulate the meter readings of more than 1 million smart meters. The tests included the
writing of quarter-hour meter readings. Actually the problem with this test was the speed of the generator to provide the data, not the speed of the AMM. In fact we are able to write more than 12.000 time-series per second. This is far enough to cope even with full meter rollouts.

Q9. What is the current status of this project? What are the lessons learned so far? And the plans ahead? Are there any similar systems implemented in Europe?

Markus Gerdes: At the moment we think that our BTC AMM- and database-performance is able to handle the upcoming mass data during the next years including a full smart meter rollout in Germany. Nevertheless, in terms of smart grid and smart home appliances and an increasing amount of real time event processings, both read and write, it is necessary to get a clear view of future technologies to speed up processing of mass data (e.g. in-memory).
In addition we still have to keep an eye on usability. Although we hope that smart metering in the end will lead to complete machine-to-machine-communication we always have to expect errors and disturbances from technology, communication or even the human factor. As event driven processes are time critical we still have to work on solutions for fast and efficient handling, analyses and processing of mass errors.
_________________

Dr. Markus Gerdes, Product Manager BTC AMM / BTC Smarter Metering Suite, BTC Business Technology Consulting AG.
Since 2009 Mr. Gerdes worked in several research, development and consulting projects in the area of smart metering. He was involved in research and consulting in the sectors Utilities, Industry and Public, regarding IT-architecture and solutions and IT-Security.
He is experienced in the development of energy management solutions.

Jun 4 12

Big Data for Good.

by Roberto V. Zicari

A distinguished panel of experts discuss how Big Data can be used to create Social Capital.

Every day, 2.5 quintillion bytes of data are created. This data comes from digital pictures, videos, posts to social media sites, intelligent sensors, purchase transaction records, cell phone GPS signals to name a few. This is Big Data.

There is a great interest both in the commercial and in the research communities around Big Data. It has been predicted that “analyzing Big Data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus”, according to research by MGI and McKinsey’s Business Technology Office.

But very few people seem to look at how Big Data can be used for solving social problems. Most of the work in fact is not in this direction. Why this?
What can be done in the international research community to make sure that some of the most brilliant ideas do have an impact also for social issues?

I have invited a panel of distinguished well known researchers and professionals to discuss this issue. The list of panelists include:

– Roger Barga, Microsoft Research, group lead eXtreme Computing Group, USA

– Laura Haas, IBM Fellow and Director Institute for Massive Data, Analytics and Modeling IBM Research, USA

– Alon Halevy, Google Research, Head of the Structured Data Group, USA

– Paul Miller, Consultant, Cloud of Data, UK

This Q&A panel focuses exactly at this question: is it possible to conduct research for a corporation and/or a research lab, and at the same time make sure that the potential output of our research has also a social impact?

We take Big Data as a key example. Big Data is clearly of interest to marketers and enterprises a like who wish to offer their customers better services and better quality products. Ultimately their goal is to sell their products/services.

This is good, but how about digging into Big Data to help people in need? Preventing / predicting natural catastrophes, helping offering services “targeting” to people and structures in social need?

Hope you`ll find this interview interesting, as well as eye-opening.

RVZ

Q1. In your opinion, would it be possible to exploit some of the current and future research and developments efforts on Big Data for achieving social capital?

Alon: Yes, Big data is not just the size of an individual data set, but rather the collection of data that is available to us online (e.g., government data, NGOs,
local governments, journalists, etc). By putting these data together we help tell stories about the data and make them of interest and of value to the wider public. As one simple example, a recent Danish Journalism Award was given to a nice visualization of data about which doctors are being sponsored by the medical industry. The ability to communicate this data with the public is certainly part of the Big Data agenda.

Laura: Absolutely. In fact, many of the efforts that we are engaged in today are exactly in this direction.
Much of our “Smarter Planet” related research is around utilizing more intelligently the large amounts of data coming from instrumenting, observing, and capturing the information about phenomena on planet earth, both natural and man-made.

Paul: First, it’s important to recognise that technological advances, new techniques, and new ways of working often deliver both tangible and intangible social benefit as a by-product of something else. Robert Owen and his peers in the late 18th and early 19th centuries might have had genuine motives for the social welfare and educational programmes they delivered for workers in their factories, but it was the commercial success of the factories themselves that paid for the philanthropy. And better educated children became better integrated factory workers, so it wasn’t completely altruistic.

That said, there is clearly scope for Big Data to deliver direct benefits in areas that aid society. Google Flu Trends is perhaps the best-known example – analysis of many millions of searches for flu-related terms (symptoms, medicines, etc) enabling Google’s non-profit Foundation to provide early visibility of illness in ways that could/should assist local healthcare systems. Google’s search engine isn’t about flu, and its indices aren’t for flu detection or prediction; this piece of societal value simply emerges from the ‘data exhaust’ of all those people searching a single site. Flu Trends isn’t alone; Harvard researchers found that Twitter data could be analysed to track the spread of Cholera on Haiti in a way that proved “substantially faster” than traditional techniques. According to Mathew Ingram’s write-up of the research, “What the Harvard and HealthMap study shows is that analyzing the data from large sets like the tweets around Haiti isn’t just good at tracking patterns or seeing connections after an event has occurred, but can actually be of use to researchers on the ground while those events are underway” (my emphasis).

Roger: Absolutely, we have already seen several such examples.
One such example in science is Jim Gray and Alex Szalay’s collaboration to build a virtual observatory for astronomy, which leveraged relational database technology.
The SDSS Sky Server has since supported hundreds of researchers and resulted in thousands of publications over the years.
Another, more recent example, is the language translation system researchers in Microsoft Research built for the aid relief worker in Haiti after the 2010 earthquake.
They leveraged the same technology we leverage in our search operations to build a statistical machine translation engine to translate Haitian Creole to English from scratch in 4 days, 17 hours, and 30 minutes and delivered to aid workers in Haiti.

Q2. If yes, what are the areas where in your opinion Big Data could have a real impact on social capital?

Alon: Bringing data that is otherwise hidden from view to the eyes of the interested public. Data activists and journalists world-wide need to be able to easily
discover data sets, merge them in a sensible fashion and tell stories about them that will grab people’s attention.
As another example, helping people in crisis response situations has huge potential. As two examples, people have used Google Fusion Tables to create maps with critical information for people after the Japan earthquake in 2011 and before the hurricane in NYC later that year.

Laura: Healthcare is an obvious one, where leveraging the vast amounts of genomic information now being produced together with patient records, and the medical literature could help us provide the best known treatments to an individual patient — or discover new therapies that may be more effective than those currently in use.
We have worked already on leveraging big data and machine learning to predict the best therapeutic regimens for AIDS patients, for example. Or, when it comes to natural resources, we are leveraging big data to optimize the placement of turbines in a wind farm so that we get the most power for the least environmental impact. We can also look at man-made phenomena — for example, understanding traffic patterns and using the insight to do better planning or provide incentives that can reduce traffic during crunch hours. Many other examples can be given of how Big Data is being used to improve the planet!

Paul: The opportunities must – surely – be enormous? Any of the big issues affecting society, from environmental change, to population growth, to the need for clean water, food, and healthcare; all of these affect large groups of people and all of them have aspects of policy formulation or delivery that are (or should be, if anyone collected it) data-rich. The Volume, Velocity and Variety of data in many of these areas should offer challenging research opportunities for practitioners… and tangible benefits to society when they’re successful.

Roger: Top of mind is to advance scientific research, what has been referred to as eScience which covers both the traditional hard sciences from astronomy, oceanography, to the social sciences and economics.
Our ability to acquire and analyze unprecedented amounts of data has the potential to have a profound impact on science. It is a leap from the application of computing to support scientists to ‘do’ science (i.e. ‘computational science’) to the integration of computer science and ability to analyze volumes of data to extract insights into the very fabric of science. While on the face of it, this change may seem subtle, we believe it to be fundamental to science and the way science is practiced. Indeed, we believe this development represents the foundations of a new revolution in science.
We captured stories from many different scientific investigations in the book “The Fourth Paradigm: Data-Intensive Scientific Discovery.”

Q3. What are the main challenges in such areas?

Alon: Data discovery is a huge challenge (how to find high-quality data from the vast collections of data that are out there on the Web).
Determining the quality of data sets and relevance to particular issues (i.e., is the data set making some underlying assumption that renders it biased or not informative for a particular question). Combining multiple data sets by people who have little knowledge of database techniques is a constant challenge.

Laura: With any big data project, many of the same issues exist. I’ll mention three major categories of issues: those related to the data, itself, those related to the process of deriving insight and benefit from the data, and finally, those related to management issues such as data privacy, security, and governance in general. In the data space, we talk about the 4 V’s of data — Volume (just dealing with the sheer size of it), Variety (handling the multiplicity of types and sources and formats), Velocity (reacting to the flood of information in the time required by the application), and, last and perhaps least understood, Veracity (how can we cope with uncertainty, imprecision, missing values, and yes, occasionally, mis-statements or untruths?).
The challenges with deriving insight include capturing data, aligning data from different sources (e.g., resolving when two objects are the same), transforming the data into a form suitable for analysis, modeling it, whether mathematically, or through some form of simulation, etc, and then understanding the output — visualizing and sharing the results, for example. And governance includes ensuring that data is used correctly (abiding by its intended uses and relevant laws), tracking how the data is used, transformed, derived, etc, and managing its lifecycle. There are research topics in ALL of these areas!

Paul: Data availability – is there data available, at all? Increasingly, there is. But coverage and comprehensiveness often remain patchy, and the rigour with which datasets are compiled may still raise concerns. A good process will, typically, make bad decisions if based upon bad data.
Data quality – how good is the data? How broad is the coverage? How fine is the sampling resolution? How timely are the readings? How well understood are the sampling biases? What are the implications in, for example, a Tsunami that affects several Pacific Rim countries? If data is of high quality in one country, and poorer in another, does the Aid response skew ‘unfairly’ toward the well-surveyed country or toward the educated guesses being made for the poorly surveyed one?
Data comprehensiveness – are there areas without coverage? What are the implications?
Personally Identifiable Information – much of this information is about people. Can we extract enough information to help people without extracting so much as to compromise their privacy? Partly, this calls for effective industrial practices. Partly, it calls for effective oversight by Government. Partly – perhaps mostly – it requires a realistic reconsideration of what privacy really means… and an informed grown up debate about the real trade-off between aspects of privacy ‘lost’ and benefits gained.
Rather than offering blanket privacy policies, perhaps customers, regulators and software companies should be moving closer to some form of explicit data agreement; if you give me access to X, Y, and Z about yourself, I will use it for purposes A, B, and C… and you will gain benefits/services D, E, and F. The first two parts are increasingly in place, albeit informally. The final part – the benefits – is far less well expressed.
Data dogmatism – analysis of big data can offer quite remarkable insights, but we must be wary of becoming too beholden to the numbers. Domain experts – and common sense – must continue to play a role. It would be worrying, indeed, if the healthcare sector only responded to flu outbreaks when Google Flu Trends told them to! See, for example, a recent blog post of mine –

Roger: The first important step is to embrace a data-centric view. The goal is not merely to store data for a specific community but to improve data quality and to deliver as a service accurate, consistent data to operational systems. It isn’t simply a matter of connecting the plumbing between many different data sources, there’s a quality function that has to be applied, to clean, and reconcile all of this information.
Researchers don’t simply need data, they need services-based information over this data to support their work.

Q4. What are the main difficulties, barriers hindering our community to work on social capital projects?

Alon: I don’t think there are particular barriers from a technical perspective. Perhaps the main barrier is ideas of how to actually take this technology and make social impact. These ideas typically don’t come from the technical community, so we need more inspiration from activists.

Laura: Funding and availability of data are two big issues here. Much funding for social capital projects comes from governments — and as we know, are but a small fraction of the overall budget. Further, the market for new tools and so on that might be created in these spaces is relatively limited, so it is not always attractive to private companies to invest. While there is a lot of publicly available data today, often key pieces are missing, or privately held, or cannot be obtained for legal reasons, such as the privacy of individuals, or a country’s national interests. While this is clearly an issue for most medical investigations, it crops up as well even with such apparently innocent topics as disaster management (some data about, e.g., coastal structures, may be classified as part of the national defense).

Paul: Perceived lack of easy access to data that’s unencumbered by legal and privacy issues? The large-scale and long term nature of most of the problems?
It’s not as ‘cool’ as something else? A perception (whether real or otherwise) that academic funding opportunities push researchers in other directions?
Honestly, I’m not sure that there are significant insurmountable difficulties or barriers, if people want to do it enough.
As Tim O’Reilly said in 2009 (and many times since), developers should “Work on stuff that matters.” The same is true of researchers.

Roger: The greatest barrier may be social. Such projects require community awareness to bring people to take action and often a champion to frame the technical challenges in a way that is approachable by the community.
These projects will likely require close collaboration between the technical community and those familiar with the problem.

Q5. What could we do to help supporting initiatives for Big Data for Good?

Alon: Building a collection of high quality data that is widely available and can serve as the backbone for many specific data projects. For example, data sets that
include boundaries of countries/counties and other administrative regions, data sets with up-to-date demographic data. It’s very common that when a particular data story arises, these data sets serve to enrich it.

Laura: Increasingly, we see consortiums of institutions banding together to work on some of these problems. These Centers may provide data and platforms for
data-intensive work, alleviating some of the challenges mentioned above by acquiring and managing data, setting up an environment and tools, bringing in expertise in a given topic, or in data, or in analytics, providing tools for governance, etc. My own group is creating just such a platform, with the goal of facilitating such collaborative ventures. Of course, lobbying our governments for support of such initiatives wouldn’t hurt!

Paul: Match domains with a need to researchers/companies with a skill/product. Activities such as the recent Big Data Week Hackathons might be one route to follow – encourage the organisers (and companies like Kaggle, which do this every day) to run Hackathons and competitions that are explicitly targeted at a ‘social’ problem of some sort. Continue to encourage the Open Data release of key public data sets. Talk to the agencies that are working in areas of interest, and understand the problems that they face. Find ways to help them do what they already want to do, and build trust and rapport that way.

Roger: Provide tools and resources to empower the long tail of research.
Today, only a fraction of scientists and engineers enjoy regular access to high performance and data-intensive computing resources to process and analyze massive amounts of data and run models and simulations quickly. The reality for most of the scientific community is that speed to discovery is often hampered as they have to either queue up for access to limited resources or pare down the scope of research to accommodate available processing power. This problem is particularly acute at the smaller research institutes which represent the long tail of the research community. Tier 1 and some tier 2 universities have sufficient funding and infrastructure to secure and support computing resources while the smaller research programs struggle. Our funding agencies and corporations must provide resources to support researchers, in particular those who do not have access to sufficient resources.

Q6. Are you aware of existing projects/initiatives for Big Data for Good?

Laura: Yes, many! See above for some examples. IBM Research alone has efforts in each of the areas mentioned — and many more. For example, we’ve been working with the city of Rio, in Brazil, to do detailed flood modeling, meter by meter; with the Toronto Children’s Hospital to monitor premature babies in the neonatal ward,
allowing detection of life-threatening infections up to 24 hours earlier; and with the Rizzoli Institute in Italy to find the best cancer treatments for particular groups of patients.

Roger: Yes, the United Nations Global Pulse initiative is one example. Earlier this year at the 2012 Annual Meeting in Davos, the World Economic Forum
published a white paper entitled “Big Data, Big Impact: New Possibilities for International Development“. The WEF paper lays out several of the ideas which fundamentally drive the Global Pulse initiative and presents in concrete terms the opportunity presented by the explosion of data in our world today, and how researchers and policymakers are beginning to realize the potential for leveraging Big Data to extract insights that can be used for Good, in particular for the benefit of low-income populations. What I find intriguing about this project from a technical perspective is how to extract insight from ambient data, from GPS devices, cell phones and medical devices, combined with crowd sourced data from health and aid workers in the field, then analyzed with machine learning and analytics to predict a potential social need or crisis in advance while remediation is still viable.

Q7. Anything else you wish to add?

Alon: Google Fusion Tables has been used in many cases for social good, either though journalists, crisis response or data activists making a compelling
visualization that caught people’s attention. This has been one of the most gratifying aspects of working on Fusion Tables and has served as a main driver for prioritizing our work: make it easy for people with passion for the data (rather than database expertise) to get their work done; make it easier for them to find relevant data and combine it with their own. We look very carefully at the workflow of these professionals and try to make it as efficient as possible.

Laura: I think our community has the ability to do a lot of good by leveraging the tools we are developing, and our expertise, to attack some of the critical problems facing our world. We may even create economic value (not a bad thing, either!) while doing so.
_____________________________

Dr. Roger Barga has been with the Microsoft Corporation since 1997, first working as a researcher in the database research group of Microsoft Research, then as architect of the Technical Computing Initiative, followed by architect and engineering group lead in the eXtreme Computing Group of Microsoft Research.
He currently leads a product group developing an advanced analytics service on Windows Azure. Roger holds a PhD in Computer Science (database systems), MS in Computer Science (machine learning), and a BS in Mathematics.

Dr. Alon Halevy heads the Structured Data Group at Google Research.
Prior to that, he was a Professor of Computer Science at the University of Washington, where he founded the Database Research Group. From 1993 to 1997 he was a Principal Member of Technical Staff at AT&T Bell Laboratories (later AT&T Laboratories). He received his Ph.D in Computer Science from Stanford University in 1993, and his Bachelors degree in Computer Science and Mathematics from the Hebrew University in Jerusalem in 1988. Dr. Halevy was elected Fellow of the Association of Computing Machinery in 2006.

Dr. Laura Haas is an IBM Fellow, and Director of IBM Research’s new Institute for Massive Data, Analytics and Modeling; she also serves as a “catalyst” for ambitious research across IBM’s worldwide research labs. She was the Director of Computer Science at IBM’s Almaden Research Center from 2005 to 2011.
From 2001-2005, she led the Information Integration Solutions architecture and development teams in IBM’s Software Group. Previously, Dr. Haas was a research staff member and manager at Almaden.
She is best known for her work on the Starburst query processor, from which DB2 LUW was developed, on Garlic, a system which allowed integration of heterogeneous data sources, and on Clio, the first semi-automatic tool for heterogeneous schema mapping.
She has received several IBM awards for Outstanding Innovation and Technical Achievement, an IBM Corporate Award for her work on information integration technology, and the Anita Borg Institute Technical Leadership Award. Dr. Haas was Vice President of the VLDB Endowment Board of Trustees from 2004-2009, and is a member of the National Academy of Engineering and the IBM Academy of Technology, an ACM Fellow, and Vice Chair of the board of the Computing Research Association.

Dr. Paul Miller is Founder of the Cloud of Data, a UK-based consultancy primarily concerned with Cloud Computing, Big Data, and Semantic Technologies.
He works with public and private sector clients in Europe and North America, and has a Ph.D in Archaeology (Geographic Information Systems) from the University of York.

________________________

Acknowledgement: I would like to thank Michael J. Carey with whom I have brainstormed about this project at EDBT in Berlin. RVZ

Note: you can DOWNLOAD THIS INTERVIEW AS .PDF

ODBMS Industry Watch

On Impedance Mismatch. Interview with Reinhold Thurner

On Eventual Consistency– An interview with Justin Sheehy.

On Big Graph Data.

On Analyzing Unstructured Data. — Interview with Michael Brands.

Managing Big Data. An interview with David Gorbet

Managing Internet Protocol Television Data. — An interview with Stefan Arbanowski.

Big Data: Smart Meters — Interview with Markus Gerdes.

About the author

Archives

Meta

About

Flickr

Search

About the author

Tags

Archives

Meta

About

Flickr

Search