Skip to content

"Trends and Information on Big Data, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Jul 2 13

On Oracle NoSQL Database –Interview with Dave Segleau.

by Roberto V. Zicari

“We went down the path of building Oracle NoSQL database because of explicit request from some of our largest Oracle Berkeley DB installations that wanted to move away from maintaining home grown sharding implementations and very much wanted an out of box technology that can replicate the robustness of what they had built “out of box” ” –Dave Segleau.

On October 3, 2011 Oracle announced the Oracle NoSQL Database, and on December 17, 2012, Oracle shipped Oracle NoSQL Database R2. I wanted to know more about the status of the Oracle NoSQL Database. I have interviewed Dave Segleau, Director of Product Management,Oracle NoSQL Database.

RVZ

Q1. Who is currently using Oracle NoSQL Database, and for what kind of domain specific applications? Please give us some examples.

Dave Segleau: There are a range of users from segments such as Web-scale Transaction Processing, to Web-scale Personalization and Real-time Event Processing. To pick the area where I would say we see the largest adoption, it would be the Real-time Event Processing category. This is basically the use case that covers things like Fraud Detection, Telecom Services Billing, Online Gaming and Mobile Device Management.

Q2. What is new in Oracle NoSQL Database R2?

Dave Segleau: We added significant enhancements to NoSQL Database in the areas of Configuration Management/Monitoring (CM/M), APIs and Application Developer Usability, as well as Integration with the Oracle technology stack.
In the area of CM/M, we added “Smart Topology” ( an automated capacity and reliability-aware data storage allocation with intelligent request routing), configuration elasticity and rebalancing, JMX/SNMP support. In the area of APIs and Application Developer Usability we added a C API, support for values as JSON objects (with AVRO serialization), JSON schema definitions, and a Large Object API (including a highly efficient streaming interface). In the area of Integration we added support for accessing NoSQL Database data via Oracle External Tables (using SQL in the Oracle Database), RDF Graph support in NoSQL Database, Oracle Coherence as well as integration with Oracle Event Processing.

Q3. How would you compare Oracle NoSQL with respect to other NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak?

Dave Segleau: The Oracle NoSQL Database is a key-value store, although it also supports JSON as a value type similar to a document store. Architecturally it is closer to Riak, Cassandra and the Amazon Dynamo-based implementations, rather than the other technologies, at least at the highest level of abstraction. With regards to features, Oracle NoSQL Database shares a lot of commonality with Riak. Our performance and scalability characteristics are showing up with the best results in YCSB benchmarks.

Q4. What is the implication of having Oracle Berkeley DB Java Edition as the core engine for the Oracle NoSQL database?

Dave Segleau: It means that Oracle NoSQL Database provides a mission-critical proven database technology at the heart of the implementation. Many of the other NoSQL databases use relatively new implementations for data storage and replication. Databases in general, and especially distributed parallel databases, are hard to implement well and achieve high product quality and reliability. So we see the use of Oracle Berkeley DB, a pervasively deployed database engine for 1000′s of mission-critical applications, as a big differentiation. Plus, many of the early NoSQL technologies are based on Oracle Berkeley DB, for example LinkedIn’s Voldemort, Amazon’s Dynamo and other popular commercial and enterprise social media products like Yammer.
The bottom line is that we went down the path of building Oracle NoSQL database because of explicit request from some of our largest Oracle Berkeley DB installations that wanted to move away from maintaining home grown sharding implementations and very much wanted an out of box technology that can replicate the robustness of what they had built “out of box”.

Q5. What is the relationships between the underlying “cleaning” mechanism to free up unused space in Oracle Berkeley DB, and the predictability and throughput in Oracle NoSQL Database?

Dave Segleau: As mentioned in the previous section, Oracle NoSQL Database uses Oracle Berkeley DB Java Edition as the key-value storage mechanism within the Storage Nodes. Oracle Berkeley DB Java Edition uses a no-overwrite log file system to store the data and a configurable multi-threaded background log cleaner task to compact and clean log files and free up unused disk space. The Oracle Berkeley DB log cleaner has underdone many years of in-house and real world high volume validation and tuning. Oracle NoSQL Database pre-defines the BDB cleaner parameters for optimal configuration for this particular use case. The cleaner enhances system throughput and predictability by a) running as a low level background task, b) being preconfigured to minimize impact on the running system. The combination of these two characteristics leads to more predictable system throughput.

Several other NoSQL database products have implemented heavy weight tasks to compact, compress and free up disk space. Running them definitely impacts system throughput and predictability. From our point of view, not only do you want a NoSQL database that has excellent performance, but you also need predictable performance. Routine tasks like Java GCs and disk space management should not cause major impacts to operational throughput in a production system.

Q7. Oracle NoSQL data model is using the concepts of “major” and “minor” key path. Why?

Dave Segleau: We heard from customers that they wanted both even distribution of data as well as co-location of certain sets of records. The Major/Minor key paradigm provides the best of both worlds. The Major key is the basis for the hash function which causes Major key values to be evenly distributed throughput the key-value data store. The Minor key allows us to cluster multiple records for a given Major key together in the same storage location. In addition to being very flexible, it also provided additional benefits:
a) A scalable two-tier indexing structure. A hash map of Major Keys to partitions that contain the data, and then a B-tree within each partition to quickly locate Minor key values.
b) Minor keys allow us to perform efficient lookups and range scans within a Major key. For example, for userID 1234 (Major key), fetch all of the products that they browsed from January 1st to January 15th (Minor key).
c) Because all of the Minor key records for a given Major key are co-located on the same storage location, this becomes our basic unit of ACID transactions, allowing applications to have a transaction that spans a single record, multiple records or even multiple write operations on multiple records for a given major key.

This degree of flexibility is often lacking in other NoSQL database offerings.

Q8. Oracle NoSQL database is a distributed, replicated key-value store using a shared-nothing master-slave architecture. Why did you choose to have a master node architecture? How does it compare with other systems which have no master?

Dave Segleau: First of all, lets clarify that each shard has a master (multi master) and it is an elected master based system. The Oracle NoSQL Database topology is deployed with user-specified replication factor (how many copies of the data should the system maintain) and then using a PAXOS based mechanism, a master is elected. It is quite possible that a new master is elected under certain operating conditions. Plus, if you throw more hardware resources at the system, those “masters” will shift the data for which they are responsible, again to achieve the optimal latency profile. We are leveraging the enterprise grade replication technology that is widely deployed via the Oracle Berkeley DB Java Edition. Also, by using an elected master implementation, we can provide a fully ACID transaction on an operation by operation basis

Q9. It is known that when the master node for a particular key-value fails (or because of a network failure), some writes may get lost. What is the implication from an application point of view?

Dave Segleau: This is a complex question in that it depends largely on the type of durability requested for the operation and that is controlled by the developer. In general though, committed transactions acknowledged by a simple majority of nodes (our default durability) are not lost when a master fails. In the case of less aggressive durability policies, in-flight transactions that have been subject to network, disk, server failures, are handled similar to process failure in other database implementations, the transactions are rolled back. However, a new master will quickly be elected and future requests will go thru without a hitch. The applications can guard against such situations by handling exceptions and performing a retry.

Q10. Justin Sheehy of Basho in an interview said (1): “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense” Would you recommend to your clients to use Oracle NoSQL database for banking applications?

Dave Segleau: Absolutely. The Oracle NoSQL Database offers a range of transaction durability and consistency options on a per operation basis. The choice of eventual consistency is best made on a case by case basis, because while using it can open up new levels of scalability and performance, it does come with some risk and/or alternate processes which have a cost. Some NoSQL vendors don’t provide the options to leverage ACID transactions where they make sense, but the Oracle NoSQL Database does.

Q11. Could you detail how Elasticity is provided in R2?

Dave Segleau: The Oracle NoSQL database slices data up into partitions within highly available replication groups. Each replication group contains an elected master and a number of replicas based on user configuration. The exact configuration will vary depending on the read latency /write throughput requirements of the application. The processes associated with those replication groups run on hardware (Storage Nodes) declared to the Oracle NoSQL Database. For elasticity purposes, additional Storage Nodes can be declaratively added to a running system, in which case some of the data partitions will be re-allocated onto the new hardware, thereby increasing the number of shards and the write throughput. Additionally, the number of replicas can be increased to improved read latency and increase reliability. The process of rebalancing data partitions, spawning new replicas, and forming new Replication Groups will cause those internal data partitions to automatically move around the Storage Nodes to take advantage of the new storage capacity.

Q12. What is the implication from a developer perspective of having a Avro Schema Support?

Dave Segleau: For the developer, it means better support for seamless JSON storage. There are other downstream implications, like compatibility and integration with Hadoop processing where AVRO is quickly becoming a standard not only for efficient wireline serialization protocols, but for HDFS storage. Also, AVRO is a very efficient serialization format for JSON, unlike other serialization options like BSON which tend to be much less efficient. In the future, Oracle NoSQL Database will leverage this flexible AVRO schema definition in order to provide features like table abstractions, projections and secondary index support.

Q13. How do you handle Large Object Support?

Dave Segleau: Oracle NoSQL Database provides a streaming interface for Large Objects. Internally, we break a Large Object up into chunks and use parallel operations to read and write those chunks to/from the database. We do it in an ordered fashion so that you can begin consuming the data stream before all of the contents are returned to the application. This is useful when implementing functionality like scrolling partial results, streaming video, etc. Large Object operations are restartable and recoverable. Let’s say that you start to write a 1 GB Large Object and sometime during the write operation a failure occurs and the write is only partially completed. The application will get an exception. When the application re-issues the Large Object operation, NoSQL resumes where it left off, skipping chunks that were already successfully written.
The Large Object chunking implementation also ensures that partially written Large Objects are not readable until they are completely written.

Q14. A NoSQL Database can act as an Oracle Database External Table. What does it mean in practice?

Dave Segleau: What we have achieved here is the ability to treat the Oracle NoSQL Database as a resource that can participate in SQL queries originating from an Oracle Database via standard SQL query facilities. Of course, the developer has to define a template that maps the “value” into a table representation. In release 2 we provide sample templates and configuration files that the application developer can use in order to define the required components. In the future, Oracle NoSQL Database will automate template definitions for JSON values. External Table capabilities give seamless access to both structured relational and unstructured NoSQL data using familiar SQL tools.

Q15. Why Atomic Batching is important?

Dave Segleau: If by Atomic Batching you mean the ability to perform more than one data manipulation in a single transaction, then atomic batching is the only real way to ensure logical consistency in multi-data update transactions. The Oracle NoSQL Database provides this capability for data beneath a given major key.

Q16 What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Dave Segleau: That’s a tough one to answer, since it is so case by case dependent. As discussed above in the banking question, in general if you can achieve your application latency goals while specifying high durability, then that’s your best course of action. However, if you have more aggressive low-latency/high-throughput requirements, you may have to assess the impact of relaxing your durability constraints and of the rare case where a write operation may fail. It’s useful to keep in mind that a write failure is a rare event because of the inherent reliability built into the technology.

Q17. Tomas Ulin mentioned in an interview (2) that “with MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements”. Isn’t MySQL 5.6 in fact competing with Oracle NoSQL database?

Dave Segleau: MySQL is an SQL database with a KV API on top. We are a KV database. If you have an SQL application with occasional need for fast KV access, MySQL is your best option. If you need pure KV access with unlimited scalability, then NoSQL DB is your best option.

———
David Segleau is the Director Product Management for the Oracle NoSQL Database, Oracle Berkeley DB and Oracle Database Mobile Server. He joined Oracle as the VP of Engineering for Sleepycat Software (makers of Berkeley DB). He has more than 30 years of industry experience, leading and managing technical product teams and working extensively with database technology as both a customer and a vendor.

Related Posts

(1) On Eventual Consistency– An interview with Justin Sheehy. August 15, 2012.

(2) MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013.

Resources

- Charles Lamb’s Blog

- ODBMS.org: Resources on NoSQL Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis.

Follow ODBMS.org on Twitter: @odbmsorg
##

Jun 10 13

On Big Data and Hadoop. Interview with Paul C. Zikopoulos.

by Roberto V. Zicari

“We’re not all LinkedIns and Facebooks; we don’t have budgets to hire 1000s of new hires with these skills, and what’s more we’ve invested in existing skills and people today. So to democratize Big Data, you need it to be consumable and integrated. These will flatten the time to value for Hadoop” — Paul C. Zikopoulos.

I have interviewed Paul C. Zikopoulos, Director of Technical Professionals for IBM Software Group’s Information Management division. The topic: Apache Hadoop and Big Data, State of the Union in 2013 and Vision for the future.

RVZ

Q1. What what do you think is still needed for big data analytics to be really useful for the enterprise?

Paul C. Zikopoulos: Integration and Consumability. We’re not all LinkedIns and Facebooks; we don’t have budgets to hire 1000s of new hires with these skills, and what’s more we’ve invested in existing skills and people today.
So to democratize Big Data, you need it to be consumable and integrated.
These will flatten the time to value for Hadoop. IBM is working really hard in these areas. I could go into other areas, but this is key.

Q2. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey.
When you speak with your customers what are the typical use cases and requirements they have?

Paul C. Zikopoulos: No matter what industry I’m working with, 90% of the Big Data use cases always have 2 common denominators: Whole Population Analytics to break free of traditional capacity constrained samples and analytics for data at-rest moving to in-motion.
So if you think about churn prediction, next best action, next best offer, fraud prediction, condition monitor, out of tolerance quality predictors, and more – it’s all going to rely on using more data (could be volume, could be variety, and often both) to build better models.
If you’re looking for specific use cases by industry, here’s a bunch of them that we’ve worked with clients on at IBM.

Q3. How do you categorize the various stages of the Hadoop usage in the enterprises?

Paul C. Zikopoulos: The IBM Institute for Business Value did a joint study with Said Business School (University of Oxford). They talked to a lot of Big Data folks and found that 28% were in the pilot phase, 24% haven’t started anything, and 47% are planning. After going through their research, they broke the answers into four stages: Educate / Explore / Engage / Execute.
So I’ll detail those four stages, but you can get the entire study here.

Educate: Building a base of knowledge (24 percent of respondents).
In the Educate stage, the primary focus is on awareness and knowledge development.
Almost 25 percent of respondents indicated they are not yet using big data within their organizations. While some remain relatively unaware of the topic of big data, our interviews suggest that most organizations in this stage are studying the potential benefits of big data technologies and analytics, and trying to better understand how big data can help address important business opportunities in their own industries or markets.
Within these organizations, it is mainly individuals doing the knowledge gathering as opposed to formal work groups, and their learnings are not yet being used by the organization. As a result, the potential for big data has not yet been fully understood and embraced by the business executives.

Explore: Defining the business case and roadmap (47 percent).
The focus of the Explore stage is to develop an organization’s roadmap for big data development.
Almost half of respondents reported formal, ongoing discussions within their organizations about how to use big data to solve important business challenges.
Key objectives of these organizations include developing a quantifiable business case and creating a big data blueprint.
This strategy and roadmap takes into consideration existing data, technology and skills, and then outlines where to start and how to develop a plan aligned with the organization’s business strategy.

Engage: Embracing big data (22 percent).
In the Engage stage, organizations begin to prove the business value of big data, as well as perform an assessment of their technologies and skills.
More than one in five respondent organizations is currently developing POCs to validate the requirements associated with implementing big data initiatives, as well as to articulate the expected returns. Organizations in this group are working – within a defined, limited scope – to understand and test the technologies and skills required to capitalize on new sources of data.

Execute: Implementing big data at scale (6 percent).
In the Execute stage, big data and analytics capabilities are more widely operationalized and implemented within the organization. However, only 6 percent of respondents reported that their organizations have implemented two or more big data solutions at scale – the threshold for advancing to this stage. The small number of organizations in the Execute stage is consistent with the implementations we see in the marketplace. Importantly, these leading organizations are leveraging big data to transform their businesses and thus are deriving the greatest value from their information assets.
With the rate of enterprise big data adoption accelerating rapidly – as evidenced by 22 percent of respondents in the Engage stage, with either POCs or active pilots underway – we expect the percentage of organizations at this stage to more than double over the next year. NOW ! While only 6% are executing, about 25% of respondents in this study are ‘piloting’ initiatives.

Q4. Could you give us some examples on how do you get (Big) Data Insights?

Paul C. Zikopoulos: IBM has a non-forked version of Hadoop called BigInsights.
When it comes to open source, it’s really hard to look past IBM’s achievements. Lucene, Apache Derby, Apache Jakarta, Apache Geronimo, Eclipse and so much more – so it shouldn’t surprise anyone that IBM is squarely in Hadoop’s corner.
Our strategy here is Embrace and Extend. We will embrace the open source Hadoop community. We are a vibrant part of it (in the latest Hadoop patch as of the time of this interview, the most fixes came from IBM; we have a number of contribution to HBase, and more). IBM has a long history in understanding enterprise concerns, that’s the extend part.
Some of the extensions work just fine with open source. For example, we provide a rich management tool, a quick installer, and concentrate opens ports into a single one to make your Hadoop cluster pass audit easier.
Some of our extensions overlay Hadoop. For example, our Adaptive Map Reduce which can deliver a 30% performance boost using its algorithms to optimize the overhead of MapReduce task startup.
We have enhanced schedulers, announced the option to use GPFS as the file system which provides a lot of benefits, and more. But these are optional. If you use BigInsights you are using a non-forked Hadoop distro.
Some of our extensions are ’round-trip-able’ – if you use them, you can walk back to pure Open Source Hadoop at any time, and some aren’t. If you want to get our fast to install non extended version of Hadoop for free, you can download InfoSphere BigInsights Basic Edition here.

Q5. What are the main technical challenges for big data analytics when data is in motion rather than at rest?

Paul C. Zikopoulos: Well the challenge is to ask yourselves how do I get those analytics artifacts that I learn at rest either in Hadoop or the EDW and get them to real time; I call this Nowcasting instead of Forecasting.
In order to do that, with agility and speed, you’re going to want a platform that’s designed for in-motion at-rest analytics.
I’m not seeing that in the marketplace today. In fact, I’m not seeing a focus on in-motion analytics.
When I refer to in-motion, I refer to the Velocity attribute of Big Data (people often talk to the Big Vs in Big Data, so that’s the one for in-motion). Velocity IS the game change.
It’s not just how fast data is produces or changes, BUT the speed at which it must be understood, acted upon, turned into something useful. So to me the main technical challenge in getting to in-motion from at-rest is the fact that I’m not really seeing that kind of true integration and it’s something we squarely hit on in the IBM Big Data platform.
Let me share an example, if you were to build some text analytical function at rest in Hadoop, perhaps an email phrase that’s highly correlated with a customer churn even, you can SEAMLESSLY take that artifact and deploy it on InfoSphere Streams (our Big Data Velocity engine) without any work at all, you just deploy the compiled AOG file. Wow! Platform.
The other challenge is just the volume and speed in which you have to process events. IBM invented our streaming products with the US government – and it can scale. For example, one of our clients analyzes and correlates over 5M market messages a second to execute algorithmic option trades with average latency of 50 microseconds.
The point is that this is not CEP; this is not 1 or 2 servers with 10-20,000 events a second. CEP can be a style or a technology.
You need to be able to do the style, but you need a technology platform too. If you asked me what is one of the biggest things IBM has done in the Big Data space, it is flattening the technical challenge to perform Big Data analytics on data in motion.

Q6. In your opinion, is there a technology which is best suited to build a Big Data Analytics Data Platform? If yes, which one?

Paul C. Zikopoulos: Well you say the word platform, and that’s going to imply a number of technologies. Right?
When I get asked this question, I refer to my Big Data Platform Manifesto, this is what you’re going to need to form a Big Data platform. Many people think big data is about Hadoop technology. It is and it isn’t. Its about a lot more than Hadoop.
One of the key requirements is to understand and navigate federated sources of big data – to discover data in place.
New technology has emerged that discovers, indexes, searches, and navigates diverse sources of big data. Of course big data is also about Hadoop. Hadoop is a collection of open source capabilities.
Two of the most prominent ones are Hadoop Distributed File System (HDFS) for storing a variety of information, and MapReduce – a parallel processing engine.
Data warehouses also manage big data- the volume of structured data is growing quickly. The ability to run deep analytic queries on huge volumes of structured data is a big data problem. It requires massive parallel processing data warehouses and purpose-built appliances for deep analytics.
Big data isn’t just at rest – it’s also in motion. Streaming data represents an entirely different big data problem – the ability to quickly analyze and act upon data while its still moving. This new technology opens a world of possibilities – from processing volumes of data that were just not practical to store, to detecting insight and responding quickly.
As much of the worlds big data is unstructured and in textual content, text analytics is a critical component to analyze and derive meaning from text.
And finally, integration and governance technology – ETL, data quality, security, MDM, and lifecycle management. Integration and governance technology establishes the veracity of big data, and is critical in determining whether information is trusted.
Finally, consumability, characteristics here include such items as being able to declare what you want done, not how to do it, expert integrated systems, deployment patterns, and so on.

So if you wanted a short answer a Big Data platform needs to be consumable, governable, give the opportunity for analytics in-motion, at rest (in an EDW AND things like Hadoop), discovery and index Big Data, and finally, provide the ability to analyze unstructured data.

Notice I didn’t mention one IBM product above; you can piece together a platform with a mash of vendors if you want; if you start to look into what IBM is doing, and although I’m bias and work there, I think you will find we have a true Big Data platform.

Q6. Does it make sense in your opinion to virtualize Hadoop?

Paul C. Zikopoulos: It can. It’s going to depend on the use case right? I see a lot of efforts by EMC in that area and that’s cool. Of course the Cloud and Hadoop kind of go hand and hand. I think this space is growing by leaps and bounds…fun to watch.

Q7. What is your opinion on the evolution of Hadoop?

Paul C. Zikopoulos: It’s just that – an evolution. I think that innovation is going to deliver more and more of what enterprises need from a ‘hardening’ aspect as time goes on. Hadoop 2.0 is a big step forward for availability. It’s out there yet now, but not ready for production in my humble opinion (although some vendors are shipping it, their documentation tells you it’s not ready for production). Next version of MapReduce (Yarn) and making Hive really fast (Tez) are also part of the evolution, stay close here, it’s changing fast!
That’s the best part of community. Now if you look at most of the vendors in this space, many are getting distracted and working on non-Hadoop’ish things to help Hadoop, and that’s fine too. We’re on a good path here.
A lot of vendors here are and more popping up all the time (like Intel just announced their own distribution). At some point, I think there will be a consolidated of distros out there, but with the hype around it right now, it will continue to evolve.
For example, it’s becoming more than just a MapReduce processing areas. Right? Lots of technologies are storing data in Hadoop’s HDFS, but bypassing MapReduce. So I find the file system key to the evolution.

Q8. Can In-Memory Data Management play a significant role for Big Data Analytics? If yes, how?

Paul C. Zikopoulos: I think it’s essential, but in a Big Data world, it would seem that the amount of data we are storing – at least right now – is proportionally bigger than the amount we can get into memory at a cost effective rate.
So in-memory needs to harmoniously live with the database. If you look at what we did with BLU Acceleration and DB2, we did just that.
In-memory columnar and typical relational tables live side by side in the same database kernel.
You can work with both structures together, in the same memory structures, queries, and so on.

When you can’t fit all the columns into memory, performance either falls off the cliff, or worse! Could crash the system.

From an analytics side, BLU Acceleration allows you to run queries faster, amazingly faster. That’s going to get more iterations of queries, analytics and what not. It’s not for everything, but if you can help my reports run faster, that’s cool. So imagine you find in a Discovery Zone powered by a Hadoop engine some interesting pieces of information, pulling that out and packing it into an in-memory structure and surfacing it to the enterprise is going to be pretty cool

Q9. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

Paul C. Zikopoulos: This is pretty important because I need the utility-like nature of a Hadoop cluster, without the capital investment. Time to analytics is the benefit here. After all, if you’re a start-up analytics firm seeking venture capital funding, do you really walk into to your investor and ask for millions to set up a cluster; you’ll get kicked out the door.
No, you go to Racksapce or Amazon, swipe a card, and get going. IBM is there with its Hadoop clusters (private and public) and you’re looking at clusters that cost as low as $0.60 US an hour.
I think at one time I costed out a 100 node Hadoop cluster for an hour and it was like $34US – and the price has likely gone down. What’s more, your cluster will be up and running in 30 minutes. So on-premise or off-premise Cloud is key for these environments.

___________________________
Paul C. Zikopoulos, B.A., M.B.A., is the Director of Technical Professionals for IBM Software Group’s Information Management division and additionally leads the World Wide Competitive Database and Big Data Technical Sales Acceleration teams.
Paul is an award winning writer and speaker with more than 19 years of experience in Information Management.
Paul is seen as a global expert in Big Data and database. He was picked by SAP as one of its “Top 50 Big Data Twitter Influencers”, named by BigData Republic to its “Top 100 Most Influential” list, Technopedia listed him a “A Big Data Expert to Follow”, and he was consulted on Big Data by the popular TV show “60 Minutes”.
Paul has written more than 350 magazine articles and 16 books, some of which include “Harness the Power of Big Data”, “Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data”, “Warp Speed, Time Travel, Big Data, and More: DB2 10 New Features”, “DB2 pureScale: Risk Free Agile Scaling”, “DB2 Certification for Dummies”, “DB2 for Dummies”, and more.
In his spare time, he enjoys all sorts of sporting activities, including running with his dog Chachi, avoiding punches in his MMA training, and trying to figure out the world according to Chloë—his daughter.

Related Posts

-On Virtualize Hadoop. Interview with Joe Russell. April 29, 2013

-On Pivotal HD. Interview with Scott Yara and Florian Waas. April 22, 2013

-On Big Data Velocity. Interview with Scott Jarr. January 28, 2013

Resources

- Harness the Power of Big Data The IBM Big Data Platform.
Paul C. Zikopoulos, Dirk deRoos, Krishnan Parasuraman, Thomas Deutsch, David Corrigan,James Giles, Chris Eaton.
Book, Copyright © 2013 by The McGraw-Hill Companies.
Download Book (.PDF 250 pages)

- Warp Speed, Time Travel, Big Data, and More. DB2 10 for Linux, UNIX, and Windows New Features.
Paul Zikopoulos, George Baklarz, Matt Huras, Walid Rjaibi, Dale McInnis, Matthias Nicola, Leon Katsnelson.
Book, Copyright © 2012 by The McGraw-Hill Companies.
Download book (.PDF 217 pages)

- Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data.
Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, Thomas Deutsch, George Lapis,
Book, Copyright © 2012 by The McGraw-Hill Companies.
Download book (.PDF 142 pages)

- ODBMS.org Resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software | Articles | Lecture Notes | PhD and Master Thesis|

- Follow ODBMS.org on Twitter: @odbmsorg

##

May 30 13

On PostgreSQL. Interview with Tom Kincaid.

by Roberto V. Zicari

“Application designers need to start by thinking about what level of data integrity they need, rather than what they want, and then design their technology stack around that reality. Everyone would like a database that guarantees perfect availability, perfect consistency, instantaneous response times, and infinite throughput, but it´s not possible to create a
product with all of those properties”
–Tom Kincaid.

What is new with PostgreSQL? I have Interviewed Tom Kincaid, head of Products and Engineering at EnterpriseDB.

RVZ

(Tom prepared the following responses with contributions from the EnterpriseDB development team)

Q1. EnterpriseDB products are based upon PostgreSQL. What is special about your product offering?

Tom Kincaid: EnterpriseDB has integrated many enterprise features and performance enhancements into the core PostgreSQL code to create a database with the lowest possible TCO and provide the “last mile” of service needed by enterprise database users.

EnterpriseDB´s Postgres Plus software provides the performance, security and Oracle compatibility needed to address a range of enterprise business applications. EnterpriseDB´s Oracle compatibility, also integrated into the PostgreSQL code base, allows many Oracle shops to realize a much lower database TCO while utilizing their Oracle skills and applications designed to work against Oracle databases.

EnterpriseDB also creates enterprise-grade tools around PostgreSQL and Postgres Plus Advanced Server for use in large-scale deployments. They are Postgres Enterprise Manager, a powerful management console for managing, monitoring and tuning databases en masse whether they´re PostgreSQL community version or EnterpriseDB´s enhanced Postgres Plus Advanced Server; xDB Replication Server with multi-master replication and replication between Postgres, Oracle and SQL Server databases; and SQL/Protect for guarding against SQL Injection attacks.

Q2. How does PostgreSQL compare with MariaDB and MySQL 5.6?

Tom Kincaid: There are several areas of difference. PostgreSQL has traditionally had a stronger focus on data integrity and compliance with the SQL standard.
MySQL has traditionally been focused on raw performance for simple queries, and a typical benchmark is the number of read queries per second that the database engine can carry out, while PostgreSQL tends to focus more on having a sophisticated query optimizer that can efficiently handle more complex queries, sometimes at the expense of speed on simpler queries. And, for a long time, MySQL had a big lead over PostgreSQL in the area of replication technologies, which discouraged many users from choosing PostgreSQL.

Over time, these differences have diminished. PostgreSQL´s replication options have expanded dramatically in the last three releases, and its performance on simple queries has greatly improved in the most recent release (9.2). On the other hand, MySQL and MariaDB have both done significant recent work on their query optimizers. So each product is learning from the strengths of the other.

Of course, there´s one other big difference, which is that PostgreSQL is an independent open source project that is not, and cannot be, controlled by any single company, while MySQL is now owned and controlled by Oracle.
MariaDB is primarily developed by the Monty Program and shows signs of growing community support, but it does not yet have the kind of independent community that PostgreSQL has long enjoyed.

Q3. Tomas Ulin mentioned in an interview that “with MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements”. What is your take on this?

Tom Kincaid: I think anyone who is developing an RDBMS today has to be aware that there are some users who are looking for the features of a key-value store or document database.
On the other hand, many NoSQL vendors are looking to add the sorts of features that have traditionally been associated with an enterprise-grade RDBMS. So I think that theme of convergence is going to come up over and over again in different contexts.
That´s why, for example, PostgreSQL added a native JSON datatype as part of the 9.2 release, which is being further enhanced for the forthcoming 9.3 release.
Will we see a RESTful or memcached-like interface to PostgreSQL in the future? Perhaps.
Right now our customers are much more focused on improving and expanding the traditional RDBMS functionality, so that´s where our focus is as well.

Q4. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

Tom Kincaid: It is a matter of the right tools for the right problem. Many of our customers use our products together with the NoSQL solutions you mention. If you need ACID transaction properties for your data, with savepoints and rollback capabilities, along with the ability to access data in a standardized way and a large third party tool set for doing it, a time tested relational database is the answer.
The SQL standard provides the benefit of always being able to switch products and having a host of tools for reporting and administration. PostgreSQL, like Linux, provides the benefit of being able to switch service partners.

If your use case does not mandate the benefits mentioned above and you have data sets in the Petabyte range and require the ability to ingest Terabytes of data every 3-4 hours, a NoSQL solution is likely the right answer. As I said earlier many of our customers use our database products together with NoSQL solutions quite successfully. We expect to be working with many of the NoSQL vendors in the coming year to offer a more integrated solution to our joint customers.

Since it is still pretty new, I haven´t had a chance to evaluate NuoDB so I can´t comment on how it compares with PostgreSQL or Postgres Plus Advanced Server.

As far as VoltDB is concerned there is a blog by Dave Page, our Chief Architect for tools and installers, that describes the differences between PostgreSQL and VoltDB. It can be found here.

There is also some terrific insight, on this topic, in an article by my colleague Bruce Momjian, who is one of the most active contributors to PostgreSQL, that can be found here.

Q5. Justin Sheehy of Basho in an interview said “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

Tom Kincaid: It´s overly simplistic. There is certainly room for asynchronous multi-master replication in applications such as banking, but it has to be done very, very carefully to avoid losing track of the money.
It´s not clear that the NoSQL products which provide eventual consistency today make the right trade-offs or provide enough control for serious enterprise applications – or that the products overall are sufficiently stable. Relational databases remain the most mature, time-tested, and stable solution for storing enterprise data.
NoSQL may be appealing for Internet-focused applications that must accommodate truly staggering volumes of requests, but we anticipate that the RDBMS will remain the technology of choice for most of the mission-critical applications it has served so well over the last 40 years.

Q6. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Tom Kincaid: Application designers need to start by thinking about what level of data integrity they need, rather than what they want, and then design their technology stack around that reality.
Everyone would like a database that guarantees perfect availability, perfect consistency, instantaneous response times, and infinite throughput, but it´s not possible to create a product with all of those properties.

If you have an application that has a large write throughput and you assume that you can store all of that data using a single database server, which has to scale vertically to meet the load, you´re going to be unhappy eventually. With a traditional RDBMS, you´re going to be unhappy when you can´t scale far enough vertically. With a distributed key-value store, you can avoid that problem, but then you have all the challenges of maintaining a distributed system, which can sometimes involve correlated failures, and it may also turn out that your application makes assumptions about data consistency that are difficult to guarantee in a distributed environment.

By making your assumptions explicit at the beginning of the project, you can consider alternative designs that might meet your needs better, such as incorporating mechanisms for dealing with data consistency issues or even application-level shading into the application itself.

Q7. How do you handle Large Objects Support?

Tom Kincaid: PostgreSQL supports storing objects up to 1GB in size in an ordinary database column.
For larger objects, there1s a separate large object API. In current releases, those objects are limited to just 2GB, but the next release of PostgreSQL (9.3) will increase that limit to 4TB. We don´t necessarily recommend storing objects that large in the database, though
in many cases, it´s more efficient to store enormous objects on a file server rather than as database objects. But the capabilities are there for those who need them.

Q8. Do you use Data Analytics at EnterpriseDB and for what?

Tom Kincaid: Most companies today use some form of data analytics to understand their customers and their marketplace and we1re no exception. However, how we use data is rapidly changing given our rapid growth and deepening
penetration into key markets.

Q9. Do you have customers who have Big Data problem? Could you please give us some examples of Big Data Use Cases?

Tom Kincaid: We have found that most customers with big data problems are using specialized appliances and in fact we partnered with Netezza to assist in creating such an appliance – The Netezza TwinFin Data Warehousing appliance.
See here.

Q10. How do you handle the Big Data Analytics “process” challenges with deriving insight?

Tom Kincaid: EnterpriseDB does not specialize in solutions for the Big Data market and will refer prospects to specialists like Netezza.

Q11. Do you handle un-structured data? If yes, how?

Tom Kincaid: PostgreSQL has an integrated full-text search capability that can be used for document processing, and there are also XML and JSON data types that can be used for data of those types. We also have a PostgreSQL-specific data type called hstore that can be used to store groups of key-value pairs.

Q12. Do you use Hadoop? If yes, what is your experience with Hadoop so far?

Tom Kincaid: We developed, and released in late 2011, our Postgres Plus Connector for Hadoop, which allows massive amounts of data from a Postgres Plus Advanced Server (PPAS) or PostgreSQL database to be accessed, processed and analyzed in a Hadoop cluster. The Postgres Plus Connector for Hadoop allows programmers to process large amounts of SQL-based data using their familiar MapReduce constructs. Hadoop combined with PPAS or PostgreSQL enables users to perform real time queries with Postgres and non-real time CPU intensive analysis and with our connector, users can load SQL data to Hadoop, process it and even push the results back to Postgres.

Q13 Cloud computing and open source: How does it relate to PostgreSQL?

Tom Kincaid: In 2012, EnterpriseDB released its Postgres Plus Cloud Database. We´re seeing a wide-scale migration to cloud computing across the enterprise. With that growth has come greater clarity in what developers need in a cloudified database. The solutions are expected to deliver lower costs and management ease with even greater functionality because they are taking advantage of the cloud.

______________________
Tom Kincaid.As head of Products and Engineering, Tom leads the company’s product development and directs the company’s world-class software engineers. Tom has nearly 25 years of experience in the Enterprise Software Industry.
Prior to EnterpriseDB, he was VP of software development for Oracle’s GlassFish and Web Tier products.
He integrated Sun’s Application Server Product line into Oracle’s Fusion middleware offerings. At Sun Microsystems, he was part of the original Java EE architecture and management teams and played a critical role in defining and delivering the Java Platform.
Tom is a veteran of the Object Database industry and helped build Object Design’s customer service department holding management and senior technical contributor roles. Other positions in Tom’s past include Director of Quality Engineering at Red Hat and Director of Software Engineering at Unica.

Related Posts

- MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013

- On Eventual Consistency– Interview with Monty Widenius. October 23, 2012

Resources

ODBMS.org: Relational Databases, NewSQL, XML Databases, RDF Data Stores
Blog Posts |Free Software | Articles and Presentations| Lecture Notes | Tutorials| Journals |

Follow ODBMS.org on Twitter: @odbmsorg

##

May 21 13

On Real Time NoSQL. Interview with Brian Bulkowski.

by Roberto V. Zicari

“The benefits of hybrid storage are lower cost, and more varied price/performance options. We are seeing SSD pricing range down to $3/GB for enterprise drives, and less than $1/GB for consumer drives, and we also support using standard rotational drives for persistence. Our goal is in-memory performance with the price of ‘storage’ – which makes multi-terabyte analytics caches possible” –Brian Bulkowski.

I have interviewed Brian Bulkowski, Founder & CTO of Aerospike.

RVZ

Q1. What is Aerospike?
Brian Bulkowski: Aerospike is a high speed and highly available NoSQL database. We have proven levels of performance and reliability on commodity hardware that go far beyond any other database – and have the years of production experience to back up our claims.

Q2. How does it compare with other NoSQL or NewSQL databases?
Brian Bulkowski: Aerospike has specialized in areas where downtime is not an option – and at the highest levels of performance. Another area of specialization is applications where low latencies dramatically simplify application development.

As a row-based store with very high levels of read and write concurrency, we perform well where behavior and interactions are concerned – whether those interactions are user-oriented, such as advertising and gaming; or machine to machine, such as capital markets or the emerging Internet of Things.

Q3. What are the typical applications for which Aerospike is currently in use?
Brian Bulkowski: A number of our customers use Aerospike for real-time audience and behavior analysis. A common requirement is to hold profile information for several billion cookies, with a rich schema-free ability to add new data and types. In this use case, a cost-effective multi-terabyte store changes what a business can achieve.

Other use cases include massive multi-player gaming, where a shared game state (at scale) is key to a great user experience, and retaining peak capacity is critical.

Q4. Could you give us some detail on Aerospike`s hybrid memory architecture?
Brian Bulkowski: One of the greatest challenges with databases is keeping indexes in sync with data, and inserting and deleting index entries while under read load. In this area, in-memory techniques shine. An in-memory index can handle a high level of parallelism, and support simultaneous read and write loads. Indexes can be made very small – Aerospike has optimized index entries to one CPU cache line.

Aerospike also uses the technique of allowing a process to restart and retain its index, by using the shared memory facility of Linux. An index needs to be recalculated only when a machine is restarting, and in those cases it can be best to simply use the promoted replica during the short time the index is being re-created.

Q5. What is the main advantage of using a hybrid memory architecture?
Brian Bulkowski: The benefits of hybrid storage are lower cost, and more varied price/performance options. We are seeing SSD pricing range down to $3/GB for enterprise drives, and less than $1/GB for consumer drives, and we also support using standard rotational drives for persistence. Our goal is in-memory performance with the price of ‘storage’ – which makes multi-terabyte analytics caches possible.

Q6. How do you ensure scale up and scale out?
Brian Bulkowski: Scale up is maintained through short individual code paths, and single-server performance benchmarks. Our architecture focuses on enabling multiple cores, deployment configurations with appropriate interrupt routing rules, and classic database techniques like asynchronous IO.

Scale out is ensured through a number of techniques, such as enforced randomness through provably random storage distribution, a minimal and deterministic number of nodes participating in a transaction, and client techniques which limit the effect of a single slow node on other nodes.

Q7. Do you have any performance measures to share with us?
Brian Bulkowski: Please see this performance benchmark report from Thumbtack Technology, demonstrating how Aerospike is 10 times faster than MongoDB or Cassandra.

Please also see Aerospike Principal Architect Russ Sullivan’s High Scalability blog post on achieving 1M TPS on commodity hardware using Aerospike. This use case involves more system tuning than the Thumbtack tests.

Q8. How do you handle real-time big data analytics?
Brian Bulkowski: Many customers use Aerospike as a persistent part of their real-data analytics processing chain. Their application logic needs to know counts of recent visits, recent searches, preferences, preferences of friends, and also store batch analytics results. For example, you might calculate that a certain user is part of an audience, then keep up-to-date per audience counters.

By placing the calculations in the hands of the developer – with extreme low latency – we find developers and architects are able to scale increasingly powerful real-time computations.

Q9. You claim to provide 17x better TCO than other NoSQL databases. Could you please provide some evidence for this claim?
Brian Bulkowski: A great benefit of Aerospike is the ability to store data in Flash, and indexes in DRAM. This plays to the strengths of both SSD and DRAM, as memory accelerates the indexes and finding of the data, and then only one device data read is required to fetch (or write) the data.

Most NoSQL databases require all the data in memory in order to achieve high performance. A more detailed implementation is outlined in David Floyer’s Dec 12, 2012 Wikibon article.

Q10. How do you manage replication in Aerospike?
Brian Bulkowski: We commit each write to memory, and reserve space for the data to be written to disk. The writes are aggregated into large blocks to allow the streaming writes that are optimal for both SSDs and standard rotational disks, which passes through a short disk queue (often committing in less than a few milliseconds).

Each row has a current master server within the cluster. This mapping changes as servers are added and removed, and one or more replicas are chosen randomly from the remaining nodes.

Q11. What about data consistency when updating the data?
Brian Bulkowski: By applying the change to all relevant machines in the cluster, we achieve the important application requirement of a write taking effect immediately. By applying the write to multiple machines in-memory, we achieve very high reliability, because multiple machines must fail in the same millisecond to have uncommitted transactions. When the application code reads, they will always see the updated data by default – with no performance impact.

Q12. Could you please explain how cross data center synchronization works for Aerospike?
Brian Bulkowski: Our cross data center synchronization is a no-single-point of failure asynchronous replication mechanism. Asynchronous replication is important because a network outage or bandwidth issues impedes long haul links, but a data center can still serve requests based on its local cluster. The system can be operated both with a ‘star topology’, where a single master cluster is replicated to other more local data centers, or in a ‘ring topology’, where every cluster takes writes and those writes are forwarded around the ring. Conflicts can happen in a ring deployment if a row is modified in different locations at the same time.
This requires a conflict resolution system – like Aerospike’s versions – that retains conflicts for later resolution, or a heuristic – retaining a row with the most writes and a later timestamp.

Q13. What is the contribution of Don Haderle to Aerospike?
Brian Bulkowski: Don first exposed us to DB2‘s historical strategy of building very fast and reliable single-row operations, then adding higher level primitives. Primary key database operations are similar to storage operations in a file store, and have extraordinary business value. By deploying with many paying customers, we’ve proven our reliable and fast primary key layer – unlike other new, clustered databases which are still seeking large production proof.

———————
Brian Bulkowski, Co-Founder & CTO Aerospike
Brian has 20+ years experience designing, developing and tuning networking systems and high performance high scale infrastructures. He founded Aerospike after learning first hand, the scaling limitations of sharded MySQL systems at Aggregated Knowledge. As Director of Performance at this Media Intelligence SaaS company, Brian’s team built and operated a clustered recommendation engine. Prior to Aggregate Knowledge, Brian was a founding member of the digital TV team at Navio Communications and Chief Architect of Cable Solutions at Liberate Technologies where he built the high performance embedded networking stack and the internet scale broadcast server infrastructure. Before Liberate, Brian was a Lead Engineer at Novell, where he was responsible for the AppleTalk stack for Netware 3 and 4. Brian holds a B.S. from Brown University in Mathematics/Computer Science.

Resources

NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, and MongoDB.
Ben Engber, CEO, Thumbtack Technology
Abstract:
Thumbtack’s latest whitepaper focuses on how NoSQL databases perform during hardware failures. This paper follows our previous study of NoSQL performance and explores how different consistency and durability models affect hardware planning and application design.
Paper | Advanced| English |DOWNLOAD (PDF)| MARCH 2013

Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs.
Ben Engber, CEO, Thumbtack Technology, JANUARY 2013
Abstract:
One question that comes up a lot in our discussions with customers is “Tell me which NoSQL database is best.” Of course, the answer to this depends on the use cases and the transactional needs of the specific customer. We regularly go through this analysis with customers, but one thing we find as we start is often the basic questions of how much consistency is needed and what are the tradeoffs involved that have not been answered. Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs is a benchmark report created to help organizations answer exactly those questions pertaining to NoSQL databases. Our goal was to create a benchmark that addressed a specific class of real-world problems with big data, unlike other published reports that are out there. We focused on how the durability and consistency needs affected raw performance.
Paper | Advanced| English | DOWNLOAD (PDF)| 2013|

- Don Haderle, Father of IBM DB2, Talks about the Information Explosion:
Watch Video.

Related Posts

-On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen. May 13, 2013

** Follow ODBMS.org on Twitter:@odbmsorg

##

May 13 13

On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen

by Roberto V. Zicari

“The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point”
–K​ingsley Uyi Idehen.

I have interviewed Kingsley Idehen founder and CEO of OpenLink Software. The main topics of this interview are: the Semantic Web, and the Virtuoso Hybrid Data Server.

RVZ

Q1. The vision of the Semantic Web is the one where web pages contain self describing data that machines will be able to navigate them as easily as humans do now. What are the main benefits? Who could profit most from the Semantics Web?

K​ingsley Uyi Idehen: The vision of a Semantic Web is actually the vision of the Web. Unbeknownst to most, they are one and the same. The goal was always to have HTTP URIs denote things, and by implication, said URIs basically resolve to their meaning [1] [2].
Paradoxically, the Web bootstrapped on the back of URIs that denoted HTML documents (due to Mosaic’s ingenious exploitation of the “view source” pattern [3]) thereby accentuating its Web of hyper-linked Documents (i.e., Information Space) aspect while leaving its Web of hyper-linked Data aspect somewhat nascent.
The nascence of the Web of hyper-linked Data (aka Web of Data, Web of Linked Data etc.) laid the foundation for the “Semantic Web Project” which naturally evoled into “The Semantic Web” meme. Unfortunately, “The Semantic Web” meme hit a raft of issues (many self inflicted) that basically disconnected it from its original Web vision and architecture aspect reality.
The Semantic Web is really about the use of hypermedia to enhance the long understood entity relationship model [4] via the incorporation of _explicit_ machine- and human-comprehensible entity relationship semantics via the RDF data model. Basically, RDF is just about an enhancement to the entity relationship model that leverages URIs for denoting entities and relations that are described using subject->predicate->object based proposition statements.
For the rest of this interview, I would encourage readers to view “The Semantic Web” phrase as meaning: a Web-scale entity relationship model driven by hypermedia resources that bear entity relationship model description graphs that describe entities and their relations (associations).

To answer your question, the benefits of the Semantic Web are as follows: fine-grained access to relevant data on the Web (or private Web-like networks) with increasing degrees of serendipity [5].

Q2. Who is currently using Semantic Web technologies and how? Could you please give us some examples of current commercial projects?

K​ingsley Uyi Idehen: I wouldn’t used “project” to describe endeavors that exploit Semantic Web oriented solutions. Basically, you have entire sectors being redefined by this technology. Examples range from “Open Government” (US, UK, Italy, Spain, Portugal, Brazil etc..) all the way to publishing (BBC, Globo, Elsevier, New York Times, Universal etc..) and then across to pharmaceuticals (OpenPHACTs, St. Judes, Mayo, etc.. ) and automobiles (Daimler Benz, Volkswagen etc..). The Semantic Web isn’t an embryonic endeavor deficient on usecases and case studies, far from it.

Q3. Virtuoso is a Hybrid RDBMS/Graph Column store. How does it differ from relational databases and from XML databases?

K​ingsley Uyi Idehen:: First off, we really need to get the definitions of databases clear. As you know, the database management technology realm is vast. For instance, there isn’t anything such thing as a non relational database.
Such a system would be utterly useless beyond an comprehendible definition, to a marginally engaged audience. A relational database management system is typically implemented with support for a relational model oriented query language e.g., SQL, QUEL, OQL (from the Object DBMS era), and more recently SPARQL (for RDF oriented databases and stores). Virtuoso is comprised of a relational database management system that supports SQL, SPARQL, and XQuery. It is optimized to handle relational tables and/or relational property graphs (aka. entity relationship graphs) based data organization. Thus, Virtuoso is about providing you with the ability to exploit the intensional (open world propositions or claims) and extensional (closed world statements of fact) aspects of relational database management without imposing either on its users.

Q4. Is there any difference with Graph Data stores such as Neo4j?

K​ingsley Uyi Idehen: Yes, as per my earlier answer, it is a hybrid relational database server that supports relational tables and entity relationship oriented property graphs. It’s support for RDF’s data model enables the use of URIs as native types. Thus, every entity in a Virtuoso DBMS is endowed with a URI as its _super key_. You can de-reference the description of a Virtuoso entity from anywhere on a network, subject to data access policies and resource access control lists.

Q5. How do you position Virtuoso with respect to NoSQL (e.g Cassandra, Riak, MongoDB, Couchbase) and to NewSQL (e.g.NuoDB, VoltDB)?

K​ingsley Uyi Idehen: Virtuoso is a SQL, NoSQL, and NewSQL offering. Its URI based _super keys_ capability differentiates it from other SQL, NewSQL, and NoSQL relational database offerings, in the most basic sense. Virtuoso isn’t a data silo, because its keys are URI based. This is a “deceptively simple” claim that is very easy to verify and understand. All you need is a Web Browser to prove the point i.e., a Virtuoso _super key_ can be placed in the address bar of any browser en route to exposing a hypermedia based entity relationship graph that navigable using the Web’s standard follow-your-nose pattern.

Q6. RDF can be encoded in various formats. How do you handle that in Virtuoso?

K​ingsley Uyi Idehen: Virtuoso supports all the major syntax notations and data serialization formats associated with the RDF data model. This implies support for N-Triples, Turtle, N3, JSON-LD, RDF/JSON, HTML5+Microdata, (X)HTML+RDFa, CSV, OData+Atom, OData+JSON.

Q7. Does Virtuoso restrict the contents to triples?

K​ingsley Uyi Idehen: Assuming you mean: how does it enforce integrity constraints on triple values?
It doesn’t enforce anything per se. since the principle here is “schema last” whereby you don’t have a restrictive schema acting as an inflexible view over the data (as is the case with conventional SQL relational databases). Of course, an application can apply reasoning to OWL (Web Ontology Language) based relation semantics (i.e, in the so-called RBox) as option for constraining entity types that constitute triples. In addition, we will soon be releasing a SPARQL Views mechanism that provides a middle ground for this matter whereby the aforementioned view can be used in a loosely coupled manner at the application, middleware, or dbms layer for applying constraints to entity types that constitute relations expressed by RDF triples.

Q8. RDF can be represented as a direct graph. Graphs, as data structure do not scale well. How do you handle scalability in Virtuoso? How do you handle scale-out and scale-up?

K​ingsley Uyi Idehen: The fundamental mission statement of Virtuoso has always be to destroy any notion of performance and scalability as impediments to entity relationship graph model oriented database management. The crux of the matter with regards to Virtuoso is that it is massively scalable due for the following reasons:
• fine-grained multi-threading scoped to CPU cores
• vectorized (array) execution of query commands across fine-grained threads
• column-store based physical storage which provides storage layout and data compaction optimizations (e.g., key compression)
• share-nothing clustering that scales from multiple instances (leveraging the items above) on a single machine all the way up to a cluster comprised of multiple machines.
The scalability prowess of Virtuoso are clearly showcased via live Web instances such as DBpedia and the LOD Cloud Cache (50+ Billion Triples). You also have no shortage of independent benchmark reports to compliment the live instances:
50 – 150 Billion scale Berlin SPARQL Benchmark (BSBM) report (.pdf)

Q9. Could you give us some commercial examples where Virtuoso is in use?

K​ingsley Uyi Idehen: Elsevier, Globo, St. Judes Medical, U.S. Govt., EU, are a tiny snapshot of entities using Virtuoso on a commercial basis.

Q10. Do you plan in the near future to develop integration interfaces to other NoSQL data stores?

K​ingsley Uyi Idehen: If a NewSQL or NoSQL store supports any of the following, their integration with Virtuoso is implicit: HTTP based RESTful interaction patterns, SPARQL, ODBC, JDBC, ADO.NET, OLE-DB. In the very worst of cases, we have to convert the structured data returned into 5-Star Linked Data using Virtuoso’s in-built Linked Data middleware layer for heterogeneous data virtualization.

Q11. Virtuoso supports SPARQL. SPARQL is not SQL, how do handle querying relational data then?

K​ingsley Uyi Idehen: Virtuoso support SPARQL, SQL, SQL inside SPARQL and SPARQL inside SQL (we call this SPASQL). Virtuoso has always had its own native SQL engine, and that’s integral to the entire product. Virtuoso provides an extremely powerful and scalable SQL engine as exemplified by the fact that the RDF data management services are basically driven by the SQL engine subsystem.

Q12. How do you support Linked Open Data? What advantages are the main benefits of Linked Open Data in your opinion?

K​ingsley Uyi Idehen: Virtuoso enables you expose data from the following sources, courtesy of its in-built 5-star Linked Data Deployment functionality:
• RDF based triples loaded from Turtle, N-Triples, RDF/XML, CSV etc. documents
• SQL relational databases via ODBC or JDBC connections
• SOAP based Web Services
• Web Services that provide RESTful interaction patterns for data access.
• HTTP accessible document types e.g., vCard, iCalendar, RSS, Atom, CSV, and many others.

Q13. What are the most promising application domains where you can apply triple store technology such as Virtuoso?

K​ingsley Uyi Idehen: Any application that benefits from high-performance and scalable access to heterogeneously shaped data across disparate data sources. Healthcare, Pharmaceuticals, Open Government, Privacy enhanced Social Web and Media, Enterprise Master Data Management, Big Data Analytics etc..

Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?

K​ingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.
There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself.

Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?

K​ingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.
The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.

Q16. In your opinion, what are the main current obstacles for the adoption of Semantic Web technologies in the Enterprise?

K​ingsley Uyi Idehen:The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point.

Links:

[1]. — 5-Star Linked Data URIs and Semiotic Triangle
[2]. — what do HTTP URIs Identify?
[3]. — View Source Pattern & Web Bootstrap
[4]. — Unified View of Data using the Entity Relationship Model (Peter Chen’s 1976 dissertation)
[5]. — Serendipitous Discovery Quotient (SDQ).

——————–
Kingsley Idehen is the Founder and CEO of OpenLink Software. He is an industry acclaimed technology innovator and entrepreneur in relation to technology and solutions associated with data management systems, integration middleware, open (linked) data, and the semantic web.

Kingsley has been at the helm of OpenLink Software for over 20 years during which he has actively provided dual contributions to OpenLink Software and the industry at large, exemplified by contributions and product deliverables associated with: Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), Object Linking and Embedding (OLE-DB), Active Data Objects based Entity Frameworks (ADO.NET), Object-Relational DBMS technology (exemplified by Virtuoso), Linked (Open) Data (where DBpedia and the LOD cloud are live showcases), and the Semantic Web vision in general.
————-

Resources

50 – 150 Billion scale Berlin SPARQL Benchmark (BSBM) report (.pdf)

- History of Virtuoso

-ODBMS.org free resources on : Relational Databases, NewSQL, XML Databases, RDF Data Stores

Related Posts

- Graphs vs. SQL. Interview with Michael Blaha April 11, 2013

- MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013

Follow ODBMS Industry Watch on Twitter: @odbmsorg

##

Apr 29 13

On Virtualize Hadoop. Interview with Joe Russell.

by Roberto V. Zicari

“A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static.” – Joe Russell.

VMware announced in June last year an open source project called Serengeti.
The main idea of Project Serengeti is to enable users and companies to quickly deploy, manage and scale Apache Hadoop on virtual infrastructure.
I have interviewed Joe Russell, VMware, Product Line Marketing Manager, Big Data.

RVZ

Q1. Why Virtualize Hadoop?

Joe Russell: Hadoop is a technology that Enterprises are increasingly using to process large amounts of information. While the technology is generally pretty early in its lifecycle, we are starting to see more enterprise-grade use cases. In its current form, Hadoop is difficult to use and lacking the toolsets to efficiently deploy run and manage Hadoop clusters in an Enterprise context. Virtualizing Hadoop not only brings enterprise-tested High Availability and Fault Tolerance to Hadoop, but it also allows for much more agile and automated management of Hadoop clusters. Additionally, virtualization allows for separation of data and compute, which allows users to preserve data locality and paves the way towards more advanced use cases such as mixed workload deployments and Hadoop-as-a-service.

Q2. You claim to be able to deploy an Apache Hadoop cluster (HDFS, MapReduce, Pig, Hive) in minutes on an existing vSphere cluster using Serengeti. How do you do this? Could you give us an example on how you customize a Hadoop Cluster?

Joe Russell:There is a downloadable virtual appliance OVA file that is available here. There is a complete user guide that can be found here (.pdf).

You will be able to customize Hadoop clusters easily from within the Serengeti tool by specifying node and resource allocations through an easy to use user interface.

Q3. There are concerns on the approach of decoupling Apache Hadoop nodes from the underlying physical infrastructure. Quoting Steve Loughran (HP Research): “Hadoop contains lots of assumptions about running in a static infrastructure; it’s scheduling and recovery algorithms assume this.” What is your take on this?

Joe Russell: A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static. Additionally, it creates true multi-tenancy within more layers of the Hadoop stack, not just the name node.

I think there is some confusion when we say “in the cloud”. Here, Steve is talking about running it on a public cloud like Amazon. Steve is largely introducing the concept of data locality, or the notion that large amounts of data are hard to move. In this scenario, it makes sense to bring compute resources to the data to ensure performance isn’t negatively impacted by networking limitations. VMware advocates that Hadoop should be virtualized, as it introduces a level of flexibility and management that allows companies to easily deploy, manage, and scale internal Hadoop clusters.

Q4. How do ensure High Availability (HA)? How do you protect against host and VM failures?

Joe Russell: We ensure High Availability (HA) by leveraging vSphere’s tested solution via Project Serengeti’s integration with vCenter (management console of vSphere).

In the event of physical server failure, affected virtual machines are automatically restarted on other production servers with spare capacity. In the case of operating system failure, vSphere HA restarts the affected virtual machine on the same physical server.
In Hadoop nomenclature, this means that there is HA on more than just the name node. vSphere’s solution also allows for HA on the jobtracker node, metastores, and on the management server, which are critical pieces of any Hadoop system that require high availability.

More importantly, as Hadoop is a batch-oriented process, it is important that when a physical host does fail, that you are able to pause and then restart that job from the point in time in which it went down. VMware’s vSphere solution allows for this and has been tested amongst the biggest Enterprises for the better part of the past decade.

Q5. How do you get Data Insights? Do you already have examples how such Virtualize Hadoop is currently used in the Enterprises? If yes, which ones?

Joe Russell: Data Insights occur farther up the stack with analytics vendors.

Project Serengeti is a tool that allows you to run Hadoop ontop of vSphere and is a solution designed to allow users to consolidate Hadoop clusters on a single underlying virtual infrastructure. The tool allows for users to run different types of Hadoop distributions on a hypervisor to gain the benefits of virtualization, which include efficiency, elasticity, and agility.

Q6. Does Serengeti only works with VMware vSphere® platform?

Joe Russell: Project Serengeti today only works with the vSphere hypervisor.

However, VMware made the decision to open source Project Serengeti to make the code available to anyone who wishes to use it.
By making it open source vs. just offering a free closed source product, VMware allows users to take the Serengeti code and alter it for their own purposes. For example, any user could download the Project Serengeti code and alter it to make it work with other hypervisors other than vSphere. While it isn’t in VMware’s interest to dedicate resources to make Project Serengeti run with other hypervisors, it doesn’t prevent users from doing so. This is an important point.

Q7. VMware is working with the Apache Hadoop community to contribute changes to the Hadoop Distributed File System (HDFS) and Hadoop MapReduce projects to make them “virtualization-aware”. What does it mean? What are these changes?

Joe Russell: Hadoop Virtual Extensions (“HVE”) is one example of this. VMware contributed HVE back to the Apache community to make Hadoop distributions virtualization aware. This means inserting a node group layer between the rack and host to make Hadoop distributions topology aware for virtualized platforms. In its simplest terms, this allows for VMware to preserve data locality and increase reliability through the separation of data and compute.

A link to a whitepaper with further detail can be found here (.pdf).

Q8. What about the performance of such “Virtualize” Hadoop? Do you have performance measures to share?

Joe Russell: Please see whitepaper referenced above.

Q9. What is the value of Hadoop-in-cloud? How does it relate to the virtualization of Hadoop?

Joe Russell: I don’t necessarily understand the question and it would be particularly helpful to define what you mean by “Hadoop-in-Cloud”.

I think you may be referring to Hadoop-as-a-Service, which is valuable in that users are able to deliver Hadoop resources to internal users based on need. Centralized control through Hadoop-as-a-Service ensures high cluster utilization, lower TCO, and an agile framework to adjust to ever-changing business needs. As Enterprises increasingly look to service internal customers, I expect Hadoop-as-a-Service to become more popular as the Hadoop technology emerges within the enterprise. Please keep in mind that this relates both to private and public clouds. Virtualizing Hadoop is the first step toward being able to provision Hadoop in the cloud.

Q10. VMware also announced updates to Spring for Apache Hadoop. Could you tell us what are these updates?

Joe Russell: Spring for Apache Hadoop is being led by the Pivotal Initiative and Scott Yara. I would point you to him with this question, as he would have more insight than us at VMware.

Q11 VMware is working with a number of Apache Hadoop distribution vendors (Cloudera, Greenplum, Hortonworks, IBM and MapR ) to support a wide range of distributions. Why? Could you tell us exactly what is VMware contribution?

Joe Russell: VMware is focused on providing a common underlying virtual infrastructure so each of these vendors can run their software better on vSphere. Project Serengeti is a toolset that pre-configures, tunes and makes it easier to deploy and run Hadoop with increased reliability on vSphere. These efforts make it easier for enterprises to make architectural decisions around how to setup Hadoop within their companies. Deciding to virtualize Hadoop can have dramatic effects not only on companies just beginning to use Hadoop, but also on more advanced users of the technology. VMware’s contributions through Project Serengeti allow each of the vendor’s software to run better on virtualized infrastructure. As you know, these contributions are available for anyone to use.

Q12 Serengeti, “Virtualize” Hadoop, Hadoop in the Cloud, Spring for Apache Hadoop: what is the global picture here? How all of these efforts relate to each other? What are the main benefits for developers and users of Apache Hadoop?

Joe Russell: All of these efforts improve the technology and make it easier for developers and users of Hadoop to actually use Hadoop. Additionally, these efforts focus on virtualizing Hadoop to make the technology more elastic, reliable, and performant.
VMware is focused on bringing the benefits of virtualization to Hadoop, both from a community standpoint and a customer standpoint. It has been open in its approach to contributing back technology that makes it easier for users / developers to utilize virtualization for their Hadoop clusters. Conversely, it is investing in bringing Hadoop to its existing customers by making the technology more reliable and building easy to use tools around the technology to make it easier to deploy and administrate in an Enterprise setting with SLAs and business critical workloads.

Joe Russell is responsible for product strategy, GTM, evangelism and product marketing of Big Data at VMware.
He has over a decade of experience in a blend of product marketing, finance, operations, and M&A roles.
Previously he worked for Yahoo!, and as an Investment Banking – Technology M&A Analyst for GCA Savvian, Credit Suisse and Societe Generale.
He holds a MSc, Accounting & Finance from London School of Economics and Political Science, a BS, Economics with Honors from University of Washington, and a MBA from Wharton School, University of Pennsylvania.

—————
Related Posts

- On Pivotal HD. Interview with Scott Yara and Florian Waas. April 22, 2013

-The Spring Data project. Interview with David Turanski. January 3, 2013

Resources

ODBMS.org- Lecture Notes: Data Management in the Cloud.
by Michael Grossniklaus, David Maier, Portland State University.
Course Description: “Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed. The topics covered in the courserange from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J). The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems. Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.”
Lecture Notes | Intermediate/Advanced | English | LINK TO DOWNLOAD ~280 slides (PDF)| 2011-12|

Follow ODBMS.org on Twitter: @odbmsorg

##

Apr 22 13

On Pivotal HD. Interview with Scott Yara and Florian Waas.

by Roberto V. Zicari

“A distribution is not–or not necessarily–a fork of the code and we have no intention to fork Hadoop. At this point, the value-add that we bring to the table is strictly layered on top of Apache HD and interacts cleanly with the vanilla Hadoop stack” –Scott Yara and Florian Waas.

Greenplum announced on Monday, February 25th a new Hadoop distribution: Pivotal HD. I asked a few questions on Pivotal HD to Scott Yara, Senior Vice President, Products and Co-Founder Greenplum/EMC, and Florian Waas, Senior Director of Advanced Research and Development at Greenplum/EMC.

RVZ

Q1. What is in your opinion the status of adoption of, and investment in, open source projects such as Hadoop within the Enterprise?

Scott Yara, Florian Waas: We have seen a massive shift in perception when it comes to open source.

In the past, innovation was primarily driven by commercial R&D departments and open source was merely trying to catch up to them. And even though a number of open source projects from that era have become household names they weren’t necessarily viewed as leaders in innovation.

This has fundamentally changed in recent years: open source has become a hotbed of innovation in particular in infrastructure technology. Hadoop and a variety of other data management and database products are testament to this change. Enterprise customers do realize this trend and have started adopting open source large-scale. It allows them to get their hands on new technology much faster than was the case before and as a additional perk this technology comes without the dreaded vendor lock-in.

By now, even the most conservative enterprises have developed open source strategies that ensures they have their hand on the pulse and adoption cycles are short and effective.

So, in short, the prospects for open source have never been better!

Q2. In your opinion is the future of Hadoop made of hybrid products?

Scott Yara, Florian Waas: Hadoop is a collection of products or tools and, apart from the relatively mature HDFS interfaces, is still evolving. Its original value proposition has changed quite dramatically. Remember, initially it was all about MapReduce the cool programming paradigm that lets you whip up large-scale distributed programs in no time requiring only rudimentary programming skills.

Yet, that’s not the reason Hadoop has attracted the attention of enterprises lately. Frankly, the MapReduce programming paradigm was a non-starter for most enterprise customers: it’s at too low a level of abstraction and curating and auditing MapReduce programs is prohibitively expensive for customers unless they have a serious software development shop dedicated to it. What has caught on, however, is the idea of ‘cheap scalable storage’!

In our view the future of Hadoop is really this: a solid abstraction of storage in the form of HDFS with any number of different processing stacks on top, including higher-level query languages. Naturally this will be a collection of different products, hybrids where necessary. I think we’ve only seen the tip of the iceberg yet.

Q3. Why introducing a new Hadoop distribution?

Scott Yara, Florian Waas: Let’s be clear about one thing first: to us a distribution is simply a bundle of software components that comes with the assurance that the bundled products have been integration-tested and certified. To enterprise customers this assurance is vital as it gives them the single point of contact when things go wrong. And exactly this is the objective of Pivotal HD.

A distribution is not–or not necessarily–a fork of the code and we have no intention to fork Hadoop. At this point, the value-add that we bring to the table is strictly layered on top of Apache HD and interacts cleanly with the vanilla Hadoop stack.

As long as no vendor actively subverts the Hadoop project, we don’t see any need to fork. That being said, if a single vendor sweeps up a significant number of contributors or even committers of any individual project it always raises a couple of red flags and customers will be concerned whether somebody is going to hijack the project. At this point, we’re not aware of any such threat to the open-source nature of Hadoop.

Q4. How did you expand Hadoop capabilities as a data platform with Pivotal HD?

Scott Yara, Florian Waas: Pivotal HD is a full Apache HD distribution plus some Pivotal add-ons. As we said before, the HDFS abstraction is a pretty good one—but the standard stack on top of it is lacks severely in performance and expressiveness; so we give customers better alternatives. For enterprise customers this means: you can use Pivotal HD like regular Hadoop where applicable but if you need more, you get it in the same bundle.

Q5. What is the rationale beyond introducing HAWQ, a relational database that runs atop of HDFS?

Scott Yara, Florian Waas: Not quite. We’ve transplanted a modern distributed query engine onto HDFS. We stripped out a lot of “incidental” database technology that databases are notorious for. HAWQ gives enterprises the best of both worlds: high-performance query processing for a query language they already know on the one hand, and scalable open storage on the other hand. And, unlike with a database, data isn’t locked away in a proprietary format: in HAWQ you can access all stored data with any number of tools when you need to.

Q6. How does Pivotal HD differ from Hadapt in this respect?

Scott Yara, Florian Waas: Hadapt is still in its infancy with what looks like a long way to go; mainly because they couldn’t tap into a MPP database product.

Folks sometimes forget how much work goes into building a commercially viable query processor. In the case of Greenplum, it’s been about 10 years of engineering.

Q7. How does HAWQ work?

Scott Yara, Florian Waas: HAWQ is modern distributed and parallel query processor atop HDFS–with all the features you truly need, but without the bloat of a complete RDBMS.

Obviously there’s a number of rather technical details how exactly the two worlds integrate and interested readers can find specific technical descriptions on our website.

Q8 You write in the Greenplum Blog that “HAWQ draws from the 10 years of development on the Greenplum Database product”. Can you be more specific?

Scott Yara, Florian Waas: Building a distributed query engine that is general and powerful enough to support deep analytics is a very tough job. There are no shortcuts. Hive and all of these SQL-ish interfaces we’ve recently seen are an attempt at it and work well for simple queries but basically failed to deliver solid performance when it comes to advanced anlaytics.

Having spent a long time working on DB internals we sometimes keep forgetting how steep a development this technology has undergone. Folks new in this space constantly “discover” some of the problems query processing has dealt with for a long time already, like join ordering—this learning-by-doing approach is kind of cute, but not necessarily effective.

Q9. Why does HAWQ have its own execution engine separate from MapReduce? Why does it manage its own data?

Looks like we’re answering the questions in the wrong order :-)

MapReduce is a great tool to teach parallelism and distributed processing and makes for an intuitive hands-on experience. But unless your problem is as simple as a distributed word-count, MapReduce quickly becomes a major development and maintenance headache; and even then the resulting performance is sub-standard.

In short, MapReduce, while maybe great for software shops with deep expertise in distributed programming and a do-it-yourself attitude, is not enterprise-ready.

Q10. HAWQ supports Columnar or row-oriented storage. Why this design choice?

Scott Yara, Florian Waas: Columnar vs. row-orientation really is a smoke screen; always has been. We’ve long advocated to view columnar for what it is: a feature, not an architectural principle. If your query processor follows even the most basic software engineering principles supporting column-orientation is really easy.

Plenty of white papers have been written on the differences and discussed the application scenarios where one out-performs the other and vice versa. As so often, there is no one-size-fits-all. HAWQ lets customers use what they feel is the right format for the job. We want customers to be successful, not blindly follow an ideology.

The same way different requirements in the workload demand different orientation, HAWQ can ingest different data formats way beyond column or row orientation—optimized for query processing, or optimized for 3rd party applications, etc.—which rounds out the picture.

Q11. Could you give us some technical detail on how the SQL parallel query optimizer and planner works?

Scott Yara, Florian Waas: What you see in HAWQ today is the true and tried Greenplum Database MPP optimizer with a couple of modifications but largely the same battle-tested technology. That’s what allowed us to move ahead of the competition so quick while everybody else is still trying to catch up to basic MPP functionality.

Having said that, we’re constantly striving for improvement and pushing the limits. Over the past years, we have invested in what we believe is a ground-breaking optimizer infrastructure which we’ll unveil later this summer. So, stay tuned!

Q12. Could you give us some details on the partitioning strategy you use and what kind of benchmark results do you have?

Scott Yara, Florian Waas: The benchmarks are a funny thing: hardly any competitor can run even the most basic database benchmarks yet, so we’re comparing on the simple, almost trivial, queries only. Anyways, here’s what we’ve been seeing so far: if the query is completely trivial the nearest competitor is slower by at least a factor of two. For anything even slightly more complex the difference widens quickly to one to two orders of magnitude.

Q13. Apache Hadoop is open-source, do you have plans to open up HAWQ and the other technologies layered atop it?

Scott Yara, Florian Waas: We’ve been debating this but haven’t really made a decision, as of yet.

Q14. The Hadoop market is crowded: e.g Cloudera (Impala), Hortonworks’ Data Platform for Windows, Intel’s Hadoop distribution, NewSQL data store Hadapt. How do you stand out of this crowd of competitors?

Scott Yara, Florian Waas: We clearly captured a position of leadership with HAWQ and enterprise customers do recognize that. We’ve also received a lot of attention from competitors which shows that we clearly hit a nerve and deliver a piece of the puzzle enterprises have long been waiting for.

Q15. With Pivotal HD are you competing in the same market space as Teradata Aster?

Scott Yara, Florian Waas: Aster has traditionally targeted a few select verticals. For all we can tell, it looks like we’re seeing the continuation of that strategy with a highly specialized offering going forward.

In contrast to that, Pivotal HD strives to be a general purpose solution for as broad a customer spectrum as you can imagine.

Q16. Jeff Hammerbacher in 2011 said “The best minds of my generation are thinking about how to make people click ads… That sucks. If instead of pointing their incredible infrastructure at making people click on ads, they pointed it at great unsolved problems in science, how would the world be different today?” What is your take on this?

Scott Yara, Florian Waas: Jeff garnered a lot of attention with this quote but let’s face it, this type of criticism isn’t exactly novel nor very productive. For decades, Joseph Weizenbaum, one of the pioneers of AI famously lamented about the genius and technology wasted on TV satellites. Along the same lines other MIT faculty have decried the fact that their most successful engineering students become quants on Wall Street. The list is probably long.

Instead of scolding people for what they didn’t do, I’d say, let’s empower people and give them tools to do great things and solve truly important problems. It’s not at coincidence that Big Data problems are at the heart of the most pressing challenges humanity faces today. So, let’s get moving!

Scott Yara
Senior Vice President, Products and Co-Founder Greenplum/EMC.
In his role as SVP, Products, Scott is responsible for the division’s overall product development and go-to-market efforts, including engineering, product management, and marketing. Scott is a co-founder of Greenplum and was President of the company. Prior to Greenplum, Scott served as vice president for Digital Island, a publicly traded Internet infrastructure services company that was acquired by Cable & Wireless in 2001. Prior to Digital Island, Scott served as vice president for Sandpiper Networks, an Internet content delivery services company that merged with Digital Island in 1999. At Sandpiper, Scott helped to create the industry’s first content delivery network (CDN), a globally distributed computing infrastructure comprised of several thousand servers, and used by many of the industry’s largest Internet services including Microsoft and Disney
.

Florian Waas
As Senior Director of Advanced Research and Development at Greenplum/EMC, Florian Wass heads up the division’s department of Impossible Ideas. That is to say, his day job is to look into ideas that are far from ready to be undertaken as engineering efforts, and then look at what would it take to turn theory into practice.
He obtained his MSc in Computer Science from Passau University, Germany and a PhD in database research from the University of Amsterdam. Florian Waas has worked as a researcher for several European research consortia and universities in Germany, Italy, and The Netherlands. Before joining Greenplum, Florian Waas held positions at Microsoft and Amazon.com.

Related Posts

- Big Data: Improving Hadoop for Petascale Processing at Quantcast. March 13, 2013

- On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. December 5, 2012

-On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. February 1, 2012

Resources

- ODBMS.org free resources on Big Data and Analytical Data Platforms
Blog Posts | Free Software | Articles | Lecture Notes | PhD and Master Thesis |

Follow ODBMS.org on Twitter: @odbmsorg

##

Apr 11 13

Graphs vs. SQL. Interview with Michael Blaha

by Roberto V. Zicari

“For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.”– Michael Blaha

Graphs, SQL and Databases. On this topic I have interviewed our expert Michael Blaha.

RVZ

Q1. A lot of today’s data can be modeled as a heterogeneous set of “vertices” connected by a heterogeneous set of “edges”, people, events, items, etc. related by knowing, attending, purchasing, etc. This world view is not new as the object-oriented community has a similar perspective on data. What is in your opinion the main difference with respect to a graph-centric data world?

Michael Blaha: This world view is also not new because this is the approach Charlie Bachman took with network databases many years ago. I can think of at least two major distinguishing aspects of graph-centric databases relative to relational databases.
(1) Graph-centric databases are occurrence-oriented while relational databases are schema-oriented. If you know the schema in advance and must ensure that data conforms to it, then a schema-oriented approach is best. Examples include traditional business applications, such as flight reservations, payroll, and order processing.
(2) Graph-centric databases emphasize navigation. You start with a root object and pull together a meaningful group of related objects. Relational databases permit navigation via joins, but such navigation is more cumbersome and less natural. Many relational database developers are not adept at performing such navigation.

Q2. The development of scalable graph applications such for example for Facebook, and Twitter require different kind of databases than SQL. Most of these large Web companies have built their own internal graph databases. But what about other enterprise applications?

Michael Blaha: The key is the distinction between being occurrence-oriented and schema-oriented. For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.

Q3: Marko Rodriguez and Peter Neubauer in an interview say that “the benefit of the graph comes from being able to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description (e.g. friends that work together, roads below a certain congestion threshold). We call this data processing pattern, the graph traversal pattern. This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk’s traversal”. What is your take on this?

Michael Blaha: That’s a great point and one that I should have mentioned in my answer to Q1. Relational databases have poor handling of recursion. I will note that the vendor products have extensions for this but they aren’t natural and are an awkward graft onto SQL. Graph databases, in contrast, are great with handling recursion. This is a big advantage of graph databases for applications where recursion arises.

Q4. Is there any synergy between graphs and conventional relational databases?

Michael Blaha: Graphs are also important for relational databases, and more so than some persons may realize…
Graphs are clearly relevant for data modeling. An Entity-Relationship data model portrays the database structure as a graph.
Graphs are also important for expressing database constraints. The OMG’s Object Constraint Language (OCL) expresses database constraints using graph traversal. The OCL is a textual language so it can be tedious to use, but it is powerful. The Common Warehouse Metamodel (CWM) specifies many fine constraints with the OCL and is a superb example of proper OCL usage.
– Even though the standard does not emphasize it, the OCL is also an excellent language for database traversal as a starting point for database queries. Bill Premerlani and I explained this in a past book (Object-Oriented Modeling and Design for Database Applications).
– Graphs are also helpful for characterizing the complexity of a relational database design. Robert Hilliard presents an excellent technique for doing this in his book (Information-Driven Business).

Q5. You say that graphs are important for data modeling, but at the end you do not store graphs in a relational database but tables, and you need joins to link them together… Graph databases in contrast cache what is on disk into memory and vendors claim that this makes for a highly reusable in-memory cache. What is your take on this?

Michael Blaha: Relational databases play many optimization games behind the covers. So in general, possible performance differences are often not obvious. I would say that the difference in expressiveness is what determines suitable applications for graph and relational databases and performance is a secondary issue, except for very specialized applications.

Q6: What are advantages of SQL relative to graph databases?

Michael Blaha: Here are some advantages of SQL:

– SQL has a widely-accepted standard.
– SQL is a set-oriented language. This is good for mass-processing of set-oriented data.
– SQL databases have powerful query optimizers for handling set-oriented queries, such as for data warehouses.
– The transaction processing behavior of relational databases (the ACID properties) are robust, powerful, and sound.
– SQL has extensive support for controlling data access.

Q7: What are disadvantages of SQL relative to graph databases?

Michael Blaha: Here are some disadvantages of SQL:

– SQL is awkward for processing the explosion of data that can result from starting with an object and traversing a graph.
SQL, at best, awkwardly handles recursion.
– SQL has lots of overhead for multi-user locking that can make it difficult to access individual objects and their data.
– Advanced and specialty applications often require less rigorous transaction processing with reduced overhead and higher throughput.

Q8: For which applications is SQL best? For which applications are graph databases best?

Michael Blaha:
– SQL is schema based. Define the structure in advance and then store the data. This is a good approach for conventional data processing such as many business and financial systems.
– Graph databases are occurrence based. Store data and relationships as they are encountered. Do not presume that there is an encompassing structure. This is a good approach for some scientific and engineering applications as well as data that is acquired from Web browsers and search engines.

Q9. What about RDF quad/triple stores?

Michael Blaha: I have not paid much attention to this. RDF is an entity-attribute-value approach. From what I can tell, it seems occurrence based and not schema based and my earlier comments apply.

_____________
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Blaha received his doctorate from Washington University in St. Louis, Missouri in Chemical Engineering with his dissertation being about databases. Both his academic background and working experience involve engineering and computer science. He is an alumnus of the GE R&D Center in Schenectady, New York, working there for eight years. Since 1993, Blaha has been a consultant and trainer in the areas of modeling, software architecture, database design, and reverse engineering. Blaha has authored six U.S. patents, four books, and many papers. Blaha is an editor for IEEE Computer as well as a member of the IEEE-CS publications board. He has also been active in the IEEE Working Conferences on Reverse Engineering.

Related Posts

- On Big Graph Data. August 6, 2012

- Applying Graph Analysis and Manipulation to Data Stores. June 22, 2011

Resources

ODBMS.org Resources on Graphs and Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes
##

Follow ODBMS.org on Twitter: @odbmsorg

Apr 4 13

On Innovation– Interview with Nathan Marz.

by Roberto V. Zicari

” I think that it’s incredibly important for all programmers to have a public presence by being involved in open source or having side projects that are publicly available. The industry is quickly changing and more and more people are realizing how ineffective the standard techniques of programmer evaluation are. This includes things like resumes and coding questions “ –Nathan Marz.

Nathan Marz open-sourced Cascalog, ElephantDB, and Storm.
I caught him while he just left Twitter to do his own startup. I asked him a few questions.

RVZ

Q1. You just left Twitter to start your own company. How much were you influenced by Jeff Bezos concept of “Regret Minimization Framework”? (watch YouTube video) What if he had failed?

Nathan Marz: You only live once, so it’s important to make the most of the time you have. I find Bezos’s “regret minimization framework” a great way to make decisions with a long term perspective. Too often people make decisions only thinking about marginal, short-term gains, and this can lead you down a path you never intended to go. And failure, if it happens, is not as bad as it seems. Worst comes to worse I’ll have learned an enormous amount, had a unique and interesting experience, and will just try something else.

Q2. Do you want to disclose in general terms what you’ll be working on?

Nathan Marz: Sorry, not at the moment. [ edit: as of now he did not disclose it]

Q3. You open-sourced Cascalog, ElephantDB, and Storm. Which of the three is in your opinion the most rewarding?

Nathan Marz: Storm has been very rewarding because of the sheer number of people using it and the diversity of industries it has penetrated, from healthcare to analytics to social networking to financial services and more.

Q4. What are in your opinion the current main challenges for Big Data analytics?

Nathan Marz: I think the biggest challenge is an educational one. There’s an overwhelming number of tools in the Big Data ecosystem, all very much different than the relational databases people are used to, and none is a one-sized-fits-all solution. This is why I’m writing my book “Big Data” – to show people a structured, principled approach to architecting data systems and how to use those principles to choose the right tool for you particular use case.

Q5. In January 2013, version 0.8.2 of Storm was released? What is new?

Nathan Marz: There was a lot of work done in 0.8.2 on making it easier to use a shared cluster for both production and in-development applications. This included improved monitoring support, which helps with detecting when you’ll need to scale with more resources, and a brand new scheduler that isolates production and development topologies from each other. And of course, the usual bug fixes and small improvements.

Q6. How do you expect Storm evolving?

Nathan Marz: There’s a lot of work happening right now on making Storm enterprise-ready. These include security features such as authentication and authorization, enhanced monitoring capabilities, and high availability for the Storm master. Long term, we want to continue with the theme of having Storm seamlessly integrate with your other realtime backend systems, such as databases, queues, and other services.

Q7. Daniel Abadi of Hadapt, said in a recent interview: “the prevalent architecture that people use to analyze structured and unstructured data is a two-system configuration, where Hadoop is used for processing the unstructured data and a relational database system is used for the structured data. However, this is a highly undesirable architecture, since now you have two systems to maintain, two systems where data may be stored, and if you want to do analysis involving data in both systems, you end up having to send data over the network which can be a major bottleneck.”
What is your opinion on this?

Nathan Marz: I think that “structured” vs. “unstructured” is a false dichotomy. It’s easy to store both unstructured and structured data in a distributed filesystem: just use a tool like Thrift to make your structured schema and serialize those records into files. A common objection to this is: “What if I need to delete or modify one of those records? You can’t cheaply do that when the data is stored in files on a DFS.” The answer is to move beyond the era of CRUD and embrace immutable data models where you only ever create or read data. In the architecture I’ve developed, which I call the Lambda Architecture, you then build views off of that data using tools like Hadoop and Storm, and it’s the views that are indexed and go on to feed the low latency requests in your system.

Q8. What are the main lessons learned in the last three years of your professional career?

Nathan Marz: I think that it’s incredibly important for all programmers to have a public presence by being involved in open source or having side projects that are publicly available. The industry is quickly changing and more and more people are realizing how ineffective the standard techniques of programmer evaluation are. This includes things like resumes and coding questions. These techniques frequently label strong people as weak or weak people as strong. Having good work out in the open makes it much easier to evaluate you as a strong programmer. This gives you many more job options and will likely drive up your salary as well because of the increased competition for your services.

For this reason, programmers should strongly prefer to work at companies that are very permissive about contributing to open source or releasing internal projects as open source. Ironically, a company having this policy is assisting in driving up the value (and price) of the employee, but as time goes on I think this policy will be necessary to even have access to the strongest programmers in the first place.

————–
Nathan Marz. was the Lead Engineer at BackType before BackType was acquired by Twitter in July of 2011. At Twitter he started the streaming compute team which provides infrastructure that supports many critical applications throughout the company. He left Twitter in March of 2013 to start his own company (currently in stealth).

Related Posts

- On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. December 5, 2012

Resources

- Video: Jeff Bezos – Regret Minimization Framework

- github Storm

- github Cascalog

- github ElephantDB

- Big Data: Principles and best practices of scalable realtime data systems.
Nathan Marz (Twitter) and James Warren
MEAP Began: January 2012
Softbound print: Fall 2013 | 425 pages
Manning Publications
ISBN: 9781617290343
Download Chapter 1: A new paradigm for Big Data (.PDF)

##
follow ODBMS.org on Twitter: @odbmsorg

Mar 25 13

Big Data for Genomic Sequencing. Interview with Thibault de Malliard.

by Roberto V. Zicari

“Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations” –Thibault de Malliard.

Big Data for Genomic Sequencing. On this subject, I have interviewed Thibault de Malliard, researcher at the University of Montreal’s Philip Awadalla Laboratory, who is working on bioinformatics solutions for next-generation genomic sequencing.

RVZ

Q1. What are the main research activities of the University of Montreal’s Philip Awadalla Laboratory?

Thibault de Malliard: The Philip Awadalla Laboratory is the Medical and Population Genomics Laboratory at the University of Montreal. Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations. Its research includes work relevant to all types of human diseases: genetic, immunological, infectious, chronic and cancer.
Using genomic data from single-nucleotide polymorphisms (SNP), next-generation re-sequencing, and gene expression, along with modern statistical tools, the lab is able to locate genome regions that are associated with disease pathology and virulence as well as study the mechanisms that cause the mutations.

Q2. What is the lab’s medical and population genomics research database?

Thibault de Malliard: The lab’s database is regrouping all the mutations (SNPs) found by DNA genotyping, DNA sequencing and RNA sequencing for each samples. There is also annotation data from public databases.

Q3. Why is data management important for the genomic research lab?

Thibault de Malliard: All the data we have is in text csv files. This is what our software takes as input, and it will output other text csv files. So we use a lot of Bash and Perl to extract the information we need and to do some stats. As time goes, we multiply the number of files by sample, by experiment, and finally we get statistics based on the whole data that need recalculating each time we perform a new sequencing/genotyping (mutation frequency, mutations per gene, etc).

With this database, we are also preparing for the lab’s future:
• As the amount of data increases, one day the memory will not fit an associative array.
• Looking to a 200 GB file to find one specific mutation will not be a good option.
• Adding new data to the current files will take more and more time/space.
• We need to be able to select the data according to every parameter we have, i.e., grouping by type of mutation and/or by chromosome, and/or by sample information by gender, ethnicity, age, or pathology.
• We then need to export a file, or count / sum / average it.

Q4. Could you give us a description of what kind of data is in the lab’s genomic research database storing and processing? And for what applications?

Thibault de Malliard: We are storing single nucleotide polymorphisms (SNPs), which are the most common form of genetic mutations among people, from sequencing and genotyping. When an SNP is found for a sample, we also look at what we have at the same position for the other samples:
• There is no SNP but data for the sample, so we know this sample does not have the SNP.
OR
• There is no data for the sample, so we cannot assess whether or not there is an SNP for this sample at this position.

We gather between 1.8 and 2.5 million nucleotides (at least one sample has it) per sample, depending on the experiment technique. We store them in the database along with some information:
• how damaged the SNP can be for the function of the gene
• its frequency in different populations (African, European, French Canadian…).

The database also contains information about each sample, such as gender, ethnicity, pathology. This will keep growing with our needs. So, basically, we have a sample table, a mutations table with their information, an experiment table and a big table linking the 3 previous tables with relations one to many.

Here is a very slightly simplified example of a single record in our database:

Single record in our database
Type of data Data Table
SNP T Begin Mutation information table
Chromosome 1
Position 100099771
gene NZT
Damaging for gene function? synonymous
Present in known database? yes End Mutation information table
Sequencing quality 26 Begin Table linking other tables together containing information about 1 mutation for 1 sample from 1 sequencing
Sequencing coverage 15
Validated by another experiment? no End Table linking other tables
Sample 345 Begin Sample table
Research project Project_1
Gender Male
Ethnicity French
family 10 End Sample table
Sequencing information Illumina Hiseq 2500 Begin Sequencing table
Sequencing type (DNA RNA…) RNAseq
Analysis pipeline info No PCR duplicates only Properly paired End Sequencing table

The applications are multiple, but here are some which come to my mind:
• extract subset of data to use with our tools
• doing stats, counts
• find specific data
• annotate our data with public databases

Q5. Why did you decide to deploy TokuDB database storage engine to optimize the lab’s medical and population genomics research database?

Thibault de Malliard: We knew that the data could not be managed with MySQL and MyISAM. One big issue is the insert rate, and TokuDB offered a solution up to 50 times faster. Furthermore, TokuDB allows us to manipulate the structure of the database without blocking access to it. As a research team, we always have new information to add, which means column additions.

Q6. Did you look/consider other vendor alternatives? If yes, which ones?

Thibault de Malliard: None. This is much too time consuming.

Q7. What are you specifically using TokuDB for?

Thibault de Malliard: We only store genetic data with information related to this genetic.

Q8. How many databases do you use? What are the data requirements?

Thibault de Malliard: I had planned to use three databases:
1. Database for RNA/DNA sequencing and from DNA genotyping (described before);
2. Database for data from well-known reference databases (dbsnp, 1000genome);
3. A last one to store analyzed data from database 1 and 2.

The data stored is manly the nucleotide (a character: A, C, G, T) with integer information like quality, position, and Boolean flags. I avoid using any string to keep the table as small as possible.

Q9. Especially, what are the requirements for data ingest of records and retrieve of data?

Thibault de Malliard: As a research team, we do not have high requirements like real-time insertion from logs. But I would say, at most, the import should take less than a night. The update of the database 1 is critical with the addition of a new sequencing or genotyping experiment: a batch of 50M records (can be more than 3 times higher!) has to be inserted. This has been happening monthly, but it should increase this year.

We have a huge amount of data, and we need to get query results as fast as possible, We have been used to one or two days (a weekend) of query time – having 10 seconds is much more preferable!

Q10. Could you give some examples of what are the typical research requests that need data ingestion and retrieval

Thibault de Malliard: We have a table with all the SNPs for 1000 samples. This is currently a 100GB table.
A typical query could be to get the sample that got a mutation different from the 999 others. We also have some samples that are families: a child with its parents. We want to find the SNPs present in this child, but not present in the other family member.
We may want to find mutations common to one group of sample given the gender, a disease state, ethnicity.

Q11. What kind of scalability problems did you have?

Thibault de Malliard: The problem is managing this huge amount of data. The number of connections should be very low. Most of the time, there is only one user. So I had to choose the data types carefully and the relationships between the tables. Lately, I ran into a very slow join with a range so I decided to split the position based tables by chromosome. Now there are 26 tables and some procedures to launch queries through the chromosomes. The gain of time is not quantifiable.

Q12. Do you have any benchmarking measures to sustain the claim that Tokutek’s TokuDB has improved scalability of your system?

Thibault de Malliard: I populated the database with two billion records in the main table and then did queries. While I did not see improvements with my particular workload for queries, I did see significant insertion performance gains. When I tried to add an extra 1M record (Load data infile), it took 51 minutes for MyISAM to load the data, but less than one minute with TokuDB. I extend this amount of data to an RNA sequencing experiment: it should take 2.5 days for MyISAM but one hour for TokuDB.

Q13. What are the lessons learned so far in using TokuDB database storage engine in your application domain?

Thibault de Malliard: We are still developing it and adding data. But inserting data into the two main tables (0.9G records, 2.3G records) was done in a fairly good time, less than one day. Adding columns to fulfill the needs of the team is also a very easy feature: it takes one second to create the column. Updating it is another story, but the table is still accessible during this process.
Another great feature, like the one I use with each query, is to be able to follow the state of the query.
You can follow in the process list the number of rows which were queried. So if you have a good estimation of the number of records expected, you know exactly the time of the query. I cannot count the number of process I killed because the query time expected was not acceptable.

Qx. Anything you wish to add?

Thibault de Malliard: The sequencing/genotyping technologies evolve very fast. Evolving means more data from the machines. I expect our data to grow at least three times each year. We are glad to have TokuDB in place to handle the challenge.

————-
Since 2010, Thibault de Malliard has worked in the University of Montreal’s Philip Awadalla Laboratory where he provides bioinformatics support to the lab crew and develops bioinformatics solutions for next-generation genomic sequencing. Previously, he worked for the French National Institute for Agricultural Research (INRA) with the MIG laboratory (Mathematics, Informatics and Genomics) where, as part of the European Nanomubiop project, he was tasked with developing software to produce probes for a HPV chip. He holds a masters degree in bioinformatics (France).

Related Posts

- Big Data: Improving Hadoop for Petascale Processing at Quantcast. March 13, 2013

- On Big Data Analytics –Interview with David Smith. February 27, 2013

- Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown. February 18, 2013

- MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013

- Scaling MySQL and MariaDB to TBs: Interview with Martín Farach-Colton. October 8, 2012

Related Resources

- Big Data for Good. by Roger Barca, Laura Haas, Alon Halevy, Paul Miller, Roberto V. Zicari. June 5, 2012:
A distinguished panel of experts discuss how Big Data can be used to create Social Capital.
Blog Panel | Intermediate | English | DOWNLOAD (PDF)| June 2012|

- ODBMS.org resources on Relational Databases, NewSQL, XML Databases, RDF Data Stores.

Follow ODBMS.org on Twitter: @odbmsorg
##