“We were an early user of Hadoop and found ourselves pushing its scalability bounds and of necessity innovating our own solutions. For example we wrote our own API on top of it, rebuilt its sorter, and developed an alternative file system to achieve better performance and cost-effectiveness” — Jim Kelly and Sriram Rao.
Quantcast released last October the Quantcast File System (QFS) to open source. I have interviewed Jim Kelly, head of R&D at Quantcast, and Sriram Rao, Principal Scientist in Microsoft’s Cloud and Information Services Lab (CISL). Both worked closely on the development of the QFS platform at Quantcast.
RVZ
Q1. What is the main business of Quantcast?
Jim Kelly: Quantcast measures digital audiences for millions of web destinations and helps advertisers reach their audiences more effectively. We observe billions of anonymous digital media consumption events every month and use large scale machine learning techniques to characterize audiences and to identify and reach relevant ones for advertisers.
We’re not in the distributed file system business, we’re in the same category as Google and Yahoo: an advertising company doing pioneering work in big data and sharing it with the rest of the community.
Q2. What are the main big data processing challenges you have experienced so far?
Jim Kelly: We’ve been doing big data processing since the launch of our measurement product in 2006, which put us in the big data space before it had become a space. We were an early user of Hadoop and found ourselves pushing its scalability bounds and of necessity innovating our own solutions. For example we wrote our own API on top of it, rebuilt its sorter, and developed an alternative file system to achieve better performance and cost-effectiveness. Our business and data volume have grown steadily since then. We currently receive over 50 TB of data per day, continually challenging us to operate at a scale beyond most organizations’ experience and most technologies’ comfort zone.
Sriram: In terms of big data, Hadoop is targeted to the sweet spot where the volume of data being processed in a computation is roughly the size of the amount of RAM in the cluster. At Quantcast, what we found was that we were frequently jobs where the amount of data being data processed per day (even with compression) was substantial. For instance, when I started at Quantcast in 2008, we were doing 100TB per day. Scaling the data processing volume required us to rethink components of Hadoop and build novel soutions. As we developed novel solutions and deployed them in our cluster, we were able to increase data processing volumes to as much as 500TB per day over a 2-year period. This increase occurred purely due to software improvements on the same hardware.
Q3. What lessons did you learn so far in using Hadoop for Big Data Analytics?
Jim Kelly: Hadoop has been a fantastic boon for us and the rest of the community. By providing a framework for developers (and increasingly, non-developers) to do parallel computing without a lot of systems knowledge, it has democratized an arcane and difficult problem. It has been a catalyst for the emergence of the big data community: a whole ecosystem of people, companies and technologies focused on different aspects of the problem. We’ve been a big beneficiary, both in terms of technology we’ve been able to use directly, and team members we’ve been able to attract because they want to do cutting-edge work in the dynamic space Hadoop has created. The lesson, I think, is how much value a technology can provide above and beyond what it does when you run it.
Sriram: The Apache Hadoop distribution is the same one that powers Yahoo!’s large clusters. Having an open source software tested at such scale has provided a great starting point for companies looking to process big data. While the data processing needs at Quantcast are extreme (10’s of PB per day), the main lesson I think is that, for a vast majority of the companies interested in data analytics Hadoop has become a stable platform which works “out-of-the-box”.
Q4. What are the scalability limits of current implementation of Hadoop?
Jim Kelly: It’s difficult to generalize, because there are so many different environments using Hadoop in different ways. What becomes an issue for one set of use cases on one set of hardware is not necessarily relevant for another, and I think it’s safe to say that the great majority of environments working with Hadoop needn’t worry about its scalability.
In our environment we’re processing a steady diet of many-terabyte jobs on many hundreds of servers, and we’ve invested a great deal in making the “sort” phase of MapReduce more efficient. In between the “map” and the “reduce,” Hadoop has to sort and group the mapper output and distribute it to the appropriate reducers. In general every reducer will have to read some of every mapper’s output, and with M mappers and N reducers that causes MxN disk seeks. When your jobs are large enough to require M and N in the thousands, this becomes a big performance bottleneck. We wrote a blog post describing the problem in more detail.
In our own environment we’ve achieved more efficiency at this scale by developing the Quantcast File System (QFS) and Quantsort. Sriram joined us in 2008 and led the effort to adapt the Kosmos File System (KFS) to the job and to build a new sort architecture on top of it. Using the two systems, we’ve sorted data sets of nearly a petabyte, and we regularly process over 20 PB per day.
QFS also improved where Hadoop was scaling too readily for us: cost. By using a more efficient data encoding, it lets us do more data processing using only half the hardware we would need with Hadoop’s HDFS, and therefore half the power and half the cooling and half the maintenance.
Sriram: The fundamental issue with Hadoop is how “shuffle” phase is handled as the data volume scales.
In a Map-Reduce computation, the output generated by map tasks during the “map” phase is partitioned and a reducer obtains its input partition from each of the map tasks. In particular, for large jobs, the size of the map output may exceed the amount of RAM in the cluster. When that happens, data has to be written to disk and then read back and transported over the network.
The “shuffle” is an “external” distributed merge-sort and is known to be seek intensive. Hadoop uses traditional mechanisms for implementing the shuffle and this is known to have scalability issues when the volume of data to be shuffled exceeds the amount of RAM in the cluster.
Q5. Sriram what was your contribution to the QFS project?
Sriram: QFS is an evolution of the KFS system that I started at Kosmix. The bulk of the KFS code base is in QFS. At Quantcast, we continued to evolve KFS by improving performance, by adding features, etc. One key feature we added at Quantcast was support multi-writer append (i.e., atomic append). It turns out that this feature can be leveraged to build high-performance sort architectures, which are fundamental to doing Map-Reduce computations at scale. In general, I worked on all aspects of the KFS system.
Q6. How does the Kosmos File System (KFS) relate to the QFS?
Jim Kelly: QFS evolved from KFS and has a couple years’ improvements on it. The most significant is Reed-Solomon erasure coding, which doubles effective storage and improves performance.
Sriram: The QFS project has its roots in the KFS system that I built. The bulk of the KFS code base is in QFS.
I started KFS originally at Kosmix Corp (now, @WalmartLabs) in 2006. At that point, Kosmix was trying to build a search engine and KFS was intended to be the storage substrate. The KFS architecture is based on the Google filesystem paper. The main idea in KFS is that blocks of a file are striped across nodes and replicated for fault-tolerance. We eventually released KFS as open source in 2007, after which I joined Quantcast. At Quantcast, we continued to evolve KFS by improving performance, by adding features, etc. KFS 0.5 is a snapshot of the code base that was run in production cluster at Quantcast in 2010. So, if you will, think of KFS as QFS 0.5. The significant difference between the two systems is erasure coding.
Q7. Sriram you also worked on another open source project Sailfish. Could you tell us a bit more about it?
Sriram: Data analytics computations are run on machines in a datacenter. The frameworks for doing these computations were designed in 2004-2006 timeframe when bandwidth within the datacenter was a scarce resource. However, that design point is changing.
Over the next few years, technological trends suggest that bandwidth within a datacenter will substantially increase. Inter-node connectivity is expected to go up from 1Gbps between any pair of nodes to 10Gbps (and possibly higher). Given such plentiful bandwidth, how would we build analytical engines (such as, Map-Reduce computation engines)? Sailfish is an attempt at re-designing how the shuffle data is transported in Map-Reduce frameworks by taking advantage of better connectivity.
Sailfish design is based on two key ideas:
1. Use network-wide data aggregation to improve disk subsystem performance.
2. Gather statistics on the aggregated data to plan for reduce phase of execution.
We instantiated these two ideas in the context of Hadoop-0.20.2, where we re-worked the entire shuffle pipeline and how the reduce phase of execution is done. Sailfish leverages the concurrent append capabilities of KFS to move data from mappers to reducers. When a Hadoop Map-Reduce job is run with Sailfish, (1) user does not tune sort parameters and (2) number of reducers in a job and their task assignment is determined at run-time in a data-dependent manner. Essentially, Sailfish simplifies running large-scale Hadoop-based Map-Reduce computations. We have also added support for preemption in Sailfish, where the preemption is implemented via a checkpoint/restart mechanism. This allows us to seamlessly handle skew.
Here is the Sailfish website which has links to the papers we have published and the software we have released.
Q8. Who is currently using Sailfish?
Sriram: Sailfish is a research prototype which has been released to open-source so that other researchers/users can benefit from our work. We are currently working on migrating many of the ideas from Sailfish to Hadoop version 2.0 (aka YARN). We have filed some JIRAs and intend to contribute the code to Hadoop.
Q9. Quantcast released last October the Quantcast File System (QFS) to open source, (link here). Why?
Jim Kelly: We’ve always relied heavily on open source software internally, and it’s good to be able to give back. It’s a chance for developers to have more of an impact than they could working on a proprietary system. And we believe file systems particularly, as foundational infrastructure components, benefit from the open source model where they can get a lot of scrutiny and be challenged to prove themselves in many different environments.
Q10. Is QFS supposed to be alternative to Apache Hadoop`s HDFS or just a complementary solution?
Jim Kelly: Hadoop has a wide variety of customers with diverse priorities. Some are just getting started and put a lot of value on ease of use. Others care most about performance at small scale, or high availability. HDFS offers broad functionality to meet needs for all of them.
Quantcast’s priorities are cost efficiency and high performance at large scale, so in developing QFS we went deeper around that use case. QFS has significantly improved the performance and economics of our data processing and can do the same for other organizations running their own clusters at large scale. By contrast organizations just getting started or needing specific HDFS features will probably find HDFS is a better fit.
Q11. What are the key technical innovations of QFS?
Jim Kelly: Architecturally QFS places a couple of different bets than HDFS. It’s implemented in C++ rather than Java, which allows better control and optimization in a number of areas. It also leans more heavily on modern networks, which have become much faster since HDFS launched and allow different optimization choices.
Those differences in approach make possible QFS’s Reed-Solomon erasure coding, which doubles storage efficiency and improves performance. The concurrent append feature allows many processes to write simultaneously to a single file, enabling applications like Quantsort. QFS’s use of direct I/O also improves both performance and manageability, by ensuring processes live within a predictable memory footprint.
Sriram: One of KFS’s key innovations that we did at Quantcast was support for “atomic append”.
This is the ability to allow multiple writers to append data to a file. Basically, think of the file as a “bag” into which data is dropped and there are no ordering guarantees on the order in which data is appended to a file. It turned out that this feature can be leveraged to build high-performance sort engines.
Q12. What kind of benchmark results do you have comparing QFS with Hadoop`s HDFS? What benchmark did you use for that?
Jim Kelly: Both QFS and HDFS rely on a central server to manage the file system, and its performance is critical as every cluster operation is dispatched through it. We’ve tested the two head-to-head for various operations and found QFS’s leaner C++ implementation pays off. Directory listings were 22% faster, and directory creation was nearly three times as fast.
These were our tests, and we encourage other people to run their own. We’ve put scripts and instructions to help on github.
Our whole-cluster tests will be harder for others to replicate at the same scale we did on our production cluster. We attempted to get some sense of how the file system affects real-world job run time. We measured end-to-end run time for simple Hadoop jobs that wrote or read 20TB of user data on an otherwise idle cluster, using QFS or HDFS as the underlying file system. The average throughput of the write job was about 75% higher using QFS, due to having to write less physical data. The read job throughput was 47% higher, primarily due to better parallelism. More information about these tests is available on the QFS wiki.
Q13. How reliable is QFS? What is your experience in using QFS in the Quantcast environment?
Jim Kelly: We started running KFS internally in 2008 on secondary storage and over years evolved it into today’s QFS, in the process adding features to make it more manageable at large scale and hardening it for production use. By 2011 it was running reliably enough that we were comfortable going all in. We copied all our data over from HDFS and have been running QFS exclusively since then for all our storage and MapReduce work. QFS has handled over six exabytes of I/O since then, so we’re confident it’s ready for other organizations’ production workloads.
Q14. You mentioned Hadoop’s architectural limitations for large jobs (terabytes and beyond). The more tasks a job uses, the less efficient its disk I/O becomes. To improve that, Quantcast developed an improved Hadoop Sort, called Quantsort. What is special about Quantsort?
Jim Kelly: Quantsort avoids the MxN combinatorics Hadoop faces shuffling data between mappers and reducers. It leverages QFS’s concurrent append feature to shuffle data through a fixed number of intermediate files. Reducers read their data in larger chunks, keeping disk I/O efficient. We wrote a blog post describing Quantsort more fully.
Q15. How is the performance of Quantsort?
Jim Kelly: Fast. Sriram did Quantcast’s first large sort back in 2009, of 140 TB in 6.2 hours on much less hardware than we have today. Quantsort’s record on a production job is 971 TB in 6.4 hours.
Sriram: For large jobs (think 10’s-100’s of TB of data and beyond) where the map output exceeds the size of RAM, Quantsort performance can be upto an order of magnitude faster when compared to Hadoop’s shuffle. Quantsort enables you to process more data quickly using the same hardware. Since jobs finish faster, it has the multiplier effect of allowing users to run more jobs on the cluster.
Q17. Is Quantsort part of QFS and therefore, also open source?
Jim Kelly: No, Quantsort is a separate code base.
Q18 In 2011, Quantcast migrated all production data to QSF and turned off HDFS. How did you manage to migrate the data stored in HDFS to QSF?
Jim Kelly: The data migration was very straightforward, a matter of running a trivial MapReduce job that copies data from one place to another. We avoided any outage for our production jobs by phasing them over, making them read from either file system but write to QFS during the transition.
Q19. How easy is to plug-in and use QFS into an existing Hadoop cluster which has already data stored in HDFS?
Jim Kelly: QFS is plug-in compatible with Hadoop, using KFS bindings that are already included in standard Hadoop distributions, so it’s very easy to integrate. The data storage format is different, requiring either migrating data to QFS or phasing it in gradually. The QFS wiki has a migration guide.
Q20. KFS has also been released an open source project. How does it relate to the Quantcast File System (QFS)?
Jim Kelly: Quantcast adopted KFS in 2008, became the main sponsor of contributions to the KFS project for the next two years and subsequently evolved it into QFS. You’ll find the concurrent append feature in the KFS repository, for example, but Reed-Solomon encoding is only in QFS.
Qx Anything you wish to add?
Jim Kelly: An invitation to other organizations doing distributed batch processing and interested in doubling their cluster capacity to give QFS a try. We’re happy to field questions or receive feedback on the QFS mailing list.
———-
Jim Kelly leads Quantcast’s R&D team, which works on both adding computing capacity through cluster software innovations and using it up through new analytic and modeling products. Having been at Quantcast for six years, he has seen its data volumes and processing challenges grow from zero to petabytes and led technical and organizational changes that have kept Quantcast a step ahead. Previously he held engineering leadership roles at Oracle, Kana, and Scopus Technology (acquired by Siebel). Jim holds a PhD in physics from Princeton University.
Sriram Rao is currently a Principal Scientist in Microsoft’s Cloud and Information Services Lab (CISL), and formerly lead the Quantcast’s Cluster Team. Sriram was the creator of the Kosmos File System (KFS), which would become the underlying architecture for the Quantcast File System (QFS) that was released to open source last year. Sriram kiis also the creator of Sailfish, another open source project that is geared towards processing massive amounts of data. Previously he has worked at Quantcast, Yahoo, and Kosmix (now @WalmartLabs). Sriram holds a PhD in Computer Sciences from UT-Austin.
Related Posts
-On Big Data Analytics –Interview with David Smith. February 27, 2013
–Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown. February 18, 2013
–Lufthansa and Data Analytics. Interview with James Dixon. February 4, 2013
Resources
–ODBMS.org: Big Data and Analytical Data Platforms – Free Software
–ODBMS.org: Big Data and Analytical Data Platforms – Articles
Follow ODBMS.org on Twitter: @odbmsorg
“So the synergies in data management come not from how the systems connect but how the data is used to derive business value” –Steve Shine,
On Dec. 21, 2012, Actian Corp. announced the completion of the transaction to buy Versant Corporation. I have interviewed Steve Shine, CEO and President, Actian Corporation.
RVZ
Q1. Why acquiring an object-oriented database company such as Versant?
Steve Shine: Versant Corporation, like us, has a long pedigree in solving complex data management in some of the world’s largest organisations. We see many synergies in bringing the two companies together. The most important of these is together we are able to invest more resources in helping our customers extract even more value from their data. Our direct clients will have a larger product portfolio to choose from, our partners will be able to expand in adjacent solution segments, and strategically we arm ourselves with the skills and technology to fulfil our plans to deliver innovative solutions in the emerging Big Data Market.
Q2. For the enterprise market, Actian offers its legacy Ingres relational database. Versant on the other hand offers an object oriented database, especially suited for complex science/engineering applications. How does this fit? Do you have a strategy on how to offer a set of support processes and related tools for the enterprise? if yes, how?
Steve Shine: While the two databases may not have a direct logical connection at client installations, we recognise that most clients use these two products as part of a larger more holistic solutions to support their operations. The data they manage is the same and interacts to solve business issues – for example object stores to manage the relationships between entities; transactional systems to manage clients and the supply chain and analytic systems to monitor and tune operational performance. – Different systems using the same underlying data to drive a complex business.
We plan to announce a vision of an integrated platform designed to help our clients manage all their data and their complex interactions, both internal and external so they can not only focus on their running their business, but better exploit the incremental opportunity promised by Big Data.
Q3. Bernhard Woebker, president and chief executive officer of Versant stated, “the combination of Actian and Versant provides numerous synergies for data management”. Could you give us some specific examples of such synergies for data management?
Steve Shine: Here is a specific example of what I mean by helping clients extracting more value from data in the Telco space. These type of incremental opportunities exist in every vertical we have looked at.
An OSS system in a Telco today may use an Object store to manage the complex relationships between the data, the same data is used in a relational store to monitor, control and manage the telephone network.
Another relational store using variants of the same data manages the provisioning, billing and support for the users of the network. The whole data set in Analytical stores is used to monitor and optimise performance and usage of the network.
Fast forwarding to today, the same data used in more sophisticated ways has allowed voice and data networks to converge to provide a seamless interface to mobile users. As a result, Telcos have tremendous incremental revenue opportunities BUT only if they can exploit the data they already have in their networks. For example: The data on their networks has allowed for a huge increase in location based services, knowledge and analysis of the data content has allowed providers to push targeted advertising and other revenue earning services at their users; then turning the phone into a common billing device to get even a greater share of the service providers revenue… You get the picture.
Now imagine other corporations being able to exploit their information in similar ways: Would a retailer benefit from knowing the preferences of who’s in their stores? Would a hedge fund benefit from detecting a sentiment shift for a stock as it happens? Even knowledge of simple events can help organisations become more efficient.
A salesman knowing immediately a key client raises a support ticket; A product manager knowing what’s being asked on discussion forums; A marketing manager knowing a perfect prospect is on the website.
So the synergies in data management come not from how the systems connect but how the data is used to derive business value. We want to help manage all the data in our customers organisations and help them drive incremental value from it. That is the what we mean by numerous synergies from data management and we have a vision to deliver it to our customers.
Q4. Actian claims to have more than 10,000 customers worldwide. What is the value proposition of Versant’s acquisition for the existing Actian`s customers?
Steve Shine: I have covered this in the answers above. They get access to a larger portfolio of products and services and we together drive a vision to help them extract greater value from their data.
Q5. Versant claims to have more than 150,000 installations worldwide. How do you intend to support them?
Steve Shine: Actian already runs a 24/7 global support organisation that prides itself in delivering one of the industry’s best client satisfaction scores. As far as numbers are concerned, Versant’s large user count is in essence driven by only 250 or so very sophisticated large installations whereas Actian already deals with over 10,000 discreet mission critical installations worldwide. So we are confident of maintaining our very high support levels and the Versant support infrastructure is being integrated into Actian’s as we speak.
Q6. Actian is active in the market for big data analytics. How does Versant’s database technology fit into Actian’s big data analytics offerings and capabilities?
Steve Shine: Using the example above imagine using OSS data to analyse network utilisation, CDR’s and billing information to identify pay plans for your most profitable clients.
Now give these clients the ability to take business action on real time changes in their data.Now imagine being able to do that from an integrated product set from one vendor. We will be announcing the vision behind this strategy this quarter. In addition, the Versant technology gives us additional options for solutions for big data for example visualisation and managing meta data.
Q7. Do you intend to combine or integrate your analytics database Vectorwise with Versant’s database technology (such as Versant JPA)? If yes, how?
Steve Shine: Specific plans for integrating products within the overall architecture have not been formulated. We have a strong philosophy that you should use the best tool for the job eg OODB for some things, OLTP RDBMS for other etc. But the real value comes from being able to perform sophisticated analysis and management across the different data stores. That is part of the work out platform integration efforts are focused on.
Q8. What are the plans for future software developments. Will you have a joint development team or else?
Steve Shine: We will be merging the engineering teams to focus on providing innovative solutions for big Data under single leadership.
Q9. You have recently announced two partnerships for Vectorwise, with Inferenda and BiBoard. Will you also pursue this indirect channel path also for Versant’s database technology?
Steve Shine: The beauty of the vision we speak of is that our joint partner have a real opportunity to expand their solutions using Actian’s broader product set and for those that are innovative the opportunity for new emerging markets
Q10. Versant recently developed Versant JPA. Is the Java market important for Actian?
Steve Shine: Yes !
Q11. It is currently a crowded database market: several new database vendors (NoSQL and NewSQL) offering innovative database technology (NuoDB, VoltDB, MongoDB, Cassandra, Couchbase, Riak to name a few), and large companies such as IBM and Oracle, are all chasing the big data market. What is your plan to stand out of the crowd?
Steve Shine: We are very excited about the upcoming announcement on our plans for the Big Data market. We will be happy to brief you on the details closer to the time but I will say that early feedback from analysts houses like Gartner have confirmed that our solution is very effective and differentiated in helping corporations extract business value from Big Data. On a higher scale, many of the start ups are going to get a very rude awakening when they find that delivering a database for mission critical use is much more than speed and scale of technology. Enterprises want world class 24×7 support service with failsafe resilience and security. Real industry grade databases take years and many $m’s to reach scalable maturity. Most of the start ups will not make it. Actian is uniquely positioned in being profitable and having delivered industry grade database innovation but also being singularly focused around data management unlike the broad, cumbersome and expensive bigger players. We believe value conscious enterprises will see our maturity and agility as a great strength.
Qx Anything else you wish to add?
Steve Shine: DATA! – What a great thing to be involved in! Endless value, endless opportunities for innovation and no end in sight as far as growth is concerned. I look forward to the next 5 years.
———————–
Steve Shine, CEO and President, Actian Corporation.
Steve comes to Actian from Sybase where he was senior vice president and general manager for EMEA, overseeing all operational, sales, financial and human resources in the region for the past three years. While at Sybase, he achieved more than 200 million in revenue and managed 500 employees, charting over 50 percent growth in the Business Intelligence market for Sybase. Prior to Sybase, Steve was at Canadian-based Geac Computer Corporation for ten successful years, helping to successfully turn around two major global divisions for the ERP firm.
Related Posts
–Managing Internet Protocol Television Data. — An interview with Stefan Arbanowski. June 25, 2012
–On Versant`s technology. Interview with Vishal Bagga. August 17, 2011
Resources
–Analyzing Big Data With Twitter. A special UC Berkeley iSchool course.
Follow ODBMS.org on Twitter: @odbmsorg
##
“The data you’re likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API” –David Smith.
On the subject of Big Data Analytics I have interviewed David Smith, Vice President of Marketing and Community at Revolution Analytics.
RVZ
Q1. How would you define the job of a data scientist?
David Smith: A data scientist is someone charged of analyzing and communicating insight from data.
It’s someone with a combination of skills: computer science, to be able to access and manipulate the data; statistical modeling, to be able to make predictions from the data; and domain expertise, to be able to understand and answer the question being asked.
Q2. What are the main technical challenges for Big Data predictive analytics?
David Smith: For a skilled data scientist, the main challenge is time. Big Data takes a long time just to move (so don’t do that, if you don’t have to!), not to mention the time required to apply complex statistical algorithms. That’s why it’s important to have software that can make use of modern data architectures to fit predictive models to Big Data in the shortest time possible. The more iterations a data scientist can make to improve the model, the more robust and accurate it will be.
Q3. R is an open source programming language for statistical analysis. Is R useful for Big Data as well? Can you analyze petabytes of data with R, and at the same time ensure scalability and performance?
David Smith: Petabytes? That’s a heck of a lot of data: even Facebook has “only” 70 Pb of data, total. The important thing to remember is that “Big Data” means different things in different contexts: while raw data in Hadoop may be measured in the petabytes, by the time a data scientist selects, filters and processes it you’re more likely to be in the terabytes or even gigabyte range when the data’s ready to be applied to predictive models.
Open Source R , with its in-memory, single-threaded engine, will still struggle even at this scale, though. That’s why Revolution Analytics added scalable, parallelized algorithms to R, making predictive modeling on terabytes of data possible. With Revolution R Enterprise , you can use SMP servers or MPP grids to fit powerful predictive models to hundreds of millions of rows of data in just minutes.
Q4. Could you give us some information on how Google, and Bank of America use R for their statistical analysis?
David Smith: Google has more than 500 R users , where R is used to study the effectiveness of ads, for forecasting, and for statistical modeling with Big Data.
In the financial sector, R is used by banks like Bank of America and Northern Trust and insurance companies like Allstate for a variety of applications, including data visualization, simulation, portfolio optimization, and time series forecasting.
Q5. How do you handle the Big Data Analytics “process” challenges with deriving insight?
– capturing data
– aligning data from different sources (e.g., resolving when two objects are the same)
– transforming the data into a form suitable for analysis
– modeling it, whether mathematically, or through some form of simulation
– understanding the output
– visualizing and sharing the results
David Smith: These steps reflect the fact that data science is an iterative process: long gone are the days where we would simply pump data through a black-box algorithm and hope for the best. Data transformation, evaluation of multiple model options, and visualizing the results are essential to creating a powerful and reliable statistical model. That’s why the R language has proven so popular: its interactive language encourages exploration, refinement and presentation, and Revolution R Enterprise provides the speed and big-data support to allow the data scientist to iterate through this process quickly.
Q6. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?
David Smith: Real-time predictive analytics with Big Data are certainly possible. (See here for a detailed explanation.) Accuracy comes with real-time scoring of the model, which is dependent on a data scientist building the predictive model on Big Data. To maintain accuracy, that model will need to be refreshed on a regular basis using the latest data available.
Q7. In your opinion, is there a technology which is best suited to build Analytics Platform? RDBMS, or instead non relational database technology, such as for example columnar database engine? Else?
David Smith: The data you’re likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API.
That’s one of the reasons the R language is so powerful: it provides interfaces to a variety of data storage and processing systems, instead of being dependent on any one technology.
Q8. Cloud computing: What role does it play with Analytics? What are the main differences between Ground vs Cloud analytics?
David Smith: Cloud computing can be a cost-effective platform for the Big-Data computations inherent in predictive modeling: if you occasionally need a 40-node grid to fit a big predictive model, it’s convenient to be able to spin one up at will. The big catch is in the data: if your data is already in the cloud you’re golden, but if it lives in a ground-based data center it’s going to be expensive (in time *and* money) to move it to the cloud.
———
David Smith, Vice President, Marketing & Community, Revolution Analytics
David Smith has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment.
He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products. David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project.
Today, David leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog.
Prior to joining Revolution Analytics, David served as vice president of product management at Zynchros, Inc.
—
Related Posts
–Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown. February 18, 2013
– Lufthansa and Data Analytics. Interview with James Dixon. February 4, 2013
–On Big Data Velocity. Interview with Scott Jarr. on January 28, 2013
Resources
– Big Data and Analytical Data Platforms
Blog Posts | Free Software | Articles | PhD and Master Thesis |
– Cloud Data Stores
Blog Posts | Lecture Notes| Articles and Presentations| PhD and Master Thesis|
– NoSQL Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations|Documentations, Tutorials, Lecture Notes | PhD and Master Thesis
Follow ODBMS.org on Twitter: @odbmsorg
##
“Our experience with MongoDB is that it’s architecture requires nodes to be declared as Master, and others as Slaves, and the configuration is complex and unintuitive. C* architecture is much simpler. Every node is a peer in a ring and replication is handled internally by Cassandra based on your desired redundancy level. There is much less manual intervention, which allows us to then easily automate many tasks when managing a C* cluster.” — Christos Kalantzis and Jason Brown.
Netflix, Inc. (NASDAQ: NFLX) is an online DVD and Blu-Ray movie retailer offering streaming movies through video game consoles, Apple TV, TiVo and more.
Last year, Netflix’s had a total of 29.4 million subscribers worldwide for their streaming service (Source).
I have interviewed Christos Kalantzis , Engineering Manager – Cloud Persistence Engineering and Jason Brown, Senior Software Engineer both at Netflix. They were involved in deploying Cassandra in a production EC2 environment at Netflix.
RVZ
Q1. What are the main technical challenges that Big data analytics pose to modern Data Centers?
Kalantzis, Brown: As companies are learning how to extract value from the data they already have, they are also identifying all the value they could be getting from data they are currently not collecting.
This is creating an appetite for more data collection. This appetite for more data, is pushing the boundaries of traditional RDBMS systems and forcing companies to research alternative data stores.
This new data size, also requires companies to think about the extra costs involved storing this new data (space/power/hardware/redundant copies).
Q2. How do you handle large volume of data? (terabytes to petabytes of data)?
Kalantzis, Brown: Netflix does not have its own datacenter. We store all of our data and applications on Amazon’s AWS. This allows us to focus on creating really good applications without the overhead of thinking about how we are going to architect the Datacenter to hold all of this information.
Economy of Scale also allows us to negotiate a good price with Amazon.
Q3. Why did you choose Apache Cassandra (C*)?
Kalantzis, Brown: There’s several reasons we selected Cassandra. First, as Netflix is growing internationally, a solid multi-datacenter story is important to us. Configurable replication and consistency, as well as resiliency in the face of failure is an absolute requirement, and we have tested those capabilities more than once in production! Other compelling qualities include being an open source, Apache project and having an active and vibrant user community.
Q4: What do you exactly mean with “a solid multi-datacenter story is important to us”? Please explain.
Kalantzis, Brown: As we expand internationally we are moving and standing up new datacenters close to our new customers. In many cases we need a copy of the full dataset of an application. It is important that our Database Product be able to replicate across multiple datacenters, reliably, efficiently and with very little lag.
Q5. Tell us a little bit about the application powered by C*
Kalantzis, Brown: Almost everything we run in the cloud (which is almost the entire Netflix infrastructure) uses C* as a database. From customer/subscriber information, to movie metadata, to monitoring stats, it’s all hosted in Cassandra.
In most of our uses, Cassandra is the source of truth database. There are a few legacy datasets in other solutions, but are actively being migrated.
Q6: What are the typical data insights you obtained by analyzing all of these data? Please give some examples. How do you technically analyze the data? And by the way, how large are your data sets?
Kalantzis, Brown: All the data Netflix gathers goes towards improving the customer experience. We analyze our data to understand viewing preferences, give great recommendations and make appropriate choices when buying new content.
Our BI team has done a great job with the Hadoop platform and has been able to extract the information we need from the terabytes of data we capture and store.
Q7. What Availability expectations do you have from customers?
Kalantzis, Brown: Our internal teams are incredibly demanding on every part of the infrastructure, and databases are certainly no exception. Thus a database solution must have low latency, high throughput, strict uptime/availability requirements, and be scalable to massive amounts of data. The solution must, of course, be able to withstand failure and not fall over.
Q8: Be scalable up to?
Kalantzis, Brown: We don’t have an upper boundary to how much data an application can store. That being said, we also expect the application designers to be intelligent with how much data they need readily available in their OLTP system. Our Applications store anywhere from 10 GB to 100 terabyte [Edit corrected a typo] of data in their respective C* clusters. C* architecture is such that the cluster’s capacity grows linearly with every node added to the cluster. So in theory we can scale “infinitely”.
Q9. What other methods did you consider for continuous availability?
Kalantzis, Brown: We considered and experimented with MongoDB, yet the operational overhead and complexity made it unmanageable so we quickly backed away from it. One team even built a sharded RDBMS cluster, with every node in the cluster being replicated twice. This solution is also very complex to manage. We are currently working to migrate to C* for that application.
Q10. Could you please explain in a bit more detail, what kind of complexity made it unmanageable using MongoDB for you?
Kalantzis, Brown: Netflix strives to choose architectures that are simple and require very little in the way of manual intervention to manage and scale. Our experience with MongoDB is that it’s architecture requires nodes to be declared as Master, and others as Slaves, and the configuration is complex and unintuitive. C* architecture is much simpler. Every node is a peer in a ring and replication is handled internally by Cassandra based on your desired redundancy level. There is much less manual intervention, which allows us to then easily automate many tasks when managing a C* cluster.
Q11. How many data centers are you replicating among?
Kalantzis, Brown: For most workloads, we use two Amazon EC2 regions. For very specific workloads, up to four EC2 regions.
To slice that further, each region has two to three availability zones (you can think of an availability zone as the closest equivalent to a traditional data center). We shard and replicate the data within a region across it’s availability zones, and replicate between regions.
Q12: When you shard data, don’t you have a possible data consistency problem when updating the data?
Kalantzis, Brown: Yes, when writing across multiple nodes there is always the issue of consistency. C* “solves” this by allowing the client (Application) to choose the consistency level it desires when writing. You can choose to write to only 1 node, all nodes or a quorum of nodes. Each choice offers a different level of consistency and the application will only return from its write statement when the desired consistency is reached.
The same is with reading, you can choose the level of consistency you desire with every statement. The timestamp of each record will be compared, and make sure to only return the latest record.
In the end the application developer needs to understand the trade offs of each setting (speed vs consistency) and make the appropriate decision that best fits their use case.
Q13. How do you reduce the I/O bandwidth requirements for big data analytics workloads?
Kalantzis, Brown: What is great about C*, is that it allows you to scale linearly with commodity hardware. Furthermore adding more hardware is very simple and the cluster will rebalance itself. To solve the I/O issue we simply add more nodes. This reduces the amount of data stored in each node allowing us to get as close as possible to the ideal data:memory ratio of 1.
Q14. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?
Kalantzis, Brown: When we chose C* we made a conscious decision to accept eventually consistent data.
We decided write/read speed and high availability was more important than consistency. Our application is such that this tradeoff, although might rarely provide inaccurate data (start a movie at the wrong location), it does not negatively impact a user. When an application does require accuracy, then we increase the consistency to quorum.
Q15. How do you ensure that your system does not becomes unavailable?
Kalantzis, Brown: Minimally, we run our c* clusters (over 50 in production now) across multiple availabilty zones within each region of EC2. If a cluster is multiregion (multi-datacenter), then one region is naturally isolated (for reads) from the other; all writes will eventually make it to all regions due to cass
Q16. How do you handle information security?
Kalantzis, Brown: We currently rely on Amazon’s security features to limit who can read/write to our C* clusters. As we move forward with our cloud strategy and want to store Financial data in C* and in the cloud, we are hoping that new security features in DSE 3.0 will provide a roadmap for us. This will enable us to move even more sensitive information to the Cloud & C*.
Q17. How is your experience with virtualization technologies?
Kalantzis, Brown: Netflix runs all of its infrastructure in the cloud (AWS). We have extensive experience with Virtualization, and fully appreciate all the Pros and Cons of virtualization.
Q18 How operationally complex is it to manage a multi-datacenter environment?
Kalantzis, Brown: Netflix invested heavily in creating tools to manage multi-datacenter and multi-region computing environments. Tools such as Asgard & Priam, has made that management easy and scalable.
Q19. How do you handle modularity and flexibility?
Kalantzis, Brown: At the data level, each service typically has it’s own Cassandra cluster so it can scale independently of other services. We are developing internal tools and processes for automatically consolidating and splitting clusters to optimize efficiency and cost. At the code level, we have built and open sourced our java Cassandra client, Astyanax , and added many recipes for extending the use patterns of Cassandra.
Q20. How do you reduce (if any) energy use?
Kalantzis, Brown: Currently, we do not. However, C* is introducing a new feature in version 1.2 called ‘virtual nodes’. The short explanation of virtual nodes is that it makes scaling up and down the size of a c* cluster much easier to manage. Thus, we are planning on using the virtual nodes concept with EC2 auto-scaling, so we would scale up the number of nodes in a C* cluster during peak times, and reduce the node count during troughs. We’ll save energy (and money) by simply using less resources when demand is not present.
Q21. What advice would you give to someone needing to ensure continuous availability?
Kalantzis, Brown: Availability is just one of the 3 dimensions of CAP (Concurrency, Availability and “Partitionability”). They should evaluate which of the other 2 dimensions are important to them and choose their technology accordingly. C* solves for A&P (and some of C).
Q22. What are the main lessons learned at Netflix in using and deploying Cassandra in a production EC2 environment?
Kalantzis, Brown: We learned that deploying C* (or any database product) in a virtualized environment has trade offs. You trade the fact that you have easy and quick access to many virtual machines, yet each virtual machine has MUCH less IOPS than a traditional server. Netflix has learned how to deal with those limitations and trade-offs, by sizing clusters appropriately (number of nodes to use).
It has also forced us to reimagine the DBA role as a DB Engineer role, which means we work closely with application developers to make sure that their schema and application design is as efficient as possible and not access the data store unnecessarily.
————-
Christos Kalantzis
(@chriskalan) Engineering Manager – Cloud Persistence Engineering Netflix
Previously Engineering-Platform Manager YouSendIt
A Tech enthusiast at heart, I try to focus my efforts in creating technology that enhances our lives.
I have built and lead teams at YouSendIt and Netflix which has lead to the scaling out of persistence layers, the creation of a cloud file system and the adoption of Apache Cassandra as a scalable and highly available data solution.
I’ve worked as a DB2, SQL Server and MySQL DBA for over 10 years and through, sometimes painful, trial and error I have learned the advantages and limitations of RDBMS and when the modern NoSQL solutions make sense.
I believe in sharing knowledge, that is why I am a huge advocate of Open Source software. I share my software experience through blogging, pod-casting and mentoring new start-ups. I sit on the tech advisory board of the OpenFund Project which is an Angel VC for European start-ups.
Jason Brown
(@jasobrown), Senior Software Engineer, Netflix
Jason Brown is a Senior Software Engineer at Netflix where he led the pilot project for using and deploying Cassandra in a production EC2 environment. Lately he’s been contributing to various open source projects around the Cassandra ecosystem, including the Apache Cassandra project itself. Holding a Master’s Degree in Music Composition, Jason longs to write a second string quartet.
———-
Related Posts
– Lufthansa and Data Analytics. Interview with James Dixon. February 4, 2013
–On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. December 5, 2012
–Big Data Analytics– Interview with Duncan Ross.November 12, 2012
Follow ODBMS.org on Twitter: @odbmsorg
##
“With MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements” –Tomas Ulin.
On February 5, 2013, Oracle announced the general availability of MySQL 5.6.
I have interviewed Tomas Ulin, Vice President for the MySQL Engineering team at Oracle. I asked him several questions on the state of the union for MySQL.
RVZ
Q1. You support several different versions of the MySQL database. Why? How do they differ with each other?
Tomas Ulin: Oracle provides technical support for several versions of the MySQL database to allow our users to maximize their investments in MySQL. Additional details about Oracle’s Lifetime Support policy can be found here.
Each new version of MySQL has added new functionality and improved the user experience. Oracle just made available MySQL 5.6, delivering enhanced linear scalability, simplified query development, better transactional throughput and application availability, flexible NoSQL access, improved replication and enhanced instrumentation.
Q2. Could you please explain in some more details how MySQL can offer a NoSQL access?
Tomas Ulin: MySQL 5.6 provides simple, key-value interaction with InnoDB data via the familiar Memcached API. Implemented via a new Memcached daemon plug-in to mysqld, the new Memcached protocol is mapped directly to the native InnoDB API and enables developers to use existing Memcached clients to bypass the expense of query parsing and go directly to InnoDB data for lookups and transactional compliant updates. With MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements. More information is available here.
MySQL Cluster presents multiple interfaces to the database, also providing the option to bypass the SQL layer entirely for native, blazing fast access to the tables. Each of the SQL and NoSQL APIs can be used simultaneously, across the same data set. NoSQL APIs for MySQL Cluster include memcached as well as the native C++ NDB API, Java (ClusterJ and ClusterJPA) and HTTP/REST. Additionally, during our MySQL Connect Conference last fall, we announced a new Node.js NoSQL API to MySQL Cluster as an early access feature. More information is available here.
Q3. No single data store is best for all uses. What are the applications that are best suited for MySQL, and which ones are not?
Tomas Ulin: MySQL is the leading open source database for web, mobile and social applications, delivered either on-premise or in the cloud. MySQL is increasingly used for Software-as-a-Service applications as well.
It is also a popular choice as an embedded database with over 3,000 ISVs and OEMs using it.
MySQL is also widely deployed for custom IT and departmental enterprise applications, where it is often complementary to Oracle Database deployments. It also represents a compelling alternative to Microsoft SQL Server with the ability to reduce database TCO by up to 90 percent.
From a developer’s perspective, there is a need to address growing data volumes, and very high data ingestion and query speeds, while also allowing for flexibility in what data is captured. For this reason, the MySQL team works to deliver the best of the SQL and Non-SQL worlds to our users, including native, NoSQL access to MySQL storage engines, with benchmarks showing 9x higher INSERT rates than using SQL, while also supporting online DDL.
At the same time, we do not sacrifice data integrity by trading away ACID compliance, and we do not trade away the ability to run complex SQL-based queries across the same data sets. This approach enables developers of new services to get the best out of MySQL database technologies.
Q4. Playful Play, a Mexico-based company, is using MySQL Cluster Carrier Grade Edition (CGE) to support three million subscribers on Facebook in Latin America. What are the technical challenges they are facing for such project? How do they solve them?
Tomas Ulin: As a start-up business, fast time to market at the lowest possible cost was their leading priority. As a result, they developed the first release of the game on the LAMP stack.
To meet both the scalability and availability requirements of the game, Playful Play initially deployed MySQL in a replicated, multi-master configuration.
As Playful Play’s game, La Vecidad de El Chavo, spread virally across Facebook, subscriptions rapidly exceeded one million users, leading Playful Play to consider how to best architect their gaming platforms for long-term growth.
The database is core to the game, responsible for managing:
• User profiles and avatars
• Gaming session data;
• In-app (application) purchases;
• Advertising and digital marketing event data.
In addition to growing user volumes, the El Chavo game also added new features that changed the profile of the database. Operations became more write-intensive, with INSERTs and UPDATEs accounting for up to 70 percent of the database load.
The game’s popularity also attracted advertisers, who demanded strict SLAs for both performance (predictable throughput with low latency) as well as uptime.
After their evaluation, PlayFul Play decided MySQL Cluster was best suited to meet their needs for scale and HA.
After their initial deployment, they engaged MySQL consulting services from Oracle to help optimize query performance for their environment, and started to use MySQL Cluster Manager to manage their installation, including automating the scaling of their infrastructure to support the growth from 30,000 new users every day. With subscriptions to MySQL Cluster CGE, which includes Oracle Premier Support and MySQL Cluster Manager in an integrated offering, Playful Play has access to qualified technical and consultative support which is also very important to them.
Playful Play currently supports more than four million subscribers with MySQL.
More details on their use of MySQL Cluster are here.
Q5. What are the current new commercial extensions for MySQL Enterprise Edition? How do commercial extensions differ from standard open source features of MySQL?
Tomas Ulin: MySQL Community Edition is available to all at no cost under the GPL. MySQL Enterprise Edition includes advanced features, management tools and technical support to help customers improve productivity and reduce the cost, risk and time to develop, deploy and manage MySQL applications. It also helps customers improve the performance, security and uptime of their MySQL-based applications.
This link is to a short demo that illustrates the added value MySQL Enterprise Edition offers.
Further details about all the commercial extensions can be found in this white paper. (Edit: You must be logged in to access this content.)
The newest additions to MySQL Enterprise Edition, released last fall during MySQL Connect, include:
* MySQL Enterprise Audit, to quickly and seamlessly add policy-based auditing compliance to new and existing applications.
* Additional MySQL Enterprise High Availability options, including Distributed Replicated Block Device (DRBD) and Oracle Solaris Clustering, increasing the range of certified and supported HA options for MySQL.
Q6. In September 2012, Oracle announced the first development milestone release of MySQL Cluster 7.3. What is new?
Tomas Ulin: The release comprises:
• Development Release 1: MySQL Cluster 7.3 with Foreign Keys. This has been one of the most requested enhancements to MySQL Cluster – enabling users to simplify their data models and application logic – while extending the range of use-cases.
• Early Access “Labs” Preview: MySQL Cluster NoSQL API for Node.js. Implemented as a module for the V8 engine, the new API provides Node.js with a native, asynchronous JavaScript interface that can be used to both query and receive results sets directly from MySQL Cluster, without transformations to SQL. This gives lower latency for simple queries, while also allowing developers to build “end-to-end” JavaScript based services – from the browser, to the web/application layer through to the database, for less complexity.
• Early Access “Labs” Preview: MySQL Cluster Auto-Installer. Implemented with a standard HTML GUI and Python-based web server back-end, the Auto-Installer intelligently configures MySQL Cluster based on application requirements and available hardware resources. This makes it simple for DevOps teams to quickly configure and provision highly optimized MySQL Cluster deployments – whether on-premise or in the cloud.
Additional details can be found here.
Q7. John Busch, previously CTO of former Schooner Information Technology commented in a recent interview (1): “legacy MySQL does not scale well on a single node, which forces granular sharding and explicit application code changes to make them sharding-aware and results in low utilization of severs”. What is your take on this?
Tomas Ulin: Improving scalability on single nodes has been a significant area of development, for example MySQL 5.6 introduced a series of enhancements which have been further built upon in the server, optimizer and InnoDB storage engine. Benchmarks are showing close to linear scalability for systems with 48 cores / threads, with 230% higher performance than MySQL 5.5. Details are here.
Q8. Could you please explain in your opinion the trade-off between scaling out and scaling up? What does it mean in practice when using MySQL?
Tomas Ulin: On the scalability front, MySQL has come a long way in the last five years and MySQL 5.6 has made huge improvements here – depending on your workload, you may scale well up to 32 or 48 cores. While the old and proven techniques can work well as well: you may use master -slave(s) replication and split your load by having writes on the master only, while reads on slave(s), or using “sharding” to partition data across multiple computers. The question here is still the same: do you hit any bottlenecks on a single (big) server or not?… – and if yes, then you may start to think which kind of “distributed” solution is more appropriate for you. And it’s not only MySQL server related — similar problems and solutions are coming with all applications today targeting a high load activity.
Some things to consider…
Scale-up pros:
– easier management
– easier to achieve consistency of your data
Scale-up cons:
– you need to dimension your server up-front, or take into account the cost of throwing away your old hardware when you scale
– cost/performance is typically better on smaller servers
– in the end higher cost
– at some point you reach the limit, i.e. you can only scale-up so much
Scale-out pros:
– you can start small with limited investment, and invest incrementally as you grow, reusing existing servers
– you can choose hardware with the optimal cost/performance
Scale-out cons:
– more complicated management
– you need to manage data and consistency across multiple servers, typically by making the application/middle tier aware of the data distribution and server roles in your scale-out (choosing MySQL Cluster for scale-out does not incur this con, only if you choose a more traditional master-slave MySQL setup)
Q9. How can you obtain scalability and high-performance with Big Data and at the same time offer SQL joins?
Tomas Ulin: Based on estimates from leading Hadoop vendors, around 80 percent of their deployments are integrated with MySQL.
As discussed above, there has been significant development in NoSQL APIs to InnoDB and MySQL Cluster storage engines which allow high speed ingestion of high velocity data Key/Value data, but which also allows complex queries, including JOIN operations to run across that same data set using SQL.
Technologies like Apache Sqoop are commonly used to load data to and from MySQL and Hadoop, so many users will JOIN structured, relational data from MySQL with unstructured data such as clickstreams within Map/Reduce processes within Hadoop. We are also working on our Binlog API to enable real time CDC with Hadoop. More on the Binlog API is here.
Q10. Talking about scalability and performance what are the main differences if the database is stored on hard drives, SAN, flash memory (Flashcache)? What happens when data does not fit in DRAM?
Tomas Ulin: The answer depends on your workload.
The most important question is: “how long is your active data set remaining cached?” — if not long at all, then it means your activity will remain I/O-bound, and having faster storage here will help. But, if the “active data set” is small enough to be cached most of the time, the impact of the faster storage will not be as noticeable. MySQL 5.6 delivers many changes improving performance on heavy I/O-bound workloads.
————–
Mr. Tomas Ulin has been working with the MySQL Database team since 2003. He is Vice President for the MySQL Engineering team, responsible for the development and maintenance of the MySQL related software products within Oracle, such as the MySQL Server, MySQL Cluster, MySQL Connectors, MySQL Workbench, MySQL Enterprise Backup, and MySQL Enterprise Monitor. Prior to working with MySQL his background was in the telecom industry, working for the Swedish telecom operator Telia and Telecom vendor Ericsson. He has a Masters degree in Computer Science and Applied Physics from Case Western Reserve University and a PhD in Computer Science from the Royal Institute of Technology.
Related Posts
(1): A super-set of MySQL for Big Data. Interview with John Busch, Schooner. on February 20, 2012
(2): Scaling MySQL and MariaDB to TBs: Interview with Martín Farach-Colton. on October 8, 2012.
–On Eventual Consistency– Interview with Monty Widenius.on October 23, 2012
Resources
ODBMS.org: Relational Databases, NewSQL, XML Databases, RDF Data Stores:
Blog Posts | Free Software | Articles and Presentations| Lecture Notes | Tutorials| Journals |
Follow us on Twitter: @odbmsorg
##
“Lufthansa is now able to aggregate and feed data into a management cockpit to analyze collected data for key decision-making purposes in the future. Users get instantly notified of transmission errors, enabling the company to detect patterns on large amounts of data at a rapid speed. There is also an automatic alarm messages sent out to IT product management, and partner airlines are informed of errors right away in the case of transmission errors between different IT systems for passenger data. Lufthansa is now able to comprehensively monitor one of its most important core processes in real-time for quality management: the handover of passenger data between different airlines” — James Dixon.
On the state of the market for Big Data Analytics I have Interviewed James Dixon, co-founder and Chief Geek / CTO, Pentaho Corporation.
RVZ
Q1. What is In your opinion the expected realistic Market Demand for Big Data analytics?
James Dixon: Big. Until recently it has not been possible to perform analysis of sub-transactional and detailed operational data for a reasonable price-tag. Systems such as Hadoop and the NoSQL repositories such as MongoDB and Cassandra make it possible to economically store and process large amount of data. The first use of this data is often to answer operational and tactical questions. Shortly after that comes the desire to answer managerial and strategic questions, and this where Big Data Analytics comes in. I estimate that 90% of all Big Data repositories will have some form of reporting/visualization/analysis requirement applied to it.
Q2. Aren’t we too early with respect to the maturity of the Big Data Analytics technology and the market acceptance?
James Dixon: We are early, but not too early. There is significant market acceptance in certain domains already – financial services, SaaS application providers, and media companies to name a few. As these initial markets mature we will see common use cases emerge and public endorsements of these technologies, this will help to increase acceptance in other markets. We’ve seen a significant uptake in commercial deals over the last few quarters whereas 2011 was more tire-kicking and exploratory.
Q3. Pentaho has worked with Lufthansa to improve their passenger handling. Could you please tell us more about this? In particular what requirement and technical challenges did you have for this project? And how did you solve them?
James Dixon: Lufthansa needed a solution that would make the core processes of Inter Airline Through Check In (IATCI) accessible, measurable and available for real-time operational monitoring. They also wanted to deliver consolidated management reporting dashboards to inform decision making out of this information. This was implemented by Pentaho’s services organization with onsite training and consulting. Our Pentaho Business Analytics suite was used for the front-end for real-time data analysis and report generation. In the back-end, Pentaho Data Integration (aka Kettle) retrieves, transforms and loads the message data streams into the data warehouse on a continuous basis.
Q4. And what results did you obtain so far?
James Dixon: Lufthansa is now able to aggregate and feed data into a management cockpit to analyze collected data for key decision-making purposes in the future. Users get instantly notified of transmission errors, enabling the company to detect patterns on large amounts of data at a rapid speed. There is also an automatic alarm messages sent out to IT product management, and partner airlines are informed of errors right away in the case of transmission errors between different IT systems for passenger data. Lufthansa is now able to comprehensively monitor one of its most important core processes in real-time for quality management: the handover of passenger data between different airlines. With Pentaho, Lufthansa is now instantly aware if they are dealing with a single occurrence of an error or if there is a pattern. They can immediately take action in order to minimize the impact on their passengers.
Q5. What is special about Pentaho’s big data analytic platform? How does it differ with respect to other vendors?
James Dixon: We have an end-to-end offering that encompasses data integration/orchestration across Big Data and regular data stores/sources, data transformation, desktop and web-based reporting, slice-and-dice analysis tools, dash boarding, and predictive analytics. Very few vendors have the breadth of technology that we do, and those that do are mainly pushing hardware and services. We enable the creation of hybrid solutions that allow companies to use the most appropriate data storage technology for every part of their system – we don’t force you to load all your data into Hadoop, for example. From an architecture perspective our ability to run our data integration engine inside of MapReduce tasks on the data nodes is a unique capability. And we provide analytics directly on top of big data tech that gives users instant results via our schema-on-read approach – you don’t have to predefine ETL or Schemas or Data Marts – we do it on the fly.
Q6. What are the technical challenges in creating and viewing Analytics on the iPad?
James Dixon: The navigation concepts are different on mobile devices, so the overall user experience of the analysis software needs to be adapted for the iPad. Vendors need to be sensitive to the interaction techniques that the touch screens provide. We have changed the way that all of our end-user web-based interfaces work so that experience on the iPad is similar. It is possible to allow ad-hoc analysis and content authoring on the iPad, and Pentaho provides that with our recent V4.8 release.
Q7. Big Data and Mobile: what are the challenges and opportunities?
James Dixon: There are some use cases that are easy to identify. Report bursting to mobile and non-mobile devices is a technique that is easy to do today. Real-time analysis of Big Data combined with the alerting and notification capability of mobile devices is an interesting combination.
Q8. How are you supposed to view complex analytics with the limited display of a mobile phone?
James Dixon: Even with a desktop computer and a large monitor, analysis of Big Data requires lots of aggregation and/or lots of filtering. If you could display all the raw data from a Big Data repository, you would not be able to interpret it. As the display gets smaller the amount of aggregation and filtering has to go up, and the complexity has to come down. It is possible to do reasonably complex analysis on a tablet, but it is certainly a challenge on the smaller devices.
Q9. What are the main technical and business challenges that customers face when they want to use Cloud analytics deployments?
James Dixon: Moving large amounts of data around is a hurdle for some organizations. For this reason cloud analytics is not very appealing to companies with established data centers. However young companies that exclusively use hosted applications do not have their data on-premise. As these companies grow and mature we will see the market for cloud analytics increase.
Q10. Pentaho has announced in July this year a technical integration of their analytics platform with Cloudera. What is the technical and business meaning of this? What are results obtained so far?
James Dixon:We are working closely with Coudera on a technical and business level. For example we worked with Cloudera to test their new Impala database with Pentaho’s analytics, so that we could demo the integration on the day that Impala was announced. We also have joint marketing campaigns and sales field engagement, as customers of Cloudera find tremendous benefit in engaging with Pentaho and vice-versa. Our tech makes it much easier and 20x faster to get Hadoop productive so their customers gravitate to us naturally.
—–
James Dixon, Founder and Chief Geek / CTO, Pentaho Corporation
As “Chief Geek” (CTO) at Pentaho, James Dixon is responsible for Pentaho’s architecture and technology roadmap. James has over 15 years of professional experience in software architecture, development and systems consulting. Prior to Pentaho, James held key technical roles at AppSource Corporation (acquired by Arbor Software which later merged into Hyperion Solutions) and Keyola (acquired by Lawson Software). Earlier in his career, James was a technology consultant working with large and small firms to deliver the benefits of innovative technology in real-world environments.
Related Posts
– On Big Data Velocity. Interview with Scott Jarr. on January 28, 2013
– The Gaia mission, one year later. Interview with William O’Mullane. on January 16, 2013
– Big Data Analytics– Interview with Duncan Ross on November 12, 2012
– On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. on December 5, 2012
– Managing Big Data. An interview with David Gorbet on July 2, 2012
– Analytics at eBay. An interview with Tom Fastner. on October 6, 2011
Resources
– Big Data: Challenges and Opportunities.
Roberto V. Zicari, October 5, 2012.
Abstract: In this presentation I review three current aspects related to Big Data:
1. The business perspective, 2. The Technology perspective, and 3. Big Data for social good.
Presentation (89 pages) | Intermediate| English | DOWNLOAD (PDF)| October 2012|
– ODBMS.org: Big Data and Analytical Data Platforms.
Blog Posts | Free Software | Articles | PhD and Master Thesis |
You can follow ODBMS.org on Twitter : @odbmsorg.
##
“There is only so much static data in the world as of today. The vast majority of new data, the data that is said to explode in volume over the next 5 years, is arriving from a high velocity source. It’s funny how obvious it is when you think about it. The only way to get Big Data in the future is to have it arrive in a high velocity rate ” — Scott Jarr.
One of the key technical challenges of Big Data is (Data) Velocity. On that, I have interviewed Scott Jarr, Co-founder and Chief Strategy Officer of VoltDB.
RVZ
Q1. Marc Geall, past Head of European Technology Research at Deutsche Bank AG/London, writes about the “Big Data myth”, claiming that there is:
1) limited need of petabyte-scale data today,
2) very low proportion of databases in corporate deployment which requires more than tens of TB of data to be handled, and
3) lack of availability and high cost of highly skilled operators (often post-doctoral) to operate highly scalable NoSQL clusters.
What is your take on this?
Scott Jarr: Interestingly I agree with a lot of this for today. However, I also believe we are in the midst of a massive shift in business to what I call data-as-a-priority.
We are just beginning, but you can already see the signs. People are loathed to get rid of anything, sensors are capturing finer resolutions, and people want to make far more, data informed decisions.
I also believe that the value that corporate IT teams were able to extract from data with the advent of data warehouses really whet the appetite of what could be done with data. We are now seeing people ask questions like “why can’t I see this faster,” or “how do we use this incoming data to better serve customers,” or “how can we beat the other guys with our data.”
Data is becoming viewed as a corporate weapon. Add inbound data rates (velocity) combined with the desire to use data for better decisions and you have data sizes that will dwarf what is considered typical today. And almost no industry is excluded. The cost ceiling has collapsed.
Q2: What are the typical problems that are currently holding back many Big Data projects?
Scott Jarr:
1) Spending too much time trying to figure out what solution to use for what problem. We were seeing this so often that we created a graphic and presentation that addresses this topic. We called it the Data Continuum.
2) Putting out fires that the current data environment is causing. Most infrastructures aren’t ready for the volume or velocity of data that is already starting to arrive at their doorsteps. They are spending a ton of time dealing with band-aids on small-data-infrastructure and unable to shake free to focus on the Big Data infrastructure that will be a longer-term fix.
3) Being able to clearly articulate the business value the company expects to achieve from a Big Data project has a way of slowing things down in a radical way.
4) Most of the successes in Big Data projects today are in situations where the company has done a very good job maintaining a reasonable scope to the project.
Q3: Why is it important to solve the Velocity problem when dealing with Big Data projects?
Scott Jarr: There is only so much static data in the world as of today. The vast majority of new data, the data that is said to explode in volume over the next 5 years, is arriving from a high velocity source. It’s funny how obvious it is when you think about it. The only way to get Big Data in the future is to have it arrive in a high velocity rate.
Companies are recognizing the business value they can get by acting on that data as it arrives rather than depositing it in a file to be batch processed at some later data. So much of the context that makes that data is lost when it not acted on quickly.
Q4: What exactly is Big Data Velocity? Is Big Data Velocity the same as stream computing?
Scott Jarr: We think of Big Data Velocity as data that is coming into the organization at a rate that can’t be managed in the traditional database. However, companies want to extract the most value they can from that data as it arrives. We see them doing three specific things:
1) Ingesting relentless feed(s) of data;
2) Making decisioning on each piece of data as it arrives; and
3) Using real-time analytics to derive immediate insights into what is happening with this velocity data.
Making the best possible decision each time data is touched is what velocity is all about. These decisions used to be called transactions in the OLTP world. They involve using other data stored in the database to make decision – approve a transaction, server the ad, authorize the access, etc. These decisions, and the real-time analytics that support them, all require the context of other data. In other words, the database used to perform these decisions must hold some amount of previously processed data – they must hold state. Streaming systems are good at a different set of problems.
Q5: Two other critical factors often mentioned for Big Data projects are: 1) Data discovery: How to find high-quality data from the Web? and 2) Data Veracity: How can we cope with uncertainty, imprecision, missing values, mis-statements or untruths? Any thoughts on these?
Scott Jarr: We have a number of customers who are using VoltDB in ways to improve data quality within their organization. We have one customer who is examining incoming financial events and looking for misses in sequence numbers to determine lost or miss-ordered information. Likewise, a popular use case is to filter out bad data as it comes in by looking at it in its high velocity state against a known set of bad or good characteristics. This keeps much of the bad data from ever entering the data pipeline.
Q6: Scalability has three aspects: data volume, hardware size, and concurrency. Scale and performance requirements for Big Data strain conventional databases. Which database technology is best to scale to petabytes?
Scott Jarr: VoltDB is focused on a very different problem, which is how to process that data prior to it landing in the long-term petabyte system. We see customers deploying VoltDB in front of both MPP OLAP and Hadoop, in roughly the same numbers. It really all depends on what the customer is ultimately trying to do with the data once it settles into its resting state in the petabyte store.
Q7: A/B testing, sessionization, bot detection, and pathing analysis all require powerful analytics on many petabytes of semi-structured Web data. Do you have some customers examples in this area?
Scott Jarr: Absolutely. Taken broadly, this is one of the most common uses of VoltDB. Micro-segmentation and on-the-fly ad content optimization are examples that we see regularly. The ability to design an ad, in real-time, based on five sets of audience meta-data can have a radical impact on performance.
Q8: When would you recommend to store Big Data in a traditional Data Warehouse and when in Hadoop?
Scott Jarr: My experience here is limited. As I said, our customers are using VoltDB in front of both types of stores to do decisioning and real-time analytics before the data moves into the long term store. Often, when the data is highly structured, it goes into a data warehouse and when it is less structured, it goes into Hadoop.
Q9: Instead of stand-alone products for ETL, BI/reporting and analytics wouldn’t it be better to have a seamless integration? In what ways can we open up a data processing platform to enable applications to get closer?
Scott Jarr: This is very much inline with our vision of the world. As Mike (Stonebraker , VoltDB founder) has stated for years, in high performance data systems, you need to have specialized databases. So we see the new world having far more data pipelines than stand alone databases. A data pipeline will have seamless integrations between velocity stores, warehouses, BI tools and exploratory analytics. Standards go a long way to making these integrations easier.
Q10: Anything you wish to add?
Scott Jarr.: Thank you Roberto. Very interesting discussion.
——————–
VoltDB Co-founder and Chief Strategy Officer Scott Jarr. Scott brings more than 20 years of experience building, launching and growing technology companies from inception to market leadership in highly competitive environments.
Prior to joining VoltDB, Scott was VP Product Management and Marketing at on-line backup SaaS leader LiveVault Corporation. While at LiveVault, Scott was key in growing the recurring revenue business to 2,000 customers strong, leading to an acquisition by Iron Mountain. Scott has also served as board member and advisor to other early-stage companies in the search, mobile, security, storage and virtualization markets. Scott has an undergraduate degree in mathematical programming from the University of Tampa and an MBA from the University of South Florida.
Related Posts
– On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. on December 5, 2012
– Two cons against NoSQL. Part II. on November 21, 2012
– Two Cons against NoSQL. Part I. on October 30, 2012
– Interview with Mike Stonebraker. on May 2, 2012
Resources
– ODBMS.org: NewSQL.
Blog Posts | Free Software | Articles and Presentations | Lecture Notes | Tutorials| Journals |
– Big Data: Challenges and Opportunities.
Roberto V. Zicari, October 5, 2012.
Abstract: In this presentation I review three current aspects related to Big Data:
1. The business perspective, 2. The Technology perspective, and 3. Big Data for social good.
Presentation (89 pages) | Intermediate| English | DOWNLOAD (PDF)| October 2012|
##
You can follow ODBMS.org on Twitter : @odbmsorg.
——————————-
” We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.” — William O`Mullane.
The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”. I recall here the Objectives and the Mission of the Gaia Project (source ESA Web site):
OBJECTIVES:
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
THE MISSION:
“Gaia is an ambitious mission to chart a three-dimensional map of our Galaxy, the Milky Way, in the process revealing the composition, formation and evolution of the Galaxy. Gaia will provide unprecedented positional and radial velocity measurements with the accuracies needed to produce a stereoscopic and kinematic census of about one billion stars in our Galaxy and throughout the Local Group. This amounts to about 1 per cent of the Galactic stellar population. Combined with astrophysical information for each star, provided by on-board multi-colour photometry, these data will have the precision necessary to quantify the early formation, and subsequent dynamical, chemical and star formation evolution of the Milky Way Galaxy.
Additional scientific products include detection and orbital classification of tens of thousands of extra-solar planetary systems, a comprehensive survey of objects ranging from huge numbers of minor bodies in our Solar System, through galaxies in the nearby Universe, to some 500 000 distant quasars. It will also provide a number of stringent new tests of general relativity and cosmology.”
Last year in February, I have interviewed William O`Mullane, Science Operations Development Manager, at the European Space Agency, and Vik Nagjee, Product Manager, Core Technologies, at InterSystems Corporation, both deeply involved with the initial Proof-of-Concept of the data management part of the project.
A year later, I have asked William O`Mullane (European Space Agency), and Jose Ruperez (Intersystems Spain), some follow up questions.
RVZ
Q1. The original goal of the Gaia mission was to “observe around 1,000,000,000 celestial objects”. Is this still true? Are you ready for that?
William O’Mullane: YES ! We will have a Ground Segment Readiness Review next Spring and a Flight Acceptance Review before summer. We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.
Q2. The plan was to launch the Gaia satellite in early 2013. Is this plan confirmed?
William O’Mullane: Currently September 2013 in Q1 is the official launch date.
Q3. Did the data requirements for the project change in the last year? If yes, how?
William O’Mullane: Downlink rate has not changed so we know how much comes into the System still only about 100TB over 5 years. Data processing volumes depend on how many intermediate steps we keep in different locations. Not much change there since last year.
Q4. The sheer volume of data that is expected to be captured by the Gaia satellite poses a technical challenge. What work has been done in the last year to prepare for such a challenge? What did you learn from the Proof-of-Concept of the data management part of this project?
William O’Mullane: I suppose we learned the same lessons as other projects. We have multiple processing centres with different needs met by different systems. We did not try for a single unified approach across these centers.
The CNES have gone completely to Hadoop for their processing. At ESAC we are going to InterSystems Caché. Last year only AGIS was on Caché – now the main daily processing chain is in Caché also [Edit: see also Q.9 ]. There was a significant boost in performance here but it must be said some of this was probably internal to the system, in moving it we looked at some bottlenecks more closely.
Jose Ruperez: We are very pleased to know that last year was only AGIS and now they have several other databases in Caché.
William O’Mullane: The second operations rehearsal is just drawing to a close. This was run completely on Caché (the first rehearsal used Oracle). There were of course some minor problems (also with our software) but in general from Caché perspective it went well.
Q5. Could you please give us some numbers related to performance? Could you also tells us what bottlenecks did you look at, and how did you avoid them?
William O’Mullane: Would take me time to dig out numbers .. we got factor 10 in some places with combination of better queries and removing some code bottle necks. We seem to regularly see factor 10 on “non optimized” systems.
Q6. Is it technically possible to interchange data between Hadoop and Caché ? Does it make sense for the project?
Jose Ruperez: The raw data downloaded from the satellite link every day can be loaded in any database in general. ESAC has chosen InterSystems Caché for performance reasons, but not only. William also explains how cost-effectiveness as well as the support from InterSystems were key. Other centers can try and use other products.
William O’Mullane:
This is a valid point – support is a major reason for our using Caché. InterSystems work with us very well and respond to needs quickly. InterSystems certainly have a very developer oriented culture which matches our team well.
Hadoop is one thing HDFS is another .. but of course they go together. In many ways our DataTrain Whiteboard do “map reduce” with some improvements for our specific problem. There are Hadoop database interfaces so it could work with Caché.
Q7. Could you tell us a bit more what did you learn so far with this project? In particular, what is the implication for Caché, now that also the the main daily processing chain is stored in Caché?
Jose Ruperez: A relevant number, regarding InterSystems Caché performance, is to be able to insert over 100,000 records per second sustained over several days. This also means that InterSystems Caché, in turn, has to write hundreds of MegaBytes per second to disk. To me, this is still mind-boggling.
William O’Mullane:
And done with fairly standard NetApp Storage. Caché and NetApp engineers sat together here at ESAC to align the configuration of both systems to get the max IO for Java through Caché to NetApp. There were several low level page size settings etc. which were modified for this.
Q8. What else still need to be done?
William O’Mullane: Well we have most of the parts but it is not a well oiled machine yet. We need more robustness and a little more automation across the board.
Q9. Your high level architecture a year ago consisted of two databases, a so called Main Database and an AGIS Database.
The Main Database was supposed to hold all data from Gaia and the products of processing. (This was expected to grow from a few TBs to few hundreds of TBs during the mission). AGIS was only required a subset of this data for analytics purpose. Could you tell us how the architecture has evolved in the last year?
William O’Mullane: This remains precisely the same.
Q10. For the AGIS database, were you able to generate realistic data and load on the system?
William O’Mullane: We have run large scale AGIS tests with 50,000,000 sources or about 4,500,000,000 simulated observations. This worked rather nicely and well within requirements. We confirmed going from 2 to 10 to 50 million sources that the problem scales as expected. The final (end of mission 2018) requirement is for 100,000,000 sources, so for now we are quite confident with the load characteristics. The simulation had a realistic source distribution in magnitude and coordinates (i.e. real sky like inhomogeneities are seen).
Q11. What results did you obtain in tuning and configuring the AGIS system in order to meet the strict insert requirements, while still optimizing sufficiently for down-stream querying of the data?
William O’Mullane: We still have bottlenecks in the update servers but the 50 million test still ran inside one month on a small in house cluster. So the 100 million in 3 months (system requirement) will be easily met especially with new hardware.
Q12. What are the next steps planned for the Gaia mission and what are the main technical challenges ahead?
William O’Mullane: AGIS is the critical piece of Gaia software for astrometry but before that the daily data treatment must be run. This so called Initial Data Treatment (IDT) is our main focus right now. It must be robust and smoothly operating for the mission and able to cope with non nominal situations occurring in commissioning the instrument. So some months of consolidation, bug fixing and operational rehearsals for us. The future challenge I expect not to be technical but rather when we see the real data and it is not exactly as we expect/hope it will be.
I may be pleasantly surprised of course. Ask me next year …
———–
William O`Mullane, Science Operations Development Manager, European Space Agency.
William O’Mullane has a PhD in Physics and a background in Computer Science and has worked on space science projects since 1996 when he assisted with the production of the Hipparcos CDROMS. During this period he was also involved with the Planck and Integral science ground segments as well as contemplating the Gaia data processing problem. From 2000-2005 Wil worked on developing the US National Virtual Observatory (NVO) and on the Sloan Digital Sky Survey (SDSS) in Baltimore, USA. In August 2005 he rejoined the European Space Agency as Gaia Science Operations Development Manager to lead the ESAC development effort for the Gaia Data Processing and Analysis Consortium.
José Rupérez, Senior Engineer at InterSystems.
He has been providing technical advise to customers and partners in Spain and Portugal for the last 10 years. In particular, he has been working with the European Space Agency since December 2008. Before InterSystems, José developed his career at eSkye Solutions and A.T. Kearney in the United States, always in Software. He started his career working for Alcatel in 1994 as a Software Engineer. José holds a Bachelor of Science in Physics from Universidad Complutense (Madrid, Spain) and a Master of Science in Computer Science from Ball State University (Indiana, USA). He has also attended courses at the MIT Sloan School of Business.
—————-
Related Posts
– Objects in Space -The biggest data processing challenge to date in astronomy: The Gaia mission. February 14, 2011
– Objects in Space: “Herschel” the largest telescope ever flown. March 18, 2011
– Objects in Space vs. Friends in Facebook. April 13, 2011
Resources
– Gaia Web page at ESA Spacecraft Operations.
– ESA’s web site for the Gaia scientific community.
– “Implementing the Gaia Astrometric Solution”, William O’Mullane, PhD Thesis, 2012
##
You can follow ODBMS.org on Twitter : @odbmsorg.
——————————-
“Given the recent explosion of NoSQL data stores, we saw the need for a common data access abstraction to simplify development with NoSQL stores. Hence the Spring Data team was created.” –David Turanski.
I wanted to know more about the Spring Data project. I have interviewed David Turanski, Senior Software Engineer with SpringSource, a division of VMWare.
RVZ
Q1. What is the Spring Framework?
David Turanski: Spring is a widely adopted open source application development framework for enterprise Java‚ used by millions of developers. Version 1.0 was released in 2004 as a lightweight alternative to Enterprise Java Beans (EJB). Since, then Spring has expanded into many other areas of enterprise development, such as enterprise integration (Spring Integration), batch processing (Spring Batch), web development (Spring MVC, Spring Webflow), security (Spring Security). Spring continues to push the envelope for mobile applications (Spring Mobile), social media (Spring Social), rich web applications (Spring MVC, s2js Javascript libraries), and NoSQL data access(Spring Data).
Q2. In how many open source Spring projects is VMware actively contributing?
David Turanski: It’s difficult to give an exact number. Spring is very modular by design, so if you look at the SpringSource page on github, there are literally dozens of projects. I would estimate there are about 20 Spring projects actively supported by VMware.
Q3. What is the Spring Data project?
David Turanski: The Spring Data project started in 2010, when Rod Johnson (Spring Framework’s inventor), and Emil Eifrem (founder of Neo Technologies) were trying to integrate Spring with the Neo4j graph database. Spring has always provided excellent support for working with RDBMS and ORM frameworks such as Hibernate. However, given the recent explosion of NoSQL data stores, we saw the need for a common data access abstraction to simplify development with NoSQL stores. Hence the Spring Data team was created with the mission to:
“…provide a familiar and consistent Spring-based programming model for NoSQL and relational stores while retaining store-specific features and capabilities.”
The last bit is significant. It means we don’t take a least common denominator approach. We want to expose a full set of capabilities whether it’s JPA/Hibernate, MongoDB, Neo4j, Redis, Hadoop, GemFire, etc.
Q4. Could you give us an example on how you build Spring-powered applications that use NOSQL data stores (e.g. Redis, MongoDB, Neo4j, HBase)
David Turanski: Spring Data provides an abstraction for the Repository pattern for data access. A Repository is akin to a Data Access Object and provides an interface for managing persistent objects. This includes the standard CRUD operations, but also includes domain specific query operations. For example, if you have a Person object:
Person { int id; int age; String firstName; String lastName; }
You may want to perform queries such as findByFirstNameAndLastName, findByLastNameStartsWith, findByFirstNameContains, findByAgeLessThan, etc. Traditionally, you would have to write code to implement each of these methods. With Spring Data, you simply declare a Java interface to define the operations you need. Using method naming conventions, as illustrated above, Spring Data generates a dynamic proxy to implement the interface on top of whatever data store is configured for the application. The Repository interface in this case looks like:
public interface PersonRepository extends CrudRepository{ Person findByFirstNameAndLastName(String firstName, String lastName); Person findByLastNameStartsWith(String lastName); Persion findByAgeLessThan(int age); ... }
In addition, Spring Data Repositories provide declarative support for pagination and sorting.
Then, using Spring’s dependency injection capabilities, you simply wire the repository into your application. For example:
public class PersonApp { @Autowired PersonRepository personRepository; public Person findPerson(String lastName, String firstName) { return personRepository.findByFirstNameAndLastName(firstName, lastName); } }
Essentially, you don’t have to write any data access code! However, you must provide Java annotations on your domain class to configure entity mapping to the data store. For example, if using MongoDB you would associate the domain class to a document:
@Document Person { int id; int age; String firstName; String lastName; }
Note that the entity mapping annotations are store-specific. Also, you need to provide some Spring configuration to tell your application how to connect to the data store, in which package(s) to search for Repository interfaces and the like.
The Spring Data team has written an excellent book, including lots of code examples. Spring Data Modern Data Acces for Enterprise Java recently published by O’Reilly. Also, the project web site includes many resources to help you get started using Spring Data.
Q5 And for map-reduce frameworks?
David Turanski: Spring Data provides excellent support for developing applications with Apache Hadoop along with Pig and/or Hive. However, Hadoop applications typically involve a complex data pipeline which may include loading data from multiple sources, pre-procesing and real-time analysis while loading data into HDFS, data cleansing, implementing a workflow to coordinate several data analysis steps, and finally publishing data from HDFS to on or more application data relational or NoSQL data stores.
The complete pipeline can be implemented using Spring for Apache Hadoop along with Spring Integration and Spring Batch. However, Hadoop has its own set of challenges which the Spring for Apache Hadoop project is designed to address. Like all Spring projects, it leverages the Spring Framework to provide a consistent structure and simplify writing Hadoop applications. For example, Hadoop applications rely heavily on command shell tools. So applications end up being a hodge-podge of Perl, Python, Ruby, and bash scripts. Spring for Apache Hadoop, provides a dedicated XML namespace for configuring Hadoop jobs with embedded scripting features and support for Hive and Pig. In addition, Spring for Apache Hadoop allows you to take advantage of core Spring Framework features such as task scheduling, Quartz integration, and property placeholders to reduce lines of code, improve testability and maintainability, and simplify the development proces.
Q6. What about cloud based data services? and support for relational database technologies or object-relational mappers?
David Turanski: While there are currently no plans to support cloud based services such as Amazon 3S, Spring Data provides a flexible architecture upon which these may be implemented. Relational technologies and ORM are supported via Spring Data JPA. Spring has always provided first class support for Relation database via the JdbcTemplate using a vendor provided JDBC driver. For ORM, Spring supports Hibernate, any JPA provider, and Ibatis. Additionally, Spring provides excellent support for declarative transactions.
With Spring Data, things get even easier. In a traditional Spring application backed by JDBC, you are required to hand code the Repositories or Data Access Objects. With Spring Data JPA, the data access layer is generated by the framework while persistent objects use standard JPA annotations.
Q7. How can use Spring to perform:
– Data ingestion from various data sources into Hadoop,
– Orchestrating Hadoop based analysis workflow,
– Exporting data out of Hadoop into relational and non-relational databases
David Turanski: As previously mentioned, a complete big data processing pipeline involving all of these steps will require Spring for Apache Hadoop in conjunction with Spring Integration and Spring Batch.
Spring Integration greatly simplifies enterprise integration tasks by providing a light weight messaging framework, based on the well known Patterns of Enterprise Integration by Hohpe and Woolf. Sometimes referred to as the “anti ESB”, Spring Integration requires no runtime component other than a Spring container and is embedded in your application process to handle data ingestion from various distributed sources, mediation, transformation, and data distribution.
Spring Batch provides a robust framework for any type of batch processing and is be used to configure and execute scheduled jobs composed of the coarse-grained processing steps. Individual steps may be implemented as Spring Integration message flows or Hadoop jobs.
Q8. What is the Spring Data GemFire project?
David Turanski: Spring Data GemFire began life as a separate project from Spring Data following VMWare’s acquisition of GemStone and it’s commercial GemFire distributed data grid.
Initially, it’s aim was to simplify the development of GemFire applications and the configuration of GemFire caches, data regions, and related components. While this was, and still is, developed independently as an open source Spring project, the GemFire product team recognized the value to its customers of developing with Spring and has increased its commitment to Spring Data GemFire. As of the recent GemFire 7.0 release, Spring Data GemFire is being promoted as the recommended way to develop GemFire applications for Java. At the same time, the project was moved under the Spring Data umbrella. We implemented a GemFire Repository and will continue to provide first class support for GemFire.
Q9. Could you give a technical example on how do you simplify the development of building highly scalable applications?
David Turanski: GemFire is a fairly mature distributed, memory oriented data grid used to build highly scalable applications. As a consequence, there is inherent complexity involved in configuration of cache members and data stores known as regions (a region is roughly analogous to a table in a relational database). GemFire supports peer-to-peer and client-server topologies, and regions may be local, replicated, or partitioned. In addition, GemFire provides a number of advanced features for event processing, remote function execution, and so on.
Prior to Spring Data GemFire, GemFire configuration was done predominantly via its native XML support. This works well but is relatively limited in terms of flexibility. Today, configuration of core components can be done entirely in Spring, making simple things simple and complex things possible.
In a client-server scenario, an application developer may only be concerned with data access. In GemFire, a client application accesses data via a client cache and a client region which act as a proxies to provide access to the grid. Such components are easily configured with Spring and the application code is the same whether data is distributed across one hundred servers or cached locally. This fortunately allows developers to take advantage of Spring’s environment profiles to easily switch to a local cache and region suitable for unit integration tests which are self-contained and may run anwhere, including automated build environments. The cache resources are configured in Spring XML:
<beans> </beans><beans profile="test"> <gfe:cache /> <gfe:local -region name="Person"/> </beans> <beans profile="default"> <context:property-placeholder location="cache.properties"/> <gfe:client-cache/> <gfe:client-region name="Person"/> <gfe:pool> <gfe:locator host="${locator.host}" port="${locator.port}"/> </gfe:pool> </beans> </beans>
Here we see the deployed application (default profile) depends on a remote GemFire locator process. The client region does not store data locally by default but is connected to an available cache server via the locator. The region is distributed among the cache server and its peers and may be partitioned or replicated. The test profile sets up a self contained region in local memory, suitable for unit integration testing.
Additionally, applications may by further simplified by using a GemFire backed Spring Data Repository. The key difference from the example above is that the entity mapping annotations are replaced with GemFire specific annotations:
@Region Person { int id; int age; String firstName; String lastName; }
The @Region annotation maps the Person type to an existing region of the same name. The Region annotation provides an attribute to specify the name of the region if necessary.
Q10. The project uses GemFire as a distributed data management platform. Why using an In-Memory Data Management platform, and not a NoSQL or NewSQL data store?
David Turanski: Customers choose GemFire primarily for performance. As an in memory grid, data access can be an order of magnitude faster than disk based stores. Many disk based systems also cache data in memory to gain performance. However your mileage may vary depending on the specific operation and when disk I/O is needed. In Contrast, GemFire’s performance is very consistent. This is a major advantage for a certain class of high volume, low latency, distributed systems. Additionally, GemFire is extremely reliable, providing disk-based backup and recovery.
GemFire also builds in advanced features not commonly found in the NoSQL space. This includes a number of advanced tuning parameters to balance performance and reliability, synchronous or asynchronous replication, advanced object serialization features, flexible data partitioning with configurable data colocation, WAN gateway support, continuous queries, .Net interoperability, and remote function execution.
Q11. Is GemFire a full fledged distributed database management system? or else?
David Turanski: Given all its capabilities and proven track record supporting many mission critical systems, I would certainly characterize GemFire as such.
———————————-
David Turanski is a Senior Software Engineer with SpringSource, a division of VMWare. David is a member of the Spring Data team and lead of the Spring Data GemFire project. He is also a committer on the Spring Integration project. David has extensive experience as a developer, architect and consultant serving a variety of industries. In addition he has trained hundreds of developers how to use the Spring Framework effectively.
Related Posts
– On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. December 5, 2012
– Two cons against NoSQL. Part II. November 21, 2012
– Two Cons against NoSQL. Part I. October 30, 2012
– Interview with Mike Stonebraker. May 2, 2012
Resources
– ODBMS.org Lecture Notes: Data Management in the Cloud.
Michael Grossniklaus, David Maier, Portland State University.
Course Description: “Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed. The topics covered in the courserange from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J). The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems. Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.”
Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|
##
“Those who claim they can give you both ACID transactions and linearly scalability at the same time are not telling the truth because it is theoretically proven impossible” –Asa Holmstrom.
I heard about a start up called Starcounter. I wanted to know more. I have interviewed the CEO of the company Asa Holmstrom.
RVZ
Q1. You just launched Starcounter 2.0 public beta. What is it? and who can already use it?
Asa Holmstrom: Starcounter is a high performance in-memory OLTP database. We have partners who built applications on top of Starcounter, e.g. AdServe application, retail application. Today Starcounter has 60+ customers using Starcounter in production.
Q2. You define Starcounter as “memory centric”, using a technique you call “VMDBMS”. What is special about VMDBMS?
Asa Holmstrom: VMDBMS integrates the application runtime virtual machine (VM) with the database management system (DBMS). Data only residees in one single place all the time in RAM, no data is transferred back and forth between the database memory and the temporary storage (object heap) of the application. The VMDBMS makes Starcounter significantly faster than other in-memory databases.
Q3. When you say “the VMDBMS makes Starcounter significantly faster than other in-memory databases”, could you please give some specific benchmarking numbers? Which other in-memory databases did you compare with your benchmark?
Asa Holmstrom: In general we are 100 times faster than any other RDBMS, 10 times comes from being IMDBMS, 10 times comes from VMDBMS.
Q4. How do you handle situations when data in RAM is no more available due to hardware failures?
Asa Holmstrom: In Starcounter the data is just as secure as in any disk-centric database. Image files and transaction log are stored on disk, and before a transaction is regarded committed it has been written to the transaction log.
When restarting Starcounter after a crash, a recovery of the database will automatically be done. To guarantee high availability we recommend our customers to have a hot stand-by machine which subscribes on the transaction log.
Q5. Goetz Graefe, HP fellow, commented in an interview (ref.1) that “disks will continue to have a role and economic value where the database also contains history, including cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.” What is your take on this?
Asa Holmstrom: As we have hardware limitations on RAM databases in practice about 2TB, therefore there will still be a need for database storage on disk.
Q6. You claim to achieve high performance and consistent data. Do you have any benchmark results to sustain such a claim?
Asa Holmstrom: Yes, we have made internal benchmark tests to compare performance while keeping data consistent.
Q7: Do you have some results of your benchmark tests publically available? If yes, could you please summarize here the main results?
Asa Holmstrom: As TPC tests are not applicable to us, we have done some internal tests. We can’t share them with you.
Q8. What kind of consistency do you implement?
Asa Holmstrom: We support true ACID consistency, implemented using snapshot isolation and fake writes, in a similar way as Oracle.
Q9. How do you achieve scalability?
Asa Holmstrom: ACID transactions are not scalable. All parallel ACID transactions need to be synchronized and the closer the transactions are executed in space, the faster the synchronization becomes. Therefore you get best performance by executing all ACID transactions on one machine. We call it to scale in. When it comes to storage, you scale up a Starcounter database by adding more RAM.
Q10: For which class of applications it is realistic to expect to execute all ACID transactions on one machine?
Asa Holmstrom: For all applications when you want high transactional throughput. When you have parallell ACID transactions you need to synchronize these transactions, and this synchorinization becomes harder when you scale out to several different machines. The benefits of scaling out grows linearly with the number of machines, but the cost of synchronization grows quadratically. Consequently you do not gain anything by scaling out. In fact, you get better total transactional throughput by running all transaction in RAM on one machine, which we call to “scale in”. No other databas can give you the same total ACID transactional throughput as Starcounter. Those who claim they can give you both ACID transactions and linearly scalability at the same time are not telling the truth because it is theoretically proven impossible. Databases which can give you ACID transaction or linearly scalablity
cannot give you both these things at the same time.
Q11. How do you define queries and updates?
Asa Holmstrom: We distinguish between read-only transactions and read-write transactions. You can only write (insert/update) database data using a read-write transaction.
Q12. Are you able to handle Big Data analytics with your system?
Asa Holmstrom: Starcounter is optimized for transactional processing, not for analytical processing.
Q13. How does Starcounter differs from other in-memory databases, such as for example SAP HANA, and McObject?
Asa Holmstrom: In general the primary differentiator between Starcounter and any other in-memory database is the VMDMBS. SAP HANA has primarily an OLAP focus.
Q14. From a user perspective, what is the value proposition of having a VMDMBS as a database engine?
Asa Holmstrom: Uncompetable ACID transactional performance.
Q15. How do you differentiate with respect to VoltDB?
Asa Holmstrom: Better ACID transactional performance. VoltDB gives you either ACID transactions (on one machine) or the possibility to scale out without any guarantees of global database consistency (no ACID). Starcounter has a native .Net object interface which makes it easy to use from any .Net language.
Q16. Is Starcounter 2.0 open source? If not, do you have any plan to make it open source?
Asa Holmstrom: We do not have any current plans of making Starcounter open source.
——————
CEO Asa Holmstrom brings to her role at Starcounter more than 20 years of executive leadership in the IT industry. Previously, she served as the President of Columbitech, where she successfully established its operations in the U.S. Prior to Colmbitech, Asa was CEO of Kvadrat, a technology consultancy firm. Asa also spent time as a management consultant, focusing on sales, business development and leadership within global technology companies such as Ericsson and Siemens. She holds a bachelor’s degree in mathematics and computer science from Stockholm University.
Related Posts
– Interview with Mike Stonebraker. by Roberto V. Zicari on May 2, 2012
Resources
– Cloud Data Stores – Lecture Notes: Data Management in the Cloud.
by Michael Grossniklaus, David Maier, Portland State University.
Course Description: “Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed.
The topics covered in the course range from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J).
The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems.
Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.”
Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|
##