On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com
“One of the core concepts of Big Data is being able to evolve analytics over time. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources. “ — Werner Vogels, Amazon.com
I wanted to know more on what is going on at Amazon.com in the area of Big Data and Analytics. For that, I have interviewed Dr. Werner Vogels, Chief Technology Officer and Vice President of Amazon.com
Q1. In your keynote at the Strata Making Data Work Conference held this February in Santa Clara, California, you said that “Data and Storage should be unconstrained“. What did you mean with that?
Vogels: In the old world of data analysis you knew exactly which questions you wanted to asked, which drove a very predictable collection and storage model. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources.
Q2. You also claimed that “Big Data requires NO LIMIT“. However, Prof. Alex Szalay talking about astronomy warns us that “Data is everywhere, never be at a single location. Not scalable, not maintainable.” Will Big Data Analysis become a new Astronomy?
Vogels: Big Data is the hot topic for this year. With the rise of the internet, and the increasing number of consumers, researchers and businesses of all sizes getting online, the amount of data now available to collect, store, manage, analyze and share is growing. When companies come across such large amounts of data it can lead to data paralysis where they don’t have the resources to make effective use of the information.
To Alex’s point, it’s challenging to get the relevant data at the place where you want it to do your analysis. This is why we see many organizations putting their data in the cloud where it’s easily accessible for everyone.
Q3. You also quoted Jim Gray and mentioned the “Fourth Paradigm: Data Intensive Scientific Discovery”. What is it? Is Business Intelligence becoming more like Science for profit?
Vogels:The book I referenced was The Fourth Paradigm: Data-Intensive Scientific Discovery. This is a collection of essays that discusses the vision for data-intensive scientific discovery, which is the concept of shifting computational science to a data intensive model where we analyze observations.
For Business Intelligence this means that analysis goes beyond finance and accountancy and help companies’ continuously improve the service to their customers.
Q4. Michael Olson of Cloudera, in a recent interview said talking about Analytical Data Platforms that “Cloud is a deployment detail, not fundamental. Where you run your software and what software you run are two different decisions, and you need to make the right choice in both cases.”
What is your opinion on this? What is in your opinion the relationships between Big Data Analysis and Cloud Computing?
Vogels: Big Data holds the promise of helping companies create a competitive advantage as through data analysis they learn how to better serve their customers. This is an approach that we have already applied for 15 years to Amazon.com and we have a solid understanding of the all the challenges around managing and processing Big Data.
One of the core concepts of Big Data is being able to evolve analytics over time. For that, a company cannot be constrained by any resource. As such, Cloud Computing and Big Data are closely linked because for a company to be able to collect, store, organize, analyze and share data, they need access to infinite resources.
AWS customers are doing some really innovative things around dealing with Big Data. For example digital advertising and marketing firm, Razorfish. Razorfish targets online adverts based on data from browsing sessions. A common issue Razorfish found was the need to process gigantic data sets. These large data sets are often the result of holiday shopping traffic on a retail website, or sudden dramatic growth on a media or social networking site.
Normally crunching these numbers would take them two days or more. By leveraging on-demand services such as Amazon Elastic MapReduce, Razorfish is able to drop their processing time to eight hours. There was no upfront investment in computing hardware, no procurement delay, and no additional operations staff hired. All this means Razorfish can offer multi-million dollar client service programs on a small business budget, helping them to increase their return on ad spend by 500%.
Q5. How has Amazon’s technology evolved over the past three years?
Vogels: Every day, Amazon Web Services adds enough new server capacity to support all of Amazon’s global infrastructure in the company’s fifth full year of operation, when it was a $2.76B annual revenue enterprise. Today we have hundreds of thousands of customers in over 190 countries—both startups and large companies. To give you an idea of the scale we’re talking about, Amazon S3 holds over 260 billion objects and regularly peaks at 200k requests per second.
Our pace of innovation has been rapid because of our relentless customer focus. Our process is to release a service into beta that is useful to a lot of people, get customer feedback and rapidly begin adding the bells and whistles based in large part on what customers want and need from the services. There’s really no substitute for the accelerated learning we’ve had from working with hundreds of thousands of customers with every imaginable use case.
We are also relentless about driving efficiencies and passing along the cost savings to our customers.
We’ve lowered our prices 12 times in the past 5 years with no competitive pressure to do so. We’re very comfortable with running high volume, low margin businesses which is very different than traditional IT vendors.
Q6. Amazon’s Dynamo is proprietary, however the publication of your seminar 2007 Dynamo-paper was used as input for Open Source Projects (e.g. Cassandra—which began as a fusion of Google`s Bigtable and Amazon’s Dynamo concepts). Why Amazon allowed this?
Vogels: Dynamo is internal technology developed at Amazon to address the need for an incrementally scalable, highly-available key-value storage system. The technology is designed to give its users the ability to trade-off cost, consistency, durability and performance, while maintaining high-availability.
We found, though, that there had been some struggles with applying the concepts so we published the paper as feedback to the academic community about what one needed to do to build realistic production systems.
Q7. What is the positioning of Amazon with respect to Open Source projects? Why didn’t you develop Open Source data platforms from the start like for example Facebook and LinkedIn?
Vogels: I believe that you need to pour your resources into areas where you can make important contributions, where you can provide the customer with the best possible experience. Our mission is to provide the one-man app developer or the 20,000 person enterprise with a platform of web services they can use to build sophisticated, scalable applications. I believe anything we can do to make AWS lower-cost and widely available will help the community tremendously.
Q8. Amazon Elastic MapReduce utilizes a hosted Hadoop framework running on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Why choosing Hadoop? Why not using already existing BI products?
Vogels: We chose Hadoop for several reasons. First, it is the only available framework that could scale to process 100s or even 1000s of terabytes of data and scale to installations of up to 4000 nodes. Second, Hadoop is open source and we can innovate on top of the framework and inside it to help our customers develop more preformat applications quicker. Third, we recognized that Hadoop was gaining substantial popularity in the industry with multiple customers using Hadoop and many vendors innovating on top of Hadoop.
Three years later we believe we made the right choice. We also see that existing BI vendors such as Microstrategy are willing to work with us and integrate their solutions on top of Elastic MapReduce.
Q9. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead? And what are the main business challenges ahead?
Vogels: I think that sharing is another important aspect to the mix. Collaborating during the whole process of collecting data, storing it, organizing it and analyzing it is essential. Whether it’s scientists in a research field or doctors at different hospitals collaborating on drug trials, they can use the cloud to easily share results and work on common datasets.
Dr. Vogels is Vice President & Chief Technology Officer at Amazon.com where he is responsible for driving the company’s technology vision, which is to continuously enhance the innovation on behalf of Amazon’s customers at a global scale.
Prior to joining Amazon, he worked as a researcher at Cornell University where he was a principal investigator in several research projects that target the scalability and robustness of mission-critical enterprise computing systems. He has held positions of VP of Technology and CTO in companies that handled the transition of academic technology into industry.
Vogels holds a Ph.D. from the Vrije Universiteit in Amsterdam and has authored many articles for journals and conferences, most of them on distributed systems technologies for enterprise computing.
He was named the 2008 CTO of the Year by Information Week for his contributions to making Cloud Computing a reality. For his unique style in engaging customers, media and the general public, Dr. Vogels received the 2009 Media Momentum Personality of Year Award.