On Big Data. Interview with Adam Kocoloski.
” The pace that we can generate data will outstrip our ability to store it.
I think you’ll soon see data scientists emphasizing the ability to make decisions on data before storing it ” –Adam Kocoloski.
I have interviewed Adam Kocoloski, Founder & CTO of Cloudant.
RVZ
Q1. What can we learn from physics when managing and analyzing big data for the enterprise?
Adam Kocoloski: The growing body of data collected in today’s Web applications and sensor networks is a potential goldmine for businesses. But modeling transactions between people and causality between events becomes challenging at large scale, and traditional enterprise systems like data warehousing and business intelligence are too cumbersome to extract value fast enough.
Physicists are natural problem solvers, equipped to think through what tools will work for particular data challenges. In the era of big data, these challenges are growing increasingly relevant, especially to the enterprise.
In a way, physicists have it easier. Analyzing isolated particle collisions translated well to distributed university research systems and parallel models of computing. In other ways, we have shared the challenge of filtering big data to find useful information. In my physics work, we addressed this problem with blind analysis and machine learning. I think you’ll soon see those practices emerge in the field of enterprise data analysis.
Q2. How do you see data science evolving in the near future?
Adam Kocoloski: The pace that we can generate data will outstrip our ability to store it. I think you’ll soon see data scientists emphasizing the ability to make decisions on data before storing it.
The sheer volume of data we’re storing is a factor, but what’s more interesting is the shift toward the distributed generation of data — data from mobile devices, sensor networks, and the coming “Internet of Things.” It’s easy for an enterprise to stand up Hadoop in its own data center and start dumping data into it, especially if it plans to sort out the valuable parts later. It’s not so easy when it’s large volumes of operational data generated in a distributed system. Machine learning algorithms that can recognize and store only the useful patterns can help us better deal with the deluge.
As physicists, we learned that the way big data is headed, there’s no way we’ll be able to keep writing it all down. That’s the tradeoff today’s data scientists must learn: right when you collect the data, you need to make decisions on throwing it away.
Q3. In your opinion, given the current available Big Data technologies, what is the most difficult challenge in filtering big data to find useful information?
Adam Kocoloski: Cloudant is an operational data store and not a big data or offline analytics platform like Hadoop. That means we deal with mutable data that applications are accessing and changing as they run.
From my physics experience, the most difficult big data challenge I’ve seen is the lack of accurate simulations for machine learning. For me, that meant simulations of the STAR particle detector at Brookhaven National Lab’s Relativistic Heavy Ion Collider (RHIC).
People use machine learning algorithms in many fields, and they don’t always understand the caveats of building in an appropriate training data set. It’s easy to apply training data without fully understanding how the process works. If they do that, they won’t realize when they’ve trained their machine learning algorithms inappropriately.
Slicing data from big data sets is great, but at a certain point it becomes a black box that makes it hard to understand what is and what isn’t working well in your analysis. The bigger the data, the more it’s possible for one variable to be related to others in nonlinear ways. This problem makes it harder to reason about data, placing more demands on data scientists to build training data sets using a balanced combination of linear and nonlinear techniques.
Q4. Could you please explain why blind analyses is important for Big Data?
Adam Kocoloski: Humans are naturally predisposed to find signals. It’s an evolutionary trait of ours. It’s better if we recognize the tiger in the jungle, even if there really isn’t one there. If we see a bump in a distribution of data, we do what we can to tease it out. We bias ourselves that way.
So when you do a blind analysis, you hopefully immunize yourself against that bias.
Data scientists are people too, and with big data, they can’t become overly reliant on data visualization. It’s too easy for us to see things that aren’t really there. Instead of seeking out the signals within all that data, we need to work on recognizing the noise — the data we don’t want — so we can inversely select the data we want to keep.
Q5. Is machine learning the right way to analyze Big Data?
Adam Kocoloski: Machine learning offers the possibility to improve the signal-to-noise ratio beyond what any manually constructed analysis can do.
The potential is there, but you have to balance it with the need to understand the training data set. It’s not a panacea. Algorithms have weak points. They have places where they fail. When you’re applying various machine-learning analyses, it’s important that you understand where those weak points are.
Q6. The past year has seen a renaissance in NewSQL. Will transactions ultimately spell the end of NoSQL databases?
Adam Kocoloski: No — 1) because there’s a wide, growing class of problems that don’t require transactional semantics and 2) mobile computing makes transactions at large scale technically infeasible.
Applications like address books, blogs, or content management systems can store a wide variety of data, and, largely, do not require a high degree of transactional integrity. Using systems that inherently enforce schemas and row-level locking — like an relational database management system (RDBMS) — unnecessarily over-complicate these applications.
It’s widely thought that the popularity of NoSQL databases was due to the inability of relational databases to scale horizontally. If NewSQL databases can provide transactional integrity for large, distributed databases and cloud services, does this undercut the momentum of the NoSQL movement? I argue that no, it doesn’t, because mobile computing introduces new challenges (e.g. offline application data and database sync) that fundamentally cannot be addressed in transactional systems.
It’s unrealistic to lock a row in an RDBMS when a mobile device that’s only occasionally connected could introduce painful amounts of latency over unreliable networks. Add that to the fact that many NoSQL systems are introducing new behaviors (strong-consistency, multi-document transactions) and strategies for approximating ACID transactions (event sourcing) — mobile is showing us that we need to rethink the information theory behind it.
Q7. What is the technical role that CouchDB clustering plays for Cloudant’s distributed data hosting platform?
Adam Kocoloski: At Cloudant, clustering allows us to take one logical database and partition that database for large scale and high availability.
We also store redundant copies of the partitions that make up that cluster, and to our customers, it all looks and operates like one logical database. CouchDB’s interface naturally lends itself to this underlying clustering implementation, and it is one of the many technologies we have used to build Cloudant’s managed database service.
Cloudant is built to be more than just hosted CouchDB. Along with CouchDB, open source software projects like HAProxy, Lucene, Chef, and Graphite play a crucial role in running our service and managing the experience for customers. Cloudant is also working with organizations like the Open Geospatial Consortium (OGC) to develop new standards for working with geospatial data sets.
That said, the semantics of CouchDB replication — if not the actual implementation itself — are critical to Cloudant’s ability to synchronize individual JSON documents or entire database partitions between shard copies within a single cluster, between clusters in the same data center, and between data centers across the globe. We’ve been able to horizontally scale CouchDB and apply its unique replication abilities on a much larger scale.
Q8. Cloudant recently announced the merging of its distributed database platform into the Apache CouchDB project. Why? What are the benefits of such integration?
Adam Kocoloski: We merged the horizontal scaling and fault-tolerance framework we built in BigCouch into Apache CouchDB™. The same way Cloudant has applied CouchDB replication in new ways to adapt the database for large distributed systems, Apache CouchDB will now share those capabilities.
Previously, the biggest knock on CouchDB was that it couldn’t scale horizontally to distribute portions of a database across multiple machines. People saw it as a monolithic piece of software, only fit to run on a single server. That is no longer the case.
Obviously new scalability features are good for the Apache project, and a healthy Apache CouchDB is good for Cloudant. The open source community is an excellent resource for engineering talent and sales leads. Our contribution will also improve the quality of our code. Having more of it out there in live deployment will only increase the velocity of our development teams. Many of our engineers wear multiple hats — as Cloudant employees and Apache CouchDB project committers. With the code merger complete, they’ll no longer have to maintain multiple forks of the codebase.
Q9. Will there be two offerings of the same Apache CouchDB: one from Couchbase and one from Cloudant?
Adam Kocoloski: No. Couchbase has distanced itself from the Apache project. Their product, Couchbase Server, is no longer interface-compatible with Apache CouchDB and has no plans to become so.
———
Adam Kocoloski, Founder & CTO of Cloudant.
Adam is an Apache CouchDB developer and one of the founders of Cloudant. He is the lead architect of BigCouch, a Dynamo-flavored clustering solution for CouchDB that serves as the core of Cloudant’s distributed data hosting platform. Adam received his Ph.D. in Physics from MIT in 2010, where he studied the gluon’s contribution to the spin structure of the proton using a motley mix of server farms running Platform LSF, SGE, and Condor. He and his wife Hillary are the proud parents of two beautiful girls.
Related Posts
– Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013
– On NoSQL. Interview with Rick Cattell. August 19, 2013
– On Big Data Analytics –Interview with David Smith. February 27, 2013
Resources
– “NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, MongoDB” (.pdf), by Denis Nelubin, Ben Engber, Thumbtack Technology, 2013
– “Ultra-High Performance NoSQL Benchmarking- Analyzing Durability and Performance Tradeoffs” (.pdf) by Denis Nelubin,, Ben Engber, Thumbtack Technology, 2013
Follow us on Twitter: @odbmsorg