Big Data and NoSQL: Interview with Joe Celko
“The real problem is not collecting the data, even at insanely high speeds; the real problem is acting on it in time. This is where we have things like automatic stock trading systems. The database is integrated rather than separated from the application.” –Joe Celko.
I have interviewed Joe Celko, a well know database expert, on the challenges of Big Data and when it makes sense using Non-Relational Databases.
Q1. Three areas make today’s new data different from the data of the past: Velocity, Volume and Variety. Why?
Joe Celko: I did a keynote at a PostgreSQL conference in Prague with the title “Our Enemy, the Punch Card” on the theme that we had been mimicking the old data models with the new technology. This is no surprise; the first motion pictures were done with a single camera that never moved to mimic a seat at a theater.
Eventually, “moving picture shows” evolved into modern cinema. This is the same pattern in data. It is physically impossible to make a punch card and magnetic tape data move as fast as fiber optics, or hold as many bits. More importantly, the cost per bit dropped by orders of magnitude. Now it was practical computerize everything! And since we can do, and do it cheap, we will do it.
But what we found out that this new, computerizable (is that a word?) data is not always traditionally structured data.
Q2. What about data Veracity? Is this a problem as well?
Q3. When information is changing faster than you can collect and query it,it simply cannot be treated the same as static data. What are the solutions available to solve this problem?
Joe Celko: I have to do a disclaimer here: I have done videos for Streambase and Kx Systems.
There is an old joke about two morons trying to fix a car. Q: “Is my signal light working?” A: “Yes. No. Yes. No. Yes. No. ..” but it summaries the basic problem with streaming data. That is streaming data or “complex events” in the literature.
The model is that tables are replaced by streams of data, but the query language in Streambase is an extended SQL dialect.
The Victory of SELECT-FROM-WHERE!
The Kx products are more like C or other low level languages.
The real problem is not collecting the data, even at insanely high speeds; the real problem is acting on it in time. This is where we have things like automatic stock trading systems. The database is integrated rather than separated from the application.
Q4. Old storage and access models do not work for big data. Why?
Joe Celko: First of all, the old stuff does not hold enough data. How would you put even a day’s worth of Wal-Mart sales on punch cards? Sequential access will not work; we need parallelism. We do not have time to index the data; the traditional tree indexing requires extra time, usually O(lg2(n)). Our best bets are perfect hashing functions and special hardware.
Q5. What are different ways available to store and access data such as petabytes and exabytes?
Joe Celko: Today, we are still stuck with moving disk. Optical storage is still too expensive and slow to write.
Solid State Disk is still too expensive, but dropping fast. My dream is really cheap solid state drives that have lots of processors in the drive which monitor a small subset of the data. We send out a command “Hey, minions, find red widgets and send me your results!” and it happens all at once. The ultimate Map-Reduce model in the hardware!
Q6. Not all data can fit into a relational model, including genetic data, semantic data, and data generated by social networks. How do you handle data variety?
Joe Celko: We have graph databases for social networks. I was a math major, so I love them. Graph theory has a lot of good problems and algorithms we can steal, just like SQL uses set theory and logic. But genetic data and semantics do not have a mature theory behind them. The real way to handle the diversity is new tools, starting at the conceptual level. How many times have you seen someone write 1960’s COBOL file systems in SQL?
Q7 What are the alternative storage, query, and management frameworks needed by certain kinds of Big Data?
Joe Celko: As best you can, do not scare your existing staff with a totally new environment.
Q8. Columnar-data stores, graph-databases, streaming databases, analytic data bases. How do classify and evaluate all of these NewSQL/ NoSQL solutions available?
Joe Celko: First decide what the problem is, then pick the tool. One of my war stories was consulting at a large California company that wanted to put their labor relations law library on their new DB2 database. It was all text, and used by lawyers. Lawyers do not know SQL. Lawyers do not want to learn SQL. But they do know Lexis and WestLaw text query tools. They know labor law and the special lingo. Programmers do not know labor law. Programmers do not want to learn labor law. But the programmers can set up a textbase for the lawyers.
Q9. If you were a user, how would you select the “right” data management tools and technology for the job?
Joe Celko: There is no generic answer. Oh, there will be a better answer by the time you get into production. Welcome to IT!
Joe Celko served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards. Mr. Celko is author a series of books on SQL and RDBMS for Morgan-Kaufmann. He is an independent consultant based in Austin, Texas. He has written over 1300 columns in the computer trade and academic press, mostly dealing with data and databases.
“Joe Celko’s Complete Guide to NoSQL: What Every SQL Professional Needs to Know about Non-Relational Databases“- Paperback: 244 pages, Morgan Kaufmann; 1 edition (October 31, 2013), ISBN-10: 0124071929
“Big Data: Challenges and Opportunities” (.PDF), Roberto V. Zicari, Goethe University Frankfurt, ODBMS.org, October 5, 2012