Q&A with Data Engineers : Jim Webber
Dr. Jim Webber is Chief Scientist with Neo Technology the company behind the popular open source graph database Neo4j, where he where he works on R&D for highly scalable graph databases and writes open source software. Jim has written two books on integration and distributed systems: “Developing Enterprise Web Services” on XML Web Services and “REST in Practice” on using the Web for building large-scale systems. His latest book is “Graph Databases” which focuses on the Neo4j database. His blog is located at http://jimwebber.org and he tweets often @jimwebber.
Q1. What are the main technical challenges you typically face in this era of Big Data Analytics?
Most data technology deals with records discretely. It’s hard to make sense of that kind of data as a human. To make that data valuable we’ve come up with all kinds of techniques to process it, to join it together so that we can gain insight from it. Our biggest challenge is the mindshift away from legacy discrete data technology with all its clever workarounds and batch processing engines towards a future where we natively store, query, and process connected data. The technology exists, our challenge is to defeat our own sunk cost fallacies and embrace the new wave.
Q2. How do you gain the knowledge and skills across the plethora of technologies and architectures available?
I’m a research scientist so my primary source of information is from my R&D team. In turn we gather our collective knowledge from the academic literature (whose value is truly under appreciated in software engineering) and from our peers in other companies.
Q3. What lessons did you learn from using shared-nothing scale-out architectures, which claim to allow for support of large volumes of data?
That they’re not terribly useful for two reasons. At some point you need to query between the silos which goes against the grain architecturally. Secondly there’s always some point of contention (network, san, whatever) where the model falters in surprising ways. Often it’s more honest to call this architecture mostly-shared-nothing.
Q4. What are in your opinion the differences between capacity, scale, and performance?
Capacity I think is easy: what’s a reasonable amount of data you can store in a system.
But scale and performance are subtle and tremendously abused. In particular you hear “scalable” as a synonym for “excellent.” That’s not often the case, we’ve all heard of laptop-scale scripts beating out large batch processing clusters.
I tend to think of it in terms of trade offs: Scale is the trade off of latency for throughout. Performance is the trade off of efficiency for scale. Any system will pick its characteristics along these axes. The better systems are the ones that have good designs and implementations to go far along even the unfavourable axis. But no system can reasonably expect to excel across all of throughput, latency, and efficient. And all to often we pick throughput at the expense of all.
Q5. There are several types of new data stores including NoSQL, XML-based, Document-based, Key-value, Column-family, Graph data stores and Array stores. What is your experience with those?
If you’ve read this far you’ll have ascertained I’m a champion of graph data. These are the only kind of stores amongst the list which deal with connected data and as such are the only ones designed for handling the unpredictability of the real world systems we have to build.
Q6. How do you convince/ explain to your managers the results you derived from your work?
I work for a database vendor – Neo4j. As such explaining my work to my boss is straightforward since we’re all computer scientists. Our informal reporting lines far extend past our company boundary into a community of database researchers from a wide range of academic and industrial organisations with whom we share research and development experiences and collectively work towards better future systems.
Q7. Do you think Hadoop is suitable for the operational side of data?
No. It’s a batch processing framework at heart.
Q8. How do you typically handle scalability problems and storage limitations?
From a database engineering point of view each of these is a challenge. We handle them by consulting the literature and carefully teasing out what we think are the most promising scientific approaches. We then plan to include these techniques into future releases of Neo4j.
Q9. ACID, BASE, CAP. Are these concepts still important?
To me yes, though I think they’ve run out of steam a little amongst the user community who have CAP fatigue.
But the reason these remain important to me is that BASE semantics corrupts graph data, whereas ACID semantics don’t.
The issue at hand is that for graphs we need to keep two replica sets in sync: we have relationships between nodes.
With eventual consistency (BASE) semantics we have race conditions which can cause data corruption under normal operation. Where one replica has been updated where the other hasn’t, a non-deterministic update can then occur based on that intermediate state. This isn’t recoverable in the general case and corrupts the graph. That’s unacceptable: a database should not lose or corrupt data.
Q10. What is your take of the various extensions of SQL to support data analytics?
I’m not convinced of it. On the one hand I appreciate that leveraging SQL means we can capitalise on existing learning.
On the other hand SQL isn’t a great fit for every analytics job.
We took the decision early on in Neo4j to create a non-SQL query language called “Cypher” which is native for graph queries. We heard the feedback that it made developers have to learn something new, but overwhelmingly the feedback was that developers are always learning something new and that Cypher is simple enough and powerful enough that it more than makes up for the couple of days’s worth of learning it requires.
Q11. What were your most successful large scale data projects? And why?
We have an Internet scale ad-tech user in the USA. They touch around 90% of the American public, hundreds of millions of users and around a billion transactions per day. I think of this as a success because this company has taken graph tech into a domain ordinarily thought of as Hadoop/Spark/NoSQL territory because of the scale aspect and outperformed with Neo4j.
I think it’s also a salient point: graphs are coming and those organisations that adapt them will outperform the competition.
Q12. What are the typical mistakes done for a large scale data project? How can they be avoided in practice?
The sunk cost fallacy is the worst mistake technologists can make. It’s the idea that because I know one tool, such as a particular database or processing framework, that I’ll use it again and again because learning something new is too costly.
To be successful in deploying complex systems we need to recognise those times where we can repeat past patterns successfully and those times when we have to strap in and learn something new. The risk/reward profile for that decision is probably on a per-project basis, but the times I’ve seen teams beating a nail into the wall with their heads because they’re too busy to pick up a hammer is astonishing.