Q&A with Data Engineers: Leon Guzenda
Leon Guzenda, Chief Technical Marketing Officer and Founder, Objectivity.
Leon Guzenda was one of the founding members of Objectivity in 1988 and one of the original architects of Objectivity/DB.
Leon has more than five decades of experience in the software industry. He started in real-time control and intelligence analytics. He was Principal Project Director for International Computers Ltd. in the United Kingdom, delivering major projects for NATO and leading multinationals.
He currently works with Objectivity’s major customers to help them effectively develop and deploy complex applications and systems that use the industry’s highest-performing, most reliable DBMS technology, Objectivity/DB. He also liaises with technology partners and industry groups to help ensure that Objectivity/DB remains at the forefront of database and distributed computing technology.
Q1. What are the main technical challenges you typically face in this era of Big Data Analytics?
Finding the right tools to extract value from data.
Q2. How do you gain the knowledge and skills across the plethora of technologies and architectures available?
It’s mainly a matter of online research and talking with peers and a few experts in specialized areas.
Q3. What lessons did you learn from using shared-nothing scale-out architectures, which claim to allow for support of large volumes of data?
The hardest part was keeping the whole system up when we were using a bare metal provider. We’ve since moved most of our work to Amazon EC2 and are looking at IBM BlueMix and Microsoft Azure for other markets.
Q4. What are in your opinion the differences between capacity, scale, and performance?
For me, capacity is a measure of the maximum throughput that a system can sustain. Scale can be a matter of total data volume, GFlops or data flow rates. Some systems require high numbers for all three. Performance is mainly a matter of response times, sometimes coupled with latency. For example, it may be OK to take an hour preparing a set of data if it can be queried in milliseconds by thousands of users. In other cases the time between data being received and being able to react to it is more important.
Q5. There are several types of new data stores including NoSQL, XML-based, Document-based, Key-value, Column-family, Graph data stores, In-Memory Databases, and Array stores. What is your experience with those?
Objectivity has been NoSQL since 1990, so I can claim experience there. Most ODBMSs are both NoSQL and good at K-V, graph and array handling and many have in-memory capabilities. I’ve had limited experience with column stores, but many of our customers store large volumes of documents, generally supported by massive indices and ontologies.
Q6. How do manage to communicate (if any) with Data Scientists in common data projects?
We start with a set of goals and constraints then proceed to an object model before selecting the right tools for the job. In one recent Proof of Concept we used ThingSpan as the scalable graph analytics platform, running on Spark with Kafka and Samza handing input streams, plus YARN for workflow management. We also used Tableau and D3.js for visualization. We met online several times a week throughout the project.
Q7. How do you convince/ explain to your managers the results you derived from your work?
Show them the pictures. 🙂 It’s actually more a matter of getting them to think more about the key metrics and data inputs that really matter to their business. They generally need help in defining supplementary data types.
Q8. What kind of data infrastructure do you use to support applications?
We used to be high performance cluster based, but now most things are run in the Cloud. Luckily, we don’t have to store the data long as storage costs are still the most expensive parameter for us. Sometimes it takes us longer to acquire and load the data than it does to analyze it. most of our customers are still on dedicated clusters.
Q9. How do you evaluate if the various new emerging data pipelines created for acquiring, storing, processing and analyzing data streaming in real-time, are suitable for the application needs?
There are always domain experts who can identify that part of the equation for us at the outset. The harder part is figuring out what other kinds of data could help. In one case the system had massive amounts of medical and drug or treatment data, but they had no way of coupling it with environmental factors. Once that was done some very clear patterns emerged.
Q10. What are your experiences of using Hadoop in production?
I helped mount a 15 Petabyte system a couple of years ago. it was really hard to keep it running. We had problems with “zombie” processes and, worse still, “revenants” that came back from nowhere and started processing the wrong data. That’s not good in a secure environment. We were eventually able to move everything to Spark, which solved those problems.
Q11. What are your experiences of using Spark in production?
I’ve been involved in a half dozen systems, two of which are in production. it’s getting easier all the time.
Q12. Do you think Hadoop is suitable for the operational side of data?
It’s OK for batch processing. HDFS is horrible at supporting random access I/O, so it’s not usually the best choice for our customers.
Q13. How do you manage an Hadoop installation?
We use Cloudera’s tools, mainly.
Q14 How do you typically handle scalability problems and storage limitations?
Put everything in the cloud.
Q15. ACID, BASE, CAP. Are these concepts still important?
To me, CAP and BASE have always been a copout for people who can’t get the performance, latency or throughput that need with ACID transactions. Most of our customers need ACID or things can go badly wrong.
Q16. What is your take of the various extensions of SQL to support data analytics?
[Sigh!] It’s a move in the right direction, but I’d rather stick with the right tool for the job. I’ve never liked those gadgets that have a dozen appendages. It’s no fun cutting your finger on a knife blade while trying to hammer a nail in.I still curl up in a ball when I remember SQL running on IDMS, a network database.
Q17. Can you share your experience of using the Lambda architecture?
Our products have been handling Lambda-like problems since before the term was invented. Lambda has three layers for batch, speed and serving. Most of our large scale deployments have used an ingest and event processing stage, a correlation and analysis stage, which can be huge, and a query processing stage that generally reaches into the results from the middle stage but can also reach back to the raw data, or something close to it. One of our oil and gas customers is having a lot of success with their Lambda architecture, based on ThingSpan and Spark.
Q18. What were your most successful large scale data projects? And why?
The public ones, like SLAC BaBar (Over a Petabyte of data in 1999/2000) were very informative and impressive in their own right. I can’t talk about the bigger ones, apart from mentioning that one of them processes tens of trillions of objects in a graph structure per day for thousands of analysts.
Q19. What are the typical mistakes done for a large scale data project? How can they be avoided in practice?
a. Overselling the business value of a project.
b. Underestimating the complexity of systems.
c. Trying to move too fast. Take small, inexpensive steps that show value quickly.
I can’t overemphasize the value of doing a thorough requirements analysis, proof of concept and performance/scale testing before embarking on the main push.
Q20. Learning a new technology and developing the knowledge is different than developing skills. Skills are perishable. How do you handle this?
Install good knowledge bases. We use tools like JIRA and have a forum that we call ObjyShare. I use Slack extensively to document projects with partners, prospects and customers.
Q21. What RDD (Resilient Distribution Datasets) and Transformations functions are useful for?
We started with RDDs but moved to DataFrames. We’re looking at GraphFrames too. We find Spark SQL and MLlib to be very useful in identifying potential targets for graph analytics, or for summarizing or filtering the output from graph queries.
Q22. Did you use Spark Streaming for processing live data streams? And if yes, what is your experience?
Yes, but we’ve used Kafka and Samza on recent projects. Tuning the data flow is key to getting good throughput and performance.
Q23. Did you use Spark MLlib (Machine Learning Algorithms)? And if yes, what is your experience?
Yes, mainly K-Means, but we’re looking at other algorithms too. We’re also looking at Iproprietarey deep learning tools.
Q24. In your experience, how good is Spark supporting stream processing?
It’s getting faster. It can’t compete with IBM’s streaming technology yet, but it may be able to in the near future.
Q25. SQL Spark (known as Shark) is a module in Spark to work with structured data and perform structured data processing. What is your experience with Shark?
We’ve used it with DataFrames produced automatically by ThingSpan. It’s adequate, but there are faster tools out there.
Q26. Is there any benefit of learning MapReduce?
Yes, for some applications, but it’s not a panacea. Divide and conquer has been around for half a century.
Q27. How data management, data science, and machine learning/AI relate to each other?
This needs a picture. Data management is about the storage, safeguarding and efficient accede to data. Data science deals with more advanced algorithms. Machine and deep learning (which is now preferred to AI) uses novel architectures, such as the many kinds of neural network, to tackle problems where the system can be trained or can evolve itself to find patterns or insights.
Q28. What are the unsolved data management challenges that arise from the increased interest in AI ?
Near-memory access to widely distributed, voluminous data.
Q29. Is Storm better then Kafka?
That’s like comparing apples and pears. “Kafka is a distributed, persistent message broker. Stream is a realtime computation system.” – Quora. they can be used in conjunction with one another.
Q30. What are the barriers to applying machine learning at scale?
Algorithms, overcoming I/O barriers and scalable visual analytics.
Q31. Do you have experience with In-database machine learning?
Yes, but I can’t disclose anything about those systems. I can say that a content addressed file system would help and that neural networks are coming back into fashion in a big way.
Q32. How do you ensure data quality?
There are lots of tools for doing basic cleansing. Many problems don’t emerge until you put the data into the database, especially a graph. Rejecting bad data before letting it in is preferable to pruning bad data. It’s like not allowing the insertion of a connection in a graph that would cause queries to loop endlessly.
Qx Anything else you wish to add?
This is a rapidly emerging field with lots of failures and successes to learn from. I encourage people to get as much training as they can before spending money on building things. I also think that the whole field will be rebooted once quantum and optical computers become generally available. Also, [shameless plug] we’ll soon have a convenient way to try ThingSpan graph analytics at scale in the cloud.