Q&A with Data Engineers: Carlos Ordonez
Carlos Ordonez studied at UNAM, Mexico, getting a B.Sc. in applied mathematics and an M.S. in computer science. He continued PhD studies at the Georgia Institute of Technology advised by Edward Omiecinski, focusing on extending database systems with data mining algorithms, getting the PhD in 2000. Carlos worked at Teradata (formerly part of NCR) from 1998 to 2006, collaborating in the optimization of machine learning and cube algorithms to work on the Teradata parallel DBMS. In 2006 Carlos joined the Department of Computer Science at the University of Houston, where he currently leads the DBMS lab. From 2013 to 2015 Carlos collaborated with Michael Stonebraker, regularly visiting the Database Group at MIT. From July 2014 to July 2015 Carlos was a visiting researcher at ATT Research Labs (formerly ATT Bell Labs), where he worked on stream analytics and data warehousing with Divesh Srivastava. His research has been funded by NSF.
Q1. How do you gain the knowledge and skills across the plethora of technologies and architectures available?
Read papers from conferences and journals.
Q2. What lessons did you learn from using shared-nothing scale-out architectures, which claim to allow for support of large volumes of data?
Beyond 10s of nodes, they are hard to manage.
Q3. What are in your opinion the differences between capacity, scale, and performance?
Parallel DBMS better performance, less scale out. Hadoop big data better scale out.
Q4. There are several types of new data stores including NoSQL, XML-based, Document-based, Key-value, Column-family and Graph data stores. What is your experience with those?
Graph engines are good. Column stores are awesome for complex queries.
Q5. What are your experiences of using Hadoop in production?
Q6. ACID, BASE, CAP. Are these concepts still important?
Less important in an analytic system. But not zero.
Q7. What is your take of the various extensions of SQL to support data analytics?
UDFs are great, SQL itself straikes behind R or matlab.
Q8. What are the typical mistakes done for a large scale data project? How can they be avoided in practice?
Q9. Learning a new technology and developing the knowledge is different than developing skills. Skills are perishable. How do you handle this?
Learn basic algorithms, C++ and SQL.
Q10. What RDD ( Resilient Distribution Datasets ) and Transformations functions are useful for?
Partitioning data sets with a new key.
Q11. Did you use Spark Streaming for processing live data streams? And if yes, what is your experience?
No, storm is better.
Q12. Did you use Spark MLlib (Machine Learning Algorithms)? And if yes, what is your experience?
Comprehensive, but slow.
Q13. In your experience, how good is Spark supporting stream processing?
Q14 SQL Spark (known as Shark) is a module in Spark to work with structured data and perform structured data processing. What is your experience with Shark?
Impala is better.
Q15. Is there any benefit of learning MapReduce?
Not any more.