On Innovation– Interview with Nathan Marz.
” I think that it’s incredibly important for all programmers to have a public presence by being involved in open source or having side projects that are publicly available. The industry is quickly changing and more and more people are realizing how ineffective the standard techniques of programmer evaluation are. This includes things like resumes and coding questions “ –Nathan Marz.
Nathan Marz: You only live once, so it’s important to make the most of the time you have. I find Bezos’s “regret minimization framework” a great way to make decisions with a long term perspective. Too often people make decisions only thinking about marginal, short-term gains, and this can lead you down a path you never intended to go. And failure, if it happens, is not as bad as it seems. Worst comes to worse I’ll have learned an enormous amount, had a unique and interesting experience, and will just try something else.
Q2. Do you want to disclose in general terms what you’ll be working on?
Nathan Marz: Sorry, not at the moment. [ edit: as of now he did not disclose it]
Q3. You open-sourced Cascalog, ElephantDB, and Storm. Which of the three is in your opinion the most rewarding?
Nathan Marz: Storm has been very rewarding because of the sheer number of people using it and the diversity of industries it has penetrated, from healthcare to analytics to social networking to financial services and more.
Q4. What are in your opinion the current main challenges for Big Data analytics?
Nathan Marz: I think the biggest challenge is an educational one. There’s an overwhelming number of tools in the Big Data ecosystem, all very much different than the relational databases people are used to, and none is a one-sized-fits-all solution. This is why I’m writing my book “Big Data” – to show people a structured, principled approach to architecting data systems and how to use those principles to choose the right tool for you particular use case.
Q5. In January 2013, version 0.8.2 of Storm was released? What is new?
Nathan Marz: There was a lot of work done in 0.8.2 on making it easier to use a shared cluster for both production and in-development applications. This included improved monitoring support, which helps with detecting when you’ll need to scale with more resources, and a brand new scheduler that isolates production and development topologies from each other. And of course, the usual bug fixes and small improvements.
Q6. How do you expect Storm evolving?
Nathan Marz: There’s a lot of work happening right now on making Storm enterprise-ready. These include security features such as authentication and authorization, enhanced monitoring capabilities, and high availability for the Storm master. Long term, we want to continue with the theme of having Storm seamlessly integrate with your other realtime backend systems, such as databases, queues, and other services.
Q7. Daniel Abadi of Hadapt, said in a recent interview: “the prevalent architecture that people use to analyze structured and unstructured data is a two-system configuration, where Hadoop is used for processing the unstructured data and a relational database system is used for the structured data. However, this is a highly undesirable architecture, since now you have two systems to maintain, two systems where data may be stored, and if you want to do analysis involving data in both systems, you end up having to send data over the network which can be a major bottleneck.”
What is your opinion on this?
Nathan Marz: I think that “structured” vs. “unstructured” is a false dichotomy. It’s easy to store both unstructured and structured data in a distributed filesystem: just use a tool like Thrift to make your structured schema and serialize those records into files. A common objection to this is: “What if I need to delete or modify one of those records? You can’t cheaply do that when the data is stored in files on a DFS.” The answer is to move beyond the era of CRUD and embrace immutable data models where you only ever create or read data. In the architecture I’ve developed, which I call the “Lambda Architecture“, you then build views off of that data using tools like Hadoop and Storm, and it’s the views that are indexed and go on to feed the low latency requests in your system.
Q8. What are the main lessons learned in the last three years of your professional career?
Nathan Marz: I think that it’s incredibly important for all programmers to have a public presence by being involved in open source or having side projects that are publicly available. The industry is quickly changing and more and more people are realizing how ineffective the standard techniques of programmer evaluation are. This includes things like resumes and coding questions. These techniques frequently label strong people as weak or weak people as strong. Having good work out in the open makes it much easier to evaluate you as a strong programmer. This gives you many more job options and will likely drive up your salary as well because of the increased competition for your services.
For this reason, programmers should strongly prefer to work at companies that are very permissive about contributing to open source or releasing internal projects as open source. Ironically, a company having this policy is assisting in driving up the value (and price) of the employee, but as time goes on I think this policy will be necessary to even have access to the strongest programmers in the first place.
Nathan Marz. was the Lead Engineer at BackType before BackType was acquired by Twitter in July of 2011. At Twitter he started the streaming compute team which provides infrastructure that supports many critical applications throughout the company. He left Twitter in March of 2013 to start his own company (currently in stealth).
– Big Data: Principles and best practices of scalable realtime data systems.
Nathan Marz (Twitter) and James Warren
MEAP Began: January 2012
Softbound print: Fall 2013 | 425 pages
Download Chapter 1: A new paradigm for Big Data (.PDF)
follow ODBMS.org on Twitter: @odbmsorg