On the Presto Open Source Project. Q&A with Tim Meehan
Q1. What is special about the Presto open source distributed query engine?
I’ve been working on this project for over 6 years, and I’m still very engaged and excited by it. Ultimately, I think it boils down to the community. Presto is an open community, but it’s also used in some of the largest data lakes in the world – think Meta, Uber, and the like. The problems being solved in data lakes are still very much being designed and implemented, so I find that there’s still so many new and interesting features being developed. At the same time, because it’s used by some of the largest data lakes in the world alongside smaller lakes, there’s a very interesting cross pollination of needs and ideas that I think converge into a very exciting project and great features.
Q2. What kind of use cases are well suited to be supported by Presto? Why?
Presto was designed for adhoc analytics, so it works extremely well for that use case. For these types of queries, you can have a mix of workload, with some queries completing quickly, and some queries scanning vast amounts of data and using huge amounts of computation. Presto has very good support for multitenancy, which means it can perform very well with these types of diversified workloads.
That said, Presto has evolved over the years to take on more workload, and this trend is accelerating. Presto has a very robust and mature support for batch, with multiple execution runtime options to suit various needs. For example, Presto can be used in the Spark execution runtime if fault tolerance and flexible resource management is important, and it can also be configured to efficiently and reliable execute batch queries without Spark if cost savings and superior performance are more important.
Additionally, people have good success using Presto to create adhoc dashboards. And with our move to a native vectorized C++ engine and deeper integration with Lakehouse table formats such as Iceberg, I expect Presto to continue to excel at dashboarding and take on more OLAP workloads in the future.
Q3. Which use cases are not well suited for Presto? Why?
Presto has, for its entire history, been a disaggregated query engine—meaning, it’s never had built-in storage (with one small exception), and it’s always been designed in environments where it’s expected that your storage is separate from your compute. This brings many advantages—cost savings, improved scalability, and more flexible deployment and integration options. However, it also brings one critical challenge—your reliability is dependent on storage, or your metastore. This means that for super mission critical applications that can’t tolerate any downtime, Presto might not be the best choice, because once you start maintaining several dependencies, and when those dependencies can cause an outage, it becomes very hard to ensure zero downtime at scale.
Q4. Presto was originally launched at Meta in 2013 and donated to the Linux Foundation in 2019. What has changed in Presto in the meanwhile?
Soon after we started the Presto Foundation, we moved towards making Presto the premier interface into a data lake. This meant expanding Presto in several directions, such as taking on more OLAP workload, and integrating with Spark for consistent Presto SQL semantics over Spark’s mature runtime environment.
By far the greatest and most wide-ranging change we embarked on was splitting apart Presto’s execution engine and rewriting it in C++. This has been a monumental effort, but with many users in production with it, we’ve definitely come a long way.
Q5. Who are the current Presto maintainers?
Presto has an open governance model. Maintainers consist of module maintainers, who maintain specific areas of the project, and project maintainers, who can approve pull requests across the whole project. We have several dozen project and module maintainers over many different companies from across the world.
Q6. You mentioned in a recent paper that “A top priority has been ensuring query reliability does not regress with the shift towards smaller, more elastic container allocation, which requires queries to run with substantially smaller memory headroom and can be preempted at any time”. Why is this so important?
This was mentioned in Presto: A Decade of SQL Analytics at Meta, and the challenge there was interesting. It’s a bit like the joke of who would you rather fight, a horse-sized duck or 100 duck-sized horses? What mix of hardware is most efficient is something that many large companies experimented with, and at Meta it was found at the time that many small servers bore more cost-savings than fewer large servers. So there was an initiative to get Presto to work at very large-scale deployments with very small servers. Now, keep in mind that Presto does its analytics in-memory, and you can imagine how challenging this would be for an engine like Presto. While this prompted lots of very hard work, such as implementing robust spilling and increasing efficiency, it partly drove some of our best improvements: Presto on Spark and our C++ rewrite. Flexibility on the form factor of your Presto deployment is important in the sense that it’s driven very useful features and improved reliability for everyone else because that’s the hardest environment to deploy Presto in.
Q7. How did new demands from machine learning, privacy, and graph analytics drive Presto maintainers to think beyond traditional data analytics?
What we found with machine learning and graph analytics is that they involve a lot of the same underlying infrastructure needed for analytics. At the end of the day, you need an efficient execution engine to perform the computations and transformations needed for those use cases. That’s why we’ve invested so much energy on the C++ execution engine—not only will it power Presto but also the fast-evolving requirements from other computation heavy use cases, without reinventing the wheel to support them.
Q8. Why did Uber choose Presto? What benefits do they have in using Presto?
Uber is a hyperscaler, and I would expect the reasons it chooses Presto are most likely the reasons why most people would want to move onto a data lake. Those happen to be: improved cost savings, because you can disaggregate your storage from your compute and support a truly massive amount of data and use only as much compute as you need; reliability, because you need your query engine to be reliable so it can power your dashboards, business intelligence tools and adhoc queries that drive your business decisions; and flexibility of deployments and integrations, so they can easily join together data from various parts of their data lake. I really expect these to be the reasons why data lakehouses become the dominant place that people do analytical queries in the future.
Q9. What is the Presto native C++ project? What is it useful for?
We’ve been working for over three years on rewriting the Presto code that powers its queries into C++. There are many motivations to do this, but two of the most powerful are shared foundations and cost savings. Shared foundations means by moving the C++ execution engine into a separate library called Velox, it can now be shared across many different query engines. In doing so, that maintenance is now more centralized and the performance improvements benefit all query engines, not just Presto. Cost savings come from the superior performance that comes from Velox. We are finding that you can power the same workload with up to ¼ of the hardware required to run a Presto cluster. Not only does this save money, but it also opens the door to use Presto in a lot of areas where it hasn’t traditionally been a strong contender for. The possibilities of this project are so immense, it makes me very excited to be a part of it!
Q10. What is the road ahead for Presto?
The major areas revolve around improved table format integration and C++. We’ve made a lot of progress with our integration with Iceberg, and I would now consider it quite mature, including supporting the C++ evaluation engine. We’ll be working on bringing that same maturity to Delta and Hudi formats. Additionally, we’ll focus on making C++ extremely reliable and improving the user experience for C++, including support for C++ UDFs, support for arbitrary connectors in any language, remote UDFs, and improved user experience like better error message.
Qx. Anything else you wish to add?
I’ll end with saying that Presto is an open project, governed by The Linux Foundation. We believe in a neutral and open governance model, meaning no one person or company can control or dictate the roadmap and direction of the project.
We welcome anyone to join and contribute to the project! Some relevant links:
…………………………………………………………
Tim Meehan is a Software Engineer at IBM. He is also the Chairperson of the Technical Steering Committee of Presto Foundation that hosts Presto under the Linux Foundation. As the chair and a Presto committer, he is works with other foundation members to drive the technical direction and roadmap of Presto. His interests are in Presto reliability and scalability. Prior to IBM, Tim was a software engineer at Meta where he also worked on Presto, focused on resource management and reliability. He’s spent a lot of his career wrangling data, and chooses to work on Presto because of its versatility, extensibility and performance.