On StarRocks. Q&A with Andy Ye 

Q1. What is StarRocks? 

StarRocks is the only open-source SQL query engine able to run the most demanding data warehouse workloads directly on the data lakehouse, without ingesting into a proprietary data warehouse for query acceleration. A Linux Foundation project, StarRocks has been adopted by some of the world’s largest data-driven companies including Airbnb, Pinterest, Demandbase, MiHoYo, Tencent, and Fanatics.

With StarRocks, enterprises can ditch costly data pipelines and data warehouses while getting extreme performance that meets their latency and concurrency requirements with at least 3x fewer computing resources.

Q2.What does it mean exactly an open-source Lakehouse SQL engine? What is the difference with a traditional SQL engine?

StarRocks is unique in its ability to offer users an SQL engine to process queries directly on the open data lake without the need for a proprietary data warehouse that a traditional SQL engine would require for adequate performance. This is thanks to StarRocks’ set of features and architectural choices optimized for data warehouse workloads, allowing it to deliver query performance that traditionally has only been possible with a proprietary data warehouse. This difference is why StarRocks is referred to as a lakehouse SQL engine and not just a traditional SQL engine.

Q3.How does it compare to Trino, the popular distributed query engine that runs analytical queries over big volumes of data with interactive latencies?

StarRocks is natively built for data warehouse workloads, with its C++ SIMD optimized execution layer purpose-built for sub-second performance, and an intelligent caching framework to accelerate data lakehouse queries. 

Trino is a good product, however, it is designed to connect to many different data sources. Written in a higher language, Java, its latency and concurrency are satisfactory for traditional data lake use cases, but not enough for many demanding workloads that are being migrated from proprietary data warehouses to data lakehouses.

Q4.What about Apache Druid® and ClickHouse?

While StarRocks provides enterprises with a lakehouse SQL engine for high-performance data lake analytics, users of Apache Druid and ClickHouse, who may not be working with a data lake, still find StarRocks to be a superior choice as a solution to their data warehouse workloads. 

Because of their compute architecture, Apache Druid and ClickHouse struggle with multi-table JOIN queries, which forces their users to develop costly denormalization pipelines, often in a real-time setting, which is not only expensive both in terms of hardware and labor, but also very complex and error-prone.

StarRocks is designed to run multi-table JOIN queries at scale, so its users can keep their tables normalized, which not only simplifies the data pipeline, but also saves precious storage space.

StarRocks’ ability to perform real-time data upserts, and its extremely high performance, often at least 3x greater than Apache Druid and ClickHouse, makes it the more favorable choice for high-volume real-time and customer-facing analytics scenarios where Apache Druid and Clickhouse struggle with costs, performance, and complexity.

Q5.What is the relationship between CelerData and StarRocks?

CelerData is the company and original team behind StarRocks. Until last year, CelerData was the maintainer and primary developer of the open-source StarRocks project before donating it to the Linux Foundation. Since then, CelerData has continued to work closely with the global StarRocks developer community to improve the project with new features and quality-of-life improvements.

Many of the world’s largest enterprises are using StarRocks in production today, and having worked closely with these users, CelerData is in a unique position to develop commercial offerings built on StarRocks that deliver the performance StarRocks is famous for alongside business-critical features not available through the open source project like security.

Q6. How does Tencent Games leverage StarRocks?

Tencent Games, the gaming division of Tencent, overcame their challenge of siloed and scattered data across their portfolio of studios by adopting StarRocks and Apache Iceberg as their unified data lakehouse platform. Previously they were hindered by a complex architecture that required extensive pre-processing and suffered from high storage costs. With their new lakehouse architecture, Tencent Games simplified their data pipeline by eliminating the need for pre-aggregations. Storage costs were also reduced by 15x since this solution let them keep a single source of truth for their data in cloud object storage. By integrating all their data into a single system, Tencent Games achieved 50% better efficiency when developing new data pipelines and benefited from improved data freshness, which has greatly aided decision-making through deeper insights into user behavior. Ultimately, this transition not only saved costs, but also enhanced the stability and flexibility of their data infrastructure.

Qx. Anything else you wish to add?

I’d encourage those interested in StarRocks and CelerData to check out the project here. For CelerData`s commercial solutions click here.

If StarRocks sounds like what you’ve been looking for, please join our community.


Andy Ye is the Co-Founder and Chief Operating Officer at CelerData, and the Co-Founder of StarRocks.

You may also like...