On Amundsen. Q&A with Li Gao tech lead at Lyft
Q1. What is your role at Lyft?
My role at Lyft is providing technical leadership across multiple initiatives and teams, focusing on the data domain. (Not people manager)
Q2. What are the main technical challenges in building a unified analytics data infrastructure at many petabyte scale?
Main challenges are the diversity in the tools the team uses and the diversity in custom expectations and skillsets. Also another main challenge is the fast changing nature of the market and the constrains of engineering resource to deliver them.
Q3. How do you investigate the time it takes for a rider to make a second trip after their first trip on your platform?
There are multiple analytical tools involved in this process that our customer can choose from, varying from more real-time, limited time scoped data sets to more longer range data sets.
Q4. What is Peloton and how does it help powering your compute clusters?
Peloton is the multi-model resource pool scheduler that originated from Uber. We investigated the scheduler for some of our initiatives. For complex and diverse data compute needs across tech stacks ranging from gang-based scheduling requirements to ETL batch focused to interactive query/command, having a comprehensive resource scheduler is the key to efficient multi-tenancy support and automation. We use a set of schedulers across our native and kubernetes clusters to meet these diverse data compute needs in efficient ways.
Q5. Amundsen is supposed to help improving the productivity of data analysts, data scientists and engineers when interacting with data. How? Can you give us some examples?
Amundsen is the key component to bring data folks together to “publish” datasets, make it searchable and discoverable. To enable such discovery process, Amundsen has multiple plugins to extract metadata from compute grid (Hadoop, Druid, K8s clusters), ETL orchestrator Apache Airflow), and Notebooks (i.e. Jupyter NB) and aggregate these metadata to build graph of relationships between users, datasets, and columns. Amundsen also provides a rich web UI so our data analytics, engineers, and machine learning scientists can search and explore datasets that they can leverage for their reports or models.
You can refer to the excellent writeup by Mark on this blog. One concrete example of Amundsen is that say you have a mobile app that generates telemetry data and sends to a remote gateway for processing, after all the processing it will land on an analytical data lake somewhere via some ETL process. How would the data scientist or data analytics find or be made aware of this new telemetry dataset in the lake?
Amundsen is then the tool the data scientist or analytics uses to explore/subscribe/find the dataset that interests them, so they can start building cool features or models on top.
Q6. Who is using Amundsen at Lyft and for what?
Most of the company are using Amundsen to discover/search/locate the people/dataset/columns that they are interested in before interacting with the dataset. This is especially true for new datasets that other org or teams heard of but have no in-depth knowledge of. Amundsen provides the valuable information for these new datasets from multiple perspectives, such as the data owner and frequent user information, data set schema, freshness, cardinality estimation, and upstream or downstream data lineage dependencies. Amundsen even provides redacted (for PII reasons) instant previews of these datasets so users can get a sense of the data prior to pull it into their compute. Amundsen also enables wider collaboration across many teams both data producers and consumers that was nearly impossible before due to lack of tooling.
Q7. What your vision of ahead?
Beyond data discovery and unified analytics. I’d see a bright future for a more coherent and automated data lifecycles as we’ve seen in the past for automated software lifecycles. We need to treat data as products similar as software so its full lifecycle is managed and automated as much as possible. This should only have integral parts for data integrity (data trust and data quality), data privacy, and data agility (how data can move from 1 shape or form to another shape or form to meet new business demands while maintain cost efficiency)
Qx Anything else you wish to add?
Li Gao is a tech lead at Lyft in the data platform org. Currently leading across multiple data initiatives, including analytical data compute, fast data, spark on k8s, data integrity and compliance, etc.