On Data Infrastructure at LinkedIn. Q&A with Kartik Paramasivam

This is all real.  Trillions of events are processed every day by stream processing applications at LinkedIn.

Q1. You are VP – Data Infrastructure at LinkedIn. What are your responsibilities at LinkedIn? And what are your current projects? 

LinkedIn is a data company and the Data infrastructure services that my teams develop and operate power all the applications and platforms in the company. This includes our Databases (OLTP/source of truth stores, OLAP stores for real time analytic queries,  Search Infrastructure optimized for inverted index and embeddings based retrieval, Graph Database);  Data processing stack (Stream and Batch processing),  Datalake, Data Pipelines, foundational primitives like scalable blob storage, pub-sub, distributed coordination, shard management, metadata storage etc. At LinkedIn, given our scale and unique requirements we build a lot of home grown infrastructure and wherever possible we open source our infrastructure.  We also leverage externally built open source software and improve it to meet our scalability, operability, and security needs. 

Q2. You were responsible for LinkedIn’s migration to the Azure Cloud. This was among the largest cloud migrations in the world. Could you tell us what are the main lessons you have learned for such a migration project? 

There are several considerations and challenges for such a large-scale migration to any public cloud.  Here are some of the top lessons I can think of

The first consideration is to decide which parts of your existing tech stack should you just lift and shift to the cloud without any modifications versus which parts of the tech stack should you ‘adapt or transform’ to operate well and get the benefits of the cloud platform.    The answer is almost always – it depends.   At the end, you need to reflect on why you are moving to the cloud and how quickly you need to go through the journey.   Adapting and transforming your stack to leverage more cloud constructs will take longer, but it will allow your company to derive more value from the cloud migration.   

The second thing to be cognizant of is that the IaaS, PaaS and SaaS services in the cloud come at different levels of maturity.   The more mature your dependency the less likely you will find gaps in the offering.   So, leveraging the more mature offerings is a safer bet.  However, on the flip side most of the nascent cloud offerings evolve at breakneck speed.  So, if you make a decision today based on the current capabilities of the cloud offering,  in two years the same decision might likely not be optimal.  For e.g. when we started looking into the Azure migration in 2018, the kubernetes and container offerings were not as mature as the virtual machine offerings and hence we started down the route of using virtual machines.  But over the course of our migration the container based offerings rapidly evolved. Eventually we had to change plans and decided to run all our stateless workloads on Azure Kubernetes Service.

Another thing to observe is that in the cloud it is important to be able to stand up all your services and resources using automation in a repeatable fashion. The tech stack for a company like LinkedIn comprises thousands of microservices and systems.    Oftentimes, although companies have good automation for deploying individual services, they are not great at bringing up 1000s of services and serving traffic.  This muscle of treating infrastructure as code and embracing automation stacks like Terraform, Airflow, Azure Resource Manager etc. is a key muscle to build for a cloud migration.  

Lastly to state out the obvious, you have to build a governance (security and cost control) strategy from the get-go.  

Although we paused the full migration of the LinkedIn stack to Azure in the first half of 2022 to meet LinkedIn and Microsoft’s business priorities and constraints, we continue to invest in and leverage Azure for several critical services that power LinkedIn. This ranges from Azure Front door, Kusto, Azure’s AI, Security and Media offerings and a whole lot more.

Q3. Looking at the present data infrastructure at LinkedIn. What are the main components? And what are they useful for? 

Data Infrastructure at LinkedIn can be described in terms of the following areas.

  1. Databases that are used for serving traffic on the LinkedIn site :   
    • Our OLTP source of truth document database – Espresso.  Most of LinkedIn’s primary data is stored in this home-grown distributed document store. Espresso strives to support strong read after write consistency over a distributed store that is partitioned on a primary key.  In addition to Espresso we do use traditional relational databases like MySQL.  We are now also looking into a distributed sql offering as well.  
    • Venice (recently open sourced) is used as a serving store for data derived from offline and nearline computations for e.g. features.  Venice has a rich client side caching layer and other optimizations to support low latency access.  All writes to Venice go over Kafka which allows for very high write rates without affecting query latency.  Naturally, the tradeoff is that Venice doesn’t support read after write consistency.
    • Low latency analytic queries are powered by  Apache Pinot – our home grown columnar store which serves a wide range of end user/member facing and internal facing analytics queries.
    • Our Search infrastructure is a very large scale fully managed offering that powers a wide range of search and recommendation use cases at LinkedIn.  
    • Our Graph Database – Liquid is another home grown offering built from the ground up to support queries over the LinkedIn social network.
  1. Streaming Data and DataLake :  Our homegrown system based on Apache Kafka is the backbone of our pub-sub offering.  It is used for both capturing changes from our databases and also as a pipeline to move data between our online, nearline and offline applications.  Most of this data eventually lands in our exabyte scale Datalake which is based on HDFS.   We are now working on a next generation blob store that will power our Datalake in the future.   
  2. Data Processing stack :  We have a managed platform for stream processing applications that can be written in Java, Python.  As far as the API we have standardized on Apache Beam and SQL which execute on the ApacheFlink and Apache Samza (home grown) stream processing engines.    Our batch processing stack is predominantly based on Apache Spark and we are now pushing into Flink batch as well.   All of this runs on a very large scale YARN based compute platform.  We are now pushing towards K8s.
  3. Base primitives:  At the end these and other infra at LinkedIn are built on top of basic primitives for metadata storage, shard management (with Apache Helix), orchestration, quota management etc.  We also have acommon portal and provider model to create and manage resources across all our offerings.

Q4. At LinkedIn, you capture trillions of events(PB’s of data) per day into Apache Kafka. Updates happening in your databases are captured using Brooklin and made available via Kafka. You process this deluge of events in real time using Apache Samza and Flink. Please briefly explain how this architecture works in practice and if this is still actual. 

This is all real.  Trillions of events are processed every day by stream processing applications at LinkedIn.  These applications are either built in Java/Python using the Apache Beam API or in SQL (based on the Apache Flink engine).  Most of the large scale stateful stream processing happens on the Samza engine, but we are also starting to leverage Flink.  The managed platform that we provide for our applications handles things likeauto-scaling, framework upgrade etc. so that the application developers can focus on their business logic.

LinkedIn members want an interactive experience when they are on the LinkedIn app and want LinkedIn to show recommendations by understanding what the member cares about.  This means that we want to update features that feed into our recommendation models in near real time

Q5. What are the hard problems in ingesting and processing data reliably and efficiently at internet scale? 

To ingest Trillions of events every day (and millions of events/sec), we have built a pub-sub platform based on Apache Kafka (LinkedIn’s greatest open source contribution).  We have been working on a new model where the kafka clients are able to seamlessly send data over multiple kafka clusters.  This allows us to scale beyond the scale limits of a single kafka cluster.  Similarly when this data lands in HDFS for batch processing we hit the scale limits of a single HDFS cluster.   Here again we have a system that federates the HDFS data seamlessly across multiple HDFS clusters.

Similarly to scale the compute layer i.e. our spark jobs , we have a scheduling system that seamlessly schedules batch jobs over multiple physical YARN clusters.  Such that from the standpoint of the developer who is submitting a spark job, it appears like a single YARN cluster.

Although the above investments of federating across multiple clusters helps us overcome existing scale needs, for the long run we have been looking at fundamentally addressing the scaling problems in HDFS with a new blob store.

One other hard problem worth mentioning, is in the realm of scaling stateful stream processing. Here we use a combination of local rocks DB based state, which is then backed off to a blob store for fault tolerance.  

Another key aspect is that we need a very highly scalable database to store the result of all the processing done in both near real time (Samza/Flink) and offline (Spark).  We have built a database called Venice (recently open sourced) which accepts all of its writes via Kafka, which allows us to store all of the data derived from the computation, without affecting the read traffic that is serving online queries.

One key problem is that in most scenarios we need to apply processing logic on historical data (over many days or full data snapshot) that is stored in our DataLake and also apply the same logic on all new updates to the data or new events in a continuous streaming fashion.  We have been pushing towards leveraging Apache Beam as the common API across our batch and streaming engines.  We also have been building a new abstraction that allows this converged processing logic to read datasets in a consistent way even if they might be behind HDFS or Kafka.

Another problem worth mentioning is that our data scientists start exploring datasets by running SQL queries on Trino.  However Trino queries usually return in seconds to minutes.  For larger datasets our data scientists have to switch to Spark.  We are starting to build a layer such that our data scientists can send a SQL query to run on the data lake and we can internally run it on Trino or Spark or Flink (depending on the scale requirements). 

Q6. What kind of databases do you use and for what? What are the recommendations when choosing a database? 

We have the following databases at linkedIn.

  • Espresso is a highly scalable distributed document database, which is used as the source of truth store for most of the high scale datasets at LinkedIn.  It supports read after write semantics and a rich API to query and store documents.  It allows users to evolve the schema of the documents, but enforces backward compatibility.
  • Venice is a low latency serving store for derived data (doesn’t offer read after write consistency).  It however supports very high throughput writes from our nearline and offline processing engines without compromising latency of serving.  It also has a thick client which supports key caching optimizations for low latency serving.
  • Our OLAP store is based on Apache Pinot (built at LinkedIn). It supports low latency slice and dice queries that power a wide variety of member facing experiences and internal dashboards.
  • Search :  We have built a highly scalable search system (called Galene) which is based on Apache Lucene.  It also integrates with our machine learning infrastructure to run models to rank query results.
  • We have built a highly scalable Graph database called Liquid which supports low latency graph queries over LinkedIn’s economic graph (social network, people, companies, and other entities).
  • SQL database: We leverage MySQL (and in the future a distributed SQL database) as a standard SQL relational database and are moving towards Distributed SQL offerings.

Q7. Let`s look at search. How do you manage to support search for about 930 million LinkedIn members worldwide? 

We built a highly scalable distributed Search infrastructure on top of Apache Lucene that powers most Search capabilities used by our members and customers.  This infrastructure is also used for some aspects of recommendations that you see in the LinkedIn app.    At the heart of it, the data is sharded and indexed.  The queries land on a broker (think of it as a query gateway) which then forwards the queries to at least one replica of each shard.  The system can scale by adding replicas.   To operate at our scale, the hosted search platform has a fairly sophisticated automation that allows developers to push new indexes into production safely.  

Search is not a new field, but the recent push towards semantic search with embeddings has revolutionized the information retrieval space.   As you know semantic search provides a lot more relevant results than we get with symbolic search.  Also with the advent of large language models you can fully expect that most search applications will support free form text based search in the near future.  The set of algorithms that we leverage for embedding based retrieval is evolving at breakneck speed.    Although we initially started with IVFPQ, we are now pushing towards HNSW, DiskANN and brute force (for smaller datasets).

Over time I also expect that traditional databases will add vector indexes as part of their offering, as opposed to thinking about piping the data into a vector database for querying.

Q8. Do you use so-called data lakes? If yes, what is your experience? 

Yes we ingest all our data into a multi-exabyte scale Data Lake.  Changes from our databases are captured into Kafka.  Another system Gobblin then ingests this data into HDFS.   The data is landed in columnar format (ORC) and in some cases Avro.   We organize our datasets as tables.  Here we are moving from HIVE tables to OpenHouse which is a new home grown system that is based on Apache Iceberg.  

The data in our data lake is processed using Apache Spark , Trino, TensorFlow, PyTorch etc.   

We capture metadata for all our datasets and organize them in a systematic way with a system called Datahub.   These metadata annotations are also leveraged by several maintenance flows and periodic jobs that ensure that all data in our data lake complies to various regulatory requirements.

One key change in how we use our DataLake has to do with compute storage disaggregation which is already a common practice in the cloud.  But one of the age-old cornerstones of Hadoop architecture has been about scheduling compute containers on the same nodes as the data for maximum efficiency.    However this optimization no longer really makes sense.   We need compute nodes to scale independently and to be fungible across workloads.   Storage nodes for the data lake workload are specialized and need to be managed separately.    Disaggregation also improves the overall security posture as it allows you to have tighter controls over the hosts that run your data lake.   

Qx Anything you wish to add?

Across the industry (and at LinkedIn), we have seen the emergence of purpose built data systems which are great at certain types of queries or data processing.  It is however non-trivial for an application developer to connect all the various data systems and make them work together.   Application developers also have to interact with each system with their own APIs.  I believe that although we will continue to see proliferation of purpose built data systems,  we will also start seeing a lot of consolidation and integration of these systems.

At LinkedIn we are working towards a future where users can trivially connect our data systems. In the ideal end state, online applications would query a single SQL endpoint/gateway which can pick the appropriate backend database to perform different types of queries and hide the complexity from the end users.   Similarly for offline and nearline data processing, users will express their logic on a common API (Beam for imperative code, SQL for declarative code) and it will just work.  Data scientists will just express their logic in SQL (or plain ol’ text in the future) and not have to worry which query engine (Spark/Trino) will execute the logic.

……………………………………………….

Kartik Paramasivam  is Vice President of Engineering at LinkedIn and his teams focus on the Data Infrastructure at LinkedIn.   Prior to this role, he was a Distinguished Engineer responsible for leading an initiative to migrate LinkedIn to the Azure cloud.   Prior to LinkedIn, Kartik worked on the first and 2nd generation of messaging services in Microsoft Azure and was responsible for Azure ServiceBus, EventHubs.  

You may also like...