Powering Big Data at Pinterest. Interview with Krishna Gade.
“Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.”–Krishna Gade.
I have interviewed Krishna Gade, Engineering Manager on the Data team at Pinterest.
Q1. What are the main challenges you are currently facing when dealing with data at Pinterest?
Krishna Gade: Pinterest is a data product and a data-driven company. Most of our Pinner-facing features like recommendations, search and Related Pins are created by processing large amounts of data every day. Added to this, we use data to derive insights and make decisions on products and features to build and ship. As Pinterest usage grows, the number of Pinners, Pins and the related metadata are growing rapidly. Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.
On the product side, we’re curating a unique dataset we call the ‘interest graph’ which captures the relationships between Pinners, Pins, boards (collections of Pins) and topic categories. As Pins are visual bookmarks of web pages saved by our Pinners, we can have the same web page Pinned many different times. One of the problems we try to solve is to collate all the Pins that belong to the same web page and aggregate all the metadata associated with them.
Visual discovery is an important feature in our product. When you click on a Pin we need to show you visually related Pins. In order to do this we extract features from the Pin image and apply sophisticated deep learning techniques to suggest Pins related to the original. There is a need to build scalable infrastructure and algorithms to mine and extract value from this data and apply to our features like search, recommendations etc.
Q2. You wrote in one of your blog posts that “data-driven decision making is in your company DNA”. Could please elaborate and explain what do you mean with that?
Krishna Gade: It starts from the top. Our senior leadership is constantly looking for insights from data to make critical decisions. Every day, we look at the various product metrics computed by our daily pipelines to measure how the numerous product features are doing. Every change to our product is first tested with a small fraction of Pinners as an A/B experiment, and at any given time we’re running hundreds of these A/B experiments. Over time data-driven decision making has become an integral part of our culture.
Q3. Specifically, what do you use Real-time analytics for at Pinterest?
Krishna Gade: We build batch pipelines extensively throughout the company to process billions of Pins and the activity on them. These pipelines allow us to process vast amounts of historic data very efficiently and tune and personalize features like search, recommendations, home feed etc. However these pipelines don’t capture the activity happening currently – new users signing up, millions of repins, clicks and searches. If we only rely on batch pipelines, we won’t know much about a new user, Pin or trend for a day or two. We use real-time analytics to bridge this gap.
Our real-time data pipelines process user activity stream that includes various actions taken by the Pinner (repins, searches, clicks, etc.) as they happen on the site, compute signals for Pinners and Pins in near real-time and make these available back to our applications to customize and personalize our products.
Q4 Could you pls give us an overview of the data platforms you use at Pinterest?
Krishna Gade: We’ve used existing open-source technologies and also built custom data infrastructure to collect, process and store our data. We built a logging agent Singer, deployed on all of our web servers that’s constantly pumping log data into Kafka, which we use as a log transport system. After the logs reach Kafka, they’re copied into Amazon S3 by our custom log persistence service called Secor. We built Secor to ensure 0-data loss and overcome the weak eventual consistency model of S3.
After this point, our self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing. All our large scale batch pipelines run on Hadoop, which is the core data infrastructure we depend on for improving and observing our product. Our engineers use either Hive or Cascading to build the data pipelines, which are managed by Pinball – a flexible workflow management system we built. More recently, we’ve started using Spark to support our machine learning use-cases.
Q5. You have built a real-time data pipeline to ingest data into MemSQL using Spark Streaming. Why?
Krishna Gade: As of today, most of our analytics happens in the batch processing world. All the business metrics we compute are powered by the nightly workflows running on Hadoop. In the future our goal is to be able to consume real-time insights to move quickly and make product and business decisions faster. A key piece of infrastructure missing for us to achieve this goal was a real-time analytics database that can support SQL.
We wanted to experiment with a real-time analytics database like MemSQL to see how it works for our needs. As part of this experiment, we built a demo pipeline to ingest all our repin activity stream into MemSQL and built a visualization to show the repins coming from the various cities in the U.S.
Q6. Could you pls give us some detail how is it implemented?
Krishna Gade: As Pinners interact with the product, Singer agents hosted on our web servers are constantly writing the activity data to Kafka. The data in Kafka is consumed by a Spark streaming job. In this job, each Pin is filtered and then enriched by adding geolocation and Pin category information. The enriched data is then persisted to MemSQL using MemSQL’s spark connector and is made available for query serving. The goal of this prototype was to test if MemSQL could enable our analysts to use familiar SQL to explore the real-time data and derive interesting insights.
Q7. Why did you choose MemSQL and Spark for this? What were the alternatives?
Krishna Gade: I led the Storm engineering team at Twitter, and we were able to scale the technology for hundreds of applications there. During that time I was able to experience both good and bad aspects of Storm.
When I came to Pinterest, I saw that we were beginning to use Storm but mostly for use-cases like computing the success rate and latency stats for the site. More recently we built an event counting service using Storm and HBase for all of our Pin and user activity. In the long run, we think it would be great to consolidate our data infrastructure to a fewer set of technologies. Since we’re already using Spark for machine learning, we thought of exploring its streaming capabilities. This was the main motivation behind using Spark for this project.
As for MemSQL, we were looking for a relational database that can run SQL queries on streaming data that would not only simplify our pipeline code but would give our analysts a familiar interface (SQL) to ask questions on this new data source. Another attractive feature about MemSQL is that it can also be used for the OLTP use case, so we can potentially have the same pipeline enabling both product insights and user-facing features. Apart from MemSQL, we’re also looking at alternatives like VoltDB and Apache Phoenix. Since we already use HBase as a distributed key-value store for a number of use-cases, Apache Phoenix which is nothing but a SQL layer on top of HBase is interesting to us.
Q8. What are the lessons learned so far in using such real-time data pipeline?
Krishna Gade: It’s early days for the Spark + MemSQL real-time data pipeline, so we’re still learning about the pipeline and ingesting more and more data. Our hope is that in the next few weeks we can scale this pipeline to handle hundreds of thousands of events per second and have our analysts query them in real-time using SQL.
Q9. What are your plans and goals for this year?
Krishna Gade: On the platform side, our plan to is to scale real-time analytics in a big way in Pinterest. We want to be able to refresh our internal company metrics, signals into product features at the granularity of seconds instead of hours. We’re also working on scaling our Hadoop infrastructure especially looking into preventing S3 eventual consistency from disrupting the stability of our pipelines. This year should also see more open-sourcing from us. We started the year by open-sourcing Pinball, our workflow manager for Hadoop jobs. We plan to open-source Singer our logging agent sometime soon.
One the product side, one of our big goals is to scale our self-serve ads product and grow our international user-base. We’re focusing especially on markets like Japan and Europe to grow our user-base and get more local content into our index.
Qx. Anything else you wish to add?
Krishna Gade: For those who are interested in more information, we share latest from the engineering team on our Engineering blog. You can follow along with the blog, as well as updates on our Facebook Page. Thanks a lot for the opportunity to talk about Pinterest engineering and some of the data infrastructure challenges.
Krishna Gade is the engineering manager for the data team at Pinterest. His team builds core data infrastructure to enable data driven products and insights for Pinterest. They work on some of the cutting edge big data technologies like Kafka, Hadoop, Spark, Redshift etc. Before Pinterest, Krishna was at Twitter and Microsoft building large scale search and data platforms.
–Singer, Pinterest’s Logging Infrastructure (LINK to SlideShares)
–Introducing Pinterest Secor (LINK to Pinterest engineering blog)
–MemSQL’s spark connector (memsql/memsql-spark-connector GitHub)
Follow ODBMS.org on Twitter: @odbmsorg