On Using Graph Database technology at Behance. Interview with David Fox
“We’ve corrected a major human resource burden that has led to an exponential decrease in the amount of developer-operations staff hours required each month to keep our activity feed running. This means they can focus on other areas of our infrastructure that need attention, versus spending time frustratingly micro-managing the activity infrastructure. ” –David Fox
I have interviewed David Fox. David is a Software Engineer at Adobe, responsible for backend infrastructure and performance of Behance – a social network for creatives, serving over 10 million members. Behance is owned by Adobe.
Q1. Who is using Behance and for what? How many users do you have, and what data volume and data types do you handle?
David Fox: Behance is the leading online platform to showcase and discover creative work. Every single day, hundreds of thousands of creative individuals update and publish new projects to the Behance platform allowing for advanced collaboration and an efficient means to solicit helpful feedback on both final work and ‘works in progress.’
Not only does Behance offer an outlet for peer-to-peer sharing among creatives, it’s a top destination for companies looking to hire creative talent on a global scale. With over 10 million members, our ‘activity feed’ infrastructure serves millions of feed loads daily.
Q2. What specific data requirements do you have from your users?
David Fox: Our users expect the ability to access current and past projects within Behance that’s both relevant to their core interest areas while also offering discovery tools/functionality. The ‘Activity Feed Feature’ allows users to follow their favorite creatives and curate galleries based on preferences. When a user follows a creative, they receive alerts and an updated feed whenever that creative logs an activity within Behance, e.g. when they ‘appreciate’ a project, publish a project, comment on a project. Users can also follow pre-curated galleries, selected by the Behance curation team, which highlight a creative theme, e.g. graphic design, photography, illustration.
Q3. You have implemented an activity feed for many years. What challenges did you have with the previous implementation?
David Fox: Our previous implementation of ‘Activity Feed’ was built with Cassandra. It was designed for optimized ‘reads’ (when we showed a user their activity feed), but consequently, it was very storage/write-heavy.
We needed 48 large Cassandra instances to power the feature, and even with those, there were still many limitations and bottlenecks on the system. We had to devote a large amount of app resources to populate Cassandra with activity feed data. For every action a user took on our app, we would use a “fanout” method to populate the activity feed of every user that followed them with the new activity item. For users who were followed by thousands of people, resource utilization skyrocketed every time they took an action. Our application worker processes that processed those items would experience delays in their work because of all the work they needed to do to populate the activity feeds in Cassandra.
We also had a lot of challenges maintaining our Cassandra cluster, which led to us having to devote a significant amount of ops/developer time to supporting the cluster. With our schema format, we didn’t have an efficient way to delete feed items from our database, so the disk usage began to add up quickly on each node. When disk usage became high, we would have to perform maintenance tasks to stabilize the cluster. The rigid schema structure also meant we couldn’t easily make any improvements to our activity feed feature, and we were just working to keep it running with little hope of improving it.
Q4. What are the goals you set for the new implementation of the activity feed? How does the activity feed relate to the rest of the data platform?
David Fox: We had several goals for our new activity feed implementation.
- Ensure the new infrastructure significantly reduced human-maintenance costs and required minimal effort and resources to keep running.
- Reduce the complexity of the system as a whole.
- Significantly improve the performance of writes while keeping reads fast.
- Increase the flexibility of the system, in turn making it easier to add new features.
- Reduce data storage size.
Q5. Can you share with us some details of the new implementation?
David Fox: Our new activity feed implementation, built on top of Neo4j, uses a simple graph model where we store relationships between users and the entities they follow. We then store simple “action” relationships that represent a user or curated gallery taking an action on a project (ex. commenting, publishing, appreciating, etc.). This data model produces very little repeating data so data maintenance is simple for our application layer, doesn’t use a lot of resources, and provides good flexibility on how we can query the data.
Q6. Why was removing activity items when you unfollowed a user “impossible” with the previous implementation? How is it done now? What about scalability?
David Fox: In our previous implementation, our Cassandra schema didn’t allow for us to efficiently delete data at any scale since it was highly optimized for reads, and there was significant data repetition. So, when a user unfollowed another user, since their activity feed was pre-written to Cassandra in-full, they would still see items from users they unfollowed. Now, we simply delete the “follows” relationship from the user to the person they unfollowed, instantly reflecting the change in their feed. This instant change is possible since we don’t store any data about individual activity feeds, and instead, only store data modeling on who users follow and actions taken on projects.
Q7. In the previous implementation you could only “backfill” 30 items per category during on boarding. Why was this a problem? How is it done now?
David Fox: This limitation was a significant struggle for us. When a user creates an account with Behance, they select one or more curated categories they want to follow. At that point, with our old implementation, we would have to backfill their feed with activity items since all the items we could show them needed to be stored in Cassandra – one row for each project in their feed. So, when a user signed up, they would need to wait the amount of time it took for us to fill Cassandra with activity items for the curated categories they selected. Because of that delay, we minimized the number of projects we backfilled to make sure there wasn’t too long of a delay (only a few seconds) between finishing sign-up and seeing the initial Behance experience. Unfortunately, limiting the backfill to 30 items meant after sign-up, a user could only scroll through 30 x (number of categories selected) items on their feed, and then they’d reach the end.
Now, with our new activity feed implementation, there is no backfill required at all. When a user creates an account and selects the curated categories they want to follow, all we have to do is create the “follows” relationships in Neo4j between the user and the categories, and their feed will instantly be available. Since we store about 1,000 latest items for each curated category in Neo4j, that means that users can now almost instantly see 1,000 x (number of categories selected) items in their feed. That’s a much better initial experience on our application!
Q8. How did you implement the ability to add new features like “newest projects from your network” ?
David Fox: The flexibility of the graph data model allowed us to easily add this new activity view to the activity feed feature. Instead of having to load a user’s activity based on a pre-existing stored structure, we were able to simply query by action type in Neo4j to create a specific view of the activity feed, analyzing only “published project” actions instead of all our usual activity actions.
Q9. Why is selectively writing for users with lots of followers no longer needed?
David Fox: Previously, if a user was followed by a significant amount of people, we were unable to share an activity update to the user’s full group of followers due to the ‘fan out’ action for each follower. Now, the number of people following a user has no impact on the time/complexity for creating the action since the same action relationship is shared by all followers of that user – meaning very little repeating data.
Q10. Can you share any numbers around the performance of the new implementation?
David Fox: We’ve seen some significant performance improvements with the new implementation. We’ve been able to cut the time from sign-up to initial activity experience from 1.4 seconds to 400 milliseconds on average. In addition to the speed improvement, users’ initial curated feed is now much more complete.
We’ve seen the most significant improvements in writing activity data. For example, one of our write job processes (when a project was featured in a curated gallery) used to take 12 minutes on average to run and consumed significant application resources. Now, on average, that write operation takes 106 milliseconds.
Q11. What about COGS, Storage Costs and Complexity?
David Fox: Our Neo4j activity implementation has led to a great decrease in complexity, storage, and infrastructure costs. Our full dataset size is now around 40 GB, down from 50 TB of data that we had stored in Cassandra. We’re able to power our entire activity feed infrastructure using a cluster of 3 Neo4j instances, down from 48 Cassandra instances of pretty much equal specs. That has also led to reduced infrastructure costs. Most importantly, it’s been a breeze for our operations staff to manage since the architecture is simple and lean.
Q12. What is the road map ahead for data projects at Behance?
David Fox: We want to continue to take advantage of our new, flexible graph infrastructure to improve our activity feed feature and make it more useful. We recently built “newest projects from your network” view for activity and we will be launching it soon. This feature is a great example of our ability to move quickly and nimbly, and we’ll continue to add more compelling features to our activity experience moving forward.
Just as important as new features, we’ve corrected a major human resource burden that has led to an exponential decrease in the amount of developer-operations staff hours required each month to keep our activity feed running. This means they can focus on other areas of our infrastructure that need attention, versus spending time frustratingly micro-managing the activity infrastructure.
David Fox is an application developer/data engineer specializing in the development of high-performance backend systems and working with a large variety of databases alongside massive datasets. He is a Software Engineer at Adobe, responsible for backend infrastructure and performance of Behance – the premier social network for creatives, serving over 10 million members.
GRAPH DATA STORES – Free Downloads for:
– On RDBMS, NoSQL and NewSQL databases. Interview with John Ryan , ODBMS Industry Watch, March 9, 2018
– Identity Graph Analysis at Scale. Interview with Niels Meersschaert, ODBMS Industry Watch, May 9, 2017
Follow us on Twitter: @odbmsorg
From → Uncategorized