Uplevel Big Data Analytics with Graph in Vertica – Part 4: It’s not your dad’s graph engine
by Walter Maguire, Chief Field Technologist with the HP Big Data Group.
It’s been awhile since my last post! My apologies! Summer is full of things like vacations (which I’ve been known to take once in a while), as well as the annual HP Big Data Conference. This year the BDC was held in Boston Massachusetts in the second week of August. We had over 1100 people attend the Big Data-focused event this year. I was blown away at the quality of the presentations and how engaged all the attendees were. It was a great week full of customer perspectives, engineering coolness, and more. One of the hot topics was Big Data analytics. Many companies are now at the point where they’ve built technology infrastructures to cope with big data, but they are just starting to evaluate how the heck to use it.
More than a few people at the event were interested in talking about graph analytics. Graph has come a long way since the seven bridges problem, and it is starting to pop up in industries from advertising to communications to logistics and more. And once again I found that HP Vertica is the best kept secret in graph analytics!
But, for those who may have missed the earlier entries, let’s start with a recap of this blog series so far:
In part 1—I demystified graph analysis and busted the myth that you can’t perform graph in a database…provided that you’re using Vertica.
In part 2—I discussed how to implement one of the more common graph analytics – single-source, shortest-path – in Vertica. I also discussed performance benchmarks by comparing Vertica to several commonly used graph engines – Giraph and GraphLab.
In part 3—I discussed methods to optimize performance of graph analytics in Vertica with a variety of techniques and optimizations. I also discussed possibilities for future enhancements.
We’ve covered a lot of ground so far in this blog series! But there are more insights to be had. The team that performed the benchmarking between Vertica and the graph engines also looked at how to implement several types of graph analytics which push the boundaries of vertex-centric processing. And the team found uncovered some very interesting insights!
Before continuing, let’s take a moment and clarify what it means to push the boundaries of vertex-centric processing. This distinction highlights one of the most fundamental differences between Vertica and the other graph engine in this comparison–Giraph.
It’s a beautiful day in the neighborhood
Giraph, originally conceived to enable large scale vertex-centric graph processing, is implemented on the Hadoop Map/Reduce framework. For those unfamiliar with it, Map/Reduce (MR) essentially divides a single, large task into a set of smaller tasks which can be executed in parallel across a number of connected systems. But at some point in this processing, information has to be passed between the connected systems. Sometimes this has to happen more than once during the execution of a particular MR process. Remember my discussion of parallelism in part 3 of this series? The same basics apply to MR as well. While we can set up a Hadoop cluster with thousands of systems, we still are limited by the network connecting them. The more we have to rely on moving data around, the slower and more resource intensive the job becomes.
In the case of the shortest path computation on Giraph, each individual process (many of which may run on any system):
- Picks up a node, its edges and any messages regarding previous steps
- Applies the logic to search for children
- Passes the output of that step as a message—which is then read by another process in subsequent steps
I know that this sounds efficient enough – until one realizes that nodes often relate to more than one node. In fact, one node could relate to many nodes. This means we have to duplicate nodes and edges across many of these processes – increasing the amount of processing as well as resource needs. Furthermore, it also results in more messages needing to be passed back and forth. As a result, there’ll be more work at the end of the job to re-organize everything into a single result set. After all, we divided things to run the algorithm in parallel, but we can’t simply return a thousand parts and expect the user to sort it all out—that would defeat the purpose of parallelism.
For an algorithm such as shortest path, this may not be a showstopper. This is because the algorithm is a relatively simple form of vertex-centric processing, in which the amount of information passed between steps is relatively small. The amount of information is small because we’re asking a relatively simple question – does this node have any children? But sometimes we want to ask questions of a graph which require the passing of information that pertain to more than a single node – a neighborhood of nodes. It is at this point that we have a problem. A MR-based platform such as Giraph would need to then create far more duplicate data, and many more messages to process the algorithm. This is where we see the limits of a vertex-centric graph processing architecture.
If we try this, we’d expect to see adverse performance impacts in a benchmark because we’re now asking a vertex-centric architecture to run a neighborhood-centric analysis. And so the team asked itself a question:
Is it possible for vertex-centric architecture to run neighborhood-centric analytics?
The team tested this hypothesis. They took two graph analytics which required the processing of neighborhoods– strong overlap and weak ties, and tested them with several datasets in Giraph and Vertica. They looked for pairs of nodes with a large number of common neighbors. A friend recommender is a great example of how to use this. A simple friend finder on a platform like Facebook or Linkedin might look for people who have one friend in common but are not connected, and recommend that those two connect. But this program could easily make a poor recommendation. After all, they could have met the common friend at a trade show and connected, so they have nothing in common with the But if we apply strong overlap, we can now look for two people who are not connected who have a large number of common friends. This might be a far better indicator of people who would really like to connect. While one common connection could be random, a large number of common connections probably isn’t.
On the other hand, weak ties looks for two nodes which bridge otherwise disconnected pairs of nodes. I don’t know about you all, but I could’ve used this in Boston to find my way around when the Big Dig was underway! Ever heard the old joke “you can’t get there from here?” For a while in Boston, it was true! If we’d taken a street map of the city with current information and represented each intersection as a node and each street as an edge, we could’ve asked exactly this question – “How the heck do I get from Mike’s Pastry in the North end to Faneuil Hall before my cannoli melts?”
In Vertica, each of these is a short query. For example, here’s a strong overlap query:
SELECT e1.from_node as n1,e2.from_node as n2, count(*)
FROM edge e1
JOIN edge e2 ON e1.to_node=e2.to_node
GROUP BY e1.from_node,e2.from_node
HAVING count(*) > THRESHOLD
And here’s a weak ties query:
SELECT e1.to_node AS Id,
sum(CASE WHEN e3.to_node IS NULL THEN 1 ELSE 0 END)/2 AS C
FROM edge e1
JOIN edge e2 ON e1.to_node=e2.from_node
LEFT JOIN edge e3 ON e2.to_node=e3.from_node
GROUP BY e1.to_node
HAVING C > THRESHOLD
The weak ties query is a bit more complex because it has to test for disconnection. However, it is very straightforward to express with SQL. But aside from the expressiveness of SQL, there are other concerns we need to address: performance and scalability.
The team tested these two analytics using both Giraph and Vertica, on the same hardware (a four-node cluster). The datasets they used were a YouTube undirected graph with 1.1 million nodes and 3 million edges, and a LiveJournal undirected graph with 4 million nodes and 35 million edges. Vertica was slightly slower than Giraph when running the strong overlap analytic on the YouTube data. However, Giraph was unable to complete strong overlap on the LiveJournal data – it ran out of memory. The same issue occurred when running weak ties with Giraph on both the YouTube and LiveJournal data – Giraph ran out of memory.
So the team’s expectations about resource consumption of the neighborhood-centric graph analytics in Giraph were correct. And while Vertica was about 13 percent slower than Giraph on the one analytic which ran in Giraph, the techniques I discussed previously to optimize Vertica graph performance could certainly be applied to speed things up.
So let’s net this out:
- There are graph analytics which require the processing of neighborhoods of nodes rather than just a node. These can be difficult to express in vertex-centric engines like Giraph. They are very straightforward to implement in SQL.
- The amount of resources required to run neighborhood-centric analytics in Giraph are significantly higher than when we run the same analytic in Vertica.
- There are numerous opportunities to improve the performance and scalability of Vertica for graph analytics.
- Using Vertica for graph analytics in place of a dedicated graph engine simplifies both the skill needs and architecture required for an enterprise seeking to leverage graph analytics.
In my next blog, I’ll complete the series by talking about putting graph analytics to work with all the other data a business might want to use. After all, graph is useful in itself, but if we want to change the game for business, we need to tie it to the rest of our data!
Next up: Rubber, meet Road – putting graph to work for your organization
You can read my previous blogs in this series here:
-Uplevel Big Data analytics with HP Vertica – Part 1: Graph in a relational database? Seriously?