My Data Analytics Lifecycle
by Mario Vinasco, Data Scientist, Facebook.
At Facebook, we routinely query the largest user database in the world; we have tables that record every single write to the social graph; examples of that are likes, comments, posts; in general anything that changes the graph and is potentially viewable by others.
For context, these are the actions performed by over 850 M on any given day on the platform!
The marketing group’s mission is to improve and strength our brand, how customers feel about us and in general, sentiment, trust and favorability.
And to support this mission, we launch new products, and start campaigns to raise awareness, adoption or usage.
Hard metrics allow us to evaluate and rank our initiatives, quantify activity and ultimately attribute any lift in sentiment to them; there are some metrics that are pretty standard; for instance in the recent launch of SayThanks videos, we measured reach (senders, receivers), feedback received, satisfaction, but also we calculated metrics more unique to the product such as virality and reciprocation as measures of how facebook can enhance relationships.
Where do I start
Typically, I use the new product a lot, or in the case of a campaign, click on it, follow the “call to action” and perform every action possible myself, carefully recording them in a spreadsheet.
I include screen shots, notes and any other contextual information that may be relevant.
This gives me the comfort to understand the product well and most importantly how the data is created and logged; this obviously happen during the development phase, sometimes before the product is open to employees and allows for an open dialog with the engineering teams, about how we are going to measure and what we will measure.
I then query my own records and verify that I can read the raw data and that it matches my off line logs.
The Data Collection
This step seems redundant, as by default the data has been logged; however, one thing is to record writes to the graph and another to have it ready (and joined) for consumption with analytical tools.
Queries that are trivial in the relational database world, are trickier in Hive as it is not easy to navigate across date partitions due to the volume of the daily data.
We use scripts to programmatically access Hive; these scripts can iterate over partitions with simpler queries; python comes very handy to perform these tasks and it can store the relevant and filtrated data into more manageable tables.
Having the data grouped meaningfully for analysis and enhanced with other pieces of data is critical; during data collection, there is usually a pre aggregation phase, where counts at the user level happens; this step however, goes beyond, and it is what I call “the count of counts”.
In code, this step could and in some cases is performed at the same time as the previous step; there are advanced constructs such as conditional counts — Count (Distinct CASE When Product= ‘SayThanks’ And l7 >0 Then userid Else Null End) — that in effect are multi pass statements; however, I prefer to keep it logically and physically separated for clarity.
My purpose is to optimize for analysis and not necessarily for performance.
At this point, the data is reduced (hopefully) to hundreds of thousands, or perhaps a few million rows.
The reduction occurs because of aggregations, sampling or both.
Examples of reduction include user feedback analysis, since is not necessary to have every single “like” event on a piece of content, but rather counts and counts of users by # of likes for analysis.
Still, a few million rows can be tricky to manipulate on a laptop.
Fortunately, there are a few ways to work with this data outside the batch world of Hive.
One of the most practical is to move the data (usually 1 big deformalized table) to a columnar/relational database that is suitable for interactive analysis.
Moving these sets of data can be done with simple web interfaces, as it’s a straight copy and not a full ETL process.
Then I connect to the data with tools like Tableau, QlickView or MS Excel to get visuals, further aggregations, comparisons, etc.
The advantage to have the data in a columnar database (data that is not completely aggregated) is that further drill down is possible and it can be done fast and in front of my stakeholders.
I start by telling a story with my own observations and this leads to questions and interpretations..
Many questions can be answered with the data, live and in real time, while others may need further (an offline) work.
The presentation usually revolves around descriptive stats for context, observations and insights and validity of the results (statistical significance and confidence intervals), then recommendations.
Some of the new analytical tools include a “story” module, where you can add contextual information to charts, and behave similar to power point.
Unfortunately, these modules are still crude and the default presentation and distribution mechanism is still MS Power Point.