The analytics platform at Twitter has experienced tremen- dous growth over the past few years in terms of size, com- plexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life “in the trenches” is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this con- text, we discuss two topics: First, schemas play an impor- tant role in helping data scientists understand petabyte-scale data stores, but they’re insufficient to provide an overall “big picture” of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various com- ponents that must be integrated together into production workflows—we refer to this as “plumbing”. This paper has two goals: For practitioners, we hope to share our experi- ences to flatten bumps in the road for those who come after us. For academic researchers, we hope to provide a broader context for data mining in production environments, point- ing out opportunities for future work.
Download article (LINK to .PDF)
SIGKDD Explorations Volume 14, Issue 2