Analytics at eBay. An interview with Tom Fastner.
“How much more complexity can human developers and organizations deal with?”— Tom Fastner, eBay.
Much has already been written about analytics at eBay. But what is the current status? Which data platforms and data management technologies do they currently use? I asked a few questions to Tom Fastner, a Senior Member of Technical Staff, and Architect with eBay.
RVZ
Q1. What are the main technical challenges for big data analytics at eBay?
Tom Fastner: The primary challenges are:
I/O bandwidth: limited due to configuration of the nodes.
Concurrency/workload management: Workload management tools usually manage the limited resource. For many years EDW systems bottle neck on the CPU; big systems are configured with ample CPU making I/O the bottleneck. Vendors are starting to put mechanisms in place to manage I/O, but it will take some time to get to the same level of sophistication.
Data movement (loads, initial loads, backup/restores): As new platforms are emerging you need to make data available on more systems challenging networks, movement tools and support to ensure scalable operations that maintain data consistency
Q2. What are the current Metrics of eBay`s main data warehouses?
Tom Fastner: We have 3 different platforms for Analytics:
A) EDW: Dual systems for transactional (structured) data; Teradata 3.5PB and 2.5 PB spinning disk; 10+ years experience; very high concurrency; good accessibility; hundreds of applications.
B) Singularity: deep Teradata system for semi-structured data; 36 PB spinning disk; lower concurrency that EDW, but can store more data; biggest use case is User Behavior Analysis; largest table is 1.2 PB with ~1.9 Trillion rows.
C) Hadoop: for unstructured/complex data; ~40 PB spinning disk; text analytics, machine learning; has the User Behavior data and selected EDW tables; lower concurrency and utilization.
Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?
Tom Fastner: EDW: We model for the unknown (close to 3rd NF) to provide a solid physical data model suitable for many applications, that limits the number of physical copies needed to satisfy specific application requirements. A lot of scalability and performance is built into the database, but as any shared resource it does require an excellent operations team to fully leverage the capabilities of the platform
Singularity: The platform is identical to EDW, the only exception are limitations in the workload management due to configuration choices. But since we are leveraging the latest database release we are exploring ways to adopt new storage and processing patterns. Some new data sources are stored in a denormalized form significantly simplifying data modeling and ETL. On top we developed functions to support the analysis of the semi-structured data. It also enables more sophisticated algorithms that would be very hard, inefficient or impossible to implement with pure SQL. One example is the pathing of user sessions. However the size of the data requires us to focus more on best practices (develop on small subsets, use 1% sample; process by day),
Hadoop: The emphasis on Hadoop is on optimizing for access. The reusability of data structures (besides “raw” data) is very low.
Q4: How do you handle un-structured data?
Tom Fastner: Un-structured data is handled on Hadoop only. The data is copied from the source systems into HDFS for further processing. We do not store any of that on the Singularity (Teradata) system.
Q5. What kind of data management technologies do you use? What is your experience in using them?
Tom Fastner: ETL: AbInitio, home grown parallel Ingest system.
Scheduling: UC4.
Repositories: Teradata EDW; Teradata Deep system; Hadoop.
BI: Microstrategy, SAS, Tableau, Excel.
Data modeling: Power Designer.
Adhoc: Teradata SQL Assistant; Hadoop Pig and Hive.
Content Management: Joomla based.
In regards to tools capabilities, there is always something you might wish for. In some cases the tool of choice might even have an important capability, but we have not implemented it (due to resources, complexity, priorities, bugs). The way we try to address required features is through partnerships with some of those vendors.
The most mature partnership is with Teradata; other good examples are Tableau and Microstrategy where we have regular meetings discussing future enhancements. However I do not feel comfortable to rate all those tools and point out shortcomings.
Q6. Cloud computing and open source: Do you they play a role at eBay? If yes, how?
Tom Fastner: We do leverage internal cloud functions for Hadoop; no cloud for Teradata.
Open source: committers for Hadoop and Joomla; strong commitment to improve those technologies
Q7. Glenn Paulley of Sybase in a recent keynote asked the question of how much more complexity can database systems deal with? What is your take on this?
Tom Fastner: I am not familiar with Glenn Pulley’s presentation, so I can’t respond in that context. But in my role at eBay I like to take into account the full picture: How much more complexity can human developers and organizations deal with?
As we learn to deal with more complex data structures and algorithms using specialized solutions, we will look for ways to make them repeatable and available for a wider audience as they mature and demand grows in the market. Specialized solutions do come at a high price (TCO), requiring separate hardware and/or software stack, data replication, data management issues, computing specialists.
Q8. Anything else you wish to add?
Tom Fastner: Ebay is rapidly changing, and analytics is driving many key initiatives like buyer experience, search optimization, buyer protection or mobile commerce. We are investing heavily in new technologies and approaches to leverage new data sources to drive innovation.
———–
Tom Fastner is a Senior Member of Technical Staff, Architect with eBay. In this role Tom works on the architecture of the analytical platforms and related tools. Currently he spends most of his time driving
innovation to process Big Data, the Singularity© system. Before Tom joined eBay he was with NCR/Teradata for 11 years. He was involved in the Dual Active program from the beginning in 2003 as the technical lead of the first global implementation of Dual Active. Prior to NCR/Teradata Tom worked for debis Systemhaus with several
clients in the aerospace sector. He holds a Masters in Computer Science from the Technical University of Munich, Germany, and is a Teradata Certified Master.
————-
Related Posts
– Measuring the scalability of SQL and NoSQL systems.
– Interview with Jonathan Ellis, project chair of Apache Cassandra.
– Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.
– The evolving market for NoSQL Databases: Interview with James Phillips.