Google Fusion Tables. Interview with Alon Y. Halevy
“The main challenges is that it’s hard for people who have data, but not database expertise, to manage their data.” –Alon Y. Halevy.
Google Fusion Tables was launched on June 9th, 2009. I wanted to know what happened since then. I have therefore interviewed Dr. Alon Y. Halevy, who heads the Structured Data Group at Google Research.
Q1.In your web page you write that your job at Google is ” to make data management tools collaborative and much easier to use, and to leverage the incredible collections of structured data on the Web.” What are the main problems and challenges you are currently facing?
Halevy: The main challenges is that it’s hard for people who have data, but not database expertise, to manage their data, share it and create visualizations. Data management requires too much up-front effort, and that forms a significant impediment towards data sharing on a large scale.
Q2. Your group is responsible for Google Fusion Tables, “a service for managing data in the cloud that focuses on ease of use, collaboration and data integration.” What is exactly Google Fusion Tables? What challenges with respect to data management is solving and how?
Halevy: Fusion Tables enables you to easily upload data sets (e.g., spreadsheets and CSV files) to the cloud and manage it. Fusion Tables makes it easy to create insightful visualizations (e.g,. maps, timelines and other charts) and to share these with collaborators or with the public at large.
In addition, Fusion Tables enables merging data sets that belong to different owners. The true power of data is realized when we can combine data from multiple sources and draw conclusions that were impossible earlier.
As an example, this visualization combines data about earthquakes and data about the location of nuclear power reactors, showing what areas are prone to disasters similar to the one experienced in Japan in March 2011.
Q3. Fusion Tables enables users to upload tabular data files. Is there a limitation to the size of such users data files? Which one?
Halevy: Right now we allow 100MB per table and 250MB per user, but that’s not a technical limitation, just a limitation of our free offering
Q4. Google Fusion Tables was launched On June 9th, 2009. What happened since then?
Halevy: We’ve continually improved our service based largely on needs expressed by our users. In particular, we’ve have made our map visualizations much more powerful, and we’ve developed APIs for programmers.
Q5. What data sources do you consider and how do you integrate them together?
Halevy: We do not prescribe any data sources. All the sources you can obtain in Fusion Tables come from users who’ve explicitly marked their data sets as public. Data sources are combined with a merge operation that’s similar to a ‘join’ in SQL.
Q6. How do you ensure a good performance in the presence of millions of user tables? In fact, do you have any benchmark results for that?
Halevy: Given that we’re the first ones to pursue storing such a large number of tables in a single repository, there isn’t an established benchmark. The technical description of how we do it appears in two short papers that we published in SIGMOD 2010  and SoCC 2010  .
Q7. One of the main features of Fusion Tables is that it “allows multiple users to merge their tables into one, even if they do not belong to the same organization or were not aware of each other when they created the tables.
A table constructed by merging (by means of equi-joins) multiple base tables is a view” . What about data consistency, duplicates and updates? Do you handle such cases?
Halevy: The data belongs to the users, not to us, so they have to ensure that it is up to date and does not contain duplicates. Of course, when you combine data from multiple sources you may get inconsistencies. Hopefully, the visualizations we provide will enable you to discover them quickly and resolve them.
Q8. How does Google Fusion Tables relates to Google Maps? What are the main challenges with respect to data management that you face when dealing with Big Data coming from Google Maps?
Halevy: Fusion Tables relies on a lot of the Google Maps infrastructure to display maps. The challenges when displaying maps from large data sets is that you need to do a lot of the computation on the server side so the client is not overwhelmed with points to render, but at the same time that the user experience remains snappy and interactive.
Q9. Why and how do you manage data in the Cloud?
Halevy: Managing data in the cloud is easier for many data owners because they do not have to maintain their own database system (which requires hiring database experts). Putting data in the cloud is also a key facilitator in order to share data with others, including people outside your organization.
We manage the data using some of the Google infrastructure such as BigTable and some layers built over it.
Q10. When you started Google Fusion Tables you did not support complex SQL queries or high throughput transactions. Why? How is the situation now?
Halevy: We still don’t support all forms of SQL queries, and we’re not in the race to become the database system supporting the highest transaction throughput. There are plenty of products on the market that serve those needs. Our goal from the start was to help under-served users with data management tasks that typically do not require complex SQL queries or high-throughput transactions, but rather emphasize data sharing and visualization.
Q11. Fusion Tables is about to “make it easier for people to create, manage and share on structured data on the Web.”
What about handling non structured data and the Web?
Halevy: There are plenty of other tools for that, including blogs, site creation tools and cloud-based word processors.
Q12. You write “…to facilitate collaboration, users can conduct fine-grained discussions on the data.” What does it mean?
Halevy: This means that users can attach comments to individual rows in a table, individual columns and even individual cells. If you are collaborating on a large data set then it is not enough to put all the comments in one big blob around the table. You really need to attach it to specific pieces of data.
Q13. Is Google Fusion Tables open source? Do you have plans to open up the API to developers?
Halevy: We have had an API for almost two years now. Fusion Tables is not open source; it’s a Google service built on top of Google’s infrastructure.
Q14. Fusion Tables API provides developers with a way of extending the functionality of the platform. What are the main extensions being implemented by the community so far?
Halevy: There have been tools developed for importing data from different formats into Fusion Tables (e.g., Shape to Fusion ). There is also a tool for importing Fusion Tables into the statistical R package and outputting the results back into Fusion Tables.
The API is used mostly to tailor applications using Fusion Tables to specific needs.
Q15. How Fusion Tables differs from Amazon`s SimpleDB?
Halevy: Fusion Tables is designed primarily to be a user-facing tool, not as much for developers. I should emphasize that Fusion Tables is not part of Google Labs anymore — it has “graduated” over a year ago.
Q16. In 2005 you co-authored a paper introducing the concept of “Dataspace Systems”, that is systems which provide “pay-as-you-go data management based on best-effort services.” Is this still actual? What are these “Dataspace Systems” exactly?
Halevy: Yes, it is. In fact, Fusion Tables are one example of a dataspace system. Fusion Tables does not require you to create a schema before entering the data, and it tries to get the data types of the columns in order to offer relevant visualizations.
The collaborative aspects of Fusion Tables make it easier for a group of collaborators to improve the quality of the data and combine it with others.
Dataspace systems are still in their infancy, and we have a long way to go to realize the full vision.
Q17. You have worked on Deep web, Surface web, and now Fusion Tables: How do these three area relate to each others?
What is your next project?
Halevy: All of these projects have the same overall goal: to make structured data on the Web more discoverable, so users can enhance it, combine it with data from other sources, and create and publish interesting new data sets.
The Deep Web project had the goal of extracting data sets from behind HTML forms and making the data discoverable in search.
The Surface Web (WebTables) project’s goal was to identify interesting data sets that are on the Web but are not being treated in the most optimal way. Fusion Tables provides a tool for users to upload their own data and publish data sets that can be crawled by search engines.
These three projects — and filling in the gaps between — will keep me busy for a while to come!
Dr. Alon Halevy heads the Structured Data Group at Google Research. Prior to that, he was a Professor of Computer Science at the University of Washington, where he founded the Database Research Group. From 1993 to 1997 he was a Principal Member of Technical Staff at AT&T Bell Laboratories (later AT&T Laboratories). He received his Ph.D in Computer Science from Stanford University in 1993, and his Bachelors degree in Computer Science and Mathematics from the Hebrew University in Jerusalem in 1988. Dr. Halevy was elected Fellow of the Association of Computing Machinery in 2006.
 Google Fusion Tables: Data Management, Integration and Collaboration in the Cloud (link .pdf).
Hector Gonzalez, Alon Halevy, Christian S. Jensen, Anno Langen,Jayant Madhavan, Rebecca Shapley, Warren Shen, Google Inc.
in SoCC’10, June 10-11, 2010, Indianapolis, Indiana, USA.
 Megastore: A Scalable Data System for User Facing Applications.
J. Furman, J. S. Karlsson, J.-M. Leon, A. Lloyd, S. Newman, and P. Zeyliger. In SIGMOD, 2008.
 Megastore: Providing Scalable, Highly Available Storage for Interactive Services (link .pdf)
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh , Google, Inc. In 5 Biennial Conference on Innovative Data Systems Research (CIDR ’11), January 9-12, 2011, Asilomar, California, USA.