On Data Mining and Data Science. Interview with Charu Aggarwal
“What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging” — Charu Aggarwal.
On Data Mining, Data Science and Big Data, I have interviewed Charu Aggarwal, Research Scientist at the IBM T. J. Watson Research Center, an expert in this area.
Q1. You recently edited two books: Data Classification: Algorithms and Applications and Data Clustering: Algorithms and Applications.
What are the main lessons learned in data classification and data clustering that you can share with us?
Charu Aggarwal: The most important lesson, which is perhaps true for all of data mining applications, is that feature extraction, selection and representation are extremely important. It is all too often that we ignore these important aspects of the data mining process.
Q2. How Data Classification and Data Clustering relate to each other?
Charu Aggarwal: Data classification is the supervised version of data clustering. Data clustering is about dividing the data into groups of similar points. In data classification, examples of groups of points are made available to you. Then, for a given test instance, you are supposed to predict which group this point might belong to.
In the latter case, the groups often have a semantic interpretation. For example, the groups might correspond to fraud/not fraud labels in a credit-card application. In many cases, it is natural for the groups in classification to be clustered as well. However, this is not always the case.
Some methods such as semi-supervised clustering/classification leverage the natural connections between these problems to provide better quality results.
Q3. Can data classification and data clustering be useful also for large data sets and data streams? If yes, how?
Charu Aggarwal: Data clustering is definately useful for large data sets, because clusters can be viewed as summaries of the data. In fact, a particular form of fine-grained clustering, referred to as micro-clustering, is commonly used for summarizing high-volume streaming data in real time. These summaries are then used for many different applications, such as first-story detection, novelty detection, prediction, and so on.
In this sense, clustering plays an intermediate role in enabling other applications for large data sets.
Classification can also be used to generate different types of summary information, although it is a little less common. The reason is that classification is often used as the end-user application, rather than as an intermediate application
like clustering. Therefore, big-data serves as a challenge and as an opportunity for classification.
It serves as a challenge because of obvious computational reasons. It serves as an opportunity because you can build more complex and accurate models with larger data sets without creating a situation, where the model inadvertently overfits to the random noise in the data.
Q4. How do you typically extract “information” from Big Data?
Charu Aggarwal: This is a highly application-specific question, and it really depends on what you are looking for. For example, for the same stream of health-care data, you might be looking for different types of information, depending on whether you are trying to detect fraud, or whether you are trying to discover clinical anomalies. At the end of the day, the role of the domain expert can never be discounted.
However, the common theme in all these cases is to create a more compressed, concise, and clean representation into one of the data types we all recognize and know how to process. Of course, this step is required in all data mining applications, and not just big data applications. What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging.
For example, if you look at Google’s original MapReduce framework, it was motivated by a need to efficiently perform operations that are almost trivial for smaller data sets, but suddenly become very expensive in the big-data setting.
Q5. What are the typical problems and scenarios when you cluster multimedia, text, biological, categorical, network, streams, and uncertain data?
Charu Aggarwal: The heterogeneity of the data types causes significant challenges.
One problem is that the different data types may often be mixed, as a result of which the existing methods can sometimes not be used directly. Some common scenarios in which such data types arise are photo/music/video-sharing (multimedia), healthcare (time-series streams and biological), and social networks. Among these different data types, the probabilistic (uncertain) data types does not seem to have graduated from academia into industry very well. Of course, it is a new area and there is a lot of active research going on. The picture will become clearer in a few years.
Q6. How effective are today ́s clustering algorithms?
Charu Aggarwal: Clustering problems have become increasingly effective in recent years because of advances in high-dimensional methods. In the past, when the data was very high-dimensional most existing methods work poorly because of locally irrelevant attributes and concentration effects. These are collectively referred to as the curse of dimensionality. Techniques such as subspace and projected clustering have been introduced to discover clusters in lower dimensional views of the data. One nice aspect of this approach is that some variations of it are highly interpretable.
Q7. What is in common between pattern recognition, database analytics, data mining, and machine learning?
Charu Aggarwal: They really do the same thing, which is that of analyzing and gleaning insights from data. It is just that the styles and emphases are different in various communities. Database folks are more concerned
about scalability. Pattern recognition and machine learning folks are somewhat more theoretical. The statistical folks tend to use their statistical models. The data mining community is the most recent one, and it was formed to create a common meeting ground for these diverse communities.
The first KDD conference was held in 1995, and we have come a long way since then towards integration. I believe that the KDD conference has played a very major role in the amalgamation of these communities. Today, it is actually possible for the folks from database and machine learning communities to be aware of each other’s work. This was not quite true 20 years ago.
Q8. What are the most “precise” methods in data classification?
Charu Aggarwal: I am sure that you will find experts who are willing to swear by a particular model. However, each model comes with a different set of advantages over different data sets. Furthermore, some models, such as univariate decision trees and rule-based methods, have the advantage of being interpretable even when they are outperformed by other methods. After all, analysts love to know about the “why” aside from the “what.”
While I cannot say which models are the most accurate (highly data specific), I can certainly point to the most “popular” ones today from a research point of view. I would say that SVMs, and neural networks (deep learning) are the most popular classification methods. However, my personal experience has been mixed.
While I have found SVMs to work quite well across a wide variety of settings, neural networks are generally less robust. They can easily over fit to noise or show unstable performance over small ranges of parameters. I am watching the debate over deep learning with some interest to see how it plays out.
Q9. When to use Mahout for classification? and What is the advantage of using Mahout for classification?
Charu Aggarwal: Apache Mahout is a scalable machine learning environment for data mining applications. One distinguishing feature of Apache Mahout is that it builds on top of distributed infrastructures like MapReduce, and enables easy building of machine learning applications. It includes libraries of various operations and applications.
Therefore, it reduces the effort of the end user beyond the basic MapReduce framework. It should be used in cases, where the data is large enough to require the use of such distributed infrastructures.
Q10. What are your favourite success stories in Data Classifications and/or Data Clustering?
Charu Aggarwal: One of my favorite success stores is in the field of high dimensional data, where I explored the effect of locally irrelevant dimensions and concentration effects on various data mining algorithms.
I designed a suite of algorithms for such high-dimensional tasks as clustering, similarity search, and outlier detection.
The algorithms continue to be relevant even today, and we have even generalized some of these results to big-data (streaming) scenarios and other application domains, such as the graph and text domains.
Qx Anything else you wish to add?
Charu Aggarwal: Data mining and data sciences are at exciting cross-roads today. I have been working in this field since 1995, and I have never seen as much excitement about data science in my first 15 years, as I have seen
in the last 5. This is truly quite amazing!
Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Yorktown Heights, New York.
He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996.
His research interest during his Ph.D. years was in combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James B. Orlin.
He has since worked in the field of performance analysis, databases, and data mining. He has published over 200 papers in refereed conferences and journals, and has applied for or been granted over 80 patents. He is author or editor of nine books.
Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research.
He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal.
He is a fellow of the IEEE for “contributions to knowledge discovery and data mining techniques”, and a life-member of the ACM.
– MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Appeared in:OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Download: PDF Version
Follow ODBMS.org on Twitter: @odbmsorg