Large-Scale Image Classification using High Performance Clustering
Bingjing Zhang, Judy Qiu, Stefan Lee, David Crandall
Department of Computer Science and Informatics Indiana University, Bloomington
Many areas of computer science, including machine learning, artificial intelligence, and computer vision, are being revolutionized by the incredible volume of data available on the Internet. Unfortunately, scaling up algorithms in these fields is difficult because they require iterative computation at unprecedented scale. Often an individual iteration can be specified as a MapReduce computation, leading to the iterative MapReduce programming model for efficient execution of data-intensive iterative computations. We propose the Map-Collective model as a generalization of our earlier Twister system that is interoperable between HPC and cloud environments. In this paper, we study the problem of large-scale clustering, applying it to cluster features from large collections of 7 million social images, with each feature represented as a point in a high dimensional vector space, into 1 million clusters. This K-means application needs 5 stages in each iteration: Broadcast, Map, Shuffle, Reduce and Combine, and this paper presents new collective communication approaches optimized for large data transfers. Furthermore one needs additional communication patterns from those familiar in MapReduce, and we develop collectives that integrate capabilities developed by the MPI and MapReduce communities. We demonstrate that a topology-aware and pipeline-based broadcasting method gives better performance than both MPI and other (Iterative) MapReduce systems. We present early results of an end-to-end computer vision application and evaluate the quality of the resulting image classifications, showing that increasing the number of feature clusters leads to improved classifier accuracy.
Keywords. Social Images; Data Intensive; High Dimension; Iterative MapReduce; Collective Communication
DOWNLOAD Article (.PDF): Large-Scale_Image_Classification_using_High_Performance_Clustering