Taming Text How to Find, Organize, and Manipulate It
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris
Softbound print: September 2012 (est.) | 350 pages
Manning, ISBN: 193398838X
Chapter Title: Clustering Document Collections with Apache Mahout (Pages: 7) Clustering is an unsupervised task (no human intervention, such as annotating training text, required) that can automatically put related content into buckets, helping you better organize your content or reduce the amount of content that you must manually process. This article, based on chapter 6 of Taming Text, looks at how Apache Mahout can be used to cluster large collections of documents into buckets.
Download Clustering Document Collections with Apache Mahout (.PDF)