Text Data Management and Analysis A Practical Introduction to Information Retrieval and Text Mining

Text Data Management and AnalysisChengXiang Zhai, Sean Massung
ISBN: 9781970001167 | PDF ISBN: 9781970001174
Hardcover ISBN: 9781970001198
Copyright © 2016 | 471 Pages | Publication Date: July, 2016


This book provides a systematic introduction to a wide range of statistical and heuristical approaches to the management and analysis of text data. It emphasizes the most useful knowledge and skills required to build a variety of practically useful text information systems. Because humans can understand natural languages far better than computers can, effective involvement of humans in a text information system is generally needed and text information systems often serve as intelligent assistants for humans.

Depending on how a text information system collaborates with humans, we distinguish two kinds of text information systems. The first is information retrieval systems which include search engines and recommender systems; they assist users in finding from a large collection of text data the most relevant text data that are actually needed for solving a specific application problem, thus effectively turning big raw text data into much smaller relevant text data that can be more easily processed by humans. The second is text mining application systems; they can assist users in analyzing patterns in text data to extract and discover useful actionable knowledge directly useful for task completion or decision making, thus providing more direct task support for users.

The book covers the major concepts, techniques, and ideas in information retrieval and text data mining from a practical viewpoint, and includes many hands-on exercises designed with a companion software toolkit (i.e., META) to help readers learn how to apply techniques of information retrieval and text mining to real-world text data and how to experiment with and improve some of the algorithms for interesting application tasks. The book can be used as a textbook for computer science undergraduates and graduates, library and information scientists, or as a reference book for practitioners working on relevant problems in managing and analyzing text data.

Table of Contents

Part I: Overview and Background
1. Introduction
2. Background
3. Text Data Understanding
4. MeTA: A Unified Toolkit for Text Data Management and Analysis

Part II: Text Data Access
5. Overview of Text Data Access
6. Retrieval Models
7. Feedback
8. Search Engine Implementation
9. Search Engine Evaluation
10. Web Search
11. Recommender Systems

Part III: Text Data Analysis
12. Overview of Text Data Analysis
13. Word Association Mining
14. Text Clustering
15. Text Categorization
16. Text Summarization
17. Topic Analysis
18. Opinion Mining and Sentiment Analysis
19. Joint Analysis of Text and Structured Data

Part IV: Unified Text Data Management Analysis System
20. Toward a Unified System for Text Management and Analysis

About the Author(s)

ChengXiang Zhai, University of Illinois at Urbana-Champaign
ChengXiang Zhai is a Professor of Computer Science and Willett Faculty Scholar at the University of Illinois at Urbana-Champaign, where he is also affiliated with the Graduate School of Library and Information Science, Institute for Genomic Biology, and Department of Statistics. He received a Ph.D. in Computer Science from Nanjing University in 1990, and a Ph.D. in Language and Information Technologies from Carnegie Mellon University in 2002. He worked at Clairvoyance Corp. as a Research Scientist and then Senior Research Scientist from 1997 to 2000.

His research interests include information retrieval, text mining, natural language processing, machine learning, biomedical and health informatics, and intelligent education information systems. He has published over 200 research papers in major conferences and journals. He is an Associate Editor for Information Processing and Management and previously served as an Associate Editor of ACM Transactions on Information Systems, and on the editorial board of Information Retrieval Journal.
He is a conference program co-chair of ACM CIKM 2004, NAACL HLT 2007, ACM SIGIR 2009, ECIR 2014, ICTIR 2015, and WWW 2015, and conference general co-chair for ACM CIKM 2016. He is an ACM Distinguished Scientist and a recipient of multiple awards, including the ACM SIGIR 2004 Best Paper Award, the ACM SIGIR 2014 Test of Time Paper Award, Alfred P. Sloan Research Fellowship, IBM Faculty Award, HP Innovation Research Program Award, Microsoft Beyond Search Research Award, and the Presidential Early Career Award for Scientists and Engineers (PECASE).

Sean Massung, University of Illinois at Urbana-Champaign
Sean Massung is a Ph.D. candidate in computer science at the University of Illinois at Urbana-
Champaign, where he also received both his B.S. and M.S. degrees. He is a co-founder of
META and uses it in all of his research. He has been instructor for CS 225: Data Structures and Programming Principles, CS 410: Text Information Systems, and CS 591txt: Text Mining Seminar. He is included in the 2014 List of Teachers Ranked as Excellent at the University of Illinois and has received an Outstanding Teaching Assistant Award and CS@Illinois Outstanding Research Project Award. He has given talks at Jump Labs Champaign and at UIUC for Data and Information Systems Seminar, Intro to Big Data, and Teaching Assistant Seminar. His research interests include text mining applications in information retrieval, natural language processing, and education.


In general terms, the authors typically provide verbose descriptions of the reasons behind the design of specific techniques, with numerical examples and illustrative figures from the slides of two massive open online courses (MOOCs) offered by the first author on Coursera. They also provide specific sections that describe in detail the proper way to evaluate every different kind of technique, a key factor to be taken into account when applying the discussed techniques in practice.

The book, however, is not always self-contained, since its broad scope in a limited number of pages entails an unavoidable depth/breadth tradeoff. Most basic techniques can be implemented just by following the instructions and guidelines in the text, although interested readers might need to resort to the bibliographic references if they want to gain a thorough understanding of the many advanced techniques. Fortunately, the authors include some bibliographic notes and very selective suggestions for further reading at the end of each chapter, instead of the encyclopedic collection of references common in many other textbooks. Although readers will not find detailed coverage of NLP techniques and some chapters might seem lacking in depth, advanced undergraduate students might find this book to be a valuable reference for getting acquainted with both information retrieval and text mining in a single volume, a worthwhile achievement for a 500-page textbook.
Fernando Berzal – In “Computing Reviews”

You may also like...