Books 4

BOOKSBOOKSBOOKS

The Text Mining Handbook

by Ronen Feldman and James Sanger

Cambridge University Press, 2007

Reviewed by L. Venkata Subramaniam

Click here for Top of Page

Return to Home Page

Click here for

Free Books!

Text mining today covers a broad range of topics. This handbook gives a high-level perspective of text mining by covering many of the important topics. The handbook is aimed at a wide spectrum of audiences comprising students, academic researchers and professional practitioners.

The first two chapters of the book provide an introduction to text mining and the operations involved in doing text mining. Chapter I presents text mining definitions. It also gives the general architecture of a text mining system. Chapter II presents core text mining operations. This chapter covers various pattern-discovery algorithms.

The next six chapters present basic preprocessing techniques in text mining. Chapter III presents an extremely brief introduction to linguistic preprocessing techniques in text mining. Chapter IV covers text categorization. Chapter V looks at text clustering. Chapter VI covers information extraction (IE). These chapters cover the main definitions and techniques. Chapter VII covers probabilistic models for information extraction. Chapter VIII presents the applications of the probabilistic models presented in the previous chapter to different IE tasks. In particular, hidden Markov models, stochastic context free grammars, and maximal entropy are covered from the mathematical perspective, and their application to IE is given in these two chapters.

The next two chapters cover the user interface part of text mining systems. Chapter IX looks at aspects related to browsing large text collections. Chapter X covers visualization approaches to view the text document collections and the results obtained from various text mining operations on document collections.

In Chapter XI the topic of link analysis is covered. In this chapter, techniques to analyze large networks of entities are presented. The work in the first eight chapters talked about how the entities can be extracted from the text. In this chapter the focus is on finding specific patterns within the network of entities.

Finally, in Chapter XII, real-world applications are presented. Text mining systems in the areas of corporate finance, patent research, and life sciences are presented.

The Appendix explains DIAL (declarative information analysis language). This is a dedicated information extraction language.

There are notes at the end of each chapter that discuss related work. This is very helpful in placing the work of the chapter in context and for looking up related work to gain better understanding. There is a common bibliography at the end of the book.

One topic that I think the authors should have but didn’t cover at all is text mining in the presence of noise. Real world user-generated text data is noisy and today it is important to deal with it. Blogs, newsgroup postings, emails and other such spontaneously written text found in abundance is very noisy. Further, there is also deliberately added noise in the form of spams and splogs. From my perspective, as a text mining practitioner, I would have liked to see some coverage of this. But that is something for the authors to add in the next edition. .

The authors in their preface have mentioned that they have tried to blend together theory and practice by providing many real-life scenarios that show how the different techniques are used in practice. I think they have largely succeeded in doing that. They have addressed the needs of both developers and users of text mining systems.

My recommendation to the readers is to buy the book. This book is definitely worth having in your book shelf as a handy reference.

Click above to go to the publisher’s web page where you see a Description of the book, the Table of Contents, an Excerpt, the Index, Copyright information, and Frontmatter.

Book Reviews Published in

the IAPR Newsletter

Dynamic Vision for Perception and Control of Motion

by Dicmkanns

(see review in this issue)

Bioinformatics

by Polanski and Kimmel

(see review in this issue)

Introduction to clustering large and high-dimensional data

by Kogan

(see review in this issue)

Information Theory, Inference,