1 / 37

Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining: Finding Nuggets in Mountains of Textual Data. Jochen D ö rre, Peter Gerstl, and Roland Seiffert. Overview. Introduction to Mining Text How Text Mining differs from data mining Mining Within a Document: Feature Extraction

azana
Download Presentation

Text Mining: Finding Nuggets in Mountains of Textual Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert

  2. Overview • Introduction to Mining Text • How Text Mining differs from data mining • Mining Within a Document: Feature Extraction • Mining in Collections of Documents: Clustering and Categorization • Text Mining Applications • Exam Questions/Answers

  3. Introduction to Mining Text

  4. Reasons for Text Mining Reasons for Text Mining

  5. Email Insurance claims News articles Web pages Patent portfolios Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents Corporate Knowledge “Ore”

  6. Challenges in Text Mining • Information is in unstructured textual form. • Not readily accessible to be used by computers. • Dealing with huge collections of documents

  7. Two Mining Phases • Knowledge Discovery: Extraction of codified information (features) • Information Distillation: Analysis of the feature distribution

  8. How Text Mining Differs from Data Mining

  9. Data Mining Identify data sets Select features Prepare data Analyze distribution Text Mining Identify documents Extract features Select features by algorithm Prepare data Analyze distribution Comparison of Procedures

  10. IBM Intelligent Miner for Text • SDK: Software Development Kit • Contains necessary components for “real text mining” • Also contains more traditional components: • IBM Text Search Engine • IBM Web Crawler • drop-in Intranet search solutions

  11. Mining Within a Document: Feature Extraction

  12. Feature Extraction • To recognize and classify significant vocabulary items in unrestricted natural language texts. • Let’s see an example…

  13. Certificate of deposit CMOs Commercial bank Commercial paper Commercial Union Assurance Commodity Futures Trading Commission Consul Restaurant Convertible bond Credit facility Credit line Debt security Debtor country Detroit Edison Digital Equipment Dollars of debt End-March Enserch Equity warrant Eurodollar … Example of Vocabulary found

  14. Implementation of Feature Extraction relies on • Linguistically motivated heuristics • Pattern matching • Limited amounts of lexical information, such as part-of-speech information. • Not used: huge amounts of lexicalized information • Not used: in-depth syntactic and semantic analyses of texts

  15. Goals of Feature Extraction • Very fast processing to be able to deal with mass data • Domain-independence for general applicability

  16. Extracted information categories • Names of persons, organizations and places • Multiword terms • Abbreviations • Relations • Other useful stuff

  17. Canonical Forms • Normalized forms of dates, numbers, … • Allows applications to use information very easily • Abstracts from different morphological variants of a single term

  18. Canonical Names • The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document • Reduces ambiguity of variants President Bush Mr. Bush George Bush Canonical Name: George Bush

  19. Disambiguating Proper Names: Nominator Program

  20. Principles of Nominator Design • Apply heuristics to strings, instead of interpreting semantics. • The unit of context for extraction is a document. • The unit of context for aggregation is a corpus. • The heuristics represent English naming conventions.

  21. Mining in Collections of Documents: Clustering and Categorization

  22. 1. Clustering • Partitions a given collection into groups of documents similar in contents, i.e., in their feature vectors. • Two clustering engines • Hierarchical Clustering tool • Binary Relational Clustering tool • Both tools help to identify the topic of a group by listing terms or words that are common in the documents in the group. • Thus, provides overview of the contents of a collection of documents

  23. Groups documents similar in their feature vectors

  24. 2. Categorization • Topic Categorization Tool • Assign documents to preexisting categories (“topics” or “themes”) • Categories are chosen to match the intended use of the collection • categories defined by providing a set of sample documents for each category

  25. 2. Categorization (cont.) • This “training” phase produces a special index, called the categorization schema • categorization tool returns a list of category names and confidence levels for each document • If the confidence level is low, document is put aside for human categorizer

  26. 2. Categorization (cont.) • Effectiveness: Tests have shown that the Topic Categorization tool agrees with human categorizers to the same degree as human categorizers agree with one another.

  27. Set of sample documents Training phase Returns list of category names and confidence levels for each document Special index used to categorize new documents

  28. Text Mining Applications

  29. Main Advantages of mining technology over traditional ‘information broker’ business • Ability to quickly process large amounts of textual data • “Objectivity” and customizability • Automation

  30. Applications used to: • Gain insights about trends, relations between people/places/organizations • Classify and organize documents according to their content • Organize repositories of document-related meta-information for search and retrieval • Retrieve documents

  31. Main Applications • Knowledge Discovery • Information Distillation

  32. CRI: Customer Relationship Intelligence • Appropriate documents selected • Converted to common format • Feature extraction and clustering tools are used to create a database • User may select parameters for preprocessing and clustering step • Clustering produces groups of feedback that share important linguistic elements • Categorization tool used to assign new incoming feedback to identified categories.

  33. CRI (continued) • Knowledge Discovery • Clustering used to create a structure that can be interpreted • Information Distillation • Refinement and extension of the clustering results • Interpreting the results • Tuning of the clustering process • Selecting meaningful clusters

  34. Exam Question #1 • Name an example of each of the two main classes of applications of text mining. • Knowledge Discovery: Discovering a common customer complaint among much feedback. • Information Distillation: Filtering future comments into pre-defined categories

  35. Exam Question #2 • How does the procedure for text mining differ from the procedure for data mining? • Adds feature extraction function • Not feasible to have humans select features • Highly dimensional, sparsely populated feature vectors

  36. Exam Question #3 • In the Nominator program of IBM’s Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? • Does not perform in-depth syntactic or semantic analyses of texts

  37. THE END http://www-3.ibm.com/software/data/iminer/fortext/

More Related