1 / 39

Text Analysis and Knowledge Mining System

Text Analysis and Knowledge Mining System. Nasukawa, N. & Nagano, T. 2001. IBM Systems Journal, 40: 967-983. Textual Data. Increase in accessible textual data has caused an information flood

Download Presentation

Text Analysis and Knowledge Mining System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Analysis and Knowledge Mining System Nasukawa, N. & Nagano, T. 2001. IBM Systems Journal, 40: 967-983

  2. Textual Data • Increase in accessible textual data has caused an information flood • TAKMI (Text analysis and knowledge mining), a text mining software has been developed to acquire useful knowledge. • TAKMI allows users to acquire useful knowledge from large amounts of textual data like internal reports, various technical documents, messages from various individuals etc.

  3. Textual Data • An important issue for text mining technology is how to represent textual data in order to apply statistical analysis • Since the content of the text varies greatly, it is essential to visualize the results and allow an interactive analysis to meet requirements of analysts working on the data.

  4. Document HandlingTechnologies • Document handling technologies

  5. Document HandlingTechnologies • Document organization technology can give us an overview of document archive by -Classifying documents into predefined classes -Clustering documents with similar contents

  6. Text Mining • “Text Mining” is a sense of knowledge discovery • It analyzes more detailed information in the content of each document • Extracts information that can be provided only by multiple documents viewed as a whole (for example trends, significant features in large number of documents related to customer calls)

  7. Text Mining • Information extraction typically focuses on finding a specified class of events • This technology is almost inverse of text mining since text mining aims to find novel patterns rather than pre defined patterns in a specified class.

  8. Text Mining • Text mining is a text version of generalized data mining and consists of NLP to extract concepts from each piece of text • It also allows statistical analysis to find interesting patterns among concepts, and visualizations to allow interactive analysis

  9. Framework of Text Mining • Concept extraction for text Mining -”Concept” is representation of textual content in order to distinguish it from simple keywords -For example “Washington” could be a person, place or something else -Also, different expressions may refer to the same meaning (example “car”, “automobile”, “H/w” etc)

  10. Framework of Text Mining • Concept extraction for text Mining -Insertion of single word such as “not” or alteration of word position may change entire meaning of a sentence as in the following examples:

  11. Concept Extraction in TAKMI • Following are some approaches to represent content in textual data for prototype text mining system TAKMI

  12. Concept Extraction in TAKMI

  13. Concept Extraction in TAKMI • To generate these representations, the authors used the following steps -Creation of semantic dictionary • Make a list of words extracted from the textual database sorted by frequency and ask domain experts to assign semantic categories to words and phrases considered important • Assign appropriate canonical forms to take care of synonymous expressions or variations

  14. Concept Extraction in TAKMI • This dictionary consists of entries with surface representations, parts of speech, canonical representations, and semantic categories such as:

  15. Concept Extraction in TAKMI • Intention analysis- This is done by matching patterns of grammatical forms or certain expressions and by searching in a semantic dictionary in the following manner

  16. Concept Extraction in TAKMI • Dependency analysis: This is done by grammatical analysis but with aim of focusing on important issues concerning the text mining application • Example-to facilitate analysis of problems in software, extraction of dependency pairs of a predicate indicative of problems and a noun with semantic category [software] is generally robust method • Dependency analysis allows to check grammatical dependencies among verb groups and noun groups clustered in the intention analysis

  17. TAKMI • Data mining functions for text mining • Once the appropriate concept is extracted from each piece of text various statistical analysis methods can be applied • As an example-highlighting significant associations among concepts of [liquid]s and [problem]s in data of PC help center

  18. TAKMI

  19. TAKMI • Visualization and interactive analysis for text mining -It is good to provide functions to visualize the results for intuitive understanding and to allow interactive analysis -The authors use Information Technology Outlining for this purpose (information outlining enables a user to obtain an overview of the features and trends of a set of data by showing the distributions of items associated with the target set of data from various viewpoints indicated by categories

  20. TAKMI • Visualization and interactive analysis for text mining -GUI of TAKMI

  21. TAKMI -GUI of consists of four frames -Frame A (shows number of records of current analysis associated with search criteria to focus on the set of texts) -Frame B shows title of texts and actual content of each text is displayed by clicking the title such as “CD-ROM” and “PLANR” in this frame.

  22. TAKMI • Distribution of concepts associated with current set of texts is shown in Frame C in accordance with the categories of the semantic features or intentions. In this frame, concept can be sorted according to their absolute frequency, relative frequency within data, or alphabetical order of concepts. • The results of mining functions are shown in Frame D.

  23. TAKMI • Thus TAKMI allows -Detection of trends -Analyze details of trends or significant associations by browsing relevant concepts in Frame C

  24. Use of TAKMI to analyze records in Customer Centers • Fixed field structured data might contain -Customer information (ID [identifier], name, phone number etc) -Related products or services -Contact type (question, request, query, etc) • A free text field captures details of each in natural language

  25. Use of TAKMI to analyze records in Customer Centers • While analyzing records of customer contact information in a PC help center to apply technology (TAKMI), the authors noticed the following:

  26. Use of TAKMI to analyze records in Customer Centers

  27. Use of TAKMI to analyze records in Customer Centers

  28. Use of TAKMI to analyze records in Customer Centers • TAKMI was installed at a PC center in US • A list of words extracted from texts in the records sorted by frequency and asked call analysts in the help center to assign semantic categories such as [software], [hardware], and [problem] to words that they consider important for call analysis

  29. Use of TAKMI to analyze records in Customer Centers • A semantic dictionary was created which had six thousand words. • This included 80% of the content words in the call records for these data

  30. Use of TAKMI to analyze records in Customer Centers • Results-

  31. Use of TAKMI to analyze records in Customer Centers • Results- -Shows that Windows 98 was the most rapidly increasing topic in [software] from the middle of June to beginning of July in 1998

  32. Use of TAKMI to analyze records in Customer Centers

  33. Use of TAKMI to analyze records in Customer Centers • Evaluation of concept extraction in TAKMI -

  34. Use of TAKMI to analyze records in Customer Centers • Complex concept representation based on intention analysis and dependency analysis was effective • For example, in one month of data from the Japanese PC help center, 55 cases contained “file…not found” in the [software… problem] category

  35. Application of TAKMI for other data • Japanese patent documents are written in markup language and basic items such as titles, dates of issue, authors' names, and organization names are explicitly indicated by tags and can easily be extracted by applying simple pattern analysis

  36. Application of TAKMI for other data • 15000 patent documents were analyzed to generate indexes by extracting terms and categories shown below

  37. Application of TAKMI for other data

  38. Conclusions • A framework of text mining was developed • The focus was on NLP to extract concepts from each piece of text • Intention analysis allowed the classification of predicates by analyzing functional words • Dependency analysis allowed to capture higher-level sequential information

  39. Conclusions • By applying results of concept extraction to statistical analysis functions that use semantic features TAKMI provided results • Categorization of terms based on semantic features is important in organizing the output knowledge as well as facilitate interactive analysis to deal with multiple view points.

More Related