390 likes | 545 Views
Text Analysis and Knowledge Mining System. Nasukawa, N. & Nagano, T. 2001. IBM Systems Journal, 40: 967-983. Textual Data. Increase in accessible textual data has caused an information flood
E N D
Text Analysis and Knowledge Mining System Nasukawa, N. & Nagano, T. 2001. IBM Systems Journal, 40: 967-983
Textual Data • Increase in accessible textual data has caused an information flood • TAKMI (Text analysis and knowledge mining), a text mining software has been developed to acquire useful knowledge. • TAKMI allows users to acquire useful knowledge from large amounts of textual data like internal reports, various technical documents, messages from various individuals etc.
Textual Data • An important issue for text mining technology is how to represent textual data in order to apply statistical analysis • Since the content of the text varies greatly, it is essential to visualize the results and allow an interactive analysis to meet requirements of analysts working on the data.
Document HandlingTechnologies • Document handling technologies
Document HandlingTechnologies • Document organization technology can give us an overview of document archive by -Classifying documents into predefined classes -Clustering documents with similar contents
Text Mining • “Text Mining” is a sense of knowledge discovery • It analyzes more detailed information in the content of each document • Extracts information that can be provided only by multiple documents viewed as a whole (for example trends, significant features in large number of documents related to customer calls)
Text Mining • Information extraction typically focuses on finding a specified class of events • This technology is almost inverse of text mining since text mining aims to find novel patterns rather than pre defined patterns in a specified class.
Text Mining • Text mining is a text version of generalized data mining and consists of NLP to extract concepts from each piece of text • It also allows statistical analysis to find interesting patterns among concepts, and visualizations to allow interactive analysis
Framework of Text Mining • Concept extraction for text Mining -”Concept” is representation of textual content in order to distinguish it from simple keywords -For example “Washington” could be a person, place or something else -Also, different expressions may refer to the same meaning (example “car”, “automobile”, “H/w” etc)
Framework of Text Mining • Concept extraction for text Mining -Insertion of single word such as “not” or alteration of word position may change entire meaning of a sentence as in the following examples:
Concept Extraction in TAKMI • Following are some approaches to represent content in textual data for prototype text mining system TAKMI
Concept Extraction in TAKMI • To generate these representations, the authors used the following steps -Creation of semantic dictionary • Make a list of words extracted from the textual database sorted by frequency and ask domain experts to assign semantic categories to words and phrases considered important • Assign appropriate canonical forms to take care of synonymous expressions or variations
Concept Extraction in TAKMI • This dictionary consists of entries with surface representations, parts of speech, canonical representations, and semantic categories such as:
Concept Extraction in TAKMI • Intention analysis- This is done by matching patterns of grammatical forms or certain expressions and by searching in a semantic dictionary in the following manner
Concept Extraction in TAKMI • Dependency analysis: This is done by grammatical analysis but with aim of focusing on important issues concerning the text mining application • Example-to facilitate analysis of problems in software, extraction of dependency pairs of a predicate indicative of problems and a noun with semantic category [software] is generally robust method • Dependency analysis allows to check grammatical dependencies among verb groups and noun groups clustered in the intention analysis
TAKMI • Data mining functions for text mining • Once the appropriate concept is extracted from each piece of text various statistical analysis methods can be applied • As an example-highlighting significant associations among concepts of [liquid]s and [problem]s in data of PC help center
TAKMI • Visualization and interactive analysis for text mining -It is good to provide functions to visualize the results for intuitive understanding and to allow interactive analysis -The authors use Information Technology Outlining for this purpose (information outlining enables a user to obtain an overview of the features and trends of a set of data by showing the distributions of items associated with the target set of data from various viewpoints indicated by categories
TAKMI • Visualization and interactive analysis for text mining -GUI of TAKMI
TAKMI -GUI of consists of four frames -Frame A (shows number of records of current analysis associated with search criteria to focus on the set of texts) -Frame B shows title of texts and actual content of each text is displayed by clicking the title such as “CD-ROM” and “PLANR” in this frame.
TAKMI • Distribution of concepts associated with current set of texts is shown in Frame C in accordance with the categories of the semantic features or intentions. In this frame, concept can be sorted according to their absolute frequency, relative frequency within data, or alphabetical order of concepts. • The results of mining functions are shown in Frame D.
TAKMI • Thus TAKMI allows -Detection of trends -Analyze details of trends or significant associations by browsing relevant concepts in Frame C
Use of TAKMI to analyze records in Customer Centers • Fixed field structured data might contain -Customer information (ID [identifier], name, phone number etc) -Related products or services -Contact type (question, request, query, etc) • A free text field captures details of each in natural language
Use of TAKMI to analyze records in Customer Centers • While analyzing records of customer contact information in a PC help center to apply technology (TAKMI), the authors noticed the following:
Use of TAKMI to analyze records in Customer Centers • TAKMI was installed at a PC center in US • A list of words extracted from texts in the records sorted by frequency and asked call analysts in the help center to assign semantic categories such as [software], [hardware], and [problem] to words that they consider important for call analysis
Use of TAKMI to analyze records in Customer Centers • A semantic dictionary was created which had six thousand words. • This included 80% of the content words in the call records for these data
Use of TAKMI to analyze records in Customer Centers • Results-
Use of TAKMI to analyze records in Customer Centers • Results- -Shows that Windows 98 was the most rapidly increasing topic in [software] from the middle of June to beginning of July in 1998
Use of TAKMI to analyze records in Customer Centers • Evaluation of concept extraction in TAKMI -
Use of TAKMI to analyze records in Customer Centers • Complex concept representation based on intention analysis and dependency analysis was effective • For example, in one month of data from the Japanese PC help center, 55 cases contained “file…not found” in the [software… problem] category
Application of TAKMI for other data • Japanese patent documents are written in markup language and basic items such as titles, dates of issue, authors' names, and organization names are explicitly indicated by tags and can easily be extracted by applying simple pattern analysis
Application of TAKMI for other data • 15000 patent documents were analyzed to generate indexes by extracting terms and categories shown below
Conclusions • A framework of text mining was developed • The focus was on NLP to extract concepts from each piece of text • Intention analysis allowed the classification of predicates by analyzing functional words • Dependency analysis allowed to capture higher-level sequential information
Conclusions • By applying results of concept extraction to statistical analysis functions that use semantic features TAKMI provided results • Categorization of terms based on semantic features is important in organizing the output knowledge as well as facilitate interactive analysis to deal with multiple view points.