1 / 66

IBM Intelligent Miner for Text

IBM Intelligent Miner for Text. John Tullis DePaul Instructor john.d.tullis@us.arthurandersen.com. Text Analysis Tools. Text Search Engine. Web Crawler Package. NetQuestion Solution. A Knowledge-discovery software development toolkit

lexi
Download Presentation

IBM Intelligent Miner for Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IBM Intelligent Miner for Text John Tullis DePaul Instructor john.d.tullis@us.arthurandersen.com

  2. Text Analysis Tools Text Search Engine Web Crawler Package NetQuestion Solution • A Knowledge-discovery software development toolkit • to build advanced Text-Mining and Text-Search applications • A NetQuestion Solution • to construct Internet/intranet text-search solutions IBM Intelligent Miner for Text

  3. Petroleum Media Intelligent Miner for Text Banking Education Government Insurance • For companies of any size and for different industries Intelligent Miner for Text

  4. Customer complaints analysis Newswire analysis Intelligent Miner for Text Opinion survey classification Intelligent Website Corporate Image analysis Competitive intelligence Potential Applications

  5. Text Analysis Tools Text Search Engine Server Text Search Engine Client Text Search Engine Java GUI JavaBeans Web Crawler Package NetQ Solution AIX 4.3 Y Y Y Y Y Y Solaris 2.5.1 Y Y Y Y Y Y Win NT 4.0 SP3 Y Y Y Y Y Y OS/390 V2R4, V2R5, V2R6 Y Y Y Y Y Y Intelligent Miner for Text: Platforms supported

  6. Reference customers & Success stories • Reference Customers • FinanceWise (Search engine for financial content on the Internet) • www.financewise.com • IBM web sites (incl. 2000 IBM intranet sites) • www.ibm.com • Sueddeutsche Zeitung (classified ads on Sueddeutsche Zeitung Web site) • www.sueddeutsche.de • SearchCafe (Business Partner) • www.search-cafe.com • Success stories available at • www.software.ibm.com/iminer/fortext

  7. Component: Text Analysis Tools

  8. Functionality • Language Identification • Clustering of document collection • hierarchical clustering • relational clustering • Categorization/Classification of document collection • Feature Extraction • Summarization

  9. To automate tasks previously done manually • automatically identifies the language of a document • automatically groups related documents based on their content, without requiring predefined classes • automatically assigns documents to one or more user-defined categories • automatically recognizes significant items in text, such as names, technical terms, and abbreviations • automatically extracts sentences from a document to create a document summary Text Analysis Tools

  10. Text Analysis • Text analysis tools are available in a command line format structured to function like common UNIX or DOS command line formats. • Text analysis tools can be used individually or in a combined mode depending on the required task. • Configuration files allow document format flexibility and performance tuning for text searches. Command line switches provide additional flexibility by permitting the user to set runtime parameters. • The documents need to be provided in plain text format. For other formats, conversion tools can be obtained from third parties such as KEYpak (http://www.keypak.com/

  11. Clustering (2) Feature Extraction Classification Language Identification Summarization

  12. To recognize significant vocabulary items • To recognize all names referring to a single entity • To provide the location of all person names, places and organization in a text • To find multi-word terms that have a meaning of their own • To find abbreviations introduced in a text and links them with their full forms • To recognize named relationships Text Analysis Tools: Feature Extraction

  13. Produces statistics for each vocabulary item. • Associates terms to canonical forms (i.e. "related" associated to the term "relate") • Feature extraction can be used as a preprocessor for the Clustering utility to bias (or control) clustering activities. • Feature extraction can be run in two modes: • 1) Lookup mode which refers to a schema generated by a training set and produces statistics for vocabulary items as they relate to the rest of the schema as well as within the document • 2) Exploration mode which requires no training and yields textual data statistics for vocabulary items as they relate within the scope of the document(s) specified Text Analysis Tools: Feature Extraction

  14. Significant concepts are detected automatically Automatic keywording: the most significant terminology in the document Names are categorized Several classes of significant vocabulary can be recognized

  15. Application here shows how one can use the statistics and analysis produced by the feature extraction. • Highlighting of selected items within a document by using the location information in the feature extract (all vocabulary terms have location information to accomplish this). • Selected categories can be filtered upon. • A significance measure for each vocabulary item is produced by feature extraction which allows prioritization of keywords within the scope of individual documents or entire collections. • This is a sample application which is not included in the software installation. Feature Extraction - statistics & analysis

  16. Multi-word phrases are the vocabulary in which concepts are expressed "Terms" include multi-word phrases whose meaning is much more than that of the individual words

  17. Recognizes multi-word phrases by pattern recognition meaning if a two word pattern appears with an acceptable frequency then it is included as an extracted vocabulary item in the output. • More heuristics are applied than mentioned but generally this is the textual processing which occurs. • Concepts can be FORMULATED from the multi-word terms. The feature extraction utility assists in emphasizing prevalent multi-word terms. Feature Extraction - statistics & analysis

  18. Clustering (2) Feature Extraction Classification Language Identification Summarization

  19. Language Identification • given a document, discover automatically the language(s) in which the document is written • It can be used to • restrict search results by languages • organize the crawls by languages • route documents to language translators

  20. A 16 language dictionary is shipped with the Intelligent Miner for Text to be used by the Language Identification utility. • The Language Identification utility also comes with a utility which can be used to add to the shipped dictionary file to extend language identification. (You can even invent your own language and add it to the dictionary!) • Documents can be analyzed for language content meaning the output of Language Identification can produce multiple degrees of language content in one pass (i.e. Document ABC has 75% English, 20% German, etc.). This is possible using a command line option. • Allows further document organization by language and a degree of internationalization to applications. Language Identification

  21. Clustering (2) Feature Extraction Classification Language Identification Summarization

  22. given a defined taxonomy, it can assign documents to preexisting categories • utilizes feature extraction capacities to do document comparisons efficiently • two stages • training using sample documents • category assignment Categorization/Classification

  23. Users determine the taxonomy for organizing the documents into topics. • Users create training sets to define categories and use the supplied training utility. • Each document is analyzed and a rank value assigned as it relates to each category. • A command line switch allows the user to display varying numbers of categories with the document's associated rank value. • REMEMBER: The categories are predefined by the user. Categorization/Classification

  24. Categorization: Solution Example

  25. Clustering (2) Feature Extraction Classification Language Identification Summarization

  26. Clustering • Functions • to automatically group related documents based on their content, without requiring predefined classes • objects within a group are more similar to each other than to members of any other group • two approaches - Hierarchical clustering and binary relational clustering

  27. Preprocessing steps • Analyze data input stream and divide it into individual textual components to be used for clustering • Extract portions of individual textual components to be used for clustering (uses Feature Extraction as a preprocessor) • Customize stop word list • Hierarchical clustering • Structure document collection using lexical affinity based on similarity function • Build clustering tree showing relationships between clusters of documents of varying granularity Clustering - Details

  28. Slicing • Customize tree by applying adjustable thresholds to reduce complexity and zoom-in on concepts of interest • Use default threshold values for specific document collection • Note - slicing allows merging similar clusters into a single cluster. • Clustering Output Formats • HTML file viewable by browser • Textual description to be parsed (in the format of a tree) Clustering - Details

  29. Hierarchical Clustering - Visualization Example

  30. This is a sample application which shows the use of the clustering results in an HTML format. This application is not shipped with the software. • The HTML output can be configured to place actual document paths in the display on the browser so users may easily view the documents which were clustered RIGHT FROM THE BROWSER. • Clusters each have labels which are generated from three 2 word pairings which are the most common lexical affinities • Similarity values in the application are represented by percentages. This is normalized as the similarity values actually range from 0 to 1000. Clustering - Details

  31. Document Collection Document Collection Category1 Training Collection Category2 Training Collection Category3 Training Collection Clustering Utility Categorizer Trainer Cluster1 Cluster2 Cluster3 Cluster4 Categorization: Comparison to Clustering In categorization, document collections are processed and grouped into predetermined groupings based on a taxonomy generated with training sets.... In clustering document collections are processed and grouped into dynamically generated clusters .... Cat1 Cat2 Cat3 Cat4

  32. Clustering (2) Feature Extraction Classification Language Identification Summarization

  33. Summarization • Extracts sentences from a document to create a document summary • Sentence selection is based on document structure and ranking of extracted features

  34. Component: Text Search Engine

  35. Boolean queries Hybrid queries Fuzzy search Free-text queries Synonyms search Text Search Engine

  36. Search Engine • offers multiple search paradigms - boolean, free text, fuzzy, hybrid, etc. • supports linguistic analysis for documents in 21 languages including Arabic and Hebrew • features Boolean queries, precise term search and fuzzy search for 4 DBCS languages • Mining Functions • to extract key features in text • to cluster result list • to refine queries • Integrated in IBM DB2 Digital Library and IBM DB2 UDB Text Extender Text Search Engine

  37. A user can refine searches meaning that they can reuse previous search result sets to perform additional searches. • Multilingual linguistic analysis performed: • - basic text analysis (recognizing terms, normalizing terms, recognizing sentence boundaries) • - reducing terms to their base form • - stop word filtering • - decomposition (splitting compound terms) Text Search Engine

  38. Included as part of the basic functional set in the Text Search Engine • Precise index • ngram index • linguistic index • 21 SBCS languages • 4 DBCS languages • relevance ranking • boolean queries • free text queries • fuzzy and phonetical searches • thesaurus support Basic Text Seach Engine functions

  39. Document support for single byte character set language • Document support for double byte character set languages • Linguistic search: • Dictionaries and synonyms lists for SBCS languages • Terms are reduced to their base form, terms are decomposed, terms are normalized to stand form • Boolean query: Operators: AND, NOT, OR • Natural language query/free text query: To formulate a query in natural language • Hybrid query: • To combine a natural language query with a Boolean search term Text Search Engine: Details

  40. Fuzzy query: • To find misspell words: TOYOTA/TOYOTTA, DATABASE/DATABSAE • Phonetical query: • Technique: remove vowel (s) from search term and replace it/them with masking characters, eliminate duplicate consonants • To search for similar-sounding words: COLOR/COLOUR, SMITH/SMYTH, JANET/JEANNETTE ... • Wildcard support for Boolean queries : Front, middle and end masking for word and character masking Text Search Engine: Details

  41. Text Search Engine: Even more details! • Section support • Able to define a section of a document • Restrict the search to given sections • Example : define a section called Summary • Limit search scope within the Summary section • Thesaurus support • for all index types and many languages • ngram index thesaurus (workstation only) • Synonyms and broader/narrower terms • DBCS language synonym support • Not supported for BiDi languages or Russian

  42. Text Search Engine: Text Mining Functions • Provides text mining functions for English documents • Feature extractions • Organize result list • Supports query refinement method for English documents • User assigns value to single documents

  43. Text Search Engine: Query refinement example

  44. This is a snap shot of the Java GUI which is shipped with Intelligent Miner for Text. The source code and instructions are shipped and must be compiled by the end user to be operational. • Interacts with the TextMiner Java server. • Comprised of Java Beans which are shipped with Intelligent Miner for Text. The Beans can also be built and integrated into other applications to interact with Intelligent Miner for text. • The Java GUI provides a "ready-to-go" search GUI to interact with the Advanced Search engine. User can perform various levels of queries and even browse the documents themselves by double clicking in the window. • Users must use a full Java enabled browser to run this pure Java applet. Query Refinement Example

  45. Where to find the Text Search Engine functions • Basic functions • S/390 Text Search Download for OS/390 V2.4 - V2.6 • IM4T V2.3 workstations • Extended functions (result list clustering, relevance feedback/query refinement, feature index) • IM4T V2.3 for OS/390 • IM4T V2.3 for Workstations

  46. Component: Java & JavaBeans

  47. Java Components • Java Search GUI - fully operational, NLS enabled • JavaBeans for Rapid Application Development • Search • Administration • Source is available and intended to be used as a 'starter kit' • Works with the Text Search Engine

  48. Java Components - Details • GUI Enhancements - • Enhanced error recovery, help • Use with NetScape and MS Internet Explorer • Internet Explorer 3.02 and 4.0 for NT • Internet Explorer 4.0 for Win95/98 • NetScape Navigator 3.0/4.0 for Win95/98/NT • NetScape Navigator 3.0/4.0 Solaris/SPARC • NetScape Navigator 3.0 for Solaris/x86 • Supported via plugin found at • http://java.sun.com/products/plugin/1.1.1/index.html • Sun's HotJava Browser

  49. Component: WebCrawler

  50. Is a Robot used to collect HTML pages for indexing • Customizable as to which HTML links are to be crawled (include and exclude patterns ...) • Results are stored • Data objects on AIX/NT file systems • Metadata in DB2 • Parallel crawling, results combined • HTML page change frequency used as revisiting factor • External subsystems can be notified of web changes detected by the crawler • Create individual crawler using crawler toolkit Web Crawler

More Related