IBM Intelligent Miner for Text

IBM Intelligent Miner for Text John Tullis DePaul Instructor john.d.tullis@us.arthurandersen.com

Text Analysis Tools Text Search Engine Web Crawler Package NetQuestion Solution • A Knowledge-discovery software development toolkit • to build advanced Text-Mining and Text-Search applications • A NetQuestion Solution • to construct Internet/intranet text-search solutions IBM Intelligent Miner for Text

Petroleum Media Intelligent Miner for Text Banking Education Government Insurance • For companies of any size and for different industries Intelligent Miner for Text

Customer complaints analysis Newswire analysis Intelligent Miner for Text Opinion survey classification Intelligent Website Corporate Image analysis Competitive intelligence Potential Applications

Text Analysis Tools Text Search Engine Server Text Search Engine Client Text Search Engine Java GUI JavaBeans Web Crawler Package NetQ Solution AIX 4.3 Y Y Y Y Y Y Solaris 2.5.1 Y Y Y Y Y Y Win NT 4.0 SP3 Y Y Y Y Y Y OS/390 V2R4, V2R5, V2R6 Y Y Y Y Y Y Intelligent Miner for Text: Platforms supported

Reference customers & Success stories • Reference Customers • FinanceWise (Search engine for financial content on the Internet) • www.financewise.com • IBM web sites (incl. 2000 IBM intranet sites) • www.ibm.com • Sueddeutsche Zeitung (classified ads on Sueddeutsche Zeitung Web site) • www.sueddeutsche.de • SearchCafe (Business Partner) • www.search-cafe.com • Success stories available at • www.software.ibm.com/iminer/fortext

Component: Text Analysis Tools

Functionality • Language Identification • Clustering of document collection • hierarchical clustering • relational clustering • Categorization/Classification of document collection • Feature Extraction • Summarization

To automate tasks previously done manually • automatically identifies the language of a document • automatically groups related documents based on their content, without requiring predefined classes • automatically assigns documents to one or more user-defined categories • automatically recognizes significant items in text, such as names, technical terms, and abbreviations • automatically extracts sentences from a document to create a document summary Text Analysis Tools

Text Analysis • Text analysis tools are available in a command line format structured to function like common UNIX or DOS command line formats. • Text analysis tools can be used individually or in a combined mode depending on the required task. • Configuration files allow document format flexibility and performance tuning for text searches. Command line switches provide additional flexibility by permitting the user to set runtime parameters. • The documents need to be provided in plain text format. For other formats, conversion tools can be obtained from third parties such as KEYpak (http://www.keypak.com/

Clustering (2) Feature Extraction Classification Language Identification Summarization

To recognize significant vocabulary items • To recognize all names referring to a single entity • To provide the location of all person names, places and organization in a text • To find multi-word terms that have a meaning of their own • To find abbreviations introduced in a text and links them with their full forms • To recognize named relationships Text Analysis Tools: Feature Extraction

Produces statistics for each vocabulary item. • Associates terms to canonical forms (i.e. "related" associated to the term "relate") • Feature extraction can be used as a preprocessor for the Clustering utility to bias (or control) clustering activities. • Feature extraction can be run in two modes: • 1) Lookup mode which refers to a schema generated by a training set and produces statistics for vocabulary items as they relate to the rest of the schema as well as within the document • 2) Exploration mode which requires no training and yields textual data statistics for vocabulary items as they relate within the scope of the document(s) specified Text Analysis Tools: Feature Extraction

Significant concepts are detected automatically Automatic keywording: the most significant terminology in the document Names are categorized Several classes of significant vocabulary can be recognized

Application here shows how one can use the statistics and analysis produced by the feature extraction. • Highlighting of selected items within a document by using the location information in the feature extract (all vocabulary terms have location information to accomplish this). • Selected categories can be filtered upon. • A significance measure for each vocabulary item is produced by feature extraction which allows prioritization of keywords within the scope of individual documents or entire collections. • This is a sample application which is not included in the software installation. Feature Extraction - statistics & analysis

Multi-word phrases are the vocabulary in which concepts are expressed "Terms" include multi-word phrases whose meaning is much more than that of the individual words

Recognizes multi-word phrases by pattern recognition meaning if a two word pattern appears with an acceptable frequency then it is included as an extracted vocabulary item in the output. • More heuristics are applied than mentioned but generally this is the textual processing which occurs. • Concepts can be FORMULATED from the multi-word terms. The feature extraction utility assists in emphasizing prevalent multi-word terms. Feature Extraction - statistics & analysis

Language Identification • given a document, discover automatically the language(s) in which the document is written • It can be used to • restrict search results by languages • organize the crawls by languages • route documents to language translators

A 16 language dictionary is shipped with the Intelligent Miner for Text to be used by the Language Identification utility. • The Language Identification utility also comes with a utility which can be used to add to the shipped dictionary file to extend language identification. (You can even invent your own language and add it to the dictionary!) • Documents can be analyzed for language content meaning the output of Language Identification can produce multiple degrees of language content in one pass (i.e. Document ABC has 75% English, 20% German, etc.). This is possible using a command line option. • Allows further document organization by language and a degree of internationalization to applications. Language Identification

given a defined taxonomy, it can assign documents to preexisting categories • utilizes feature extraction capacities to do document comparisons efficiently • two stages • training using sample documents • category assignment Categorization/Classification

Users determine the taxonomy for organizing the documents into topics. • Users create training sets to define categories and use the supplied training utility. • Each document is analyzed and a rank value assigned as it relates to each category. • A command line switch allows the user to display varying numbers of categories with the document's associated rank value. • REMEMBER: The categories are predefined by the user. Categorization/Classification

Categorization: Solution Example

Clustering • Functions • to automatically group related documents based on their content, without requiring predefined classes • objects within a group are more similar to each other than to members of any other group • two approaches - Hierarchical clustering and binary relational clustering

Preprocessing steps • Analyze data input stream and divide it into individual textual components to be used for clustering • Extract portions of individual textual components to be used for clustering (uses Feature Extraction as a preprocessor) • Customize stop word list • Hierarchical clustering • Structure document collection using lexical affinity based on similarity function • Build clustering tree showing relationships between clusters of documents of varying granularity Clustering - Details

Slicing • Customize tree by applying adjustable thresholds to reduce complexity and zoom-in on concepts of interest • Use default threshold values for specific document collection • Note - slicing allows merging similar clusters into a single cluster. • Clustering Output Formats • HTML file viewable by browser • Textual description to be parsed (in the format of a tree) Clustering - Details

Hierarchical Clustering - Visualization Example

This is a sample application which shows the use of the clustering results in an HTML format. This application is not shipped with the software. • The HTML output can be configured to place actual document paths in the display on the browser so users may easily view the documents which were clustered RIGHT FROM THE BROWSER. • Clusters each have labels which are generated from three 2 word pairings which are the most common lexical affinities • Similarity values in the application are represented by percentages. This is normalized as the similarity values actually range from 0 to 1000. Clustering - Details

Document Collection Document Collection Category1 Training Collection Category2 Training Collection Category3 Training Collection Clustering Utility Categorizer Trainer Cluster1 Cluster2 Cluster3 Cluster4 Categorization: Comparison to Clustering In categorization, document collections are processed and grouped into predetermined groupings based on a taxonomy generated with training sets.... In clustering document collections are processed and grouped into dynamically generated clusters .... Cat1 Cat2 Cat3 Cat4

Summarization • Extracts sentences from a document to create a document summary • Sentence selection is based on document structure and ranking of extracted features

Component: Text Search Engine

Boolean queries Hybrid queries Fuzzy search Free-text queries Synonyms search Text Search Engine

Search Engine • offers multiple search paradigms - boolean, free text, fuzzy, hybrid, etc. • supports linguistic analysis for documents in 21 languages including Arabic and Hebrew • features Boolean queries, precise term search and fuzzy search for 4 DBCS languages • Mining Functions • to extract key features in text • to cluster result list • to refine queries • Integrated in IBM DB2 Digital Library and IBM DB2 UDB Text Extender Text Search Engine

A user can refine searches meaning that they can reuse previous search result sets to perform additional searches. • Multilingual linguistic analysis performed: • - basic text analysis (recognizing terms, normalizing terms, recognizing sentence boundaries) • - reducing terms to their base form • - stop word filtering • - decomposition (splitting compound terms) Text Search Engine

Included as part of the basic functional set in the Text Search Engine • Precise index • ngram index • linguistic index • 21 SBCS languages • 4 DBCS languages • relevance ranking • boolean queries • free text queries • fuzzy and phonetical searches • thesaurus support Basic Text Seach Engine functions

Document support for single byte character set language • Document support for double byte character set languages • Linguistic search: • Dictionaries and synonyms lists for SBCS languages • Terms are reduced to their base form, terms are decomposed, terms are normalized to stand form • Boolean query: Operators: AND, NOT, OR • Natural language query/free text query: To formulate a query in natural language • Hybrid query: • To combine a natural language query with a Boolean search term Text Search Engine: Details

Fuzzy query: • To find misspell words: TOYOTA/TOYOTTA, DATABASE/DATABSAE • Phonetical query: • Technique: remove vowel (s) from search term and replace it/them with masking characters, eliminate duplicate consonants • To search for similar-sounding words: COLOR/COLOUR, SMITH/SMYTH, JANET/JEANNETTE ... • Wildcard support for Boolean queries : Front, middle and end masking for word and character masking Text Search Engine: Details

Text Search Engine: Even more details! • Section support • Able to define a section of a document • Restrict the search to given sections • Example : define a section called Summary • Limit search scope within the Summary section • Thesaurus support • for all index types and many languages • ngram index thesaurus (workstation only) • Synonyms and broader/narrower terms • DBCS language synonym support • Not supported for BiDi languages or Russian

Text Search Engine: Text Mining Functions • Provides text mining functions for English documents • Feature extractions • Organize result list • Supports query refinement method for English documents • User assigns value to single documents

Text Search Engine: Query refinement example

This is a snap shot of the Java GUI which is shipped with Intelligent Miner for Text. The source code and instructions are shipped and must be compiled by the end user to be operational. • Interacts with the TextMiner Java server. • Comprised of Java Beans which are shipped with Intelligent Miner for Text. The Beans can also be built and integrated into other applications to interact with Intelligent Miner for text. • The Java GUI provides a "ready-to-go" search GUI to interact with the Advanced Search engine. User can perform various levels of queries and even browse the documents themselves by double clicking in the window. • Users must use a full Java enabled browser to run this pure Java applet. Query Refinement Example

Where to find the Text Search Engine functions • Basic functions • S/390 Text Search Download for OS/390 V2.4 - V2.6 • IM4T V2.3 workstations • Extended functions (result list clustering, relevance feedback/query refinement, feature index) • IM4T V2.3 for OS/390 • IM4T V2.3 for Workstations

Component: Java & JavaBeans

Java Components • Java Search GUI - fully operational, NLS enabled • JavaBeans for Rapid Application Development • Search • Administration • Source is available and intended to be used as a 'starter kit' • Works with the Text Search Engine

Java Components - Details • GUI Enhancements - • Enhanced error recovery, help • Use with NetScape and MS Internet Explorer • Internet Explorer 3.02 and 4.0 for NT • Internet Explorer 4.0 for Win95/98 • NetScape Navigator 3.0/4.0 for Win95/98/NT • NetScape Navigator 3.0/4.0 Solaris/SPARC • NetScape Navigator 3.0 for Solaris/x86 • Supported via plugin found at • http://java.sun.com/products/plugin/1.1.1/index.html • Sun's HotJava Browser

Component: WebCrawler

Is a Robot used to collect HTML pages for indexing • Customizable as to which HTML links are to be crawled (include and exclude patterns ...) • Results are stored • Data objects on AIX/NT file systems • Metadata in DB2 • Parallel crawling, results combined • HTML page change frequency used as revisiting factor • External subsystems can be notified of web changes detected by the crawler • Create individual crawler using crawler toolkit Web Crawler

IBM Intelligent Miner for Text