280 likes | 303 Views
Explore advanced tools for text analysis and categorization of emails, web pages, and documents. Enhance text mining with clustering, feature extraction, language identification, summarization, and web crawling. Combine text search engines with categorization tools to manage information overload efficiently.
E N D
Platforms • AIX workstation • OS/390 • Sun Solaris • Windows NT
Tools to Use • Topic categorization tool • Categorizing emails • Categorizing Web Pages
Text Analysis Tool • Topic Categorization Tool
Text Analysis Tool • Topic Categorization Tool • Category 1 (AI Schedule)
Text Analysis Tool • Category2 (Database Schedule)
Text Analysis Tool • Target Category ( Data Mining Schedule)
Text Analysis Tool • Result - Category 2 (Databases)
Tools to Use • Clustering Tool (Finding Similar Information) • Dividing Documents into Groups • Identifying hidden similarities in documents • Identifying duplicate documents from a collection • Finding Documents that are out of place
Text Analysis Tool • Hierarchical Clustering - imzhclst
Text Analysis Tool • Binary Clustering - imzcrlst
Text Analysis Tool • Results
Text Analysis Tool • Results
Tools to Use • Feature Extraction Tool • Name Extraction • Abbreviation Extraction • Relation Extraction
Text Analysis Tool • Using Feature Extraction tool to extract names • imzxrun -b 2 -f C -x n -o faculty.out faculty.htm
Tools to Use • Language Identification Tool • Organize collection of documents by language • Restrict Search Results to documents in a particular language
Text Analysis Tool • Using Language Identification tool • imzlgini -b 2 -v < mydoc.htm
Text Analysis Tool • Language Identification Tool Results • Supports 13 Languages, New Languages Can be trained
Text Analysis Tool • Using Summarizer tool • imzsum -l 4 project.html
Text Analysis Tool • Summarizer tool - Results
Tools to Use • Web Crawler • Follows the Link topology for a fast search • Produces a Web Site Map • Use to Recognize the Authoritative pages • Provides a filtered collection of pages
Web Crawler • imyclean - to define a web space • Created include.re , exclude.re, types.re • imycrawl - to crawl a defined web space • imycrawl url webspace • imystat - to track what happens during a crawl
Tools to Use • Text Search Engine • Complicated Text Search • Powerful Linguistic Capabilities • Fuzzy searches • Query based on structure of document
Text Search Engine • Operates on a Previously based index
Text Search Engine • Types of Index • Linguistic Index (bought as buy) • Feature Index (Linguistics + Names) • Precise Index (bought as bought) • Normalized Precise Index (Case Insensitive) • Ngram Index
Combining Tools for Solutions • Searching with Categories • combining Text Search Engine and Topic Categorization Tool • Surviving a flood of email • by using Topic Categorization Tools • Selectively indexing Web Pages • by combining Web Crawler, Topic Categorization Tool & Text Search Engine
Views of the Tool • Command Line (Good for Unix) • Not very useful on Windows NT • Not a good stand-alone Tool • Should be viewed as a Library