280 likes | 301 Views
NITISH MANOCHA. Platforms. AIX workstation OS/390 Sun Solaris Windows NT. Tools to Use. Topic categorization tool Categorizing emails Categorizing Web Pages. Text Analysis Tool. Topic Categorization Tool. Text Analysis Tool. Topic Categorization Tool Category 1 (AI Schedule).
E N D
Platforms • AIX workstation • OS/390 • Sun Solaris • Windows NT
Tools to Use • Topic categorization tool • Categorizing emails • Categorizing Web Pages
Text Analysis Tool • Topic Categorization Tool
Text Analysis Tool • Topic Categorization Tool • Category 1 (AI Schedule)
Text Analysis Tool • Category2 (Database Schedule)
Text Analysis Tool • Target Category ( Data Mining Schedule)
Text Analysis Tool • Result - Category 2 (Databases)
Tools to Use • Clustering Tool (Finding Similar Information) • Dividing Documents into Groups • Identifying hidden similarities in documents • Identifying duplicate documents from a collection • Finding Documents that are out of place
Text Analysis Tool • Hierarchical Clustering - imzhclst
Text Analysis Tool • Binary Clustering - imzcrlst
Text Analysis Tool • Results
Text Analysis Tool • Results
Tools to Use • Feature Extraction Tool • Name Extraction • Abbreviation Extraction • Relation Extraction
Text Analysis Tool • Using Feature Extraction tool to extract names • imzxrun -b 2 -f C -x n -o faculty.out faculty.htm
Tools to Use • Language Identification Tool • Organize collection of documents by language • Restrict Search Results to documents in a particular language
Text Analysis Tool • Using Language Identification tool • imzlgini -b 2 -v < mydoc.htm
Text Analysis Tool • Language Identification Tool Results • Supports 13 Languages, New Languages Can be trained
Text Analysis Tool • Using Summarizer tool • imzsum -l 4 project.html
Text Analysis Tool • Summarizer tool - Results
Tools to Use • Web Crawler • Follows the Link topology for a fast search • Produces a Web Site Map • Use to Recognize the Authoritative pages • Provides a filtered collection of pages
Web Crawler • imyclean - to define a web space • Created include.re , exclude.re, types.re • imycrawl - to crawl a defined web space • imycrawl url webspace • imystat - to track what happens during a crawl
Tools to Use • Text Search Engine • Complicated Text Search • Powerful Linguistic Capabilities • Fuzzy searches • Query based on structure of document
Text Search Engine • Operates on a Previously based index
Text Search Engine • Types of Index • Linguistic Index (bought as buy) • Feature Index (Linguistics + Names) • Precise Index (bought as bought) • Normalized Precise Index (Case Insensitive) • Ngram Index
Combining Tools for Solutions • Searching with Categories • combining Text Search Engine and Topic Categorization Tool • Surviving a flood of email • by using Topic Categorization Tools • Selectively indexing Web Pages • by combining Web Crawler, Topic Categorization Tool & Text Search Engine
Views of the Tool • Command Line (Good for Unix) • Not very useful on Windows NT • Not a good stand-alone Tool • Should be viewed as a Library