620 likes | 896 Views
LINGO. Search Results Clustering. Sandra Gama. Internet endless document collection . Search Engines. NO question answering. FAST access to Web content. SENSITIVE to query quality. we NEED meaningful RESULTS. CLUSTERING!. GROUPING by Similarity. Semantic structure. Groups.
E N D
LINGO Search Results Clustering Sandra Gama
Luxury Car Feline, panther family
user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents
Stage 1/4: Preprocessing user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents
Stage 1/4: Preprocessing • 1. Text segmentation • 2. Stemming • 3. Ignore stop words
Stage 2/4:PHRASE EXTRACTION user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents
How many non-empty suffixes? 11 suffixes
Stage 3/4:CLUSTER-LABEL INDUCTION user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents
A term x document matrix U, ∑ , V find matrixes such that A = U ∑ VT
D1: Large-scale singular value computations D2: Software for the sparse singular value decomposition D3: Introduction to modern information retrieval D4: Linear algebra for intelligent information retrieval D5: Matrix computations D6: Singular value cryptogram analysis D7: Automatic information organization T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval P1: Singular value P2: Information retrieval
D1: Large-scale singular value computations D2: Software for the sparse singular value decomposition D3: Introduction to modern information retrieval D4: Linear algebra for intelligent information retrieval D5: Matrix computations D6: Singular value cryptogram analysis D7: Automatic information organization T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval
P2: Information retrieval P1: Singular value T5: Retrieval T1: Information T2: Singular T3: Value T4: Computations T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval = P
M matrix = UkTP Phrases/single words Abstract concepts P2: Information retrieval P1: Singular value T5: Retrieval T1: Information T2: Singular T3: Value T4: Computations
Stage 4/4:CLUSTER-CONTENT ALLOCATION user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents
Test Data 10 categories 4 subjects