1 / 61

LINGO

LINGO. Search Results Clustering. Sandra Gama. Internet  endless document collection . Search Engines. NO question answering. FAST access to Web content. SENSITIVE to query quality. we NEED meaningful RESULTS. CLUSTERING!. GROUPING by Similarity. Semantic structure. Groups.

kinsey
Download Presentation

LINGO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LINGO Search Results Clustering Sandra Gama

  2. Internet  endless document collection

  3. Search Engines

  4. NO question answering

  5. FAST access to Web content

  6. SENSITIVE to query quality

  7. we NEED meaningful RESULTS

  8. CLUSTERING!

  9. GROUPING by Similarity

  10. Semantic structure

  11. Groups

  12. Description

  13. Luxury Car Feline, panther family

  14. Description QUALITY

  15. How to cluster?

  16. LINGOa new approach

  17. user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents

  18. Stage 1/4: Preprocessing user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents

  19. Stage 1/4: Preprocessing • 1. Text segmentation • 2. Stemming • 3. Ignore stop words

  20. Stage 2/4:PHRASE EXTRACTION user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents

  21. Goal

  22. 1/4 More than N occurrences

  23. 2/4 No more than 1 sentence

  24. 3/4 Complete phrase

  25. 4/4 Stop words

  26. How it works

  27. How many non-empty suffixes? 11 suffixes

  28. Suffix array:

  29. Stage 3/4:CLUSTER-LABEL INDUCTION user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents

  30. Singular Value Decomposition

  31. A  term x document matrix U, ∑ , V find matrixes such that A = U ∑ VT

  32. D1: Large-scale singular value computations D2: Software for the sparse singular value decomposition D3: Introduction to modern information retrieval D4: Linear algebra for intelligent information retrieval D5: Matrix computations D6: Singular value cryptogram analysis D7: Automatic information organization T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval P1: Singular value P2: Information retrieval

  33. D1: Large-scale singular value computations D2: Software for the sparse singular value decomposition D3: Introduction to modern information retrieval D4: Linear algebra for intelligent information retrieval D5: Matrix computations D6: Singular value cryptogram analysis D7: Automatic information organization T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval

  34. Abstract concept matrix (SVD) U =

  35. P2: Information retrieval P1: Singular value T5: Retrieval T1: Information T2: Singular T3: Value T4: Computations T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval = P

  36. M matrix = UkTP Phrases/single words Abstract concepts P2: Information retrieval P1: Singular value T5: Retrieval T1: Information T2: Singular T3: Value T4: Computations

  37. Last step

  38. Prune overlapping label descriptions ZTZ

  39. Stage 4/4:CLUSTER-CONTENT ALLOCATION user query Pre-processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels clustered documents

  40. Similarity

  41. Cluster Score

  42. Evaluation and Results

  43. Test Data 10 categories 4 subjects

More Related