1 / 70

WEB BAR 2004 Advanced Retrieval and Web Mining

WEB BAR 2004 Advanced Retrieval and Web Mining. Lecture 14. Today’s Topics. Latent Semantic Indexing / Dimension reduction Interactive information retrieval / User interfaces Evaluation of interactive retrieval. How LSI is used for Text Search. LSI is a technique for dimension reduction

love
Download Presentation

WEB BAR 2004 Advanced Retrieval and Web Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14

  2. Today’s Topics • Latent Semantic Indexing / Dimension reduction • Interactive information retrieval / User interfaces • Evaluation of interactive retrieval

  3. How LSI is used for Text Search • LSI is a technique for dimension reduction • Similar to Principal Component Analysis (PCA) • Addresses (near-)synonymy: car/automobile • Attempts to enable concept-based retrieval • Pre-process docs using a technique from linear algebra called Singular Value Decomposition. • Reduce dimensionality to: • Fewer dimensions, more “collapsing of axes”, better recall, worse precision • More dimensions, less collapsing, worse recall, better precision • Queries handled in this new (reduced) vector space.

  4. n dj ti m Input: Term-Document Matrix • wi,j = (normalized) weighted count (ti , dj) • Key idea: Factorize this matrix

  5. Matrix Factorization A = W x H hj dj n n k = x k m m Basis Representation hj is representation of dj in terms of basis W If rank(W) ≥rank(A) then we can always find H so A = WH Notice duality of problem More “semantic” dimensions -> LSI (latent semantic indexing)

  6. Minimization Problem • Minimize • Minimize information loss • Given: • norm • for SVD, the 2-norm • constraints on W, S, V • for SVD, W and V are orthonormal, and S is diagonal

  7. Matrix Factorizations: SVD A = W x S x VT n n k = x x k m m Singular Values Representation Basis Restrictions on representation: W, V orthonormal; S diagonal

  8. Dimension Reduction • For some s << Rank, zero out all but the s biggest singular values in S. • Denote by Ss this new version of S. • Typically s in the hundreds while r (Rank) could be in the (tens of) thousands. • Before: A= W SVt • Let As = W Ss Vt = WsSsVst • Asis a good approximation to A. • Best rank s approximation according to 2-norm

  9. Dimension Reduction As = W x Ss x VT n s k n s 0 0 = x x 0 0 0 k m m Singular Values Representation Basis The columns of As represent the docs, but in s << m dimensions Best rank s approximation according to 2-norm

  10. More on W and V • Recall mn matrix of terms  docs, A. • Define term-term correlation matrix T = AAt • At denotes the matrix transpose of A. • T is a square, symmetric mm matrix. • Doc-doc correlation matrix D=AtA. • D is a square, symmetric nn matrix. Why?

  11. Eigenvectors • Denote by W the mr matrix of eigenvectors of T. • Denote by V the nr matrix of eigenvectors of D. • Denote by S the diagonal matrix with the squares of the eigenvalues of T = AAt in sorted order. • It turns out that A = WSVt is the SVD of A • Semi-precise intuition: The new dimensions are the principal components of term correlation space.

  12. Query processing • Exercise: How do you map the query into the reduced space?

  13. Take Away • LSI is optimal: optimal solution for given dimensionality • Caveat: Mathematically optimal is not necessarily “semantically” optimal. • LSI is unique • Except for signs, singular values with same value • Key benefits of LSI • Enhances recall, addresses synonymy problem • But can decrease precision • Maintenance challenges • Changing collections • Recompute in intervals? • Performance challenges • Cheaper alternatives for recall enhancement • E.g. Pseudo-feedback • Use of LSI in deployed systems Why?

  14. Resources: LSI • Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html • Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html • Latent semantic indexing: http://citeseer.nj.nec.com/deerwester90indexing.html http://cs276a.stanford.edu/handouts/fsnlp-svd.pdf • Books: FSNLP 15.4, MG 4.6, MIR 2.7.2.

  15. Interactive Information RetrievalUser Interfaces

  16. The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop

  17. Main Focus of Information Retrieval Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no Focus of most IR! yes Stop

  18. Information Access Information Access in Context Analyze Synthesize High-Level Goal Done? User no yes Stop

  19. The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop

  20. Queries on the WebMost Frequent on 2002/10/26

  21. Queries on the Web (2000) Why only 9% sex?

  22. 3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map 773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid Intranet Queries (Aug 2000) Source: Ray Larson

  23. Intranet Queries • Summary of sample data from 3 weeks of UCB queries • 13.2% Telebears/BearFacts/InfoBears/BearLink (12297) • 6.7% Schedule of classes or final exams (6222) • 5.4% Summer Session (5041) • 3.2% Extension (2932) • 3.1% Academic Calendar (2846) • 2.4% Directories (2202) • 1.7% Career Center (1588) • 1.7% Housing (1583) • 1.5% Map (1393) Source: Ray Larson

  24. Types of Information Needs • Need answer to question (who won the superbowl?) • Re-find a particular document • Find a good recipe for tonight’s dinner • Exploration of new area (browse sites about Mexico City) • Authoritative summary of information (HIV review) • In most cases, only one interface! • Cell phone / pda / camera / mp3 analogy

  25. The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop

  26. Find Starting Point By Browsing Entry point x x x x x x x x x x x x x x Starting point for search (or the answer?)

  27. Hierarchical browsing Level 0 Level 1 Level 2

  28. Visual Browsing: Hyperbolic Tree

  29. Visual Browsing: Hyperbolic Tree

  30. Visual Browsing: Themescape

  31. Scatter/Gather • Scatter/gather allows the user to find a set of documents of interest through browsing. • It iterates: • Scatter • Take the collection and scatter it into n clusters. • Gather • Pick the clusters of interest and merge them.

  32. Scatter/Gather

  33. Browsing vs. Searching • Browsing and searching are often interleaved. • Information need dependent • Open-ended (find information about mexico city) -> browsing • Specific (who won the superbowl) -> searching • User dependent • Some users prefer searching, others browsing (confirmed in many studies: some hate to type) • Advantage of browsing: You don’t need to know the vocabulary of the collection • Compare to physical world • Browsing vs. searching in a grocery store

  34. Browsers vs. Searchers • 1/3 of users do not search at all • 1/3 rarely search • Or urls only • Only 1/3 understand the concept of search • (ISP data from 2000) Why?

  35. Starting Points • Methods for finding a starting point • Select collections from a list • Highwire press • Google! • Hierarchical browsing, directories • Visual browsing • Hyperbolic tree • Themescape, Kohonen maps • Browsing vs searching

  36. The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop

  37. Form-based Query Specification (Infoseek) Credit: Marti Hearst

  38. Boolean Queries • Boolean logic is difficult for the average user. • Some interfaces for average users support formulation of boolean queries • Current view is that non-expert users are best served with non-boolean or simple +/- boolean (pioneered by altavista). • But boolean queries are the standard for certain groups of expert users (eg, lawyers).

  39. Direct Manipulation Spec.VQUERY (Jones 98) Credit: Marti Hearst

  40. One Problem With Boolean Queries: Feast or Famine Specifying a well targeted query is hard. Bigger problem for Boolean. Google: 1860 hits for “standard user dlink 650” 0 hits after adding “no card found” Feast Famine How general is the query?

  41. Boolean Queries • Summary • Complex boolean queries are difficult for average user • Feast or famine problem • Prior to google, many IR researchers thought boolean queries were a bad idea. • Google queries are strict conjunctions. • Why is this working well?

  42. Parametric search example Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.

  43. Parametric search example We can add text search.

  44. Parametric search • Each document has, in addition to text, some “meta-data” e.g., Make, Model, City, Color • A parametric search interface allows the user to combine a full-text query with selections on these parameters

  45. Interfaces for term browsing

  46. Re/Formulate Query • Single text box (google, stanford intranet) • Command-based (socrates) • Boolean queries • Parametric search • Term browsing • Other methods • Relevance feedback • Query expansion • Spelling correction • Natural language, question answering

  47. The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop

  48. Category Labels to Support Exploration • Example: • ODP categories on google • Advantages: • Interpretable • Capture summary information • Describe multiple facets of content • Domain dependent, and so descriptive • Disadvantages • Domain dependent, so costly to acquire • May mis-match users’ interests Credit: Marti Hearst

More Related