700 likes | 849 Views
WEB BAR 2004 Advanced Retrieval and Web Mining. Lecture 14. Today’s Topics. Latent Semantic Indexing / Dimension reduction Interactive information retrieval / User interfaces Evaluation of interactive retrieval. How LSI is used for Text Search. LSI is a technique for dimension reduction
E N D
Today’s Topics • Latent Semantic Indexing / Dimension reduction • Interactive information retrieval / User interfaces • Evaluation of interactive retrieval
How LSI is used for Text Search • LSI is a technique for dimension reduction • Similar to Principal Component Analysis (PCA) • Addresses (near-)synonymy: car/automobile • Attempts to enable concept-based retrieval • Pre-process docs using a technique from linear algebra called Singular Value Decomposition. • Reduce dimensionality to: • Fewer dimensions, more “collapsing of axes”, better recall, worse precision • More dimensions, less collapsing, worse recall, better precision • Queries handled in this new (reduced) vector space.
n dj ti m Input: Term-Document Matrix • wi,j = (normalized) weighted count (ti , dj) • Key idea: Factorize this matrix
Matrix Factorization A = W x H hj dj n n k = x k m m Basis Representation hj is representation of dj in terms of basis W If rank(W) ≥rank(A) then we can always find H so A = WH Notice duality of problem More “semantic” dimensions -> LSI (latent semantic indexing)
Minimization Problem • Minimize • Minimize information loss • Given: • norm • for SVD, the 2-norm • constraints on W, S, V • for SVD, W and V are orthonormal, and S is diagonal
Matrix Factorizations: SVD A = W x S x VT n n k = x x k m m Singular Values Representation Basis Restrictions on representation: W, V orthonormal; S diagonal
Dimension Reduction • For some s << Rank, zero out all but the s biggest singular values in S. • Denote by Ss this new version of S. • Typically s in the hundreds while r (Rank) could be in the (tens of) thousands. • Before: A= W SVt • Let As = W Ss Vt = WsSsVst • Asis a good approximation to A. • Best rank s approximation according to 2-norm
Dimension Reduction As = W x Ss x VT n s k n s 0 0 = x x 0 0 0 k m m Singular Values Representation Basis The columns of As represent the docs, but in s << m dimensions Best rank s approximation according to 2-norm
More on W and V • Recall mn matrix of terms docs, A. • Define term-term correlation matrix T = AAt • At denotes the matrix transpose of A. • T is a square, symmetric mm matrix. • Doc-doc correlation matrix D=AtA. • D is a square, symmetric nn matrix. Why?
Eigenvectors • Denote by W the mr matrix of eigenvectors of T. • Denote by V the nr matrix of eigenvectors of D. • Denote by S the diagonal matrix with the squares of the eigenvalues of T = AAt in sorted order. • It turns out that A = WSVt is the SVD of A • Semi-precise intuition: The new dimensions are the principal components of term correlation space.
Query processing • Exercise: How do you map the query into the reduced space?
Take Away • LSI is optimal: optimal solution for given dimensionality • Caveat: Mathematically optimal is not necessarily “semantically” optimal. • LSI is unique • Except for signs, singular values with same value • Key benefits of LSI • Enhances recall, addresses synonymy problem • But can decrease precision • Maintenance challenges • Changing collections • Recompute in intervals? • Performance challenges • Cheaper alternatives for recall enhancement • E.g. Pseudo-feedback • Use of LSI in deployed systems Why?
Resources: LSI • Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html • Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html • Latent semantic indexing: http://citeseer.nj.nec.com/deerwester90indexing.html http://cs276a.stanford.edu/handouts/fsnlp-svd.pdf • Books: FSNLP 15.4, MG 4.6, MIR 2.7.2.
The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop
Main Focus of Information Retrieval Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no Focus of most IR! yes Stop
Information Access Information Access in Context Analyze Synthesize High-Level Goal Done? User no yes Stop
The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop
Queries on the Web (2000) Why only 9% sex?
3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map 773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid Intranet Queries (Aug 2000) Source: Ray Larson
Intranet Queries • Summary of sample data from 3 weeks of UCB queries • 13.2% Telebears/BearFacts/InfoBears/BearLink (12297) • 6.7% Schedule of classes or final exams (6222) • 5.4% Summer Session (5041) • 3.2% Extension (2932) • 3.1% Academic Calendar (2846) • 2.4% Directories (2202) • 1.7% Career Center (1588) • 1.7% Housing (1583) • 1.5% Map (1393) Source: Ray Larson
Types of Information Needs • Need answer to question (who won the superbowl?) • Re-find a particular document • Find a good recipe for tonight’s dinner • Exploration of new area (browse sites about Mexico City) • Authoritative summary of information (HIV review) • In most cases, only one interface! • Cell phone / pda / camera / mp3 analogy
The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop
Find Starting Point By Browsing Entry point x x x x x x x x x x x x x x Starting point for search (or the answer?)
Hierarchical browsing Level 0 Level 1 Level 2
Scatter/Gather • Scatter/gather allows the user to find a set of documents of interest through browsing. • It iterates: • Scatter • Take the collection and scatter it into n clusters. • Gather • Pick the clusters of interest and merge them.
Browsing vs. Searching • Browsing and searching are often interleaved. • Information need dependent • Open-ended (find information about mexico city) -> browsing • Specific (who won the superbowl) -> searching • User dependent • Some users prefer searching, others browsing (confirmed in many studies: some hate to type) • Advantage of browsing: You don’t need to know the vocabulary of the collection • Compare to physical world • Browsing vs. searching in a grocery store
Browsers vs. Searchers • 1/3 of users do not search at all • 1/3 rarely search • Or urls only • Only 1/3 understand the concept of search • (ISP data from 2000) Why?
Starting Points • Methods for finding a starting point • Select collections from a list • Highwire press • Google! • Hierarchical browsing, directories • Visual browsing • Hyperbolic tree • Themescape, Kohonen maps • Browsing vs searching
The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop
Form-based Query Specification (Infoseek) Credit: Marti Hearst
Boolean Queries • Boolean logic is difficult for the average user. • Some interfaces for average users support formulation of boolean queries • Current view is that non-expert users are best served with non-boolean or simple +/- boolean (pioneered by altavista). • But boolean queries are the standard for certain groups of expert users (eg, lawyers).
Direct Manipulation Spec.VQUERY (Jones 98) Credit: Marti Hearst
One Problem With Boolean Queries: Feast or Famine Specifying a well targeted query is hard. Bigger problem for Boolean. Google: 1860 hits for “standard user dlink 650” 0 hits after adding “no card found” Feast Famine How general is the query?
Boolean Queries • Summary • Complex boolean queries are difficult for average user • Feast or famine problem • Prior to google, many IR researchers thought boolean queries were a bad idea. • Google queries are strict conjunctions. • Why is this working well?
Parametric search example Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.
Parametric search example We can add text search.
Parametric search • Each document has, in addition to text, some “meta-data” e.g., Make, Model, City, Color • A parametric search interface allows the user to combine a full-text query with selections on these parameters
Re/Formulate Query • Single text box (google, stanford intranet) • Command-based (socrates) • Boolean queries • Parametric search • Term browsing • Other methods • Relevance feedback • Query expansion • Spelling correction • Natural language, question answering
The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop
Category Labels to Support Exploration • Example: • ODP categories on google • Advantages: • Interpretable • Capture summary information • Describe multiple facets of content • Domain dependent, and so descriptive • Disadvantages • Domain dependent, so costly to acquire • May mis-match users’ interests Credit: Marti Hearst