500 likes | 628 Views
Data Mining, Search and Other Stuff. Amos Fiat Tel Aviv University Mostly based on joint work with Azar, Karlin, McSherry and Saia (STOC 2001), and Achlioptas,Karlin and McSherry (FOCS 2001). What this talk is about. Introduce Data Mining and specific problems: Document Classification
E N D
Data Mining, Search and Other Stuff Amos Fiat Tel Aviv University Mostly based on joint work with Azar, Karlin, McSherry and Saia (STOC 2001), and Achlioptas,Karlin and McSherry (FOCS 2001)
What this talk is about • Introduce Data Mining and specific problems: • Document Classification • Collaborative Filtering • Web Search • Describe LSA • Provable probabilistic generative models: • Papadimitriou et. al. • Generalizations, capturing document indexing and other problems (collaborative filtering) • Web search: • Google, Hits • New Web search algorithm (Smarty Pants) • Generative model from which smarty pants is derived • Sketch of proof
What is Data Mining? • First SIAM International conference on Data Mining, April 5-7, 2001 (from Call for Papers): • Advances in information technology and data collection methods have led to the availability of large data sets in commercial enterprises and in a wide variety of scientific and engineering disciplines… • …The field of data mining draws upon extensive work in areas such as statistics, machine learning, pattern recognition, databases, and high performance computing to discover interesting and previously unknown information in data sets…
From SIAM CFP: Topics of Interest: • Methods and Paradigms: • … • Mining high-dimensional data … • Collaborative filtering … • Data cleaning and pre-processing … • Applications • … • Web data … • Financial and e-commerce data … • Text, document, and multimedia data … • Human Factors and Social Issues …
Example: Document Classification / Search / Similarity • Classify documents in some meaningful way • Find documents by search terms, find similar documents
Example: Collaborative Filtering • Gather information on Supermarket purchases • Make good recommendations to customer at checkout. Good = likely to purchase • With/Without customer identification
2nd Example: Collaborative Filtering • Movie recommendations • User gives some input on movies he/she likes/dislikes • User does not know grade for movies not yet seen
Example: Web Search / Scientific Citation Search • Classify documents in some meaningful way • Find documents by search terms, find similar documents • Find High Quality documents
Latent Semantic Analysis • Deerwester, Dumais, Landauer, Furnas, Harshman, Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990. • Supposed to work rather well in practice. See http://lsa.colorodo.edu - Latent Semantic Analysis@CU Boulder
Idea of LSA • Embed corpus of documents into a low-dimensional “semantic” space by computing a good low-rank approximation to the term-document matrix. • Compute document/keyword relationships in this low-dimensional space. • Intuition: forcing low-rank representation maintains only usage patterns that correspond to strong linear trends. • Every action must be due to one or other of seven causes: chance, nature, compulsion, habit, reasoning, anger, or appetite • Aristotle, Rhetoric, Bk. II
Let’s prove that LSA works • What’s there to prove? • Papdimitriou, Raghavan, Tamaki, Vempala, Latent Semantic Indexing: A Probabilistic Analysis, PODS, 1997 • Introduce probabilistic generational model for documents. Real documents are an instantiation of this probabilistic process • LSA effectively reconstructs the probabilistic model (if the model is very very simple – a block matrix). If you know all the probabilistic parameters used to generate documents, classification, similarity, etc. are obvious. • Our contribution (STOC 2001): • very very simple -> simple • Block matrix -> arbitrary matrix
Our (somewhat) More General Model • Entries mijare generated from arbitrary (unknown) distributions with bounded deviation • Errors are introduced, entries are omitted • Can be used to model documents/terms, customers/preferences, web sites/links and web sites/terms (We will use yet another model for the web) • Theorem: Given A*, we can compute theexpected valuesof the mij’s, under certain conditions.
Supermarket Collaborative Filtering • Build an n x m matrix 0/1 matrix with one row per cart and one column per product • Place m/(m-ri)in entry i,j if cart i contained product j, 0 otherwise. • ri = number of items in cart i • Add a last row for the current cart. • Take the SVD of the matrix, discard all singular values less than • Read out current customer preferences from last row of low rank matrix • Theorem: If customer preferences are a low rank matrix (with large singular values), then this algorithm is guaranteed to give the approximately correct result.
Main Idea: • The matrix of user product preferences can be viewed as proportional to the sum of the matrix of expected cart content plus an additive “error” matrix. • The additive error matrix may be (relatively) large in (say) Frobenius norm. However: it will be small (with high probability) in terms of 2-norm. • Requires Furedi, Komlos, (and later) Boppana’s result that a matrix of independent random variables has small 2-norm. • 2-norm: max{|Mu| | |u|=1} • Discarding the singular vectors with small singular values effectively removes the “error” contribution from the matrix.
Web Search Issues • What pages are relevant? • What pages are high quality? • Huge amount of research….
Google [Brin &Page]and HITS [Kleinberg] • Relevance: • Google: Documents are potentially relevant if they contain the search terms • HITS: Documents are potentially relevant if they are in a “neighborhood” of pages containing the search terms. • Quality: • Google: Universal query-independent measure of quality called PageRank; essentially “normalized popularity”. • HITS: Quality is a more complex function of the “associated documents”; compute “authority” and “hub” score for each page. Quality= authority.
Determining Quality: Google and HITS • Google (Simplified): Quality is derived from the quality of the pages linking into a page. Q(p) = Q(q)/outdegree(q) + Q(s)/outdegree(s) • HITS: Quality is derived from a subset of “associated pages”, where every pages has 2 quality measures: Authority quality A Hub quality H • A(p) = H(q)+H(s) • H(p) = A(r) + A(t)
Potential Issues with Google and HITS • Both may have problems with • Polysemy: “Bug” could be an insect, a software problem, a listening device, to bother someone, etc. • Synonymy: “Terrorist” and “Freedom Fighter” refer to the same thing. • What is the basis for the heuristic used to choose “associated pages” in HITS ? • Does it make sense to determine quality based on contributions from pages on irrelevant topics, as Google does? Key question: what are the mathematical conditions under which these algorithms work?
Key Questions • What would the web have to look like for these algorithms to give the “right” answer? • What are the mathematical conditions under which these algorithms work? Preview of Answer: if the web is rank 1.
The Rest of this Talk • A new (entirely untested) web search algorithm: Smarty Pants • A unified mathematical model for web link structure, document content, and human generated queries • Proof that algorithm gives an approximately correct result given the model
Modeling the Web and a new Algorithm (Smarty Pants) • We define a common probabilistic generative model for : • the link structure of the web • the term content of web documents • the query generation process • Each component can be generated by the previous “Our (somewhat) more general model” • Our algorithm is entirely derived from the model. • If the model describes reality, our algorithm is guaranteed to give the correct answer.
New Algorithm Inputs: • n, the total number of web pages • l, the total number of terms • W, the web graph, W(i,j) = #links from page i to page j • S, the document/term matrix, S(i,j) = #occurrences of term j in document i • q, the query vector q(j) = #occurrences of term j in query
Smarty Pants Query Independent Part: • Find a “good” low rank approximation to the matrix W, say Wr . • Find a “good” low rank approximation to the matrix M=(WT|S) , say Mm . • Compute the pseudo inverse of Mm , say Mm-1 .
Smarty Pants Query Independent Part: • Compute the Singular Value Decomposition of the matrices • Letm be such that • Let r be such that • Let Mmbe the rank m SVD approx. to M , • Let Wr be the rank r SVD approx. to W, • Let Mm-1 be the pseudo inverse of Mm
Smarty Pants, cont. Query Dependent Part: • Let q be the characteristic vector of the query, q(i) = #occurrences of term i in query • q’T=[0n | qT] • Computew=q’T Mm-1 Wr • Output pages p in order of decreasing w(p)
An Alternative View • This algorithm is provably equivalent to: • Take query vector and determine the topic the human searcher is interested in • Output the documents in order of their quality on the specific topic that the user is interested in • We call this synthesizing aperfect hub for the topic. • Topics, quality and hubs have not been defined. • In fact, it is provably impossible to determine the topic from the inputs available
Inspirations for Model • Latent Semantic Analysis [Deerwester et al] and models thereof [Papadimitriou et al] • PLSI [Hofmann] • PHITS [Cohn & Chang] • combined model of [Cohn & Hofmann] • the list goes on and on…
The Model: Concepts and Topics • There exist k fundamental concepts(latent semantic categories) that can be used to describe reality • How large k is and what the concepts mean is unknown • A topic is a k-tuple describing the relative proportion of the fundamental concepts in the topic • Two k-tuples that are scalar multiples of each other refer to the same topic
The Model:Web Pages • Every web page p has two k-tuples associated with it. • Its authority topic A(p), captures the content on which this page is an authority, and therefore, influences incoming links to this page. • E.g., authority on Linux • Its hub topic H(p),, i.e., the topic of the outgoing links • E.g., hub on Microsoft Bashing • H is n by k matrix whose p-th row is H(p). • A is n by k matrix whose p-th row is A(p).
The Model: Link Generation • Model assumes that the number of links from page p to page q is a random variable with expected value <H(p),A(q)> • Intuition: the more closely aligned the hub topic of page p is with the topic on which q is an authority, the more likely it is that there will be a link from p to q. • The web link matrix is a instantiation of the probabilistic process defined by HAT
Terms: Authority and Hub • Model allows general term distributions for any topic. • Model allows for possibility of different uses of terminology in an authoritative sense and in a hub sense. Example: • Hubs on Microsoft may refer to “Evil Empire” whereas few of Microsoft’s own sites will use this term. • For hub terminology, think: anchor text.
Terms and Topics Associated with each term t are two distributions: • Use as authoritative term, given by k-tuple SA(t):i’th coordinate is the expected number of occurrences of term t in apure authorityon the i’th concept. • Use as hub term, given by k-tuple SH(t):i’th coordinate is the expected number of occurrences of term t in apure hubon the i’th concept. • SH (resp. SA) is the l by k matrix whose t-th row is SH(t) (resp SA(t))
Document/Term Structure Terms on page with authority topicA(p)and hub topicH(p)are generated from a distribution where Expected(# occurrences of term t in p) = <H (p) ,SH (t)> + < A (p),SA (t)> Document-term matrix S is instantiation of probabilistic process defined by matrix HSHT+ ASAT
The Model: Query Generation • The searcher chooses a k-tuplevrepresenting the topic he wants to search, and computesq’T=vTSHT • q’[u], is the expected number of occurrences of termuin a pure hub page on topicv. • Searcher decides whether or not to include termuamong search terms by sampling from a distribution with expectationq’[u] • Result: a query q which is the instantiation of random process.
A Perfect Hub: • The correct search results are the pages ordered by their authoritativeness on topicv • w = vT ATgives the relative authority of all pages on topicv
Model Summary • Documents have 2 k-tuples, one for the topic on which the document is an authority, one for the topic on which the document is a hub • Terms also have 2 k-tuples, one for the use of the term in an authoritative context and one for the use of the term in a hub context • Humans generate queries by first choosing a k-tuple representing the topic of the query, and then choosing search terms using the hub term dist’n for the topic. • The correct answer is now well defined: it’s the sites ordered by authoritativeness on the topic, i.e., a perfect hub • The real web (links and content) and real queries are derived by an instantiation of the probabilistic model
There exist: H, A, SH, SA , v web hub topics web authority topics hub term dist’n authority term dist’n query topic such that: • Link structure W: instantiation of HAT = E(W) • Doc-term S: instantiation of HSHT + ASHT = E(S) • Query q: instantiation of vT SHT = E(q) Goal: Given W, S, q, want to compute good approximation to vT AT, vector of authoritativeness of pages on topic v.
About the Model • The model is fairly general • This is an advantage, not a disadvantage • The more powerful the model, the greater the flexibility in using the model to approximate reality • If reality is indeed simpler than the full generality that the model allows, then the results still hold, don’t need to use the full flexibility of the model (e.g., the case H=A is particularly easy to deal with).
Main Theorem If the Web link structure W, and document-term matrix S, are generated according to the model, andsomeother technical conditions hold, Then w.h.p. for any query q generated according to the model, with sufficiently many terms, our algorithm produces an answer w’ (=qT Mm-1 Wr) such that for 1-o(1) of the entries, the correct answer is produced up to lower order terms, i.e., |w’(i)-w(i)| = o(|w(i)|), where w (= vT AT) is the correct answer to the query.
Sufficiently Many Terms? • We assume that k, the number of fundamental concepts, is a constant. • The number of query terms required to guarantee success with high probability depends on k and the singular values of M =(WT|S) andW. • If the singular values are sufficiently high then we only require a constant number of terms in the query. • For example, if they are Zipf, the i’th singular value of the Web is proportional to n/i then we only need a constant number of terms. • The faster the singular values drop, the worse our guarantee, but the algorithm still works for a wide range (with ever increasing query term requirements).
Proof Techniques: • Key idea #1: Instantiation of a random variable can be viewed as an additive error process E(W) = H AT W = H AT + Error • Key idea #2:In many cases, the effect of a random error can be estimated (as a function of various spectral properties of the underlying matrices) • Incredibly useful: A’ = A+E, where A is rank k with large singular values => A’k, best rank k approximation to A’ essentially = A
Proof Techniques (Cont.) • Key idea #3: • Forget that the model has: • real web link structure derived via a random process, • real document content derived via a random process, • real query derived via a random process. • Imagine that we had the original (non-instantiated) Expected(web matrix),Expected(document/term matrix),and Expected(query vector) • Key idea #4:Pray that the errors introduced by this “forgetfulness” are amenable to analysis and some magical matrix perturbation theorems from Stewart and Sun can be applied.
Proof Techniques – Synthesizing a Perfect Hub • So, imagine that we have: • Expected(web matrix) = E(W) = H AT • Expected(document/term matrix) =E(S) =HSHT+ ASAT • Expected(query) = vTSHT • E(M)= (E(WT )|E(S)) • What the algorithm tries to do is: • Find a linear combination of the rows of E(M),that gives (0n| vT SHT). • Apply the same linear combination to the rows of E(W).
Proof Techniques – Synthesizing a Perfect Hub What? I.e.,find a hub term distribution (with no authority content) giving the query distribution. Simultaneously derive the required perfect hub on the query topic
Revisiting Google and HITS • Google’s authority vector is the primary left eigenvector of the stochastic matrix representing a random walk on the web matrixW(ignoring periodic restarts). • HITS’s authority vector is the primary right singular vector of the web matrixW(ignoring the “associated pages” issue), which is essentially the same as the primary right singular vector of the expected web matrix,E(W),under our model whenk=1. • But, ifrank(E(W))=1,then the primary right singular vector ofE(W)and the primary left eigenvector of the stochastic matrix associated withE(W)areone and the same. • I.e., for a rank one model,our algorithm HITS Google.
A Few Obvious Limitations of the Model • All correlations are linear • Entries in various matrices are instantiated independently. • The inner product measure of quality means that more authoritative pages on a related topic may be chosen over less authoritative pages more closely aligned with the topic. We have a heuristic suggestion as to how to deal with this issue using a recursive approach. • The Model simply disallows a page like the “50 worst Computer Science Sites”. Such as site has a hub topic of Computer Science and therefore, by the model, will more likely point to good authorities in Computer Science.
Summary Use of generative probabilistic models to • Prove the correctness of algorithms (LSA) • Understand when algorithms work (Google, Hits) • Generate new provably correct algorithms: • Collaborative Filtering • Web Search: Smarty Pants • Generative models can have varying complexity, the stronger the better
Future Work • Test the algorithm….
Some Bibliography(Apologies Extended to all Omitted References) • Kleinberg, Authoritative Sources in a Hyperlinked Environment, JACM, 1999 • Page, Brin, Motwani, Wingrad, The pagerank Citation Ranking: Bringing Order to the Web, 1998 • Brin, Page, The Anatomy of a large-scale Web Search Engine, 1998 • Chakrabarti, Dom, Gibson, Kleinberg, Kumar, Raghavan, Rajagopalan, Tomkins, Hypersearching the Web, Scientific American, June 1999. (Also Computer, 1998). • Deerwester, Dumais, Landauer, Furnas, Harshman, Indexing by Latent Semantic Analysis. 1990. • Papdimitriou, Raghavan, Tamaki, Vempala, Latent Semantic Indexing: A Probabilistic Analysis, PODS, 1997. • Azar, Fiat, Karlin, McSherry, Saia, Spectral Analysis of Data, STOC 2001 • Boppana, Eigenvalues and Graph Bisection, 28th FOCS, 1987 • Stewart and Sun, Matrix Perturbation Theory, 1990.