420 likes | 603 Views
Search Algorithms Winter Semester 2004/2005 29 Nov 2004 7th Lecture. Christian Schindelhauer schindel@upb.de. Chapter III. Organization 29 Nov 2004 Mid Term Exam. Mid Term Exam. Wednesday, 8 Dec 2004, 1pm-1.45pm, F1.110 4 parts 1. s hort questions, testing general understanding
E N D
Search AlgorithmsWinter Semester 2004/200529 Nov 20047th Lecture Christian Schindelhauer schindel@upb.de
Chapter III Organization 29 Nov 2004 Mid Term Exam
Mid Term Exam • Wednesday, 8 Dec 2004, 1pm-1.45pm, F1.110 • 4 parts • 1. short questions, testing general understanding • 2.-4. Show that you understand • Text search algorithms • Searching in Compressed Text • The Pagerank algorithm • If you have successfully presented an exercise: • Only the best 3 of 4 parts count • If you fail, or if you receive a bad grade • then the oral exam at the end of the semester will cover the complete lecture • If you are happy with your grade • this grade counts half of the complete lecture • if you succeed within the oral exam (over the rest of the lecture)
Chapter II Chapter II Searching in Compressed Text 15 Nov 2004
Searching in LZW-CodesInside a node Example: Search for tapioca abtapiocaab abarb blahblah tapioca is “inside” a node Then we have found tapioca For all nodes u of a trie: Set: Is_inside[u]=1 if the text of u contains the pattern
Searching in LZW-CodesTorn apart Example: Search for tapioca Parts are hidden in some other nodes All parts arenodes of theLZW-Trie carasi abrastap io The end is thestart of anothernode Startingsomewhere in a node
Finding the start: longest_prefixThe Suffix of Nodes = Prefix of Patterns Is the suffix of the node a prefix of the pattern And if yes, how long is it? Classify all nodes of the trie For very long text encoded by a node only the last m letters matter Can be computed using the KMP-Matcher-algorithm while building the Trie Example: Pattern: “manamana” The last fourletter are the first four ofthe pattern pamana amanaplanacanalpamana length of suffix of node which is prefix of patter is 2 result: 4 mama papa amanaplanacanalpamana result: 0 m mana result: 4 amanaplanacanalpamanam
Finding the End: longest_suffixPrefix of the Node = Suffix of the Pattern Is the prefix of the node a suffix of the pattern And if yes, does it complete the pattern, if already i letters were found? Classify all nodes of the trie For very long text encoded by a node only the first m letters matter Since the text is added at the right side this property can be derived from the ancestor Example: Pattern: “manamana” ananimal Here 3 and 1 could be the solution We take 3, because 1 can be derived from 3 using the technique shown in KMP-Matcher (using on the reverse string) manammanaaaaaaaaaaaa manamanamana result: 8 manammanaaaaaaaaaaaa m panamacanal result: 0 manammanaaaaaaaaaaaam
How does it fit? On the left side we have the maximum prefix of the pattern On the right side we have the maximum suffix of the pattern 10 letter pattern: pamapamapa 6 letter suffix found 8 letter prefix found 14 letters? pamapana panapamapama Yet the pattern is inside, though, since the last 6 letters + the first 8 letters of the pattern give the pattern Solution: Define prefix-suffix-table PS-T(p,s) = 1 if p-letter prefix of P and s-letter suffix of P contain the pattern
Computing the PS-Table in time O(m3) For all p and s such that p+sm compute PS-T[p,s] Run the KMP-Matcher for pattern P in P[m-p+1..m]P[1..s] needs time O(m) for each combination of p and s Leads to run time of O(n3) 10 letter pattern: pamapamapa xyzpamapama pamapaxyz If pattern pamapamapa found in text pamapamapamapa then PS-T[8,6] = 1
Computing the Prefix-Suffix-Table in Time O(m2) - Preparation p a m a p a m a p a p a m a p a m a p a p a m a p a m a p a p p a m a a m a p a p p a m a a m a p a ptr[i,j] = next left position from i where the suffix of P of length j occurs = max{k < i | P[m-j+1..m] = P[k..k+j-1] or k = 0}
Computing the Prefix-Suffix-Table in time O(m2)Initialization Init-ptr (P) m length(P) for i 1 to m do ptr[i,0] i-1 od for j 1 to m-1 do last m-j+1 i ptr[last+1,j-1]-1 while i 0 do if P[i]=P[last] then ptr[last,j] i last i fi i ptr[i+1,j-1]-1 od od p a m a p a m a p a p a m a p a m a p a p a m a p a m a p a p p a m a a m a p a p p a m a a m a p a Run time: O(m2)
Computing the Prefix-Suffix-Table in time O(m2) p a m a p a m a p a Init-PS-T(P) • m length(P) • ptr Init-ptr(P) • for i 1 to m-1 do • j i+1 • while j 0 do • PS-T[i,m-j+1] 1 • j ptr[j,m-i] • od • od Problem: What happens if the search pattern does not start at the beginning? This case is not covered, here! WRONG!! PS-T[8,2]=1 p a m a p a m a p a ptr[5,2] ptr[9,2] PS-T[8,8]=1 PS-T[8,6]=1
Counter-ExampleSearch pattern: b a a b a a a b a Init-PS-T(P) • m length(P) • ptr Init-ptr(P) • for i 1 to m-1 do • j i+1 • while j 0 do • PS-T[i,m-j+1] 1 • j ptr[j,m-i] • od • od Problem: What happens if the search pattern does not start at the beginning? This case is not covered, here! WRONG!! PS-T[6,3] = 1 prefix length: 6 suffix length: 6 b a a b a a PS-T[6,7] = 1 b a a b a a b a a a b a b a a b a a a b a b a a b a a a b a b a a b a a a b a PS-T[6,6] = 1 Algorithm sets: PS-T[6,6] = 0
Example of a Prefix-Suffix-Table babab babbab bababbab babab bab bababbab babab abbab bababbab babab ababbab bababbab
Computing the Prefix-Suffix-Table PS-T (revisited) Pr[i] := max{ k | 1km-i+1 and P[i,..,i+k-1] = P[1,..,k] } Length of the longest prefix starting at P[i] Su[j] := max{k| 1km-j+1 and P[j-k+1,..,j]= [m-j+1,..,m]} Length of the longest suffix ending at P[j] Can be computed in time O(m2) Pr[1]=8 Pr[2]=0 Pr[3]=3 Pr[4]=0 Pr[5]=1 Pr[6]=3 Pr[7]=0 Pr[8]=1 b a b a b b a b b a b b b a b b Su[1]=1 Su[2]=0 Su[3]=3 Su[4]=0 Su[5]=3 Su[6]=1 Su[7]=0 Su[8]=8 b b a b b a b b b a b a b b a b
Computing the Prefix-Suffix-Table PS-T (revisited) Pr[i] := length of the longest prefix starting at P[i] Su[j] := length of the longest suffix ending at P[j] For all i {1,..,m+1},j {0,..,m} do If Pr[i]+Su[j] m then s i+m-Su[j]-1 /* =i-1+Pr[i]-(Pr[i]+Su[j]-m) */ for k 0 to Su[j]+Pr[i]-m do PS-T[s+k,m-j+Su[j]-k] =1 od fi od Problem: Running time: O(n3) b a b a b b a b Pr[3]=3 b a b Su[8]=8 b a b a b b a b Overlapping: Pr[3]+Su[8]-8=3 Leads to 3+1=4 entries in PS-T: b a b a b a b b a b PS-T[2,8]=1 b a b a b a b b a b PS-T[3,7]=1 b a b b a b b a b a PS-T[4,6]=1 b a b a b a b b a b PS-T[5,5]=1
Computing the Prefix-Suffix-Table PS-T (revisited) Pr[i] := length of the longest prefix starting at P[i] Su[j] := length of the longest suffix ending at P[j] For all i {1,..,m+1},j {0,..,m} do If Pr[i]+Su[j] m then s i+m-Su[j]-1 /* =i-1+Pr[i]-(Pr[i]+Su[j]-m) */ for k 0 to Su[j]+Pr[i]-m do PS-T[s+k,m-j+Su[j]-k] =1 od fi od Problem: Running time: O(n3) Overlap: Pr[i]+Su[j]-m i+m -overlap i+Pr[i] 1 i m Prefix Pr[i] Pattern Suffix Su[j] 1 m j j-Su[j] m-j+Su[j]
Example of a Prefix-Suffix-Table bababbab bababbab -----++++++ ------+++++ -------++++ --------+++ bababbab bababbab -----++++++++ ------+++++++ -------++++++ --------+++++
Computing the Prefix-Suffix-Table PS-T in time O(n2) O(n3)-algorithm: PS-T[0..m,0..m] 0 For all i {1,..,m+1},j {0,..,m} do If Pr[i]+Su[j] m then s i+m-Su[j]-1 /* =i-1+Pr[i]-(Pr[i]+Su[j]-m) */ for k 0 to Su[j]+Pr[i]-m do PS-T[s+k,j-k] =1 od fi od PS-T[0..m,0..m] -1 For all i {1,..,m+1},j {0,..,m} do s i+m-Su[j]-1 PS-T[s+k,j-k] = max{PS-T[s+k,j-k],Su[j]+Pr[i]-m} od for d 0 to 2m do z -1 for i 0 to min{d,2m-d} do j d-i z max{z,PS-T[i,j]} if z0 then PS-T[i,j] 1 z z-1 else PS-T[i,j] 0 fi od od Run time O(n2) For all diagonals Mark diagonals with length of series with 1s Fill up the diagonals
Chapter III Chapter III Searching the Web 29 Nov 2004
Searching the Web • Introduction • The Anatomy of a Search Engine • Google’s Pagerank algorithm • The Simple Algorithm • Periodicity and convergence • Kleinberg’s HITS algorithm • The algorithm • Convergence • The Structure of the Web • Pareto distributions • Search in Pareto-distributed graphs
The Anatomy of a Web Search Engine • “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page, Computer Networks and ISDN Systems, Vol. 30, 1-6, p. 107-117, 1998 • Design of the prototype • Stanford University 1998 • Key components: • Web Crawler • Indexer • Pagerank • Searcher • Main difference between Google and other search engines (in 1998) • The Pagerank mechanism
Simplified PageRank-Algorithmus • Simplified PageRank-Algorithmus • Rank of a wep-page R(u) [0,1] • Important pages hand their rank down to the pages they link to. • c is a normalisation factor such that ||R(u)||1= 1, i.e. • the sum of all page ranks add to 1 • Predecessor nodes Bu • sucessor nodes Fu
Matrix representaion R c M R , where R is a vector (R(1),R(2),… R(n)) and M denotes the following n n – Matrix
The Simplified Pagerank Algorithm • Does it converge? • If it converges, does it converge to a single result? • Is the result reasonable?
The Eigenvector and Eigenvalue of the Matrix • For vector x and n n-matrix and a number λ: • If M x = λ x then x is called the eigenvector and λ the eigen-value • Every n n-matrix M has at most n eigenvalues • Compute the eigenvalues by eigen-decomposition M x = λ x (M - I λ) x = 0, where I is the identity matrix • This equality has only non-trivial solutions if Det(M - I λ) = 0 • This leads to a polynomial equation of degree n, which has always n solutions λ1, λ2, ..., λn • (Fundamental theorem of algebra) • Solving the linear equations (M - I λi) x = 0 lead to the eigenvectors • The eigenvektor of the matrix is a fix point of the recursion of the simplified pagerank algorithm
Stochastic Matrices • Consider n discrete states and a sequence of random variable X1, X2, ... over this set of states • The sequence X1, X2, ... is a Markov chain if • A stochastic matrix M is the transition matrix for a finite Markov chain, also called a Markov matrix: • Elements of the matrix M must be real numbers of [0, 1]. • The sum of all column in M is 1 • Observation for the matrix M of the simpl. pagerank algorithm • M is stochastic if all nodes have at least one outgoing link
The Random Surfer • Consider the following algorithm • Start in a random web-page according to a probability distribution • Repeat the following for t rounds • If no link is on this page, exit and produce no output • Uniformly and randomly choose a link of the web-page • Follow that link and go to this web-page • Output the web-page Lemma The probability that a web-page i is output by the random surfer after t rounds started with probability distribution x1, .., xn is described by the i-th entry of the output of the simplified Pagerank-algorithm iterated for t rounds without normalization. Proof follows applying the definition of Markov chains
Eigenvalues of Stochastic Matrices • Notations • Die L1-Norm of a vector x is defined as • x0, if for all i: xi 0 • x0, if for all i: xi 0 • Lemma For every stochastic matrix M and every vector x we have • || M x ||1 || x ||1 • || M x ||1= || x ||1, if x0 or x0 Eigenvalues of M |i| 1 • Theorem For every stochastic matrix M there is an eigenvector x with eigenvalue 1 such that x 0 and ||x||1 = 1
Eigenvalues of Stochastic Matrices • Lemma For every stochastic matrix M and every vector x we have • || M x ||1 || x ||1 • || M x ||1= || x ||1, if x0 or x0 • Proof (x0 or x0)
Eigenvalues of Stochastic Matrices • Lemma For every stochastic matrix M and every vector x we have • || M x ||1 || x ||1 • || M x ||1= || x ||1, if x0 or x0 • Proof • Decompose x in two vectors p0 and n0 with x = p + n • Use triangle inequality |a+b| |a| + |b| for all metrics
Periodic Matrices • Definition • A square matrix M such that the matrix power Mk=M for k a positive integer is called a periodic matrix. • If k is the least such integer, then the matrix is said to have period k-1. • If the period is 1, then M2 = M and M is called idempotent. • Fact • For non-periodic matrices there are vectors x, such that limk Mk x does not converge. • Definition • The directed graph G=(V,E) of a n x n-matrix consistis of the node set V={1,..., n} and has edges • E = {(i,j) | Mij 0} • A path is a sequence of edges (u1,u2),(u2,u3),(u3,u4),..,(ut,ut+1) of a graph • A graph cycle is a path where the start node is the end node • A stronglyconnected subgraph S is a minimum sub-graph such that every graph cycle starting and ending in a node of S is contained in S.
Necessary and Sufficient Conditions for Periodicity • Theorem (necessary condition) • If the stochastic matrix M is periodic with periodicity t2, then for the graph G of M there exists a strongly connected subgraph S of at least two nodes such that every directed graph cycle within S has a length of the form i t for natural number i. • Theorem (sufficient condition) • Let the graph consist of one strongly connected subgraph and • let L1,L2, ..., Lm be the lengths all directed graph cycles of maximal length n • Then M is non-periodic if and only if gcd(L1,L2, ..., Lm) = 1 • Notation: • gcd(L1,L2, ..., Lm) = greatest common divisor of numbers L1,L2, ..., Lm • Corollary • If the graph is strongly connected and there exists a graph cycle of length 1 (i.e. a loop), then M is non-periodic.
Disadvantages of the Simplified Pagerank-Algorithm • The Web-graph has sinks, i.e. pages without links M is not a stochastic matrix • The Web-graph is periodic Convergence is uncertain • The Web-graph is not strongly connected Several convergence vectors possible • Rank-sinks • Strongly connected subgraphs absorb all weight of the predecessors • All predecessors pointing to a web-page loose their weight.
The (non-simplified) Pagerank-Algorithm • Add to a sink links to all web-pages • Uniformly and randomly choose a web-page • With some probability q < 1 perform a step of the simplified Pagerank algorithm • With probability 1-q start with the first step (and choose a random web-page) • Note M ist stochastic
Properties of the Pagerank-Algorithm • Graph of the matrix is strongly connected • There are graph cycles of length 1 Theorem In non-periodic matrices of strongly connected graphs the Markov-chain converges to a unique eigenvector with eigenvalue 1. PageRank converges to this unique eigenvector
Thanks for your attentionEnd of 7th lectureNext lecture: Mo 29 Nov 2004, 11.15 am, FU 116Next exercise class: Mo 29 Nov 2004, 1.15 pm, F0.530 or We 01 Dec 2004, 1.00 pm, E2.316