310 likes | 476 Views
Keywords Selection Problem in Hidden Web Crawling. Ka Cheung Sia, Richard March 15 2004. Agenda. What is Hidden Web? How to crawl the Hidden Web? Problem formalization Searching for “best” keyword Greedy Tree searching Pruning Experiments & results Conclusion. What is Hidden Web?.
E N D
Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004
Agenda • What is Hidden Web? • How to crawl the Hidden Web? • Problem formalization • Searching for “best” keyword • Greedy • Tree searching • Pruning • Experiments & results • Conclusion
What is Hidden Web? • Hidden • Unreachable by following hyperlinks • Dynamically generated • Accessible only through a search interface • Informative • Examples • http://citeseer.ist.psu.edu/ - CS research paper • http://www.pubmed.org – medical research paper • http://catalog.loc.gov – library of congress
What is Hidden Web? • Search interface • http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Search+Documents&cs=1
What is Hidden Web? • Result
What is Hidden Web? • Document
How to crawl the Hidden Web • http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Search+Documents&cs=1 Our task Figure out a keyword Query HiddenWeb Result
Problem formalization • Set-cover • Vertex – documents • Hyper-edges – query words
Goal • Maximize the number of unique documents retrieved with minimum number of query words
Problem formalization • P(qi) • portion of unique documents retrieved by issuing query word qi (portion of documents containing “qi”) • P(qi v qj) • portion of unique documents retrieved by issuing query words qi and qj (portion of documents containing qi or qj) • P(qi | qj) • portion of documents containing qi in the set of documents retrieved by issuing query words qj
Problem formalization • What is the next “best” query word? • P((q1 v … v qi-1) v qi)= P(q1 v … v qi-1) + P(qi) – P((q1 v … v qi-1) ^ qi)= P(q1 v … v qi-1) + P(qi) – P(q1 v … v qi-1)P(qi | q1 v … v qi-1) • P(q1 v … v qi-1) – knownP(qi | q1 v … v qi-1) – knownP(qi) – unknown • Approximate P(qi) using P(qi | q1 v … v qi-1)
Search for best query word • Greedy: choose the most frequently occurring word so far to be the query • Choose qi with maximum P(qi | q1 v … v qi-1) • For set-cover problem, greedy is proven to obtain log-optimal solution
Search for best query word • Can we do better? • Intuition • Correlation of keywords • E.g.- linux- debian, redhat, suse, knoppix, fedora, etc… • We might save the query word “linux” !
Search for best query word Wholedocumentcollection Documentsretrieved by qk Documents retrieved by qi Documentsretrieved by qj Already retrieveddocuments
Search for best query word linux debian redhat f(x) = Number of documents we get by issuing queries linux, debain, redhat minus theoverlapping between “redhat, linux” and “debain, linux” and “redhat, debain”
Search for best query word • The search tree is huge (branching factor) • We look ahead for the 10 most frequent keywords • We only search up to depth 6 • Pruning
Search for best query word • DFBnBSub-tree where the sum of documentsretrieved assuming no overlappingbetween keywords are less than thecurrent best solution
Experiment • Document collection : ~100K front pages of randomly selected websites • Query interface : an inverted index (a program that returns documents containing the given query word) • Methods • Greedy • DFS search (look ahead for 10 words, up to depth 6) • DFS search with pruning (DFBnB)
Results provide 51 work 159 privacy 144 years 172 world 344 list 205 info 1467 map 184 want 57 order 87 people 85 read 56 main 2270 high 95 designed 240 latest 36 events 132 looking 46 send 80 right 380 enter 1285 local 77 browser 1216 questions 77 real 77 provide 51 work 159 privacy 144 years 172 read 101 main 2364 designed 291 info 1455 latest 53 looking 60 send 101 right 402 local 99 world 239 list 142 map 150 want 42 order 69 people 67 high 85 events 126 questions 85 enter 1272 browser 1216 real 77 • Does searching helps?
Results • Does searching helps?
Results • How much does pruning saves? • With out pruning – 187300 nodes are examined187300=(10)+(10*9)+(10*9*8)+(10*9*8*7)+(10*9*8*7*6)+(10*9*8*7*6*5) • With pruning – 5558 nodes are examined on average (when we choose the most frequent keyword to expand) • DFBnB saves ~ 30 times
Conclusion • Searching helps little “in this problem” • DFBnB is “really effective” in pruning search tree
More results • Priori information helps
Search for best query word • base = q1 v … v qi • P(base v qi+1 v qi+2)= P(base v qi+1) + P(qi+2) – P((base v qi+1) ^ qi+2) • P((base v qi+1) ^ qi+2)= P((base ^ qi+2) v (qi +1^ qi+2))= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)