1 / 42

Downloading Textual Hidden-Web Content Through Keyword Queries

Downloading Textual Hidden-Web Content Through Keyword Queries. Downloading Textual Hidden-Web Content Through Keyword Queries. Alexandros Ntoulas Petros Zerfos Junghoo Cho University of California Los Angeles Computer Science Department {ntoulas, pzerfos, cho}@cs.ucla.edu

shania
Download Presentation

Downloading Textual Hidden-Web Content Through Keyword Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Downloading Textual Hidden-WebContent Through Keyword Queries Downloading Textual Hidden-WebContent Through Keyword Queries Alexandros NtoulasPetros Zerfos Junghoo Cho University of California Los Angeles Computer Science Department {ntoulas, pzerfos, cho}@cs.ucla.edu JCDL, June 8th 2005

  2. Motivation • I would like to buy a used ’98 Ford Taurus • Technical specs ? • Reviews ? • Classifieds ? • Vehicle history ? Google?

  3. Why can’t we use a search engine ? • Search engines today employ crawlers that find pages by following links around • Many useful pages are available only after issuing queries (e.g. Classifieds, USPTO, PubMed, LoC, …) • Search engines cannot reach such pages: there are no links to them (Hidden-Web) • In this talk: how can we download Hidden-Web content?

  4. Outline • Interacting with Hidden-Web sites • Algorithms for selecting queries for the Hidden-Web sites • Experimental evaluation of our algorithms

  5. Interacting with Hidden-Web pages (1) • The user issues a query through a query interface liver

  6. Result List Page Interacting with Hidden-Web pages (2) • The user issues a query through a query interface • A result list is presented to the user

  7. Interacting with Hidden-Web pages (3) • The user issues a query through a query interface • A result list is presented to the user • The user selects and views the “interesting” results

  8. Querying a Hidden-Web site Procedurewhile ( there are available resources ) do (1) select a query to send to the site (2) send query and acquire result list (3) download the pages done

  9. How should we select the queries ? (1) • S: set of pages in Web site (pages as points) • qi: set of pages returned if we issue query qi(queries as circles)

  10. How should we select the queries ? (2) • Find the queries (circles) that cover the maximum number of pages (points) • Equivalent to the set-covering problem in graph-theory

  11. Challenges during query selection • In practice we don’t know which pages will be returned by which queries (qi are unknown) • Even if we did know qi, the set-covering problem is NP-Hard • We will present approximation algorithms to the query selection problem • We will assume single-keyword queries

  12. Outline • Interacting with Hidden-Web sites • Algorithms for selecting queries for the Hidden-Web sites • Experimental evaluation of our algorithms

  13. Some background (1) • Assumption: When we issue query qito a Web site, all pages containing qiare returned • P(qi): fraction of pages from site we get back after issuing qi • Example: q = liver • No. of docs in DB: 10,000 • No. of docs containing liver: 3,000 • P(liver) = 0.3

  14. Some background (2) • P(q1/\q2): fraction of pages containing both q1and q2 (intersection of q1 and q2) • P(q1\/q2): fraction of pages containing either q1or q2 (union of q1 and q2) • Cost and benefit: • How much benefit do we get out of a query ? • How costly is it to issue a query?

  15. Cost function • The cost to issue a query and download the Hidden-Web pages: • cq: query cost • cr: cost for retrieving a result item • cd: cost for downloading a document cq Cost(qi) = + crP(qi) + cdP(qi) (2) Cost for retrieving a result item times no. of results (3) Cost for retrieving a doc times no. of docs (1) Cost for issuing a query

  16. Problem formalization Find the set of queries q1,…,qn which maximizes P(q1\/…\/qn) Under the constraint:

  17. Query selection algorithms • Random: Select a query randomly from a precompiled list (e.g. a dictionary) • Frequency-based: Select a query from a precompiled list based on frequency (e.g. a corpus previously downloaded from the Web) • Adaptive: Analyze previously downloaded pages to determine “promising” future queries

  18. Adaptive query selection • Assume we have issued q1,…,qi-1. • To find a promising query qi we need to estimate P(q1\/…\/qi-1\/qi) • P( (q1\/…\/qi-1) \/ qi) = P(q1\/…\/qi-1) + P(qi) - P(q1\/…\/qi-1) P(qi|q1\/…\/qi-1) Known (by counting) since we have issued q1,…,qi-1 Can measure by counting P(qi) within P(q1,…,qi-1) What about P(qi) ?

  19. Estimating P(qi) • Independence estimator • Zipf estimator [IG02] • Rank queries based on frequency of occurrence and fit a power law distribution • Use fitted distribution to estimate P(qi) P(qi) ~ P(qi|q1\/…\/qi-1)

  20. Query selection algorithm foreachqiin [potential queries] do Pnew(qi) = P(q1\/…\/qi-1\/qi) – P(q1\/…\/qi-1) Estimate done returnqi with maximum Efficiency(qi)

  21. Other practical issues • Efficient calculation of P(qi|q1\/…\/qi-1) • Selection of the initial query • Crawling sites that limit the number of results(e.g. DMOZ returns up to 10,000 results) • Please refer to our paper for the details

  22. Outline • Interacting with Hidden-Web sites • Algorithms for selecting queries for the Hidden-Web sites • Experimental evaluation of our algorithms

  23. Experimental evaluation • Applied our algorithms to 4 different sites

  24. Policies • Random-16K • Pick query randomly from 16,000 most popular terms • Random-1M • Pick query randomly from 1,000,000 most popular terms • Frequency-based • Pick query based on frequency of occurrence • Adaptive

  25. Coverage of policies • What fraction of the Web sites can we download by issuing queries ? • Study P(q1\/…\/qi) as i increases

  26. Coverage of policies for PubMed • Adaptive gets ~80% with ~83 queries • Frequency needs 103 for the same coverage

  27. Coverage of policies for DMOZ (whole) • Adaptive outperforms others

  28. Coverage of policies for DMOZ (arts) • Adaptive performs best in topic-specific texts

  29. Other experiments • Impact of the initial query • Impact of the various parameters of the cost function • Crawling sites that limit the number of results(e.g. DMOZ returns up to 10,000 results) • Please refer to our paper for the details

  30. Related work • Issuing queries to databases • Acquire language model [CCD99] • Estimate fraction of the Web indexed [LG98] • Estimate relative size and overlap of indexes [BB98] • Build multi-keyword queries that can return a large number of documents [BF04] • Harvesting approaches/cooperative databases (OAI [LS01], DP9 [LMZN02])

  31. Conclusion • An adaptive algorithm for issuing queries to Hidden-Web sites • Our algorithm is highly efficient (downloaded >90% of a site with ~100 queries) • Allows users to tap into unexplored information on the Web • Allows the research community to download, mine, study, understand the Hidden-Web

  32. References • [IG02] P. Ipeirotis, L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB 2002. • [CCD99] J. Callan, M.E. Connel, A. Du. Automatic discovery of language models for text databases. SIGMOD 1999. • [LG98] S. Lawrence, C.L. Giles. Searching the World Wide Web. Science 280(5360):98-100, 1998. • [BB98] K. Bharat, A. Broder. A technique for measuring the relative size and overlap of public web search engines. WWW 1998. • [BF04] L. Barbosa, J. Freire. Siphoning hidden-web data through keyword-based interfaces. • [LS01] C. Lagoze, H.V. Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. JCDL 2001. • [LMZN02] X. Liu, K. Maly, M. Zubair, M.L. Nelson. DP9-An OAI Gatway Service for Web Crawlers. JCDL 2002.

  33. Thank you ! Questions ?

  34. Impact of the initial query • Does it matter what the first query is ? • Crawled PubMed with queries: • data (1,344,999 results) • information (308,474 results) • return (29,707 results) • pubmed (695 results)

  35. Impact of the initial query • Algorithm converges regardless of initial query

  36. Incorporating the document download cost • Cost(qi) = cq + crP(qi)+ cdPnew (qi) • Crawled PubMed with • cq = 100 • cr = 100 • cd = 10,000

  37. Incorporating document download cost • Adaptive uses resources more efficiently • Document cost significant portion of the cost

  38. Can we get all the results back ?

  39. Downloading from sites limiting the number of results (1) • Site returns qi’ instead of qi • For qi+1 we need to estimate P(qi+1|q1\/…\/qi)

  40. Downloading from sites limiting the number of results (2) • Assuming qi’ is a random sample of qi

  41. Impact of the limit of results • How does the limit of results affect our algorithms ? • Crawled DMOZ but restricted the algorithms to 1,000 results instead of 10,000

  42. Dmoz with a result cap at 1,000 • Adaptive still outperforms frequency-based

More Related