CRAWLING THE HIDDEN WEB

CRAWLING THE HIDDEN WEB Authors: S. Raghavan & H. Garcia-Molina Presenter: Nga Chung

OUTLINE • Introduction • Challenges • Approach • Experimental Results • Contributions • Pros and Cons • Related Work

INTRODUCTION • Hidden Web • Content stored in databases that can only be retrieved through user query, such as, medical research databases, flight schedules, product listings, news archives • Social media blog posts, comments • So why should we care? • Scale of the web (55 ~ 60 billions of pages) does not include the deep web or pages behind security walls [2] • Estimate in 2001, Hidden Web is 500 times the publicly indexed web • Mike Bergman, “The long-term impact of Deep Web search had more to do with transforming business than with satisfying the whims of Web surfers.” [5]

CHALLENGES • From a search engine perspective • Locate the hidden databases • Identify which databases to search for a given user query • From a crawler’s perspective • Interact with a search form • Search can be form-based, facet/guided navigation, free-text, which are intended for users [3] • Know what keywords to put into the form fields • Filter search results returned from search queries • Define metrics to measure crawler’s performance

HIDDEN WEB EXPOSER ARCHITECTURE(HIWE) URL List Task Specific Database Parser Crawl Manager Label Value Set (LVS) Form Analyzer WWW Form Submission LVS Manager Form Processor Feedback … Response Response Analyzer Data Sources

FORM ANALYSIS • How does a crawler interact with a search form? • Crawler builds an “Internal Form Representation” • F = ({E1, E2, …, En}, S, M) • Label(E1) is descriptive text describing the field e.g. Date • Domain(E1) is set of possible values for the field which can be finite (select box) of infinite (text box) set of n form elements meta-information e.g. URL of form page, web site hosting form, links to form submission information e.g. submission URL

FORM ANALYSIS Label(E1) = Make Domain(E1) = {Acura, Lexus…} Label(E5) = Your ZIP Domain(E5) = {s | s is a text string}

TASK SPECIFIC DATABASE • How does a crawler know what keywords to put into the form fields? • Crawler has a “task-specific database” • For instance, if the task is to search archives pertaining to the automobile industry, the database will contain lists of all car makes and models. • Database has a Label Value Set (LVS) table • Each row contains • L – a label e.g. “Car Make” • V = {v1, …, vn} – a graded set of values e,g, {‘Toyota’, ‘Honda’, ‘Mercedes-Benz’, …} • Membership function Mv assigns weight to each member of the set V

TASK SPECIFIC DATABASE • LVS table can be populated through • Explicit initialization by human intervention • Built-in entries for commonly used categories e.g. dates • Querying external data sources e.g. Open Directory Project • Crawler’s encounter with forms that have finite domain fields Categories Regional: North America: United States

TASK SPECIFIC DATABASE • Computing weights M(v1) • Case 1: Precomputed • Case 2: Computed by respective data source wrapper • Case 3: Computed by crawling experience shown below Extract Label Extracted? Found? Find Label in LVS Table Add new entry to LVS no yes no yes Find entry that close resembles Domain(E) and add Domain(E) to set Replace (L, V) with (L, V U Domain(E)

MATCHING FUNCTION • “Matching function” maps values from database to form field • Step 1: Label matching • Normalize form label and use string matching algorithm to compute minimum edit distance between form label and all LVS labels E1 = Car Make v1 = Toyota E1 = Car Make Match E2 = Car Model E2 = Car Model v2 = Prius

MATCHING FUNCTION • Step 2: Value assignment • Take all possible combinations of value assignments, rank them, and choose the best set to use for form submission • There are three ranking functions • Fuzzy conjunction • Average • Probabilistic • Example: form with 2 fields car make and year • Jaguar, 2009 where Mv1(Jaguar) = 0.5 and Mv2(2009) = 1 • ρfuz = 0.5 • ρavg = ½ (0.5 + 1) = 0.75 • ρprob = 1 – [(1 – 0.5) * (1 – 1)] = 1 • Toyota, 2010, where Mv1(Toyota) = 1 and Mv2(2010) = 1 • ρfuz = 1 • ρavg = ½ (1 + 1) = 1 • ρprob = 1 – [(1 – 1) * (1 – 1)] = 1

LAYOUT-BASED INFORMATION EXTRACTION(LITE) Label Extraction Method Results

RESPONSE ANALYSIS • How does crawler determine whether response page contains results or error message? • Identify significant portion of the response page by removing header, footer, etc. and find content in middle of the page • See if content matches predefined error messages e.g. “No results,” “No matches” • Store hash of significant portion and assume that if hash occurs very often, then hash is that of an error page

METRICS • How to measure the efficiency of the hidden web crawler? • Define submission efficiency SE • Ntotal = total number of forms submitted • Nsuccess = total number of submissions that resulted in response page containing search results • Nvalid = number of semantically correct submissions (e.g. inputting “Orange” for form element labeled “Vegetable” is semantically incorrect)

EXPERIMENT • Task: Market analyst interested in building an archive of information about the semiconductor industry in the past10 years • LVS table populated from online sources such as Semiconductor Research Corporation, Lycos Companies Online

EXPERIMENTAL RESULTS – RANKING FUNCTION • Crawler executed 3 times with different ranking function • ρfuz and ρavg submission efficiency above 80% • ρfuz does better but less forms are submitted as compared to ρavg 83.1% 88.8%

EXPERIMENTAL RESULTS – MINIMUM FORM SIZE • Effect of minimum form size – crawler performs better on larger forms 78.9% 88.77% 88.96%

CONTRIBUTIONS • Introduces HiWE, one of the first publicly available techniques for crawling the hidden web • Introduces LITE, a technique to extract form data, by incorporating the physical layout of the HTML page • Techniques prior to this were based on pattern recognition of the underlying HTML

PROS • Defines clear performance metric from which to analyze the crawler’s efficiency • Points out known limitations of technique from which future work can be done • Directs readers to technical report which provides more detailed explanation of HiWE implementation

CONS • Not an automatic approach, requires human intervention • Task-specific • Requires creation of LVS table per task • Technique has lots of limitations • Can only retrieve search results from HTML based forms • Cannot support forms that is driven by Javascript events e.g. onclick, onselect • No mention of whether forms submitted through HTTP post were stored/indexed

RELATED WORK • USC ISI Extract Data from Web (1999 - 2001) [7, 8] • Describe relevant information on web page with a formal grammar and automatically adapt to web page changes • Research at UCLA (2005) [4] • Adaptive approach – automatically generate queries by examining results from previous queries • Google’s Deep-Web Crawler (2008) [1] • Select only a small number of input combinations that provides good coverage of content in underlying database and adds the resulting HTML pages into a search engine index • DeepPeep [6] • Tracks 45,000 forms across 7 domains and allows users to search for these forms

Q & A

REFERENCES [1] J. Madhavan, D. Ko, Ł. Kot, V. Ganapathy, A. Rasmussen, & A. Halev, “Google’s Deep-Web Crawl,” Proceedings of the VLDB Endowment, 2008. Available: http://www.cs.cornell.edu/~lucja/Publications/I03.pdf. [Accessed June 13, 2010] [2] C. Mattmann, “Characterizing the Web,” Available: http://sunset.usc.edu/classes/cs572_2010/Characterizing_the_Web.ppt. [Accessed May 19, 2010] [3] C. Mattmann, “Query Models,” Available: http://sunset.usc.edu/classes/cs572_2010/Query_Models.ppt. [Accessed June 10, 2010] [4] A. Ntoulas, P. Zerfos, & J. Cho, “Downloading Textual Hidden Web Content by Keyword Queries,” Proceedings of the Joint Conference on Digital Libraries,June 2005. Available: http://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdf. [Accessed June 13, 2010] [5] A. Wright, “Exploring a ‘Deep Web’ That Google Can’t Grasp,” The New York Times, February 22, 2009. Available: http://www.nytimes.com/2009/02/23/technology/internet/23search.html?_r=1&th&emc=th. [Accessed June 1, 2010] [6] DeepPeep beta, Available: http://www.deeppeep.org/index.jsp [7] C. A. Knoblock, K. Lerman, S. Minton, & I. Muslea, “Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach,” IEEE Data Engineering Bulletin, 1999. Available: http://www.isi.edu/~muslea/PS/deb-2k.pdf. [Accessed June 28, 2010] [8] C. A. Knoblock, S. Minton, & I. Muslea,” Hierarchical Wrapper Induction for Semistructured Information Sources,” Journal of Autonomous Agents and Multi-Agent Systems, 2001. Available: http://www.isi.edu/~muslea/PS/jaamas-2k.pdf. [Accessed June 28, 2010]

CRAWLING THE HIDDEN WEB

CRAWLING THE HIDDEN WEB

Presentation Transcript

User-Centric Web Crawling

Crawling the Hidden Web

Web Crawling

Web Crawling

Web Crawling

Searching the Hidden Web

Crawling the Hidden Web

Web Crawling

CRAWLING THE WEB

Crawling the Hidden Web

Crawling the Hidden Web

Accessing the Hidden Web

Crawling the Hidden Web

Ch. 8: Web Crawling

Keywords Selection Problem in Hidden Web Crawling

Chapter 9 Web Crawling

Datahut - Web Crawling Services

Ch. 8: Web Crawling

Web Crawling and Automatic Discovery

Deep Web Crawling

User-Centric Web Crawling*