Crawling the Hidden Web

Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina Computer Science Department Stanford University Reviewed by Pankaj Kumar

Introduction • What are web crawlers? Programs, that traverses Web graph in a structured manner, retrieving web pages. • Are they really crawling the whole web graph? Their target: Publicly Index-able Web (PIW) • They are missing something… Crawling Hidden Web

What about results, which can only be obtained by: • Search Forms • Web pages, that need authorization. • Let’s face the truth: • Size of hidden web with respect to PIW • High Quality information are present out there. Example – Patents & Trademark Office, News Media Crawling Hidden Web

Now…The Goal: • To create a web crawler, which can crawl and extract information from hidden database. • Indexing, analysis and mining of hidden web content. • But, the path is not easy: • Automatic parsing and processing of form-based interfaces. • Input to the form of search queries. Crawling Hidden Web

Our approach: • Task-specificity – • Resource Discovery (will NOT focus in this paper) • Content Extraction • Human Assistance – It is critical, as it • enables the crawler to use relevant values. • gathers additional potential values. Crawling Hidden Web

Hidden Web Crawlers • A new operational model – developed at Stanford University. • First of all… • How a user interacts with a web form: Crawling Hidden Web

Now, how a crawler should interact with a web form: • Wait…what is this all about ??? - Let’s understand the terminologies first. That will help us. Crawling Hidden Web

Terminologies: • Form Page: Actual web page containing the form. • Response Page: Page received in response to a form submission. • Internal Form Representation: Created by the crawler, for a certain web form, F. F = ({E1, E2,…, En}, S, M) • Task-specific Database: Information, that the crawler needs. • Matching Function: It implements the “Match” algorithm to produce value assignments for the form elements. Match(({E1, E2,…, En}, S, M), D) = [E1v1, E2v2,…, Envn] • Response Analysis: Receives and stores the form submission in the crawler’s repository. Crawling Hidden Web

Submission Efficiency (Performance): Let, Ntotal= Total # of forms submitted by the crawler, Nsuccess= # of submissions which result in a response page containing one or more search results, and Nvalid= # of semantically correct form submissions. Then, • Strict Submission Efficiency (SEstrict) = (Nsuccess) / (Ntotal) • Lenient Submission Efficiency (SElenient) = (Nvalid) / (Ntotal) Crawling Hidden Web

HiWE: Hidden Web Exposer • HiWE Architecture: Crawling Hidden Web

But, how does this fit in our operational model ???? • Form Representation • Task Specific Database (LVS Table) • Matching Function • Computing Weights Crawling Hidden Web

LITE: Layout-based Information Extraction Technique What is it ?? A technique where page layout aids in label extraction. • Prune the form page. • Approximately layout the pruned page using Custom Layout Engine. • Identify and rank the Candidate. • The highest ranked candidate is the label associated with the form element. Crawling Hidden Web

Experiments • Task Description: Collect Web pages containing “News articles, reports, press releases, and white papers relating to the semiconductor industry, dated sometime in the last ten years”. • Parameter values: Crawling Hidden Web

Effect of Value Assignment Ranking function (ρfuzz , ρavg and ρprob ): • Label Extraction: • LITE: 93% • Heuristic purely based on Textual Analysis : 72% • Heuristic based on Extensive manual observation: 83% Crawling Hidden Web

Effect of α: • Effect of crawler input to LVS table: Crawling Hidden Web

Pros and Cons… • Pros • More amount of information is crawled • Quality of information is very high • More focused results • Crawler inputs increases the number of successful submissions • Cons • Crawling becomes slower • Task-specific Database can limit the accuracy of results • Unable to process simple form element dependencies • Lack of support for partially filled out forms Crawling Hidden Web

Where does our course fit in here…?? • In Content Extraction • Given the set of resources, i.e. sites and databases, automate the information retrieval • In Label Matching (Matching Function) • Label Normalization • Edit Distance Calculation • In LITE-based heuristic for extracting labels • Identify and Rank Candidates • In maintaining Crawler’s repository Crawling Hidden Web

Related Works… • J. Madhavan et al, VLDS, 2008, Google's Deep Web Crawl • J. Madhavan et al, CIDR, Jan. 2009, Harnessing the Deep Web: Present and Future • Manuel Álvarez, Juan Raposo, Fidel Cacheda and Alberto Pan, Aug. 2006, A Task-specific Approach for Crawling the Deep Web • Lu Jiang, Zhaohui Wu, Qian Feng, Jun Liu, Qinghua Zheng, Efficient Deep Web Crawling Using Reinforcement Learning • Manuel Álvarez et al, Crawling the Content Hidden Behind Web Forms • Yongquan Dong, Qingzhong Li, 2012, A Deep Web Crawling Approach Based on Query Harvest Model • Alexandros Ntoulas, Petros Zerfos, Junghoo Cho, Downloading Hidden Web Content • Rosy Madaan, Ashutosh Dixit, A.K. Sharma, Komal Kumar Bhatia, 2010, A Framework for Incremental Hidden Web Crawler • Ping Wu, Ji-Rong Wen, Huan Liu, Wei-Ying Ma, Query Selection Techniques for Efficient Crawling of Structured Web Sources • http://deepweb.us/ Crawling Hidden Web

So…what’s the “Conclusion” ? • Traditional Crawler’s limitations • Issues related to extending the Crawlers for accessing the “Hidden Web” • Need for narrow application focus • Promising results of HiWE • Limitations (of HiWE): • Inability to handle simple dependencies between form elements • Lack of support for partial filled out forms Crawling Hidden Web

Crawling the Hidden Web