250 likes | 470 Views
Crawling the Hidden Web. Sriram Raghavan Hector Garcia-Molina @ Stanford University. Introdution. What’s the problem? Current-day crawlers retrieve only Publicly Indexable Web (PIW) Why is it a problem? Large amounts of high quality information are ‘hidden’ behind search forms
E N D
Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University
Introdution • What’s the problem? • Current-day crawlers retrieve only Publicly Indexable Web (PIW) • Why is it a problem? • Large amounts of high quality information are ‘hidden’ behind search forms • The hidden Web is 500 times as large as PIW
Introduction (cont’d) • What’s the solution? • Design a crawler capable of extracting content from the hidden Web • A generic operational model of a hidden Web crawler, Hidden Web Exposer (HiWE) • Why is HiWE a solution?
Challenges and Simplifications • Challenges • Parse, process and interact with search forms • Fill out forms for submission • Simplifications • Application dependant • With user assistance • Only address content retrieval and resource discovery step is done
Performance Metrics • Coverage Metric • Submission Efficiency • Lenient Submission Efficiency
Design Issues • Internal Form Representation • Task-specific Database • Matching Function • Response Analysis
HiWE – Task-Specific Database • Label Value-Set (LVS) Tables • Vaule Set is a fuzzy set of element values is a membership function to assign weights [0, 1] to the member of the set
HiWE – Populating the LVS Table • Explicit Initialization • Built-in Entries • Wrapped Data Sources • Crawling Experience
HiWE – Computing Weights • Values from explicit initialization and built-in categories have weight 1 • Values from external data sources assigned weights by wrappers [0, 1] • Values gathered by crawlers • Extract and Match the label – add new values • Extract and can not match the label – add new entries (L,V) • Can not extract the label – find closest entry and add new values
HiWE – Matching Function • Enumerate values for finite domain elements • Label matching • step 1: string normalization • step 2: string matching • Evaluate value assignment • Fuzzy Conjunction • Average • Probabilistic
HiWE – extraction from pages • Prune form page and only keep forms • Approximately lay-out the pruned page using a lay-out engine • Using lay-out engine to identify candidate labels to form elements • Rank each candidate and chose the best one
HiWE – Experiments (cont’d) 93% accuracy
Future Work • Recognize and respond to the dependencies between form elements • Support partially filling-out forms
Conclusion • Propose an application specific approach to hidden Web crawling • Implement a prototype crawler – HiWE • Set the stage for designing a variety of hidden Web crawlers