Accessing the Hidden Web

Accessing the Hidden Web • Hidden Web vs. Surface Web Surface Web (Static or Visible Web): Accessible to the conventional search engines via hyperlinks. Hidden Web: (Dynamic or Invisible Web): Web Pages that require HTML forms to access the information. Dynamically generated Web Pages: Interactive and responsive Ajax applications. For instance: GoogleMaps

Obstacles in Accessing hidden Web • Certain Properties of Ajax Applications (Client Side Dynamism) • Client-side Execution • State Changes & Navigation • Dynamic Representational Model • Clickables • Access interface using HTML forms (Input Dynamism) • Filtering of information which is then passed to the server to be processed which generates the answer page • Issue of Scale • Comprehensive coverage of the hidden web not possible due to enormous size.

CRAWLJAX • An approach to improve search engine discoverability of Ajax applications. • Linked multi page mirror site (which is fully accessible to search engines) is generated after the Ajax application has been built. • Application is dynamically analyzed by actually running it. • div and span elements in Ajax applications might have clickables attached to them. • Detecting whether such an element is clickable by inspecting the code is very difficult.

Index Current Reserve ratio Current Currency ratio Money Base Federal Updation Policy Previous RR State Flow Graph • Root: Initial State: “Index” • Vertices Set: Set of Runtime States: “Money Base”, “Previous RR”, “Current Currency Ratio”, etc. • Edge Set: An Edge between v1, v2 represents a clickable that states that v2 is reachable from v1.

Working of CRAWLJAX • The URL of the website along with prospective click elements (a set of HTML tags) is input. • Robot is used to simulate real user clicks on the embedded browser to fire possible events and actions attached to candidate clickables. • A state flow graph is then built. • The new state flow graph becomes the input for the crawl procedure which is recursively called. • The distance between the current DOM and previous DOM is compared. • The state flow graph is updated accordingly. • The links in the nodes of the graphs are established by replacing the clickable with a hypertext link. • HTML String representation of all DOM objects are generated and uploaded on the server. • The original Ajax site is then linked to the mirror site in order to find the first door way to the search engines.

Update Robot Ajax Engine click UI generate click event DOM CrawlJax Controller browser State Machine Access generate sitemap generate mirror Delta Updates Link up Output Linker Site Map Generator Mirror Site Generator Transform DOM to HTML Transformer Site Map XML Multi-Page HTML CRAWLJAX Architecture 1. procedure START (url, Set tags) browser = initBrowser(url) robot = initRobot() sm = initStateMachine() crawl(sm, tags) linkupAndSaveAsHTML(sm) generateSitemap(sm) end procedure 2. procedure CRAWL (StateMachine sm, Set tags) Cs = sm.getCurrentState() deltaupdate = diff(cs.getDom(), browser.getDom()) Set C = getCandidateClickables(deltaupdate, tags) for c in C do robot.click(c) dom = browser.getDom() if distance(cs.getDom(), dom) > t then n s= State(c, dom) sm.addState(ns) sm.changeState(ns) crawl(sm, tags) sm.changeState(cs) if browser.history.canBack then browser.history.goBack() else browser.reload() clickThroughTo(cs) end if end if end for end procedure

Evaluation of CRAWLJAX Features • High Number of Ajax Server Calls:Need to identify hot nodes and minimize hot calls. • State Explosion Handling Mechanism:Configuration options like similarity threshold, maximum crawling time, maximum depth level, etc. can be specified. • Redundant Clickables: <span id = “a”><div id = “CreatingContent”>Textual</div></span> Two clickables are generated rather than one. Span is the actual clickable, not div. • Mouse-Over Clickables Ignored: Certain tags like <image> have mouse – over clickables that are ignored. • Dependence on Back Implementation:Assigns unique IDs to identify elements to navigate its way back to the original state.

SmartCrawl • An approach to access the web hidden behind the forms. • Generation of name, value pairs for the input elements like text boxes. • Form represented as F = {U, (N1, V1), (N2, V2),….,(Nn,Vn)} • Handles combo boxes, radio buttons, check boxes, text fields, etc.

Working of SmartCrawl • Finding Forms:It Indexes pages containing forms. • Query Generation:Depending upon the information collected from the forms, queries are generated. • Visiting the Results:Label values are injected to the form and the query is submitted. HTTP requests using the GET and POST methods are used to submit the parameter values. • Searching for stored Pages:After a user issues a search, indexed web pages are consulted and pages having a high probability of answering a user’s search are returned.

Warehouse Crawler DownLoader URL List Document List Form Result Indexer Form Parser URL Queue Form List Form Inquirer Lexicon Query List Document Seeker Categories Word Match New Query SmartCrawl Architecture

Label Extraction Algorithm • Form extractor looks for nodes which represent forms (<form>, </form>) • Converted to a hierarchical table representation (e.g., Form contains checkbox, checkbox contains options) • First pass of generated table verifies what exists to the left side of the field. If it is a label, it is associated to the field. • Second Pass of generated table looks for labels one cell above of those fields whose labels were not generated in the first pass. • For checkboxes, option values’ labels are extracted from the right.

Evaluation of SmartCrawl Features • Incomplete Extraction of Labels: Label extraction algorithm only accesses elements in <form> </form> tag. • Incomplete form extraction: API used for extraction of DOM tree (NekoHTML) has no support for malformed HTML. • Architecture compatibility: easier implantations’ of strategies in current search engines to gain performance and scalability. • Quality of indexed data: No analysis of the “result pages”. • Slow Searching and Indexing: Use of unsophisticated structures for index storage.

LVS Table URL List www Parser LVS Manager Crawl Manager Submission Form Analyzer Response Form Processor Feedback Response Analyzer Data sources HiWE: Hidden Web Exposer • Addresses the issue of input dynamism

Working of HiWE • Form Analysis:Internal representation is built in form of (element, domain) pairs. • Value Assignment and Submission:“Best Values” are generated for the input fields and the form is submitted. Explicit initialization, built in categories, crawling experience. • Response Analysis:The response page is analyzed to check its relevance. • Response Navigation:Hypertext links in response page are recursively followed to a pre specified depth threshold.

Evaluation of HiWE Features • Sophisticated Label Extraction Algorithm: relies on visual adjacency and partial page layout • Response Analysis: The result pages are analyzed to filter out erroneous pages. • Efficient Ranking Function: Fuzzy, Average and Probabilistic methods used. • Partial inputs in the form model ignored: Certain forms are ignored. Forms such as having low number of input elements, no matching entries in the LVS table, etc. • Effective form filling using Crawl History: The finite and infinite values in LVS table are populated based on the past values. • Task Specific Initialization: It helps HiWE avoid topic drift.

Comparisons • Kinds of Applications crawled: CRAWLJAX: automatically clicks SmartCrawl, HiWe: automatically fill • Topic Drift: CRAWLJAX blindly clicks the clickable elements HiWe follows a task specific approach. SmartCrawl: Initially fills default values for the fields but then fills out other combinations of values. • Different Label Extraction Algorithms HiWe: Visual adjacency SmartCrawl: Generation of the DOM tree (hierarchical table structure)

Comparisons contd…. • Performance:Relevant pages in the result set. HiWE: Clever Ranking Function, Crawler input, Response Analysis CRAWLJAX: Kind of clickables discovered. SMARTCRAWL: Low performance data structures, naïve label extraction algorithm, no analysis of response pages, etc. • Speed of Execution HiWe and SmartCrawl execute faster than CRAWLJAX. Pressence of hot calls makes it slower. • Maintainence of IDs: An overhead of CRAWLJAX required to implement the back functionality. There is no such requirement for HiWe and SmartCrawl.

Accessing the Hidden Web