Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web

Guest Lecture Large-Scale Deep Web Integration:Exploring and Querying Structured Data on the Deep Web Zhen Zhang

What you will learn in this lecture • What is deep Web? • Why information integration on deep Web? • What are integration paradigms? • What are technical challenges? • What are proposed solutions?

The Web becomes increasing dynamic! Static surface Web Dynamic deep Web

Data are hidden behind query forms

Bring data up front: An evidence—Google Base

The Deep Web: Databases on the Web How to enable effective access to the deep Web?

Survey the frontier: BrightPlanet.com, March 2000 [Bergman00] • Overlap analysis of search engines. • Estimated 43,000 – 96,000 deep Web sites. • Content size 500 times that of surface Web.

Survey the frontier UIUC MetaQuerier, April 2004 [ChangHL+04] • Macro: Deep Web at large • Data: Automatically-sampled 1 million IPs • Micro: per-source specific characteristics • Data: Manually-collected sources • 8 representative domains, 494 sources Airfare (53), Autos (102), Books (69), CarRentals (24) Hotels (38), Jobs (55), Movies (78), MusicRecords (75) • Available at http://metaquerier.cs.uiuc.edu/repository

We want to observe • How many deep-Web sources are out there? • 307,000 sites, 450,000 DBs, 1,258,000 interfaces. • How many structured databases? • 348,000 (structured) : 102,000 (text) == 3 : 1 • How do search engines cover them? • Google covered 5% fresh and 21% state objects. • InvisibleWeb.com covered 7.8% sources. • How hidden are they? • CarRental (0%) > Airfares (~4%) > … > MusicRec > Books > Movies (80%+)

Google’s Recent Survey [courtesy Jayant Madhavan]

System: Example Applications

Web Database Web Database Web Database Web Database Web Database … PDF Journal Homepage DOC PS Auhtor Homepage Conf. Homepage Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA) • Integrating information from multiple types of sources • Ranking papers, conferences, and authors for a given query • Handling structured queries

On-the-fly Meta-querying Systems— e.g., WISE [HeMYW03], MetaQuerier [ChangHZ05] MetaQuerier@UIUC : FIND sources Amazon.com Cars.com db of dbs Apartments.com QUERYsources 411localte.com unified query interface

What needs to be done? Technical Challenges: • Source Modeling & Selection • Schema Matching • Source Querying, Crawling, and Obj Ranking • Data Extraction

Technical Challenges • 1. Source Modeling & Selection • How to describe a source and find right sources for query answering?

Source Modeling & Selection: for Large Scale Integration • Focus:Discovery of sources. • Focused crawling to collect query interfaces [BarbosaF05, ChangHZ05]. • Focus:Extraction of source models. • Hidden grammar-based parsing [ZhangHC04]. • Proximity-based extraction [HeMY+04]. • Classification to align with given taxonomy [HessK03, Kushmerick03]. • Focus:Organization of sources and query routing • Offline clustering [HeTC04, PengMH+04]. • Online search for query routing [KabraLC05].

Form Extraction: the Problem • Output all the conditions, for each: • Grouping elements (into query conditions) • Tagging elements with their “semantic roles” attribute operator value

Interface Creation Grammar Form Extraction: Parsing Approach [ZhangHC04]A hidden syntactic model exist? • Observation: Interfaces share “patterns” of presentation. • Hypothesis: • Now, the problem: • Given , how to find ? query capabilities

Tokenizer Layout Engine Best-Effort Visual Language Parsing Framework Input: HTML query form 2P Grammar Productions Preferences BE-Parser Ambiguity Resolution Error Handling X Output: semantic structure

Form Extraction: Clustering Approach [HessK03, Kushmerick03] Concept: A form as a Bayesian network. • Training: Estimate the Bayesian probabilities. • Classification: Max-likelihood predictions given terms.

Technical Challenges • 2. Schema Matching • How to match the schematic structures between sources?

Category: Business Fiction Computer Type : Business Fiction Computer Subject Category Schema Matching: for Large Scale Integration • Focus:Matching large number of interface schemas, often in a holistic way. • Statistical model discovery [HeC03]; correlation mining [HeCH04, HeC05]. • Query probing [WangWL+04]. • Clustering [HeMY+03, WuYD+04]. • Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06]. • Focus:Constructing unified interfaces. • As a global generative model [HeC03]. • Cluster-merge-select [HeMY+03].

WISE-Integrator: Cluster-Merge-Represent[HeMY+03]

WISE-Integrator: Cluster-Merge-Represent[HeMY+03] • Matching attributes: • Synonymous label: WordNet, string similarity • Compatible value domains (enum values or type) • Constructing integrated interface: • form = initial empty • until all attributes covered: • take one attribute • find clusters it belongs to • select a representative and merge values • put representative to the interface if not there

M α α α Concepts C1 C2 C3 γ β γ α1 α2 α3 η η η Attributes α β γ η δ β1=1 β2=1 β3=1 β4 β5=1 -β4 Statistical Model α α γ γ γ β η η δ δ Schema Generation M M α β α β γ γ η δ η δ Statistical Schema Matching: MGS A hidden statistical model exist? [HeC03, HeCH04, HeC05] • Observation: Schemas share “tendencies” of attribute usage. • Hypothesis: • Now, the problem: • Given , how to find ? α γ γ η δ attribute matchings

Technical Challenges • 3. Source Querying, Crawling & Search • How to query a source? How to crawl all objects and to search them?

Source Querying: for Large Scale Integration • Metaquerying model: • Focus:On-the-fly Querying. • MetaQuerier Query Assistant [ZhangHC05]. • Vertical-search-engine model: • Focus:Source crawling to collect objects. • Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06]. • Focus:Object search and ranking [NieZW+05]

On-the-fly Querying: [ZhangHC05] Type-locality based Predicate Translation Source predicates Target template P X Predicate Mapper Type Recognizer Domain Specific Handler Text Handler Numeric Handler Datetime Handler Target Predicate t* • Correspondences occur within localities • Translation by type-handler

Source Crawling by Query Selection [WuWL+06] System Compiler Theory • Conceptually, the DB as a graph: • Node: Attributes • Edge: Occurrence relationship • Crawling is transformed into graph traversal problem: Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum. Application Ullman Automata Data Mining Han

Object Ranking-- Object Relationship Graph [NieZW+05] • Popularity Propagation Factor for each type of relationship link • Popularity of an object is also affected by the popularity of the Web pages containing the object

Technical Challenges • 4. Data Extraction • How to extract result pages into relations?

Data Extraction: Circa 2000Need for rapid wrapper construction well recognized. • Focus: • Semi-automatic wrapper construction. • Techniques: • Wrapper-mediator architecture [Wiederhold92] . • Manual construction: • Semi-automatic: Learning-based • HLRT [KushmerickWD97], Stalker [MusleaMK99], Softmealy [HsuD98]; Mediator Wrapper Wrapper Wrapper

Data Extraction: for Large ScaleEven more automatic approaches. • Focus: • Even more automatic approaches. • Techniques: • Semi-automatic: Learning-based • [ZhaoMWRY05], [IRMKS06]. • Automatic: Syntax-based • RoadRunner [MeccaCM01], ExAlg [ArasuG03], DEPTA [LiuGZ03, ZhaiL05]. Mediator Wrapper Wrapper Wrapper

HLRT Wrapper: the first “Wrapper Induction” [KushmerickWD97] A manual wrapper: ExtractCCs(page P) skip past first occurrence of in P while next is before next <HR> in P for each <lk,rk>belongs to {< ,>,< ,>} skip past next occurrence of lk in P extract attribute from P to next occurrence of rk return extracted tuples A generalized wrapper: labeled data ExecuteHLRT(<h,t,l1,r1,..,lk,rk>,page P) skip past first occurrence of h in P while next l1 is before next t in P for each <lk,rk>belongs to {<l1,r1>,..,< lk, rk >} skip past next occurrence of lk in P extract attr from P to next occurrence of rk return extracted tuples wrapper rules: (delimiters) h l1, r1 l2, r2 …… lk, rk t Induction Algorithm

RoadRunner [MeccaCM01] • Basic idea: • Page generation: filling (encoding) data into a template • Data extraction: as the reverse, decoding the template • Algorithm • Compare two HTML pages at one time • one as wrapper and the other as sample • Solving the mismatches • string mismatch -- content slot • tag mismatch -- structure variance

RoadRunner the template

Finally, observations Large scale is not only a challenge, but also an opportunity!

Thank You! For more information: http://metaquerier.cs.uiuc.edu

Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web