390 likes | 528 Views
Guest Lecture. Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web. Zhen Zhang. What you will learn in this lecture. What is deep Web? Why information integration on deep Web? What are integration paradigms? What are technical challenges?
E N D
Guest Lecture Large-Scale Deep Web Integration:Exploring and Querying Structured Data on the Deep Web Zhen Zhang
What you will learn in this lecture • What is deep Web? • Why information integration on deep Web? • What are integration paradigms? • What are technical challenges? • What are proposed solutions?
The Web becomes increasing dynamic! Static surface Web Dynamic deep Web
The Deep Web: Databases on the Web How to enable effective access to the deep Web?
Survey the frontier: BrightPlanet.com, March 2000 [Bergman00] • Overlap analysis of search engines. • Estimated 43,000 – 96,000 deep Web sites. • Content size 500 times that of surface Web.
Survey the frontier UIUC MetaQuerier, April 2004 [ChangHL+04] • Macro: Deep Web at large • Data: Automatically-sampled 1 million IPs • Micro: per-source specific characteristics • Data: Manually-collected sources • 8 representative domains, 494 sources Airfare (53), Autos (102), Books (69), CarRentals (24) Hotels (38), Jobs (55), Movies (78), MusicRecords (75) • Available at http://metaquerier.cs.uiuc.edu/repository
We want to observe • How many deep-Web sources are out there? • 307,000 sites, 450,000 DBs, 1,258,000 interfaces. • How many structured databases? • 348,000 (structured) : 102,000 (text) == 3 : 1 • How do search engines cover them? • Google covered 5% fresh and 21% state objects. • InvisibleWeb.com covered 7.8% sources. • How hidden are they? • CarRental (0%) > Airfares (~4%) > … > MusicRec > Books > Movies (80%+)
System: Example Applications
Web Database Web Database Web Database Web Database Web Database … PDF Journal Homepage DOC PS Auhtor Homepage Conf. Homepage Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA) • Integrating information from multiple types of sources • Ranking papers, conferences, and authors for a given query • Handling structured queries
On-the-fly Meta-querying Systems— e.g., WISE [HeMYW03], MetaQuerier [ChangHZ05] MetaQuerier@UIUC : FIND sources Amazon.com Cars.com db of dbs Apartments.com QUERYsources 411localte.com unified query interface
What needs to be done? Technical Challenges: • Source Modeling & Selection • Schema Matching • Source Querying, Crawling, and Obj Ranking • Data Extraction
Technical Challenges • 1. Source Modeling & Selection • How to describe a source and find right sources for query answering?
Source Modeling & Selection: for Large Scale Integration • Focus:Discovery of sources. • Focused crawling to collect query interfaces [BarbosaF05, ChangHZ05]. • Focus:Extraction of source models. • Hidden grammar-based parsing [ZhangHC04]. • Proximity-based extraction [HeMY+04]. • Classification to align with given taxonomy [HessK03, Kushmerick03]. • Focus:Organization of sources and query routing • Offline clustering [HeTC04, PengMH+04]. • Online search for query routing [KabraLC05].
Form Extraction: the Problem • Output all the conditions, for each: • Grouping elements (into query conditions) • Tagging elements with their “semantic roles” attribute operator value
Interface Creation Grammar Form Extraction: Parsing Approach [ZhangHC04]A hidden syntactic model exist? • Observation: Interfaces share “patterns” of presentation. • Hypothesis: • Now, the problem: • Given , how to find ? query capabilities
Tokenizer Layout Engine Best-Effort Visual Language Parsing Framework Input: HTML query form 2P Grammar Productions Preferences BE-Parser Ambiguity Resolution Error Handling X Output: semantic structure
Form Extraction: Clustering Approach [HessK03, Kushmerick03] Concept: A form as a Bayesian network. • Training: Estimate the Bayesian probabilities. • Classification: Max-likelihood predictions given terms.
Technical Challenges • 2. Schema Matching • How to match the schematic structures between sources?
Category: Business Fiction Computer Type : Business Fiction Computer Subject Category Schema Matching: for Large Scale Integration • Focus:Matching large number of interface schemas, often in a holistic way. • Statistical model discovery [HeC03]; correlation mining [HeCH04, HeC05]. • Query probing [WangWL+04]. • Clustering [HeMY+03, WuYD+04]. • Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06]. • Focus:Constructing unified interfaces. • As a global generative model [HeC03]. • Cluster-merge-select [HeMY+03].
WISE-Integrator: Cluster-Merge-Represent[HeMY+03] • Matching attributes: • Synonymous label: WordNet, string similarity • Compatible value domains (enum values or type) • Constructing integrated interface: • form = initial empty • until all attributes covered: • take one attribute • find clusters it belongs to • select a representative and merge values • put representative to the interface if not there
M α α α Concepts C1 C2 C3 γ β γ α1 α2 α3 η η η Attributes α β γ η δ β1=1 β2=1 β3=1 β4 β5=1 -β4 Statistical Model α α γ γ γ β η η δ δ Schema Generation M M α β α β γ γ η δ η δ Statistical Schema Matching: MGS A hidden statistical model exist? [HeC03, HeCH04, HeC05] • Observation: Schemas share “tendencies” of attribute usage. • Hypothesis: • Now, the problem: • Given , how to find ? α γ γ η δ attribute matchings
Technical Challenges • 3. Source Querying, Crawling & Search • How to query a source? How to crawl all objects and to search them?
Source Querying: for Large Scale Integration • Metaquerying model: • Focus:On-the-fly Querying. • MetaQuerier Query Assistant [ZhangHC05]. • Vertical-search-engine model: • Focus:Source crawling to collect objects. • Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06]. • Focus:Object search and ranking [NieZW+05]
On-the-fly Querying: [ZhangHC05] Type-locality based Predicate Translation Source predicates Target template P X Predicate Mapper Type Recognizer Domain Specific Handler Text Handler Numeric Handler Datetime Handler Target Predicate t* • Correspondences occur within localities • Translation by type-handler
Source Crawling by Query Selection [WuWL+06] System Compiler Theory • Conceptually, the DB as a graph: • Node: Attributes • Edge: Occurrence relationship • Crawling is transformed into graph traversal problem: Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum. Application Ullman Automata Data Mining Han
Object Ranking-- Object Relationship Graph [NieZW+05] • Popularity Propagation Factor for each type of relationship link • Popularity of an object is also affected by the popularity of the Web pages containing the object
Technical Challenges • 4. Data Extraction • How to extract result pages into relations?
Data Extraction: Circa 2000Need for rapid wrapper construction well recognized. • Focus: • Semi-automatic wrapper construction. • Techniques: • Wrapper-mediator architecture [Wiederhold92] . • Manual construction: • Semi-automatic: Learning-based • HLRT [KushmerickWD97], Stalker [MusleaMK99], Softmealy [HsuD98]; Mediator Wrapper Wrapper Wrapper
Data Extraction: for Large ScaleEven more automatic approaches. • Focus: • Even more automatic approaches. • Techniques: • Semi-automatic: Learning-based • [ZhaoMWRY05], [IRMKS06]. • Automatic: Syntax-based • RoadRunner [MeccaCM01], ExAlg [ArasuG03], DEPTA [LiuGZ03, ZhaiL05]. Mediator Wrapper Wrapper Wrapper
HLRT Wrapper: the first “Wrapper Induction” [KushmerickWD97] A manual wrapper: ExtractCCs(page P) skip past first occurrence of <P> in P while next <B> is before next <HR> in P for each <lk,rk>belongs to {< <B>,</B>>,< <I>,</I>>} skip past next occurrence of lk in P extract attribute from P to next occurrence of rk return extracted tuples A generalized wrapper: labeled data ExecuteHLRT(<h,t,l1,r1,..,lk,rk>,page P) skip past first occurrence of h in P while next l1 is before next t in P for each <lk,rk>belongs to {<l1,r1>,..,< lk, rk >} skip past next occurrence of lk in P extract attr from P to next occurrence of rk return extracted tuples wrapper rules: (delimiters) h l1, r1 l2, r2 …… lk, rk t Induction Algorithm
RoadRunner [MeccaCM01] • Basic idea: • Page generation: filling (encoding) data into a template • Data extraction: as the reverse, decoding the template • Algorithm • Compare two HTML pages at one time • one as wrapper and the other as sample • Solving the mismatches • string mismatch -- content slot • tag mismatch -- structure variance
RoadRunner the template
Finally, observations Large scale is not only a challenge, but also an opportunity!
Thank You! For more information: http://metaquerier.cs.uiuc.edu