1 / 39

Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web

Guest Lecture. Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web. Zhen Zhang. What you will learn in this lecture. What is deep Web? Why information integration on deep Web? What are integration paradigms? What are technical challenges?

gail-fowler
Download Presentation

Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Guest Lecture Large-Scale Deep Web Integration:Exploring and Querying Structured Data on the Deep Web Zhen Zhang

  2. What you will learn in this lecture • What is deep Web? • Why information integration on deep Web? • What are integration paradigms? • What are technical challenges? • What are proposed solutions?

  3. The Web becomes increasing dynamic! Static surface Web Dynamic deep Web

  4. Data are hidden behind query forms

  5. Bring data up front: An evidence—Google Base

  6. The Deep Web: Databases on the Web How to enable effective access to the deep Web?

  7. Survey the frontier: BrightPlanet.com, March 2000 [Bergman00] • Overlap analysis of search engines. • Estimated 43,000 – 96,000 deep Web sites. • Content size 500 times that of surface Web.

  8. Survey the frontier UIUC MetaQuerier, April 2004 [ChangHL+04] • Macro: Deep Web at large • Data: Automatically-sampled 1 million IPs • Micro: per-source specific characteristics • Data: Manually-collected sources • 8 representative domains, 494 sources Airfare (53), Autos (102), Books (69), CarRentals (24) Hotels (38), Jobs (55), Movies (78), MusicRecords (75) • Available at http://metaquerier.cs.uiuc.edu/repository

  9. We want to observe • How many deep-Web sources are out there? • 307,000 sites, 450,000 DBs, 1,258,000 interfaces. • How many structured databases? • 348,000 (structured) : 102,000 (text) == 3 : 1 • How do search engines cover them? • Google covered 5% fresh and 21% state objects. • InvisibleWeb.com covered 7.8% sources. • How hidden are they? • CarRental (0%) > Airfares (~4%) > … > MusicRec > Books > Movies (80%+)

  10. Google’s Recent Survey [courtesy Jayant Madhavan]

  11. System: Example Applications

  12. Web Database Web Database Web Database Web Database Web Database … PDF Journal Homepage DOC PS Auhtor Homepage Conf. Homepage Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA) • Integrating information from multiple types of sources • Ranking papers, conferences, and authors for a given query • Handling structured queries

  13. On-the-fly Meta-querying Systems— e.g., WISE [HeMYW03], MetaQuerier [ChangHZ05] MetaQuerier@UIUC : FIND sources Amazon.com Cars.com db of dbs Apartments.com QUERYsources 411localte.com unified query interface

  14. What needs to be done? Technical Challenges: • Source Modeling & Selection • Schema Matching • Source Querying, Crawling, and Obj Ranking • Data Extraction

  15. Technical Challenges • 1. Source Modeling & Selection • How to describe a source and find right sources for query answering?

  16. Source Modeling & Selection: for Large Scale Integration • Focus:Discovery of sources. • Focused crawling to collect query interfaces [BarbosaF05, ChangHZ05]. • Focus:Extraction of source models. • Hidden grammar-based parsing [ZhangHC04]. • Proximity-based extraction [HeMY+04]. • Classification to align with given taxonomy [HessK03, Kushmerick03]. • Focus:Organization of sources and query routing • Offline clustering [HeTC04, PengMH+04]. • Online search for query routing [KabraLC05].

  17. Form Extraction: the Problem • Output all the conditions, for each: • Grouping elements (into query conditions) • Tagging elements with their “semantic roles” attribute operator value

  18. Interface Creation Grammar Form Extraction: Parsing Approach [ZhangHC04]A hidden syntactic model exist? • Observation: Interfaces share “patterns” of presentation. • Hypothesis: • Now, the problem: • Given , how to find ? query capabilities

  19. Tokenizer Layout Engine Best-Effort Visual Language Parsing Framework Input: HTML query form 2P Grammar Productions Preferences BE-Parser Ambiguity Resolution Error Handling X Output: semantic structure

  20. Form Extraction: Clustering Approach [HessK03, Kushmerick03] Concept: A form as a Bayesian network. • Training: Estimate the Bayesian probabilities. • Classification: Max-likelihood predictions given terms.

  21. Technical Challenges • 2. Schema Matching • How to match the schematic structures between sources?

  22. Category: Business Fiction Computer Type : Business Fiction Computer Subject Category Schema Matching: for Large Scale Integration • Focus:Matching large number of interface schemas, often in a holistic way. • Statistical model discovery [HeC03]; correlation mining [HeCH04, HeC05]. • Query probing [WangWL+04]. • Clustering [HeMY+03, WuYD+04]. • Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06]. • Focus:Constructing unified interfaces. • As a global generative model [HeC03]. • Cluster-merge-select [HeMY+03].

  23. WISE-Integrator: Cluster-Merge-Represent[HeMY+03]

  24. WISE-Integrator: Cluster-Merge-Represent[HeMY+03] • Matching attributes: • Synonymous label: WordNet, string similarity • Compatible value domains (enum values or type) • Constructing integrated interface: • form = initial empty • until all attributes covered: • take one attribute • find clusters it belongs to • select a representative and merge values • put representative to the interface if not there

  25. M α α α Concepts C1 C2 C3 γ β γ α1 α2 α3 η η η Attributes α β γ η δ β1=1 β2=1 β3=1 β4 β5=1 -β4 Statistical Model α α γ γ γ β η η δ δ Schema Generation M M α β α β γ γ η δ η δ Statistical Schema Matching: MGS A hidden statistical model exist? [HeC03, HeCH04, HeC05] • Observation: Schemas share “tendencies” of attribute usage. • Hypothesis: • Now, the problem: • Given , how to find ? α γ γ η δ attribute matchings

  26. Technical Challenges • 3. Source Querying, Crawling & Search • How to query a source? How to crawl all objects and to search them?

  27. Source Querying: for Large Scale Integration • Metaquerying model: • Focus:On-the-fly Querying. • MetaQuerier Query Assistant [ZhangHC05]. • Vertical-search-engine model: • Focus:Source crawling to collect objects. • Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06]. • Focus:Object search and ranking [NieZW+05]

  28. On-the-fly Querying: [ZhangHC05] Type-locality based Predicate Translation Source predicates Target template P X Predicate Mapper Type Recognizer Domain Specific Handler Text Handler Numeric Handler Datetime Handler Target Predicate t* • Correspondences occur within localities • Translation by type-handler

  29. Source Crawling by Query Selection [WuWL+06] System Compiler Theory • Conceptually, the DB as a graph: • Node: Attributes • Edge: Occurrence relationship • Crawling is transformed into graph traversal problem: Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum. Application Ullman Automata Data Mining Han

  30. Object Ranking-- Object Relationship Graph [NieZW+05] • Popularity Propagation Factor for each type of relationship link • Popularity of an object is also affected by the popularity of the Web pages containing the object

  31. Technical Challenges • 4. Data Extraction • How to extract result pages into relations?

  32. Data Extraction: Circa 2000Need for rapid wrapper construction well recognized. • Focus: • Semi-automatic wrapper construction. • Techniques: • Wrapper-mediator architecture [Wiederhold92] . • Manual construction: • Semi-automatic: Learning-based • HLRT [KushmerickWD97], Stalker [MusleaMK99], Softmealy [HsuD98]; Mediator Wrapper Wrapper Wrapper

  33. Data Extraction: for Large ScaleEven more automatic approaches. • Focus: • Even more automatic approaches. • Techniques: • Semi-automatic: Learning-based • [ZhaoMWRY05], [IRMKS06]. • Automatic: Syntax-based • RoadRunner [MeccaCM01], ExAlg [ArasuG03], DEPTA [LiuGZ03, ZhaiL05]. Mediator Wrapper Wrapper Wrapper

  34. HLRT Wrapper: the first “Wrapper Induction” [KushmerickWD97] A manual wrapper: ExtractCCs(page P) skip past first occurrence of <P> in P while next <B> is before next <HR> in P for each <lk,rk>belongs to {< <B>,</B>>,< <I>,</I>>} skip past next occurrence of lk in P extract attribute from P to next occurrence of rk return extracted tuples A generalized wrapper: labeled data ExecuteHLRT(<h,t,l1,r1,..,lk,rk>,page P) skip past first occurrence of h in P while next l1 is before next t in P for each <lk,rk>belongs to {<l1,r1>,..,< lk, rk >} skip past next occurrence of lk in P extract attr from P to next occurrence of rk return extracted tuples wrapper rules: (delimiters) h l1, r1 l2, r2 …… lk, rk t Induction Algorithm

  35. RoadRunner [MeccaCM01] • Basic idea: • Page generation: filling (encoding) data into a template • Data extraction: as the reverse, decoding the template • Algorithm • Compare two HTML pages at one time • one as wrapper and the other as sample • Solving the mismatches • string mismatch -- content slot • tag mismatch -- structure variance

  36. RoadRunner the template

  37. Finally, observations Large scale is not only a challenge, but also an opportunity!

  38. Thank You! For more information: http://metaquerier.cs.uiuc.edu

More Related