1 / 67

Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web

Tutorial in SIGMOD’06. Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web. Kevin C. Chang. Still challenges on the Web? Google is only the start of search (and MSN will not be the end of it). Structured Data--- Prevalent but ignored !.

Download Presentation

Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial in SIGMOD’06 Large-Scale Deep Web Integration:Exploring and Querying Structured Data on the Deep Web Kevin C. Chang

  2. Still challenges on the Web? Google is only the start of search (and MSN will not be the end of it).

  3. Structured Data--- Prevalent but ignored!

  4. Challenges on the Web come in “dual”: Gettingaccessto thestructuredinformation! • Kevin’s 4-quardants: Deep Web Surface Web   Access   Structure

  5. Tutorial Focus:Large Scale Integration of structured data over the Deep Web • That is: Search-flavored integration. • Disclaimer--What it is not: • Small-scale, pre-configured, mediated-querying settings • many related techniques  some we will relate today • Text databases (or, meta-search) • Several related but “text-oriented” issues in meta-search • eg, Stanford, Columbia, UIC • more in the IR community (distributed IR) • And, never a “complete” bibliography!! • http://metaquerier.cs.uiuc.edu/ “Web Integration” bibliography • Finally, no intention to “finish” this tutorial.

  6. An evidence in Beta: Google Base.

  7. When Google speaks up…“What is an “Attribute”,” says Google!

  8. And things are indeed happening!

  9. The Deep Web: Databases on the Web

  10. The previous Web: Search used to be “crawl and index”

  11. The current Web: Search must eventually resort to integration

  12. How to enable effective access to the deep Web? Cars.com Amazon.com Biography.com Apartments.com 411localte.com 401carfinder.com

  13. Survey the frontier: BrightPlanet.com, March 2000 [Bergman00] • Overlap analysis of search engines. • “Search sites” not clearly defines. • Estimated 43,000 – 96,000 deep Web sites. • Content size 500 times that of surface Web.

  14. Survey the frontier UIUC MetaQuerier, April 2004 [ChangHL+04] • Macro: Deep Web at large • Data: Automatically-sampled 1 million IPs • Micro: per-source specific characteristics • Data: Manually-collected sources • 8 representative domains, 494 sources Airfare (53), Autos (102), Books (69), CarRentals (24) Hotels (38), Jobs (55), Movies (78), MusicRecords (75) • Available at http://metaquerier.cs.uiuc.edu/repository

  15. They wanted to observe… • How many deep-Web sources are out there? • “The dot-com bust has brought down DBs on the Web.” • How many structured databases? • “There are just (or, much more) text databases.” • How do search engines cover them? • “Google does it all.”– Or, “InvisibleWeb.com does it all.” • How hidden are they? • “It is the hidden Web.” • How complex are they? • “Queries on the Web are much simpler, even trivial.” • “Coping with semantics is hopeless– Let’s Just wait till the semantic Web.”

  16. And their results are… • How many deep-Web sources are out there? • 307,000 sites, 450,000 DBs, 1,258,000 interfaces. • How many structured databases? • 348,000 (structured) : 102,000 (text) == 3 : 1 • How do search engines cover them? • Google covered 5% fresh and 21% state objects. • InvisibleWeb.com covered 7.8% sources. • How hidden are they? • CarRental (0%) > Airfares (~4%) > … > MusicRec > Books > Movies (80%+) • How complex are they? • “Amazon effects”

  17. Reported the “Amazon effect”… Attributes converge in a domain! Condition patterns converge even across domains!

  18. Google’s Recent Survey [courtesy Jayant Madhavan]

  19. Driving Force: The Large Scale

  20. Circa 2000: Example System– Information Agents [MichalowskiAKMTT04, Knoblock03]

  21. Circa 2000: Example System– Comparison Shopping Engines [GuptaHR97] Virtual Database

  22. System: Example Applications

  23. Web Database Web Database Web Database Web Database Web Database … PDF Journal Homepage DOC PS Auhtor Homepage Conf. Homepage Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA) • Integrating information from multiple types of sources • Ranking papers, conferences, and authors for a given query • Handling structured queries

  24. On-the-fly Meta-querying Systems— e.g., WISE [HeMYW03], MetaQuerier [ChangHZ05] MetaQuerier@UIUC : FIND sources Amazon.com Cars.com db of dbs Apartments.com QUERYsources 411localte.com unified query interface

  25. What needs to be done? Technical Challenges: • Source Modeling & Selection • Schema Matching • Source Querying, Crawling, and Obj Ranking • Data Extraction • System Integration

  26. The Problems: Technical Challenges

  27. Technical Challenges • Source Modeling & Selection • How to describe a source and find right sources for query answering?

  28. Source Modeling: Circa 2000 • Focus: • Design of expressive model mechanism. • Techniques: • View-based mechanisms: answering queries using views, LAV, GAV (see [Halevy01] for survey). • Hierarchical or layered representations for modeling in-site navigations ([KnoblockMA+98], [DavulcuFK+99]).

  29. Source Modeling & Selection: for Large Scale Integration • Focus:Discovery of sources. • Focused crawling to collect query interfaces [BarbosaF05, ChangHZ05]. • Focus:Extraction of source models. • Hidden grammar-based parsing [ZhangHC04]. • Proximity-based extraction [HeMY+04]. • Classification to align with given taxonomy [HessK03, Kushmerick03]. • Focus:Organization of sources and query routing • Offline clustering [HeTC04, PengMH+04]. • Online search for query routing [KabraLC05].

  30. Form Extraction: the Problem • Output all the conditions, for each: • Grouping elements (into query conditions) • Tagging elements with their “semantic roles” attribute operator value

  31. Interface Creation Grammar Form Extraction: Parsing Approach [ZhangHC04]A hidden syntactic model exist? • Observation: Interfaces share “patterns” of presentation. • Hypothesis: • Now, the problem: • Given , how to find ? query capabilities

  32. Tokenizer Layout Engine Best-Effort Visual Language Parsing Framework Input: HTML query form 2P Grammar Productions Preferences BE-Parser Ambiguity Resolution Error Handling X Output: semantic structure

  33. Form Extraction: Clustering Approach [HessK03, Kushmerick03] Concept: A form as a Bayesian network. • Training: Estimate the Bayesian probabilities. • Classification: Max-likelihood predictions given terms.

  34. Technical Challenges • 2. Schema Matching • How to match the schematic structures between sources?

  35. Schema Matching: Circa 2000 • Focus: • Generic matching without assuming Web sources • Techniques: [RahmB01]

  36. Schema Matching: for Large Scale Integration • Focus:Matching large number of interface schemas, often in a holistic way. • Statistical model discovery [HeC03]; correlation mining [HeCH04, HeC05]. • Query probing [WangWL+04]. • Clustering [HeMY+03, WuYD+04]. • Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06]. • Focus:Constructing unified interfaces. • As a global generative model [HeC03]. • Cluster-merge-select [HeMY+03].

  37. WISE-Integrator: Cluster-Merge-Represent[HeMY+03]

  38. WISE-Integrator: Cluster-Merge-Represent[HeMY+03] • Matching attributes: • Synonymous label: WordNet, string similarity • Compatible value domains (enum values or type) • Constructing integrated interface: • form = initial empty • until all attribtes covered: • take one attribute • select a representative and merge values

  39. α α α β β β η η η α α γ γ β β η η α β α β γ γ η η δ δ δ δ Statistical Schema Matching: MGS A hidden statistical model exist? [HeC03, HeCH04, HeC05] • Observation: Schemas share “tendencies” of attribute usage. • Hypothesis: • Now, the problem: • Given , how to find ? α γ β η δ Schema Generation attribute matchings Statistical Model

  40. α β η α γ β η α β γ η δ δ Statistical Hypothesis Discovery • Statistical formulation: • Given as observations: • Find underlying hypothesis: • “Global” approach: Hidden model discovery [HeC03] • Find entire global model at once • “Local” approach: Correlation mining[HeCH04, HeC05] • Find local fragments of matchings one at a time. Prob QIs

  41. Technical Challenges • 3. Source Querying, Crawling & Search • How to query a source? How to crawl all objects and to search them?

  42. Source Querying: Circa 2000 • Focus:Mediation of cross-source, join-able queries • Query rewriting, planning– Extensive study: e.g., [LevyRO96, AmbiteKMP01, Halevy01]. • Focus:Execution & optimization of queries • Adaptive, speculative query optimization; e.g., [NaughtonDM+01, BarishK03, IvesHW04].

  43. Source Querying: for Large Scale Integration • Metaquerying model: • Focus:On-the-fly Querying. • MetaQuerier Query Assistant [ZhangHC05]. • Vertical-search-engine model: • Focus:Source crawling to collect objects. • Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06]. • Focus:Object search and ranking [NieZW+05]

  44. Source predicates Target template P Predicate Mapper Type Recognizer Domain Specific Handler Text Handler Numeric Handler Datetime Handler Target Predicate t* On-the-fly Querying: [ZhangHC05] Type-locality based Predicate Translation • Correspondences occur within localities • Translation by type-handler

  45. Source Crawling by Query Selection [WuWL+06] System Compiler Theory • Conceptually, the DB as a graph: • Node: Attributes • Edge: Occurrence relationship • Crawling is transformed into graph traversal problem: Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum. Application Ullman Automata Data Mining Han

  46. Object Ranking-- Object Relationship Graph [NieZW+05] • Popularity Propagation Factor for each type of relationship link • Popularity of an object is also affected by the popularity of the Web pages containing the object

  47. Object Ranking-- Training Process [NieZW+05] Initial Combination of PPFs Link Graph new combination from neighbors PopRank Calculator • Subgraph selection to approximate rank calculation for speeding up. Ranking Distance Estimator Expert Ranking Accept The worse one ? Better than the best ? No Yes Yes Chosen as the best

  48. Technical Challenges • 3. Data Extraction • How to extract result pages into relations?

More Related