400 likes | 571 Views
Deep Web Integration: Querying Structured Data on the Deep Web. Fangjiao Jiang. Outline. Background Access Deep Web MetaQuerier Metasearch engine vs. MetaQuerier Related research groups Conclusion … Some suggestions. Background. Part 1.
E N D
Deep Web Integration:Querying Structured Data on the Deep Web Fangjiao Jiang
Outline • Background • Access Deep Web • MetaQuerier • Metasearch engine vs. MetaQuerier • Related research groups • Conclusion • … • Some suggestions
Background Part 1
The current Web: Getting “deeper” • A great number of data is hidden behind query forms
The Problem for access data from Deep Web • Deep = not accessible through traditional search engines ? ? ? ?
Why is it important? • More than 10 million distinct forms
Why is it important? • Up to5,000 billions dynamic result pages
Why is it important? ——Google’s Recent Survey [CIDR 2007] • If there are 1 billion web pages 25 million potential Deep Web sources
Cars.com Challenge: How to enable effective access to the Deep Web?
Access the Deep Web Part 2
Repository Web Database Web Database Web Database … Integrated query interface QUERYWeb databases Three different manners • Warehouse-like approach • MetaQuerier • Surfacing the Deep Web 1) Pre-compute appropriate queriers over the forms 2) Insert the resulting pages into a web-search index
Web Database Web Database Web Database Web Database Web Database … PDF Journal Homepage 中文期刊全文数据库 国家自然基金信息库 …… DOC PS Auhtor Homepage Conf. Homepage (1) Warehouse-like approach
The Deep Web (2) MetaQuerier MetaQuerier Front-end: Query Execution Schema matching Result processing Query Translation Source Selection MetaQuerier is what we focus on. Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery Database Crawler Interface Extraction Source Clustering interface integration
(3) Surfacing the Deep Web [VLDB’08] • Viewpoint • Many domains and many languages • No human in the loop, no site-specific scripts • Main idea • predicting input values for text boxes • predicting input combinations • Google’s Deep-Web crawling system • Affects more than 1000 queries per second • Enables access to more than a million Deep-Web sites • Spans 50+ languages and 100+ domains
MetaQuerier Part 3
A Survey on Deep Web [SIGMOD 2006] • How many deep-Web sources are out there? • 307,000 sites, 450,000 DBs, 1,258,000 interfaces. • How structured in Deep Web? • 348,000 (structured) : 102,000 (text) == 3 : 1 • How do search engines cover them? • covered 10% sources. • What’s the subject distribution of Web databases? • Across all areas • How complex are they? • “Amazon effects”
Reported the “Amazon effect”… Condition patterns converge even across domains! Attributes converge in a domain!
Technical Challenges • How to discover the query interface? • Which form is the query interface of a Web database? • How to understand a query interface? • Where is the first condition? What’s its attribute? • How to match query interfaces? • What does “author” on this source match on that? • How to translate queries? • How to ask this query on that source?
Technical Challenges • How to extract the query results? • According to vision information? • How to identify the same entity? • Especially the large-scale entity identification. • How to annotate the query results? • How to specify the semantic of the data?
Online data Data Search Engine Surface Web Deep Web Metasearch Engine Metaquerier Example: mamma.com Example: Addall.com Search Engine 1 Web database 1 Search Engine 2 Web database 2 …… …… Search Engine n Web database n Preliminary
Search Engine Document search engine Key technology Crawling the Web Re-crawl Changed added Indexing Web Pages Index terms Stop words Stemming Invert file structure Term (p,w) Web Database Database search engine Search Engine VS. Web Database OK
Search Engine Document search engine Key technology Ranking Page Similar (Query, Page) Linkage information (Pagerank) Result Organization Matching score (descending) Clustering/categorizing Large “apple” Effective and Efficient Retrieval Recall-precision curve Web Database Database search engine Search Engine VS. Web Database OK
Online data Data Search Engine Surface Web Deep Web Metasearch Engine Metaquerier Example: mamma.com Example: Addall.com Search Engine 1 Web database 1 Search Engine 2 Web database 2 …… …… Search Engine n Web database n Metasearch Engine VS. MetaQuerier
Search Engine Selection Search Result Extraction Result Merging Query interface integration Database selection Query translation Result Extraction , Entity Identification and Annotation Metasearch Engine VS. Metaquerier
Main research groups Part 5
Main research groups Yiyao Lu Weiyi MengProfessor Binghamton University Eduard Dragut Hai He Interface extraction, interface integration, Query translation, Result annotation, Kevin Chen-Chuan ChangAssociate Professor University of Illinois at Urbana-Champaign Bin He Zhen Zhang Interface extraction, interface integration, Query translation
Main research groups • Others … Jayant Madhavan, Google, Inc. Google base Zaiqing Nie Microsoft,Inc. Vertical search Microsoft Luis Gravano Columbia University Top-k query Panagiotis G. Ipeirotis New York University Classification
Conclusion: Our works toward large scale integration • Completed several key subtasks: • Deep Web Data Extraction[TKDE 2009, WEBDB 2006, WISE 2005, WAIM 2005] • Query translation[DASFAA 2009, DASFAA 2007, SKG 2008] • Deep Web survey[VLDB Workshop 2006, 计算机学报2007] • Schema matching[计算机学报2008] • Database selection[软件学报2008] • Moving forward to exciting system issues: • System integration for building an integration system • Web data integration in mobile environment
Some suggestions Part 6
Four years ago… • How to find a paper? Google enough? • What are the theories we should to be familiar with first?
Find the papers … • Google • Google scholar • DBLP Bibliography • C-DBLP • Libra Academic Search • ACM Digital Library • Citeseer • Authors’ homepage • Send the Email to author
Journal: TOIS TODS VLDB J. TKDE Conferences/Workshop SIGMOD/ WebDB VLDB ICDE EDBT WWW SIGIR CIKM/WIDM WISE DASFAA Find the papers …
Read the books … • Information Retrieval • Data Mining • Machine Learning • Statistics • Theory of probability …
Three years ago… • How to find a problem? • Which problem is significant?
Two years ago… • How to write a paper?
Helpful points… • Right subject • Well-define problem • Clear contribution • Good Structure and logical flow • Proper use of words • Notice format, equation, reference… • Ask others to read your paper • Record your own mistake • Not leave out the important related work
Take some time to learn… • Latex • Matlab or Gnuplot (for the chart if necessary)