350 likes | 473 Views
Part II Large-Scale Web Database Integration Systems. Definitions. Web database ( database search engine ): Web-accessible database ( WDB ) Characteristics: Data are structured and are stored in database systems. Data are accessible through a Web search interface.
E N D
Definitions • Web database (database search engine): Web-accessible database (WDB) Characteristics: • Data are structured and are stored in database systems. • Data are accessible through a Web search interface. • Result pages are dynamically generated by wrapping data in HTML files. • Web database integration: the process of enabling unified access to multiple Web databases in the same application domain.
WDB Integration System vs. MSE • Major differences between Web databases and regular document search engines (DSE): • DSE searches Web pages while WDB searches database entities. • WDB usually has a complex interface while DSE usually has a simple interface. • DSE ranks results by similarity while WDB usually ranks results by some attribute values.
WDB Integration System Architecture Web User query Result WDB List Web Database Discovery Domain Mapping Result Merging Integrated Interface Entity Identification WDB Interface Schema Extraction WDB Clustering By Domain Result Annotation Database Selection Interface Integration WDB Cluster 1 . . . . . . Query Translation and Dispatch Result Extraction WDB Cluster n Integrated Interface 1 . . . . . . … … World Wide Web Integrated Interface n WDB 1 WDB m Integrated Interface Generation Module. Query Processing Module.
Main Technical Problems • WDB Search Interface Modeling • WDB Search Interface Extraction • WDB Search Interface Clustering • WDB Search Interface Integration • Global Query Mapping and Optimization • Search Result Extraction and Annotation • Online Entity Identification • Remaining Research Challenges
A Related Book • Eduard Dragut, Weiyi Meng, Clement Yu. Deep Web Query Interface Understanding and Integration. Morgan & Claypool Publishers, June 2012. • Table of Content • Introduction • Query Interface Representation and Extraction • Query Interface Clustering and Categorization • Query Interface Matching • Query Interface Attribute Integration • Query Interface Integration • Summary and Future Research
WDB Query Interface Modeling Problem: Represent the information on each interface in a format that is suitable for integration and query submission.
An Example WDB Interface An attribute
WDB Interface Modeling Different models have been proposed: • WISE Three-Level Model: site-level, attribute-level, and element-level. • Hierarchical Model: A search interface is modeled as an ordered tree of elements. • Hierarchical model is designed to capture the order semantics and the nested grouping of the attributes in an interface. • Querying Capability Model: Formally characterize what kinds of queries are valid for a search interface.
Hierarchical Model: An Example aa.com 1. Where Do You Want to Go? 3. Number of Passengers 4. What are Your Service Preferences? 2. When Do You Want to Go? carrier5. Choose a Carrier originFrom: City or Airport Code destinationTo: City or Airport Code numAdult Adults numChild Children cabinClass Class of Service maxiumStops Number of Connections Departure Date Return Date depMonth depDay depTime retMonth retDay retTime
Query Interface Extraction • Automatic interface extraction: Automatically extract information described in an interface representation model from any given WDB interface. • Primarily two tasks: • Attribute extraction • Extract elements and labels from the interface. • Group elements and labels into logical attributes. • Attribute analysis • Extract and derive meta-information about each attribute based on the interface representation model.
WDB Query Interface Clustering Objective: Group WDBs into different clusters such that all WDBs in the same cluster are related to the same domain (e.g., sell the same type of products). Techniques: • First, construct a concept hierarchy. • Then apply one of the following techniques • Supervised clustering (training required) • Unsupervised clustering (no training required)
Query Interface Integration • It is related to database schema integration. • Schema integration has been studied since 1980s. • Based on different data models: ER model, relational model, object-oriented model, etc. • In different context: a single database during database design, or multiple databases in multidatabase/data warehouse systems. • Key issues: resolve name conflict, data type conflict, structural conflicts, data inconsistency, etc. • Manual approach: Integration rules are manually written.
Schema Integration vs. Interface Integration Comparing WDB interface integration and database schema integration. • WDB interface schema is simpler (one table/view versus multiple tables of a database schema). • Attributes in WDB interface are more complex as they may consist of multiple elements. • WDB interface mixes attributes and query conditions while database schema don’t. • Meta-data need to be extracted from WDB interface while they are readily available in database schema. • WDB interface integration needs to integrate element format, attribute layout and external values while database schema integration doesn’t.
Attribute Matching A key problem in schema/interface integration is to match attributes from different schemas/interfaces. A general framework for attribute matching [Rahm and Bernstein, VLDB Journal 2001]. • Develop a number of matchers based on different information. • Dictionary-level information: attribute names • Schema-level information: data type, key, foreign key, … • Instance-level data: values of attributes • Utilize auxiliary information: Special dictionaries, thesaurus, user-input, …
Attribute Integration • After attribute matching, attributes are divided into clusters such that each cluster corresponds to a global attribute in the integrated interface. Remaining issues: • Determine the name of the global attribute for each cluster. • Determine the domain type of each global attribute. The domain type will determine the format. • Determine the external values of each global attribute.
Hierarchical Interface Integration (1) An example of hierarchical schema representation 1. Where Do You Want to Go? From: City To: City 2. When Do You Want to Go? Departure Date Return Date 3. Number of Passengers? Adults Children 4. Class of Service Economy Business First Class Root Where … When … Number … Class … 1 Jan 1am From To Departure Return Adult …… 1 Jan 1am Dmonth Dday Dtime Rmonth Rday Rtime 1 0 Siblings are ordered!
Hierarchical Interface Integration (2) Simple mapping versus complex mapping • Simple mapping: 1-to-1 mapping between two fields • Complex mapping: 1-to-m mapping between one field in one interface and multiple fields in another interface Examples of 1-to-m mappings from date passengers No. of passengers departure date adults children month day year
Tree Merging American Express Chase Hierarchical Interface Integration (3) Please tell us about yourself Please tell us about your employment Please tell us about your employment Phone How to merge? Years there Address Occupation Company address Country State State City Street
Hierarchical Interface Integration (4) Grouping Constraint: Given subgroups in different user interfaces, is it possible to find a group such that all elements in each subgroup are in adjacent locations? Example: The following example satisfies this requirement: {state, city, street} {country, state} {country, state, city, street}
Preserving ancestor-descendant relationships Integrated American Express Chase Hierarchical Interface Integration (5) Please tell us about yourself Please tell us about yourself Please tell us about your employment Please tell us about your employment Please tell us about your employment Phone Phone Years there Address Occupation Years there address Occupation Company address Country State Country State City Street State City Street
Hierarchical Interface Integration (6) Naming attributes Group Naming Compatibility: Names of attributes within a group in a user interface should be compatible. Example: Compatible naming {adults, children} {adults, infants} Incompatible naming: {adults, children} {#children, #infants} {adults, children, infants} {adults, children, #infants}
Search Result Annotation Goal: Identify the semantic meaning of each piece of information within each search result record (SRR). • Before result annotation, SRRs on the result pages returned from search engines need to be extracted first. • Some approaches combine result extraction and result annotation in one step. Data annotation is needed for • Comparison-shopping applications: entity identification, result merging, … • Deep Web crawling and data collection
title authors Result Annotation: Problem Description
Entity Identification • Problem: Automatically derive rules to determine if two search result records from different WDBs are in fact the same entity (product). • Entity identification is closely related to entity matching, entity resolution, duplicate detection, and record linkage. • It is a classical problem in federated systems that deal with data from multiple sources.
Remaining Research Challenges (1) 1. Automatic WDB discovery Goal: Discover Web database interfaces from the Web automatically. Some issues to consider: • How to identify web pages that have a search interface? • There are already some existing work on this. • How to differentiate search interfaces for Web databases from those for text search engines? • Is the information from the search interface sufficient? Do we need information from search results? • How to learn a classifier?
Remaining Research Challenges (2) 2. Extraction and understanding of dynamic query interfaces • An increasing number of query interfaces are dynamic in the sense that the query interface may alter after certain fields are selected. Two types of dynamic changes have been observed. • The change of values of some fields (e.g., values under a selection list). • The structure of the query interface (e.g., some fields are added, deleted or modified). • Current query interface models do not consider dynamic query interfaces.
Remaining Research Challenges (3) 3. Handling boundary query interfaces in Web-scale clustering. • There are two challenges in Web-scale clustering of query interfaces [Madhavan et el., 2007; Mahmoud and Aboulnaga, 2010]. • The number of domains is unknown in advance, which means that the number of clusters is unknown in advance. • There are likely many query interfaces with unclear domains, i.e., they appear between boundaries of multiple domains. • However, the current solutions are not sufficiently accurate and have significant room to improve.
Remaining Research Challenges (4) 4. Web database selection Goal: For any given user query, identify the Web databases that are most likely to return good results. Some issues to consider: • How to summarize the content of a Web database? • Numerical attributes • Categorical attributes • Textual attributes • Relationships among the attributes
Remaining Research Challenges (5) Web database selection(continued) • How to obtain the summaries automatically? • How to design sample queries for each type of attributes? • How to use the summaries to do Web database selection? • How to measure “usefulness” based on different types of attributes? • How to combine “usefulness” across different attributes?
Remaining Research Challenges (6) 5. Automatic SRR extraction from complex result pages Goal: Automatically identify the rules to extract search result records from complex result pages. Some characteristics of complex result pages: • Record contains both text and images • SRRs may be organized into multiple columns/multiple sections. • SRRs have a variety of formats. • Have no fixed sections (i.e., some sections only appear in some result pages) • Some SRRs are divided into multiple blocks.
Remaining Research Challenges (7) 6. Global query processing and optimization Goal: Evaluate global queries efficiently and correctly. Some issues to consider: • It consists of many steps: • Identify relevant Web databases (global cost) • Translate/map global queries to local queries (global cost) • Submit queries and receive results (communication cost) • Evaluate translated queries by local Web databases (local cost) • Extract search results from result pages (global cost) • Filter out unqualified results (global cost) • How to optimize the above process? • What are the differences between Web integration systems and multidatabase/federated database systems?