Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences

Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented by Prasanna Kunchavaram (800690762)‏ ITCS 6265 3rd November 2009

schema integration – is the process of unification of heterogeneous data sources to obtain a single non redundant, consistent data source schema integration - the process of combining local schemas into a global, integrated schema Examples : Combining data bases/ tables due to a merger or acquisition of companies. Combining two products into one and resulting combination of historic sales data. Creation of new table for employees using employment data and medical records. Introduction

Correspondence- is the matching between the elements of heterogeneous schemas. There is a weight and direction associated to each correspondence based on the confidence of the matches. For example weight of correspondence between A to B might be different from the weight of correspondence between B to A. Previous Schema integration tools provide interactive option for the users to select a desired integrated schema using the surviving correspondences. Introduction (continued)‏

Previous schema integration techniques Do not consider direction and weight of correspondence. Need user interaction for a final integration decision. Laborious process of selecting easy as well as difficult integration. Time consuming. Resource intensive. Problem

Example : Weighted and directed correspondence Options for schema integration Problem (continued)‏

Relationships in the integrated schema are defined using direction and weight of correspondence between elements. Such relationships are ranked based on priority of similarity and coverage to produce top k schemas. Easy integrations are adopted without user interaction. For difficult integrations user is provided an option to select constraints on the schemas involved. System generates revised top k schemas that satisfy constraints. Steps 4 and 5 are repeated till final schema is obtained. Solution

Example of Easy integration (Integration without user interaction)‏

A concept is a relation name associated with a set of attributes in a schema. Correspondences between schemas are expressed using Concept graph. A concept graph is a pair (V, has) where V is a set of concepts and has is a set of directed and labeled edges between concepts. Correspondence of concepts across schemas is defined by the pair of weights (in both directions). Considering pair (C1,C2) of concepts, where C1 is from schema S1 and C2 is from schema S2. The weight of the directed correspondence C1 → C2, can be denoted by ˆs(C1,C2). The weight of the directed correspondence C2 → C1, can be denoted by ˆs(C2, C1). Correspondence of concepts across S1 and S2 is defined by pair [ˆs(C1,C2),ˆs(C2, C1)] Concept and Concept graph

Example- Concept graph

AnassignmentA is a fixed-sized, ordered vector of bits where each bit X represents the state of one correspondence, value 1 representing a correspondence and value 0 representing an absence of correspondence. Set of assignments are ranked to get the top K assignments. For each assignment with value 1 the concepts involved in the respective correspondence should be combined. There are two ways by which concepts can be combined based on the weight and direction of the similarity. The two methods are mergeandhas A threshold λ is used for deciding which method is to be used for the combination Assignment

Example of λ effect on integration decision

Algorithm

And n is total number of non-zero correspondences Example Assignment with weights Cost function (used to rank assignments)‏

Calculate ^Si and ^Di and assign 1 for correspondences where ^Si > ^Di and 0 where ^Si < ^Di. The result is the optimal assignment for k=1. Next k-1 best assignments is based on decision to flip the bits of assignment vector. Let the vector Δfbe the difference between ^Si and ^Di. calculate Δf to quantify the cost impact of flipping the bit i from its current value in the assignmentA1. For each i, Δf represents the increase in cost with respect to cost(A1) if the bit i in A1 were to be flipped. Sort Δf in increasing order and denote as Δfs. Find the next assignment that minimizes the increase in cost. Now the 2nd best assignment can be obtained by flipping bit Xi that gives the least cost increase. Next compute the 3rd best assignment , we need to change the variable with the next cost increase and leave Xi unflipped. If there are two choices, select the choice that gives smaller cost increase. Other assignments are calculated likewise. Top K algorithm

Top K Algorithm- Example

As stated before is the threshold which is used for combining concepts in an integration based on the following rules Steps to calculate λ 1. iteratively scan all the correspondences in E, where E is the set of correspondences that are selected by at least one of the top k assingments. 2. for each such correspondence, record max(ˆs1, ˆs2) and add this value to a list L, and finally 3. set λ to be the minimum of the values in L. Tuning λ

Example of Schema Integration with different λ values

Results

Top K algorithm for schema integration that executes in polynomial time is developed. Important information like weight and direction of the correspondence are efficiently used to reduce user interaction. Easy integrations are performed by the system without any user interaction while keeping the data consistent. Results clearly state that the algorithm can be efficient in reducing user interaction and thus reducing the time taken to achieve complex schema integration. Future work includes automation (integration without user interaction) and enhancements to the algorithm to implement with couple of hundred schemas. Conclusion

Questions ?

Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences

Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences

Presentation Transcript

Temperature Application based on Directed Diffusion

Schemas

Modeling and generation of network traffic based on user behavior

Merging Models Based on Given Correspondences

Distance sensitivity oracles in weighted directed graphs

Feedback-directed Random Test Generation

Schemas

Business Correspondences

Interactive Generation of Integrated Schemas

Scripts and Schemas

Integrated services delivery based on eGovernment

Schemas and Heuristics

Scenario Generation based on Quantization

Integrated Interpretation and Generation of Task-Oriented Dialogue

Headline Generation Based on Statistical Translation

Optimal Top-k Generation of Attribute Combinations based on Ranked Lists

… more on XML Schemas

Optimal Top-k Generation of Attribute Combinations based on Ranked Lists