210 likes | 362 Views
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented by Prasanna Kunchavaram (800690762) ITCS 6265 3 rd November 2009.
E N D
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented by Prasanna Kunchavaram (800690762) ITCS 6265 3rd November 2009
schema integration – is the process of unification of heterogeneous data sources to obtain a single non redundant, consistent data source schema integration - the process of combining local schemas into a global, integrated schema Examples : Combining data bases/ tables due to a merger or acquisition of companies. Combining two products into one and resulting combination of historic sales data. Creation of new table for employees using employment data and medical records. Introduction
Correspondence- is the matching between the elements of heterogeneous schemas. There is a weight and direction associated to each correspondence based on the confidence of the matches. For example weight of correspondence between A to B might be different from the weight of correspondence between B to A. Previous Schema integration tools provide interactive option for the users to select a desired integrated schema using the surviving correspondences. Introduction (continued)
Previous schema integration techniques Do not consider direction and weight of correspondence. Need user interaction for a final integration decision. Laborious process of selecting easy as well as difficult integration. Time consuming. Resource intensive. Problem
Example : Weighted and directed correspondence Options for schema integration Problem (continued)
Relationships in the integrated schema are defined using direction and weight of correspondence between elements. Such relationships are ranked based on priority of similarity and coverage to produce top k schemas. Easy integrations are adopted without user interaction. For difficult integrations user is provided an option to select constraints on the schemas involved. System generates revised top k schemas that satisfy constraints. Steps 4 and 5 are repeated till final schema is obtained. Solution
Example of Easy integration (Integration without user interaction)
A concept is a relation name associated with a set of attributes in a schema. Correspondences between schemas are expressed using Concept graph. A concept graph is a pair (V, has) where V is a set of concepts and has is a set of directed and labeled edges between concepts. Correspondence of concepts across schemas is defined by the pair of weights (in both directions). Considering pair (C1,C2) of concepts, where C1 is from schema S1 and C2 is from schema S2. The weight of the directed correspondence C1 → C2, can be denoted by ˆs(C1,C2). The weight of the directed correspondence C2 → C1, can be denoted by ˆs(C2, C1). Correspondence of concepts across S1 and S2 is defined by pair [ˆs(C1,C2),ˆs(C2, C1)] Concept and Concept graph
AnassignmentA is a fixed-sized, ordered vector of bits where each bit X represents the state of one correspondence, value 1 representing a correspondence and value 0 representing an absence of correspondence. Set of assignments are ranked to get the top K assignments. For each assignment with value 1 the concepts involved in the respective correspondence should be combined. There are two ways by which concepts can be combined based on the weight and direction of the similarity. The two methods are mergeandhas A threshold λ is used for deciding which method is to be used for the combination Assignment
And n is total number of non-zero correspondences Example Assignment with weights Cost function (used to rank assignments)
Calculate ^Si and ^Di and assign 1 for correspondences where ^Si > ^Di and 0 where ^Si < ^Di. The result is the optimal assignment for k=1. Next k-1 best assignments is based on decision to flip the bits of assignment vector. Let the vector Δfbe the difference between ^Si and ^Di. calculate Δf to quantify the cost impact of flipping the bit i from its current value in the assignmentA1. For each i, Δf represents the increase in cost with respect to cost(A1) if the bit i in A1 were to be flipped. Sort Δf in increasing order and denote as Δfs. Find the next assignment that minimizes the increase in cost. Now the 2nd best assignment can be obtained by flipping bit Xi that gives the least cost increase. Next compute the 3rd best assignment , we need to change the variable with the next cost increase and leave Xi unflipped. If there are two choices, select the choice that gives smaller cost increase. Other assignments are calculated likewise. Top K algorithm
As stated before is the threshold which is used for combining concepts in an integration based on the following rules Steps to calculate λ 1. iteratively scan all the correspondences in E, where E is the set of correspondences that are selected by at least one of the top k assingments. 2. for each such correspondence, record max(ˆs1, ˆs2) and add this value to a list L, and finally 3. set λ to be the minimum of the values in L. Tuning λ
Top K algorithm for schema integration that executes in polynomial time is developed. Important information like weight and direction of the correspondence are efficiently used to reduce user interaction. Easy integrations are performed by the system without any user interaction while keeping the data consistent. Results clearly state that the algorithm can be efficient in reducing user interaction and thus reducing the time taken to achieve complex schema integration. Future work includes automation (integration without user interaction) and enhancements to the algorithm to implement with couple of hundred schemas. Conclusion