130 likes | 274 Views
Semantic Integration in Heterogeneous Databases Using Neural Networks. Wen-Syan Li, Chris Clifton Presentation by Jeff Roth. Introduction. Basic schema matching problem GTE’s data integration project included 27,000 data elements
E N D
Semantic Integration in Heterogeneous Databases Using Neural Networks Wen-Syan Li, Chris Clifton Presentation by Jeff Roth
Introduction • Basic schema matching problem • GTE’s data integration project included 27,000 data elements • This took 4 hours per data element or 25 full time employees 2 years to complete • This method -> .1 seconds, 144000 x faster • “how to match knowledge is discovered”
Method Outline “The end user is able to distinguish between unreasonable and reasonable answers, and exact results aren’t critical. This method allows a user to obtain reasonable answers requiring database integration at a low cost”
Automated semantic integration methods • Attribute Name Comparison This method is not used in this paper • Attribute values and domains comparison Equal, Contains, Overlap, Contained-in and Disjoint Used but not with the above measures • Field Specifications Data type, field length constraints and others. This is also used in this method
Field Specifications The following measures are used • data types Each possible data type has a network input, with the field data type having a value of 1 and all the other having a value of 0 • field length Length = 2 * (1/(1 + k-length) - 0.5) • format specifications similar to data type • constraints (primary key, foreign key, disallowing nulls, access restrictions, etc…) similar to data type
Attribute Values and Domains Divide measures into character fields and numeric fields • Patterns for Character fields 1. Ratio of numerical characters Address: 146 South 920 West would score 6/18 2. Ratio of white space Address: 146 South 920 West would score 3/18 3. Length Statistics Average, Variance, and coefficient of the “used” length relative to the maximum length
Attribute Values and Domains cont. • Patterns for numeric fields 1. Average (mean) 2. Variance 3. Coefficient of variation Recognizes similarity between values of different Units and Granularity This can also help recognize which fields may need unit conversions 4. Grouping For example: area code, zip code, first three digits of SSN
Self-Organizing Grouping algorithm • N = number of possible discriminators • M = number of categories, this can be adjusted by user. “ideally this is |attributes| - |foreign keys|” • This is unsupervised, i.e. you don’t have to provide a correct classification, it simply groups based on similarity
Training the Back-Prop Network • Inputs (N) are identical to classifier • Outputs (M) are trained using Back-Propagation and classifier’s results • Categories are labeled with the attributes they grouped together*
What is the classifier for? • Ease of training: “ideally [M] is |attributes| - |foreign keys|” and it is less computationally expensive to train M classifications where M < |attributes| - |foreign keys| • It is less computationally complex to compare new elements to the M classification than to ever attribute of the training database or |attributes| - |foreign keys| • Networks can be trained in which there there are attributes that are identical
Integration Procedure 1 2 3 1. DBMS Specific Parser 2. Classify (Categorize) Training Data 3. Train Neural Network 4. DBMS Specific Parser 5. Classification by Neural Network 6. User Checks Results 6 4 5
Conclusion and Future Work • Human Effort needed for semantic integration is minimized • Different Systems have different attribute properties available - automated solution • Extend to automated information integration • C source code available at eecs.nwu.edu/pub/semint