100 likes | 231 Views
Measuring referential Integrity in Distributed Databases. Dhara Shah. Introduction. Distributed database: multiple databases residing at different locations which are communicated through the Internet. Violation of referential integrity due to similar content from different sources.
E N D
Measuring referential Integrity in Distributed Databases Dhara Shah
Introduction • Distributed database: multiple databases residing at different locations which are communicated through the Internet. • Violation of referential integrity due to similar content from different sources. • Goal: Identify referential integrity problem to detect and avoid inconsistency or incompleteness. • Promising alternative to detect and fix data quality issues in scientific database.
Assumptions • Same tables but different content. • Rows may have null values for primary key. • Metadata has been integrated before. • Content may be inconsistent due to both local and global issues. • Broadcasting updates happens independently and asynchronously.
Column Metrics • Metrics are measured on scale of [0…1] (1 being the optimal) • lrcom(Ti.K) = |Ti KTj | / |Ti| • grcom(Ti.K) = |Ti KTj | / |Ti| • lrcon(Ti.F) = |Ti K,F Tj | / |Ti| • grcon(Ti.K, Ti.F) = |Ti K,F Tj | / |Ti|
Table Metrics • gcur(Ti) = |D1.Ti ∩ D2.Ti ∩ ・ ・ ・ ∩ Dn.Ti| / |D1.Ti ∪ D2.Ti ∪ ・ ・ ・ ∪ Dn.Ti| • grcom(Ti) = Σkj=1|Ti|grcom(Ti.Kj ) / k|Ti| • grcon(Ti) = Σfj=1|Ti|grcon(Ti.Fj ) / f|Ti|
Database Metrics • lrcom(Di) = Σmj=1|Tj |lrcom(Tj ) / Σj|Tj | • lrcon(Di) = Σmj=1|Tj |lrcon(Tj ) / Σj|Tj | • grcom(D) = Σmj=1|Tj |grcom(Tj ) / Σj|Tj | • grcon(D) = Σmj=1|Tj |grcon(Tj ) / Σj|Tj |
Query Optimization • Local metrics in a single database • Aggregations grouping by FK before joins for table with several FKs. • Creating secondary index on each FK. • Global metrics in distributed database • Transfer n-1 copies to central site • Compute metrics at one site and then incrementally update • Compute metrics for each pair of tables linked by a FK • Smallest table is transferred when join is required for two tables at different sites
Applications • Applications w/ Scientific Databases • Central database: need fast connection and should be available all time • Local database: flexible and faster, many have more referential errors • Program: • uses Logical data model (LDM) to calculate metrics. • Has graphical user interface, list which explains why errors happend
Conclusion • Related work: • MOCHA: middleware system to integrate distributed data sources. • Metrics that measure absolute and relative error w/ respect to referential integrity. • Measures completeness and consistency. • Raises new issues such as distributed query optimizations.
Citation • Authors: Carlos Ordonez, Javier Garcia-Garcia, Zhibo Chen • Title: Measuring Referential Integrity in Distributed Databases • Name of Journal: CIMS '07 Proceedings of the ACM first workshop on CyberInfrastructure: information management in eScience • Publication Date: November 2007 • Page Range: 61-66