1 / 10

Measuring referential Integrity in Distributed Databases

Measuring referential Integrity in Distributed Databases. Dhara Shah. Introduction. Distributed database: multiple databases residing at different locations which are communicated through the Internet. Violation of referential integrity due to similar content from different sources.

pcollett
Download Presentation

Measuring referential Integrity in Distributed Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring referential Integrity in Distributed Databases Dhara Shah

  2. Introduction • Distributed database: multiple databases residing at different locations which are communicated through the Internet. • Violation of referential integrity due to similar content from different sources. • Goal: Identify referential integrity problem to detect and avoid inconsistency or incompleteness. • Promising alternative to detect and fix data quality issues in scientific database.

  3. Assumptions • Same tables but different content. • Rows may have null values for primary key. • Metadata has been integrated before. • Content may be inconsistent due to both local and global issues. • Broadcasting updates happens independently and asynchronously.

  4. Column Metrics • Metrics are measured on scale of [0…1] (1 being the optimal) • lrcom(Ti.K) = |Ti KTj | / |Ti| • grcom(Ti.K) = |Ti KTj | / |Ti| • lrcon(Ti.F) = |Ti K,F Tj | / |Ti| • grcon(Ti.K, Ti.F) = |Ti K,F Tj | / |Ti|

  5. Table Metrics • gcur(Ti) = |D1.Ti ∩ D2.Ti ∩ ・ ・ ・ ∩ Dn.Ti| / |D1.Ti ∪ D2.Ti ∪ ・ ・ ・ ∪ Dn.Ti| • grcom(Ti) = Σkj=1|Ti|grcom(Ti.Kj ) / k|Ti| • grcon(Ti) = Σfj=1|Ti|grcon(Ti.Fj ) / f|Ti|

  6. Database Metrics • lrcom(Di) = Σmj=1|Tj |lrcom(Tj ) / Σj|Tj | • lrcon(Di) = Σmj=1|Tj |lrcon(Tj ) / Σj|Tj | • grcom(D) = Σmj=1|Tj |grcom(Tj ) / Σj|Tj | • grcon(D) = Σmj=1|Tj |grcon(Tj ) / Σj|Tj |

  7. Query Optimization • Local metrics in a single database • Aggregations grouping by FK before joins for table with several FKs. • Creating secondary index on each FK. • Global metrics in distributed database • Transfer n-1 copies to central site • Compute metrics at one site and then incrementally update • Compute metrics for each pair of tables linked by a FK • Smallest table is transferred when join is required for two tables at different sites

  8. Applications • Applications w/ Scientific Databases • Central database: need fast connection and should be available all time • Local database: flexible and faster, many have more referential errors • Program: • uses Logical data model (LDM) to calculate metrics. • Has graphical user interface, list which explains why errors happend

  9. Conclusion • Related work: • MOCHA: middleware system to integrate distributed data sources. • Metrics that measure absolute and relative error w/ respect to referential integrity. • Measures completeness and consistency. • Raises new issues such as distributed query optimizations.

  10. Citation • Authors: Carlos Ordonez, Javier Garcia-Garcia, Zhibo Chen • Title: Measuring Referential Integrity in Distributed Databases • Name of Journal: CIMS '07 Proceedings of the ACM first workshop on CyberInfrastructure: information management in eScience • Publication Date: November 2007 • Page Range: 61-66

More Related