420 likes | 534 Views
Data Quality Assurance in Telecommunications Databases. C.-M. Chen , M. Cochinwala {chungmin, munir}@research.telcordia.com Applied Research Telcordia Technologies Morristown, NJ 07960. Outline. Operation Support Systems Data Quality Issues Record Matching Problem & Techniques
E N D
Data Quality Assurance in Telecommunications Databases C.-M. Chen , M. Cochinwala {chungmin, munir}@research.telcordia.com Applied Research Telcordia Technologies Morristown, NJ 07960
Outline • Operation Support Systems • Data Quality Issues • Record Matching Problem & Techniques • Example Applications • References
Operation Support Systems • Telecom carriers or service providers use multiple Operation Support Systems (OSS) for • network configuration • network engineering • service provisioning • network performance monitoring • customer care & billing, etc. • Each OSS may maintain its own database using a DBMS
OSS Databases • The databases may overlap on the network entities they describe • The OSSs may use different model/schemas to describe the same entity
Data Quality Issues • Corrupted data: data do not correctly reflect the properties/specifications of the modeled entities • Inaccurate data: discrepancy between the data and the real-world entity they modeled (e.g. outdated information) • Inconsistent data: • records in different OSS databases referencing the same real-world entity do not agree on the attribute values of the entity • an entity is represented by more than one record in a single OSS database
Example: DSL Deployment • corrupted data: many of the Central Offices (CO) do not have correct location information (longitude,latitude) • inaccurate data: the database says a port or bandwidth is available at the CO while indeed it is not (or vice versa) • inconsistent data: Service Provisioning database says a line is available while Network Configuration database says otherwise (which one to believe?)
Why is data quality an issue • corrupted data disrupt/delay business deployment • inaccurate data impede decision making • inconsistent data degrade operation efficiency • bottom line: • increased cost • decreased revenue • customer dissatisfaction
Cost of Data Quality Assurance • To date, practice of data quality assurance is mostly manual-driven and labor-intensive • “ … it costs a typical telephone company in the mid-1990s $2 per line per year to keep the network database reasonably accurate.” • Now imagine the scale and complexity of the networks brought by the IP and wireless technologies and their effects on the cost of data quality • (Semi-) automatic tools are needed
Data Quality Assurance Techniques • data correctness and accuracy: • database validation and correction via autodiscovery tools • data consistency (reconciliation) • record matching (record linkage): identify records that correspond to the same real-world entity
Data Accuracy • to assess and correct the discrepancy between data in the database and the modeled entities in the real-world • autodiscovery tools: • automatically probe network elements for configuration parameters • reconstruct the network connection topology at different layers (from physical to application layers) • how to efficiently validate the data stored in the databases against the auto-discovered information? • sampling • graph algorithms
Record Matching Techniques [Cochinwala01b] • Problem Definition • Record Matching Phases • Quality Metrics • Searching Techniques (to reduce search space) • Matching Decision Rules (to determine matched pairs)
The Record Matching Problem Goal: To identify records in the same or different databases that correspond to the same real-world entity. • two records that correspond to the same entity are called a matched pair • in a relational table, no two records can hold identical values on all attributes a matched pair consists of two different records that refer to the same entity • the same entity may be represented differently in different tables/databases, with no agreement on key value(s)
Three Phases of Reconciliation • Data Preparation: scrubbing and cleansing • Searching: reducing search space • Matching: finding matched pairs
Data Preparation • Parsing • Data transformation • Standardization
Searching Problem Consider reconciling tables A and B • search space: P = AB (Cartesian product) • M: matched pairs • U: unmatched pairs • P = M U Problem: searching P requires |A||B| comparisons Goal: reduce the search space to a smaller P’ P, such that M P’.
Searching Techniques Heuristic: potentially matched records should fall into the same cluster • Relational join operators: nest-loop join, merge-sort join, hash join, band join [Hernadez96] • Blocking: Soundex Code, NYSIIS [Newcombe88] • Windowing (sorted neighborhood) [Hernadez98] • Priority queue [MongeElkan97] • Multi-pass merge-sort joins [Hernadez98]
String Matching Techniques • Edit distance: counts of insert, delete, replace [Manber89] • Smith-Waterman: dynamic programming to find minimum edit distance [SmithWaterman81]
Record Matching Techniques Given a (reduced) space of pairs, how to determine if a pair (x,y) is a match, i.e., records x and y refer to the same real-world entity ? • Probabilistic approaches [Newcombe59] • Bayes test for minimum error • Bayes test for minimum cost • Non-probabilistic approaches • Supervised Learning (to generate decision rules) [Cochinwala01a] • Equation theory model [Hernadez98] • Distance-based techniques [DeySarkarDe98]
Bayes Test for Minimum Error Cartesian product P = A B = M U For each record (a1, a2, … , an , b1, b2, … , bn) P, define x = (x1, x2, …, xn), where xi = 1 if ai = bi 0 if ai bi a-prior prob conditional density function Class M: Mp(x|M) Class U: Up(x|U) Unconditional density function: p(x) = Mp(x|M) + Up(x|U)
Bayes Test for Minimum Error (cont.) Assume M, U, p(x|M), and p(x|U) are knowm Basyes theorem: the posteriori probability p(M|x) = Mp(x|M) / (Mp(x|M) + Up(x|U)) Likelihood Ratio Decision Rule: x is decided to belong to M iff p(M|x) p(U|x) iff Mp(x|M) Up(x|U) iff l(x) = (p(x|M) / p(x|U)) (U/ M ) The test gives min probability of error (misclassification)
Bayes Test for Minimum Cost C i,j : cost of a class j record being (mis)-classified to class i Conditional cost of x being classified to class M given x is: cM(x) = cMM p(M|x) + cMU p(U|x) Similarly, cU(x) = cUM p(M|x) + cUU p(U|x) Likelihood Ratio Decision Rule: x is decided to belong to M iff cM(x) cU(x) iff l(x) = (p(x|M) / p(x|U)) ((cMU - cUU )U/ (cUM-cMM )M ) The test gives min cost
Supervised Learning • Take a small sample S form A B • For every record sS, label it as M (matched) or U (unmatched) • Select a predictive model, along with associated decision rules, that best discriminates between classes M and U for records in S • Apply the selected model to classify all records in A B
Arroyo: a data reconciliation tool [Cochinwala01a] • matches customer records across two databases: • wireline database: 860,000 records • wireless database: 1,300,000 records • methodology • pre-processing (application dependent) • parameter space definition (application dependent) • matching rule generation & pruning (learning, model selection) • matching
Preprocessing (Application Dependent) • elimination of stop-words • blanks, special characters, ... • “Chung-Min Chen” becomes “ChungMin Chen” • word re-ordering • “20 PO BOX” becomes “PO BOX 20” • word substitution • “St.” becomes “Street”
Parameter Space Definition (Application Dependent) Six parameters used: • edit distance between the Name fields • edit distance between the Address fields • length of Name field in Wireline • length of Name field in Wireless • length of Address field in Wireline • length of Address field in Wireless
Matching Rule Generation 1. Learning Set Generation • select records that contain the word “Falcon” in Name field • Wireless - 241 records • Wireline - 883 records • Choose sample first from the database that is the image of the higher degree “onto”-mapping. For example, Wireline Wireless has a higher “onto” degree than “Wireless Wireline”
Matching Rule Generation 2. Learning Set Labeling • identify prospective matches • match if the edit distances on Name and Address fields are both less than 3 • (30% false matches but no miss of true matches) • verdict assignment • each pair that is not a prospective match is assigned “unmatch” • each prospective match pair is manually examined and labeled as “match”, “unmatch”, or “ambigious” (labor intensive)
Matching Rule Generation 3. Model Selection • Input: 241 883 = 212,803 pairs with 7 fields (parameters) Label (verdict) Edit distance between Name fields Edit distance between Address fields Length of Name field in Wireless Length of Name field in Wireline Length of Address in Wireless Length of Address in Wireline • Three model tried, with cross-validation (half sample to build the model, half sample to estimate the error rate) Model Error Rate (average of 50 runs) CART 5.1% Linear Discriminant Analysis 5.3% Vector Quantization 9.4%
Matching Rule Generation 3. Model Selection (cont.) Model Error Rate (average of 50 runs) CART 5.1% Linear Discriminant Analysis 5.3% Vector Quantization 9.4% • LDA could have inefficient rules for database evaluation, e.g.: 2*(Address edit distance) + 1.3*(Name length on Wireless) < 3 match • an index on Name is useless • CART rules can be evaluated efficiently using existing database indices
Matching Rule Pruning • determines which parameters (rules) to drop to reduce tree complexity while retaining tree quality • maximize delta of a cost function • dynamic programming with a threshold • complexity of tree T: C(T) = C(p) / C(q) , C(p) = complexity to compute the value of parameter p e.g. C(edit distance between fields A and B) = avg_length(A)* avg_length(B) Note: 0 < C(T) 1 pT q S
Matching Rule Pruning (cont.) C(edit distance between fields A and B) = avg_length(A)* avg_length(B) Parameter Avg length Complexity Address length (wireless) 6.88 6.88 Address length (wireline) 6.72 6.72 Name length (wireless) 10.87 10.87 Name length (wireline) 10.70 10.70 Edit distance on Addresses 6.88*6.72 = 46.23 Edit distance on Names 10.87*10.7 = 116.31
Matching Rule Pruning (cont.) • misclassification rate of T: M(T) = # misclassified pairs / # pairs in the test set, Note: 0 M(T) 1 • Cost of Tree T: J(T) = w1*C(T) + w2*M(T), where w1 and w2 are weights. (We used w1 = w2 = 1) • Let T* = T - {all rues involving parameter p), define J(T) = J(T) - J(T*) = (C(T)-C(T*)) + (M(T) - M(T*)) = C(T) + M(T) • Goal: Find a set of parameters to drop to maximize J(T)
Matching Rule Pruning (cont.) • Dynamic programming • find a parameter p1 that maximizes J(T) • fix p1 and find a second parameter p2 such that {p1,p2} maximizes J(T) • repeat the step until we reach a set of parameters {p1,p2,…,pk} such that J(T) is less than a threshold value Parameter C(T) M(T) J(T) Edit dist on Name 0.5882 -0.0140 0.5742 Edit dist on Address 0.2338 -0.1840 0.0498 Len. on Name (Wireless) 0.0566 -0.0610 -0.0044 Len. on Name (Wireline) 0.0541 -0.0160 0.0381 Len. on Address (Wireless) 0.0347 -0.0081 0.0266 Len. On Address (Wireline) 0.0339 -0.0070 0.0269 Single parameter pruning
Matching Rule Pruning (cont.) 2nd Parameter C(T) M(T) J(T) Len. on Name (Wireline) 0.0541 -0.0088 0.0453 Len. on Address (Wireless) 0.0347 -0.0072 0.0275 Len. On Address (Wireline) 0.0339 -0.0091 0.0248 ... Dual parameters pruning
Matching Rule Pruning (cont.) • Reduce from 6 parameters (original tree) to 2 parameters (final tree) • index on Name field can be used to evaluate the tree root Address edit distance > 1.5 Address edit distance < 1.5 matched Wireless Name length < 11.5 Wireless Name length > 11.5 ambiguous unmatched
CART Matching Result x/y Matched Unmatched Ambiguous Matched 0.965 0.014 0.03 Unmatched 0.020 0.956 0.18 Ambiguous 0.015 0.030 0.79 • approximately 10% error rate • reduced 50% computation cost for matching • about 47% more correctly matched records over a leading commercial product Prob (predicted as x | given y)
Major Telephone Operating Company Problem: • Mergers and acquisitions of wireless companies resulted in the RBOC’s inability to determine common customers among wireline and wireless businesses • Customer databases for each business unit use different schema and contain many quality problems • RBOC’s experience with commercial vendor’s data reconciliation tool was unsatisfactory • Solution: • Use small manually-verified data samples (~100 records) to determine appropriate matching rules • Use machine learning to prune the rules for efficient analysis of the large dataset • Resulted in 30% more correct matches than the commercial tool
Large Media Conglomerate Problem: • Company provides magazines to wholesalers who in turn provide magazines to resalers for distribution • Reconciliation of wholesaler and retailer databases would make it easier to track where gaps in reporting are occurring • Identify ‘bad’ retailers • Solution: • Group by primary keys • Match by secondary keys • E.g. 3000 C.V.S. Pharmacies are grouped and compared by zip code and street address – identify ‘bad’ pharmacies
International Government Problem: • Reconcile vital taxpayer data from several different sources • Known problems include record duplication, address mismatches, address obsolescence, distributed responsibility for database accuracy and updates • Identify causes for mistakes • Solution: • Improve the process flows and architecture to allow for rapid modification of pre-processing rules and matching rules • Detection and classification of likely causes of duplication • Analysis and improvements reduced number of records that required manual verification
ILEC-CLEC Billing Reconciliation • Problem • ILECs charge CLECs for use of network resources • Verification of actual usage vs. charging • E.g customer changing providers • Identify actual usage and send verification to ILEC • Resource identification in ILEC and CLEC is different • Solution • Check charges in bill against actual usage • Common identification of resources (matching table) • Solution has only been implemented for access line charge
Future Directions • Reduction of manual work • Type based standardization, rules • Rules and Parameter pruning? • Database and tool • Rule execution plan • E.g. string length before edit distance • Sharing/maintenance of index across steps in process • Extending matching to structures (trees, graphs – circuits) • Matching across layers in networks (OSI layers) • How often to discover/audit network? – sampling techniques
References • [Cochinwala01a ]M Cochinwala, V. Kurien, G. Lalk and D. Shasha, “Efficient data reconciliation”, Information Sciences, 137 (1-4) , 2001. • [Cochinwala01b] M. Cochinwala, S. Dalal, A. Elmagarmid and V. Verykios, “Record matching: past, present and future”, submitted for publication. • [DeySarkarDe]D. Dey, S. Sarkar and P. De, “Entity matching in heterogeneous databases: a distance-based decision model”, 31st Hawaii int. Conf. On System Sciences, 1998. • [Hernadez96] M. Hernadez, “A generalization of band joins and the merge/purge problem”, Ph.D. Thesis, CS Dept, Columbia University, 1996. • [Hernadez98] M. Hernadez and S. Stolfo, “Real-world data is dirty: data cleansing and the merger/purge problem”, Data Mining and Knowledge Discovery, 2 (1), 1998. • [Manber89] U. Manber, “Introductions to Algorithms”, Addiso_Wesley, 1989. • [MongeElkan97] A. Monge and C. Elkan, “An efficient domain-independent algorithm for detecting approximately duplicate database records”, SIGMOD DMKD Workshop, 1997. • [Newcombe59] H. Newcombe, J. Kennedy, S. Axford and A. James, “Automatic linkage of vital records”, Science 130(3381), 1959. • [Newcombe88]H. Newcombe, “Handbook of Record Linkage”, Oxford University Press, 1988. • [SmithWaterman81] T. Smith and M. Waterman, “Identification of common molecular subsequences”, J. Mol. Biol. 147, 1981.