180 likes | 368 Views
Application of Regular Expressions in the German Business Register. Session 5: Projects on Improvements for Business Registers Wiesbaden Group on Business Registers Paris, November 26 th 2007, Patrizia Moedinger. Example 1: Improving legal form coding by using regular expressions.
E N D
Application of Regular Expressions in the German Business Register Session 5: Projects on Improvements for Business Registers Wiesbaden Group on Business Registers Paris, November 26th 2007, Patrizia Moedinger
Example 1: Improving legal form coding by using regular expressions
Background • information on legal forms mainly from VAT records • not all administrative sources provide information on legal forms • use of different not compatible legal form coding or different aggregation levels • special requirements for other purposes like the coding of institutional sectors
Background • enterprises (legal units) with certain legal forms are legally obliged to carry their legal form in the enterprise name: • incorporated firms • non-incorporated firms • cooperatives • merchants that are registered in the German Commercial Register enterprise names can be used for legal form coding
Definition of search patterns • patterns from nomenclature, abbreviation and notations (tax authorities) GmbH, AG & Co.KG, Limited, Ltd. • patterns from BR real data mistakes in writing, missing blanks, .. construction of regular expression
Evaluation of search patterns • completeness of codinglegal obligation: high level of found legal forms in enterprise names • degree of reliance: evaluation of coding results • drawing sample after legal form coding • classification of the coding results • calculation of sensitivity, specificity, positive predictive value, negative predictive value
Example 2: Data pre-processing as a preliminary for record linkage
Background • no common unique identifiers available • data from different sources are initially linked by names and addresses • different or none address standards • different notations “BMW“ or “Bayerische Motorenwerke“ or “Bay. Motorenwerke“ • German BR is technically limited in storing several addresses (only dispatch and domicile)
Problem of non standardized notations • matching by administrative identifiers • dependent variable = match by administrative identifiers + no change in the postal code • independent variable = differences between enterprise names, street names and town names (Levenshtein edit distance) • same (administrative) source • different sources (administrative source – BR)
Matching probability against string similarity within an administrative source (Employment Agency) (Model: Logistic regression)
Matching probability against string similarity between an administrative source (Employment Agency) and BR (Model: Logistic regression)
identical unit different unit low high differences in name or address Pre-processing of administrative data for record linkage • high level of similarity between two strings identical units high level of disparity between two strings different units
enterprise address enterprise name: BMW BMW AG Branch Munich Mr Mueller AG legal form: other elements: Branch Munich Mr Mueller Pre-processing of administrative data for record linkage • conversion into specific variables for string matching • simplify comparison strings
Methods for evaluation • evaluate link between string similarity and match before and after pre-processing the data • evaluation of matching results • (drawing sample after matching process) • classification of the matching results • calculation of sensitivity, specificity, positive predictive value, negative predictive value • controlling for effects caused by the used matching program
Synopsis • BR text data needs special treatment in data processing • applications for regular expressions • simple application: legal form coding (limited set of search pattern) • more complex application: pre-processing (set of pattern depends on data source and later use) • application of regular expressions should always be evaluated