1 / 18

Application of Regular Expressions in the German Business Register

This session explores the application of regular expressions in the German Business Register, focusing on projects related to improving legal form coding and data preprocessing for record linkage. The examples provided demonstrate the benefits of using regular expressions in these processes. The evaluation methods used for assessing coding completeness and matching results are also discussed.

rmichael
Download Presentation

Application of Regular Expressions in the German Business Register

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application of Regular Expressions in the German Business Register Session 5: Projects on Improvements for Business Registers Wiesbaden Group on Business Registers Paris, November 26th 2007, Patrizia Moedinger

  2. Example 1: Improving legal form coding by using regular expressions

  3. Background • information on legal forms mainly from VAT records • not all administrative sources provide information on legal forms • use of different not compatible legal form coding or different aggregation levels • special requirements for other purposes like the coding of institutional sectors

  4. Background • enterprises (legal units) with certain legal forms are legally obliged to carry their legal form in the enterprise name: • incorporated firms • non-incorporated firms • cooperatives • merchants that are registered in the German Commercial Register enterprise names can be used for legal form coding

  5. Definition of search patterns • patterns from nomenclature, abbreviation and notations (tax authorities) GmbH, AG & Co.KG, Limited, Ltd. • patterns from BR real data mistakes in writing, missing blanks, .. construction of regular expression

  6. Evaluation of search patterns • completeness of codinglegal obligation: high level of found legal forms in enterprise names • degree of reliance: evaluation of coding results • drawing sample after legal form coding • classification of the coding results • calculation of sensitivity, specificity, positive predictive value, negative predictive value

  7. Completeness of coding

  8. Evaluation of Type I and II errors

  9. Example 2: Data pre-processing as a preliminary for record linkage

  10. Background • no common unique identifiers available • data from different sources are initially linked by names and addresses • different or none address standards • different notations “BMW“ or “Bayerische Motorenwerke“ or “Bay. Motorenwerke“ • German BR is technically limited in storing several addresses (only dispatch and domicile)

  11. Problem of non standardized notations • matching by administrative identifiers • dependent variable = match by administrative identifiers + no change in the postal code • independent variable = differences between enterprise names, street names and town names (Levenshtein edit distance) • same (administrative) source • different sources (administrative source – BR)

  12. Matching probability against string similarity within an administrative source (Employment Agency) (Model: Logistic regression)

  13. Matching probability against string similarity between an administrative source (Employment Agency) and BR (Model: Logistic regression)

  14. identical unit different unit low high differences in name or address Pre-processing of administrative data for record linkage • high level of similarity between two strings identical units high level of disparity between two strings different units

  15. enterprise address enterprise name: BMW BMW AG Branch Munich Mr Mueller AG legal form: other elements: Branch Munich Mr Mueller Pre-processing of administrative data for record linkage • conversion into specific variables for string matching • simplify comparison strings

  16. Methods for evaluation • evaluate link between string similarity and match before and after pre-processing the data • evaluation of matching results • (drawing sample after matching process) • classification of the matching results • calculation of sensitivity, specificity, positive predictive value, negative predictive value • controlling for effects caused by the used matching program

  17. Synopsis • BR text data needs special treatment in data processing • applications for regular expressions • simple application: legal form coding (limited set of search pattern) • more complex application: pre-processing (set of pattern depends on data source and later use) • application of regular expressions should always be evaluated

  18. Thank you for your attention.

More Related