1 / 15

Improving data quality in O J/TED

Improving data quality in O J/TED. Jáchy m Hercher , DG MARKT, ESWG 29.4.2014. Presentation outline. Introduction Miscoded data Numbers Text. Quality data is crucial for both procurement and analyses. Introduction. The only way to have good data is for it to be filled in correctly.

Download Presentation

Improving data quality in O J/TED

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving data quality in OJ/TED Jáchym Hercher, DG MARKT, ESWG 29.4.2014

  2. Presentation outline • Introduction • Miscoded data • Numbers • Text

  3. Quality data is crucial for both procurement and analyses Introduction • The only way to have good data is for it to be filled in correctly. • Member states (and academics) are often provided with OJ/TED data, and it is important to know what to expect from this data. • Many of you are dealing with the same issues at national level, so it may be worth sharing your experience.

  4. There are two problems with data: some is miscoded, some is missing Introduction

  5. There are very many ways in which filling in a field can go wrong Miscoded numbers • Some values do not look trustworthy - they are too large, too small, or strange. • Typos (e.g. adding and removing zeroes) • Wrong units (e.g. thousands instead of ones) • Mistaking contract award and notice values • Ill-intent (e.g. notice value 12345678) • Per unit values instead of total values (e.g. €1.2) • … • Not only about values, but also about number of bidders and dates.

  6. Once the data is incorrectly filled in it is extremely difficult to correct Miscoded numbers • A challenge faced by everyone ever using the data. • If wrongly filled in, it is impossible to correct perfectly. We only try to minimize the damage. • Any corrections are probabilistic: they are intended to remove more wrong values than correct ones.

  7. We use many tools to correct values, but all are far from perfect Miscoded numbers • Our corrections • Comparing notice values and sum of award values • Comparing prior information notice values vs. contract notice values • Comparing estimated values vs. final values • Removing very small values • Removing very large values

  8. Only names enable us to really understand who is buying what Miscoded text • Same contracting authorities and winning companies with different names. E.g.: • Typos • Different legitimate names and naming conventions • Multiple acceptable languages • Special symbols • Inclusion of legal forms • Number of entities overestimated ~3-5x. • We are starting an extensive IT project to clean up the text data.

More Related