150 likes | 167 Views
Improving data quality in O J/TED. Jáchy m Hercher , DG MARKT, ESWG 29.4.2014. Presentation outline. Introduction Miscoded data Numbers Text. Quality data is crucial for both procurement and analyses. Introduction. The only way to have good data is for it to be filled in correctly.
E N D
Improving data quality in OJ/TED Jáchym Hercher, DG MARKT, ESWG 29.4.2014
Presentation outline • Introduction • Miscoded data • Numbers • Text
Quality data is crucial for both procurement and analyses Introduction • The only way to have good data is for it to be filled in correctly. • Member states (and academics) are often provided with OJ/TED data, and it is important to know what to expect from this data. • Many of you are dealing with the same issues at national level, so it may be worth sharing your experience.
There are two problems with data: some is miscoded, some is missing Introduction
There are very many ways in which filling in a field can go wrong Miscoded numbers • Some values do not look trustworthy - they are too large, too small, or strange. • Typos (e.g. adding and removing zeroes) • Wrong units (e.g. thousands instead of ones) • Mistaking contract award and notice values • Ill-intent (e.g. notice value 12345678) • Per unit values instead of total values (e.g. €1.2) • … • Not only about values, but also about number of bidders and dates.
Once the data is incorrectly filled in it is extremely difficult to correct Miscoded numbers • A challenge faced by everyone ever using the data. • If wrongly filled in, it is impossible to correct perfectly. We only try to minimize the damage. • Any corrections are probabilistic: they are intended to remove more wrong values than correct ones.
We use many tools to correct values, but all are far from perfect Miscoded numbers • Our corrections • Comparing notice values and sum of award values • Comparing prior information notice values vs. contract notice values • Comparing estimated values vs. final values • Removing very small values • Removing very large values
Only names enable us to really understand who is buying what Miscoded text • Same contracting authorities and winning companies with different names. E.g.: • Typos • Different legitimate names and naming conventions • Multiple acceptable languages • Special symbols • Inclusion of legal forms • Number of entities overestimated ~3-5x. • We are starting an extensive IT project to clean up the text data.