150 likes | 168 Views
Explore the critical importance of quality data in procurement and analysis, focusing on miscoded data issues and their impact on decision-making and transparency. Learn about common problems with numeric and text data, and discover strategies to improve accuracy and reliability.
E N D
Improving data quality in OJ/TED Jáchym Hercher, DG MARKT, ESWG 29.4.2014
Presentation outline • Introduction • Miscoded data • Numbers • Text
Quality data is crucial for both procurement and analyses Introduction • The only way to have good data is for it to be filled in correctly. • Member states (and academics) are often provided with OJ/TED data, and it is important to know what to expect from this data. • Many of you are dealing with the same issues at national level, so it may be worth sharing your experience.
There are two problems with data: some is miscoded, some is missing Introduction
There are very many ways in which filling in a field can go wrong Miscoded numbers • Some values do not look trustworthy - they are too large, too small, or strange. • Typos (e.g. adding and removing zeroes) • Wrong units (e.g. thousands instead of ones) • Mistaking contract award and notice values • Ill-intent (e.g. notice value 12345678) • Per unit values instead of total values (e.g. €1.2) • … • Not only about values, but also about number of bidders and dates.
Once the data is incorrectly filled in it is extremely difficult to correct Miscoded numbers • A challenge faced by everyone ever using the data. • If wrongly filled in, it is impossible to correct perfectly. We only try to minimize the damage. • Any corrections are probabilistic: they are intended to remove more wrong values than correct ones.
We use many tools to correct values, but all are far from perfect Miscoded numbers • Our corrections • Comparing notice values and sum of award values • Comparing prior information notice values vs. contract notice values • Comparing estimated values vs. final values • Removing very small values • Removing very large values
Only names enable us to really understand who is buying what Miscoded text • Same contracting authorities and winning companies with different names. E.g.: • Typos • Different legitimate names and naming conventions • Multiple acceptable languages • Special symbols • Inclusion of legal forms • Number of entities overestimated ~3-5x. • We are starting an extensive IT project to clean up the text data.