100 likes | 200 Views
Synthetic D ata G eneration . - Darshana Pathak. Synthetic Data. A process of creation of realistic data set. Realistic means having characteristics of real world data. Errors Duplicates Similar entities Changing data. Types of Errors:. Spelling mistakes Typographical errors
E N D
Synthetic Data Generation - DarshanaPathak
Synthetic Data • A process of creation of realistic data set. • Realistic means having characteristics of real world data. • Errors • Duplicates • Similar entities • Changing data
Types of Errors: • Spelling mistakes • Typographical errors • Insert, replace, delete • Transposition errors • Missing attributes • Computational errors (e.g. year) • …
Why do we need it? • Availability of data suitable for record linkage and data visualization research • With all required attributes easily available • Privacy concerns • Personally Identifying Information • IRB approvals • Information disclosure laws
Base Data • We made our task easier by getting real data set as a base data to generate synthetic data. • Idea of using voters registration data - Vanderbilt University student’s PhD Dissertation. • Voter registration data for one of the large counties in NC. • http://www.wakegov.com/elections/8data.htm
Voter Registration Data • Why Voter Registration Data is Available: • According to North Carolina law (General Statute 132), "The public records and public information compiled by the agencies of North Carolina Government or its subdivisions are the property of the people. Therefore, it is the policy of this State that the people may obtain copies of their public records and public information free or at minimal cost unless otherwise specifically provided by law." (Voter registration records are not exempt from this law.)
Data Generation - 1 • Pretty clean data!!! • Introduce realistic errors… Chicken and egg problem. • How do we know the pattern and percentage of different types of errors in real data? • If we knew answer to this question, we could have easily solved the record linkage problem. • Insert id/SSN like column • Registration number
Please Read:Date of birth is not provided in voter records. Per § 163-82.10, effective June 1, 2005, dates of birth that may be generated in the voter registration process, by either the State Board of Elections or a County Board of Elections, are confidential and shall not be considered public records and subject to disclosure to the general public under Chapter 132 of the General Statutes. No list produced under this section shall contain a voter's date of birth; however, lists may be produced according to voters' ages. • Bingo! We have ages, we can get the birth year!
Data Generation - 2 • Insert DOB column to the voters dataset. • Birth year = Current year – age. • Day and month? • Simulate the duplicates, twins, couples and families based on the last name, address, age and accordingly assign DOB
Future Plan Machine learning techniques to simulate real world data errors during synthetic data generation?