160 likes | 294 Views
Data Preparation and Profiling: Strategies, challenges, and experiences Tim Norris and Mark Lundgren. Todays Agenda. Introductions Date Profiling and Readiness Lessons Learned Future Direction. About the P20W Data Warehouse. Statewide longitudinal data system
E N D
Data Preparation and Profiling: Strategies, challenges, and experiences Tim Norris and Mark Lundgren
Todays Agenda • Introductions • Date Profiling and Readiness • Lessons Learned • Future Direction
About the P20W Data Warehouse • Statewide longitudinal data system • De-identified data about people's early childhood, Kindergarten through 12th grade, higher education and workforce experiences and performances • Collected and linked from existing state agency data systems. • It includes data about the kinds of services they receive, programs in which they participate, and their academic performance and program or degree completion. • It also includes a variety of demographic data so we are able to look at a variety of different groups of people. • Personally identifiable information, such as names, social security numbers, addresses, and other data which can identify a person as an individual, are not part of the research database.
ERDC Data Sources Output data Research OFM
Data Flow Process • Chart of data flow goes here
Data Source Characteristics • Over 20 source data feeds • Data systems being developed in parallel • Some migrated historic data, some didn’t
Data Preparation: Data Profiling • Do it early, do it often • Verification of data dictionary • Descriptive statistics • Distinct counts and percentages • Zero, blanks and nulls • Minimum and maximum values • Patterns of data
Data Preparation: Data Profiling • Dataset validation checks • Counts of records by time, institution • Values and codes over time • Systematic changes (0,1 to Y,N) • Values defined in data dictionary • Quality of data • Names and identifiers • Data elements
Data Preparation: Data Profiling • Toolset varied by analyst • SAS • Informatica Data Analyst • Excel • Goal of understanding the data • Constraints • Completeness, patterns over time • Values of each data element
Data Preparation: Data Readiness • Document and expand results of profiling process • Generate the “goto” resource for follow-up question • Resource to begin data loading • Content that feeds the data dictionary
Data Preparation: Data Readiness • Information about: • Data provider • Data file • Data elements
What we’ve learned • Customers need to be involved • Dictionaries don’t match data • Educate our analyst on the data, the customer on the vision of the database • Avoid custom extracts • More time required up front
Toward the Future • Empower the provider by offering guidance and tools for profiling • Develop feedback process of data quality and edits back to customer • Open and transparent