1 / 16

Data Preparation and Profiling: Strategies, challenges, and experiences

Data Preparation and Profiling: Strategies, challenges, and experiences Tim Norris and Mark Lundgren. Todays Agenda. Introductions Date Profiling and Readiness Lessons Learned Future Direction. About the P20W Data Warehouse. Statewide longitudinal data system

trula
Download Presentation

Data Preparation and Profiling: Strategies, challenges, and experiences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Preparation and Profiling: Strategies, challenges, and experiences Tim Norris and Mark Lundgren

  2. Todays Agenda • Introductions • Date Profiling and Readiness • Lessons Learned • Future Direction

  3. About the P20W Data Warehouse • Statewide longitudinal data system • De-identified data about people's early childhood, Kindergarten through 12th grade, higher education and workforce experiences and performances • Collected and linked from existing state agency data systems. • It includes data about the kinds of services they receive, programs in which they participate, and their academic performance and program or degree completion. • It also includes a variety of demographic data so we are able to look at a variety of different groups of people. • Personally identifiable information, such as names, social security numbers, addresses, and other data which can identify a person as an individual, are not part of the research database.

  4. ERDC Data Sources Output data Research OFM

  5. Data Flow Process • Chart of data flow goes here

  6. Data Source Characteristics • Over 20 source data feeds • Data systems being developed in parallel • Some migrated historic data, some didn’t

  7. Data Preparation: Data Profiling • Do it early, do it often • Verification of data dictionary • Descriptive statistics • Distinct counts and percentages • Zero, blanks and nulls • Minimum and maximum values • Patterns of data

  8. Data Preparation: Data Profiling • Dataset validation checks • Counts of records by time, institution • Values and codes over time • Systematic changes (0,1 to Y,N) • Values defined in data dictionary • Quality of data • Names and identifiers • Data elements

  9. Data Preparation: Data Profiling • Toolset varied by analyst • SAS • Informatica Data Analyst • Excel • Goal of understanding the data • Constraints • Completeness, patterns over time • Values of each data element

  10. Data Preparation: Data Readiness • Document and expand results of profiling process • Generate the “goto” resource for follow-up question • Resource to begin data loading • Content that feeds the data dictionary

  11. Data Preparation: Data Readiness • Information about: • Data provider • Data file • Data elements

  12. Readiness Content Items

  13. Data Readiness Template • s

  14. What we’ve learned • Customers need to be involved • Dictionaries don’t match data • Educate our analyst on the data, the customer on the vision of the database • Avoid custom extracts • More time required up front

  15. Toward the Future • Empower the provider by offering guidance and tools for profiling • Develop feedback process of data quality and edits back to customer • Open and transparent

  16. Questions?

More Related