130 likes | 255 Views
Data Cleaning and Transformation. Playing in the Mud. The Many Roles of Knowledge Workers. Brilliant IS idea. The Gist of the Problem. Getting data out of some system to: Analyze it (e.g. Excel, Access, stats package) Get it into another system (e.g. ERP)
E N D
Data Cleaning and Transformation Playing in the Mud
The Many Roles of Knowledge Workers Brilliant IS idea
The Gist of the Problem • Getting data out of some system to: • Analyze it (e.g. Excel, Access, stats package) • Get it into another system (e.g. ERP) • “Smart” manipulation of electronic reports with embedded data • Don’t want to do it manually (why?)
A former student describing new job with major consulting firm • “Lots of modeling, muddy data problems, and working with OLAP tools and data warehousing. I know the muddy data area was a particular area of interest to you and it seems that it is a really BIG issue for many businesses - actually bigger than I imagined.”
ACD Report Example Report Header Date Split Blank lines Data! Totals Lines Next report
Why Talk About This? • Very common problem in business • [insert examples] • Huge amount of time wasted doing manual processing • Really useful spreadsheet and database skills (and mindset) • Example: from the EXCEL-L Developers listserv • It’s fun to play in the mud sometimes • Commercial products such as Content Extractor (formerly Cambio) (www.datajunction.com) are available if you need to do this stuff routinely. Doing it here will allow you to become power users of such products quickly. • Next few slides are screen shots from Content Extractor
This is the main window where one defines style definitions for the different types of lines in the data file.
The date is always in the same spot so we can use Fixed Column to get it out of the line we’ve defined as “DAY”.
Use “Floating Tags” for data that could appear in different positions on a line. Ex: we don’t know how long the split name will be, but we do know that it will end with a period. For data appearing in headers/footers, (e.g. Split Name and #), we tell Content Extractor to “Propogate Field Contents” so we get the name and number with each detail line.
Repeating day, split num and split name Check out our progress.
Export to a wide variety of file types. ContExTest-acddata.asc Here’s an export I did to a delimited ASCII file.