310 likes | 678 Views
Data Mining Process Source: CRISP-DM (SPSS.com website). Data Cleaning . MIS Issues (Source: Article by Ralph Kimball) Analyst Issues. MIS Issues . Elementizing (Parsing) Standardizing Verifying Matching, Householding Documenting. Elementising.
E N D
Data Mining Process Source: CRISP-DM (SPSS.com website)
Data Cleaning • MIS Issues • (Source: Article by Ralph Kimball) • Analyst Issues
MIS Issues • Elementizing (Parsing) • Standardizing • Verifying • Matching, • Householding • Documenting
Elementising • Ralph B and Julianne Kimball Trustees for Kimball Fred CSte. 11613150 Hiway 9Box 1234 Boulder CrkColo 95006
Addressee First Name(1): RalphAddressee Middle Initial(1): BAddressee Last Name(1): KimballAddressee First Name(2): JulianneAddressee Last Name(2): KimballAddressee Relationship: Trustees forRelationship Person First Name: FredRelationship Person Middle Name: CRelationship Person Last Name: KimballStreet Address Number: 13150Street Name: Hiway 9Suite Number: 116Post Office Box Number: 1234City: Boulder CrkState: ColoFive Digit Zip: 95006
Standardizing • Ste = suite • Hiway 9 = Highway 9 • Other example - • Grade “D” = Distinction in Australia
Verification • Zip code 95006 is CA, not Colorado
Matching/Householding • Match record with other customer records containing Ralph and Julianne Kimball • Establish that they are part of the same household
Analyst Issues • Physical data problems • Data Dictionaries • Validation (Frequencies) • Missing Data • The “zero” value problem • Inappropriate (Future) data for modeling • Unavailable data
Physical • Cannot access data • ASCII vs EBCDIC • On a medium that you can’t use (certain type of tape, for instance)
Data Dictionaries • What are the fields? • Where are they located? • What format are they stored in?
Missing Data • Ignore • Find the right values if you can • Use Average for that variable • Replace with number that matches its characteristics (What do the missing people look like in terms of the dependent? Who else looks like that?
The zero problem • What does 0 mean? • If “Number of Revolving Bankcard Trades Currently Past Due” = 0, what does that mean?
# of Bank Rev. Trds Currently Past Due Cumulative Cumulative BRPSTD Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 13485 96.0 13485 96.0 1 486 3.5 13971 99.5 2 57 0.4 14028 99.9 3 12 0.1 14040 100.0 4 2 0.0 14042 100.0
# of Trds Cumulative Cumulative TRADES Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 1606 11.4 1606 11.4 1 1080 7.7 2686 19.1 2 1056 7.5 3742 26.6 3 1007 7.2 4749 33.8 4 949 6.8 5698 40.6 5 911 6.5 6609 47.1 6 849 6.0 7458 53.1 7 793 5.6 8251 58.8 8 682 4.9 8933 63.6 9 622 4.4 9555 68.0 10+ 4487 32.0 14042 100.0
# of Bank Rev. Trds Cumulative Cumulative BRTRDS Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ INQS. & PR ONLY 64 0.5 64 0.5 PR ONLY 22 0.2 86 0.6 INQS. ONLY 960 6.8 1046 7.4 NO RECORD 560 4.0 1606 11.4 0 6183 44.0 7789 55.5 1 2616 18.6 10405 74.1 2 1427 10.2 11832 84.3 3 831 5.9 12663 90.2 4 496 3.5 13159 93.7 5 287 2.0 13446 95.8 6 188 1.3 13634 97.1 7 142 1.0 13776 98.1 8 92 0.7 13868 98.8 9 60 0.4 13928 99.2 10+ 114 0.8 14042 100.0
# of Bank Rev. Trds Currently Past Due Cumulative Cumulative BRPSTD Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ NO TRADES OF THIS TYPE 6183 44.0 6183 44.0 INQS. & PR ONLY 64 0.5 6247 44.5 PR ONLY 22 0.2 6269 44.6 INQS. ONLY 960 6.8 7229 51.5 NO RECORD 560 4.0 7789 55.5 MISSING 3475 24.7 11264 80.2 0 2221 15.8 13485 96.0 1 486 3.5 13971 99.5 2 57 0.4 14028 99.9 3 12 0.1 14040 100.0 4 2 0.0 14042 100.0
Inappropriate Data Used • Future data used to build great looking model. • Used payments till month end instead of payments until cycle date.
Unavailable Data • Data on Rejected Applicants • Would they have been Good or Bad had they been accepted? • Use “Reject Inferencing” techniques.