230 likes | 242 Views
In memory of Dr . Jan Zytkow SEP 09 1944 - JAN 16 2001. ITCS - 8265. Mining Financial Data Histograms & Contingency Tables. Shishir Gupta Under the guidance of Dr. Mirsad Hadzikadic. Agenda. Database Task goals Tool & technique used Data preparation and cleaning Attribute selection
E N D
In memory of Dr. Jan Zytkow SEP 09 1944 - JAN 16 2001 ITCS - 8265 Mining Financial Data Histograms & Contingency Tables Shishir Gupta Under the guidance of Dr. Mirsad Hadzikadic
Agenda • Database • Task goals • Tool & technique used • Data preparation and cleaning • Attribute selection • Data transformation • Data Mining/Pattern Evaluation • Knowledge presentation • Pros/Cons • Questions & Demonstration
Database • Financial Dataset from PKDD 1999 • Financial Dataset from a Czech Bank • Relational Dataset • 8 Relations • ACCOUNT - LOAN • DEMOGRAPH - ORDER • TRANSACTION - CARD • DISPOSITION - CLIENT
Task Goal • Determine Good Client to offer some additional service • Determine Bad Client to watch carefully to minimize bank loss • Offer Services : • Loan • Credit Card
Technique Used - Histogram SQL Statement used SELECT age, COUNT(age) FROM table_x GROUP BY age ORDER BY age
Technique Used – C-Tables SQL Statement used SELECT sex, COUNT(sex), age FROM table_x a, table_y b WHERE a.id = b.fid GROUP BY sex, age ORDER BY sex, age
Technique Used – Correlation SQL Statement used SELECT x, y FROM table_x a, table_y b WHERE a.id = b.fid ORDER BY x, y
Data Cleaning • Missing Value • Relation DEMOGRAPHIC • Incorrect Values • Relation TRANSACTION (Data reduced by 10% after cleaning)
Data Preparation • Relation CLIENT • Separating SEX & BDATE from BIRTHNUMBER • All Date fields converted to AGE • Ref 199901.
Data Preparation Cont…. • Creating Table definitions • Setting up data in table compatible format • Loading data into Database • Evaluate loading errors and changing attribute definitions accordingly
A4? Y N A1? A6? N Y Y N Class1 Class2 Class1 Class2 Attribute Selection • Decision Relation • LOAN • Decision Attributes • STATUS • Classification Attributes • All other attributes that do not belong to LOAN relation.
Data Transformation • Discretization • Continuous attributes into 4 to 10 buckets • Transactions performed in the year 1997 considered for relation TRANSACTION. • Due to resource limitations • Maximum loans were approved during this period TRANSFORM
Data Mining/Pattern Evaluation • Run Histogram on all non-key attributes to study its distribution. • Discretize continuous attributes. • Run Contingency Table study the reference among two attributes. • Check significance with Correlation function if both attributes are continuous.
Knowledge Presentation - 1 • All loans on accounts where a second person is allowed to dispose are GOOD LOANS (100%)
Knowledge Presentation - 2 • Permanent Orders of type household & leasing indicates financial stability
Knowledge Presentation - 3 • Accounts with Cash withdrawals are more likely to repay their loans
Knowledge Presentation - 4 • Accounts with low transaction amounts indicate good loans
Knowledge Presentation - 5 • Accounts that are in debt indicates BAD LOAN
Pros • Flexibility to alter data presentation to understand the nature of data • Customers with no background with datamining can appreciate the output results because of its simplicity • Since there is a provision to store the results in a file, subsequent analysis on a subset of data becomes very easy
Cons • Needs capability for Multi-Variable analysis. • Some kind of quantification needs to be put in. • Performance issues with using RDBMS.
Questions & Demonstration Thank You