350 likes | 367 Views
Credit Card Analysis of Czech Bank. Goals. Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer.
E N D
Credit Card Analysis of Czech Bank Goals • Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer. • The Berka dataset is from the 1999 PKDD Discovery Challenge. • The Berka dataset is a collection of financial information from a Czech bank.
Credit Card Analysis of Czech Bank Domain Description • Entity-Relationship • Description • Each account has both static characteristics (e.g. date of creation, address of the branch) given in relation "account" and dynamic characteristics (e.g. payments debited or credited, balances) given in relations "permanent order" and "transaction". • Relation "client" describes characteristics of persons who can manipulate with the accounts. • Relations "loan" and "credit card" describe some services which the bank offers to its clients; • Relation "demographic data" gives some publicly available information about the districts (e.g. the unemployment rate); additional information about the clients can be deduced from this.
Credit Card Analysis of Czech Bank Domain Description - Continued • The dataset contains the following tables: • Accounts • Each record describes static characteristics of an account • Size: 4500 records • Clients • Each record describes characteristics of a client • Size : 5369 records • Disposition (Disp) • Each record relates a client with an account and describes the client’s right to operate that account • Size: 5369 records
Credit Card Analysis of Czech Bank Domain Description - Continued • Dataset Table Description, Continued • Permanent Orders, Debit Only (Orders) • Each record describes characteristics of a payment order • Size : 6471 records • Transactions (Trans) • Each record describes one transaction on an account • Size: 1056320 records • Loans • Each record describes a loan granted for a given account • Size: 682 records
Credit Card Analysis of Czech Bank Domain Description - Continued • Dataset Table Description, Continued • Credit Cards (Cards) • Each record describes a credit card issued to an account • Size : 892 records • Demographic Data (District) • Each record describes demographic characteristics of a district • Size: 77 records
Credit Card Analysis of Czech Bank Data Preprocessing Activities • Converted the ascii files to: • MS Excel and/or MS Word files for cleaning data • MS Access database for use in • data mining, • de-normalizing or ‘flattening’ files, and • basic querying to learn more about the data. • Put all modified files into file types recognized by Weka. • These files are comma delimited with a ‘heading’ of attribute definition information.
Credit Card Analysis of Czech Bank Data Preprocessing Activities - Continued • Verified all table relationships: • Every account has an Owner via Disp and Account tables • Order and Loan records are duplicated in transaction records. That is, the transactions include Order records and Loan payments. • Loan records in Trans are identified by k_symbol=”LP”
Credit Card Analysis of Czech Bank Data Preprocessing Activities - Continued • De-normalize, or ‘flatten’, files for mining. Our database is relational. In order to mine or cluster attributes, those attributes must be in a single table. We have created a de-normalized table based on our goals. • Goal: Analyze credit-card information to extrapolate the type of customer who makes a good candidate for a credit-card. • i.Account-Client-Disp-Card-District-Loan-Transaction Table • Using information we discovered about accounts from previous clustering, cluster customer information • ·Using card type as clustering attribute • ·Added "N" (None) as a possible value to the Loan Status attribute • In order to better understand customers, we looked at this table in two ways: • To identify, from all customers, which were credit card holders and which were not. • To examine the variances that exist between all credit card holding customers.
Credit Card Analysis of Czech Bank Data Preprocessing Activities - Continued Column Description Changes Missing or Invalid Values Notes Frequency Frequency of Statement Issuance • Translated values as follows: • POPLATEK MESICNE changed to MONTHLY ISSUANCE (MI) • POPLATEK TYDNE changed to WEEKLY ISSUANCE (WI) • POPLATEK PO OBRATU change to ISSUANCE AFTER TRANSACTION (TI) N/A Translated for ease of use Date Date of account creation Removed Ignore This attribute is not used in our mining effort • Add, change, remove, and descretize attributes as necessary. The tables shown below describe the changes made: • Preprocessing Activities – Account Table: Changes Made With Excel
Credit Card Analysis of Czech Bank Data Preprocessing Activities - Continued Column Description Changes Missing or Invalid Values Notes Changes Made With Excel BirthNumber Birthday and gender Removed This is a 6-digit number. The documentation says that its format is as follows: - YYMMDD (Men) - YYMM50+DD (Women) Analysis suggests that the format is as follows: - YYMMDD (Men) - YY50+MMDD (Women) Format changed to: - MM/DD/YYYY - Created a new field for Client_Sex N/A Attribute removed from dataset and replaced by attributes Client_Sex and Client_Age • Preprocessing Activities – Client Table:
Credit Card Analysis of Czech Bank Data Preprocessing Activities - Continued Column Description Changes Missing or Invalid Values Notes Client_Sex Gender derived from BirthNumber attribute Added Built during data pre-processing. Values are MALE (M) and FEMALE (F) N/A Distribution of Values: - 49% Male- 51% Female Client_Age Discretized value derived from BirthNumber attribute Values: - 1 = YOUTH (0 - 24) - 2 = ADULT (24 - 35) - 3 = MIDDLE-AGE (36 - 64) - 4 = SENIOR (65 - *) Added Built during data pre-processing. N/A None • Preprocessing Activities – Client Table, Continued:
Credit Card Analysis of Czech Bank Data Preprocessing Activities - Continued Column Column Description Description Changes Changes Missing or Invalid Values Missing or Invalid Values Notes Notes Type Type of Disposition(owner/user) • Translated as follows: • Disponent = User • Owner = Owner User Distribution of Values: - 84% Owner - 16% User K_Symbol Characterization of the payment Removed 22% of the values are missing Removed attribute due to missing values • Preprocessing Activities – Disposition Table Changes Made With Excel • Preprocessing Activities – Order Table Changes Made With Excel
Credit Card Analysis of Czech Bank Column Description Changes Missing or Invalid Values Notes Data Preprocessing Activities - Continued Date Date the loan was granted Changed format from YYMMDD to MM/DD/YYYY Ignore • Correlation with Status Attribute: • R2 = 98% • r = 0.49 Amount Amount of loan • Removed • Discretized values stored in Amt attribute N/A • Correlation with Status Attribute: • R2 = 68% • r = 0.34 Duration • Duration of the loan • Possible Values: • 12 months • 24 months • 36 months • 48 months • 60 months • Removed • Discretized values stored in Dur attribute N/A • Correlation with Status attribute: • R2 = 100% • r = 0.51 • Distribution of Values: • 12 = 19% • 20 = 20% • 36 = 19% • 40 = 20% • 60 = 21% • Preprocessing Activities – Loan Table Changes Made With Excel
Credit Card Analysis of Czech Bank Column Description Changes Missing or Invalid Values Notes Data Preprocessing Activities - Continued Payments Monthly loan payment • Removed • Discretized values stored in Amt attribute N/A Dur • Discretized Duration Attribute: • (*,30) • (31,42) • (43,54) • (55,*) • Added • Discretized in Rosetta using Entropy Algorithm N/A • Distribution of values: • (*,30) = 39% • (31,42) = 19% • (43,54) = 20% • (55,*) = 21% Pmt • Discretized Payment Attribute: • (*,8041) = 1 • (8041,*) = 2 • Added • Discretized in Rosetta using Entropy Algorithm • Rosetta discretized into 50+ values.. We merged values using <1% N/A • Distribution of values: • (*,8041) = 94% • (8041,*) = 6% Amt • Discretized Amount Attribute: • (*,30708) = 1 • (30709,49380) = 2 • (49381,76926) = 3 • (76927,230310) = 4 • (230311,*) = 5 • Added • Discretized in Rosetta using Entropy Algorithm • Rosetta discretized into 100+ values.. We merged values using <1% N/A • Distribution of values: • (*,30708) = 9% • (30709,49380) = 10% • (49381,76926) = 13% • (76927,230310) = 47% • (230311,*) = 21% • Preprocessing Activities – Loan Table, Continued
Credit Card Analysis of Czech Bank Column Description Changes Missing or Invalid Values Notes Data Preprocessing Activities - Continued Date Date of transaction Format changed from YYMMDD to MM/DD/YYYY Ignore Type +/- transaction • Translated Values: • PRIJEM = Credit • VYDAJ = Withdrawal Flag Record - Do not use Operation Mode of transaction • Translated values: • VYBER KARTOU = Credit Card Withdrawal (CCW) • VKLAD = Credit in Cash (CRC) • PREVOD Z UCTU = Collection from Another Bank (CAB) • VYBER = Withdrawal in Cash (WC) • PREVOD NA UCET = Remittance to Another Bank (RAB) Flag record - Do not use K_Symbol Characterization of transaction • Translated Values: • POJISTNE = Insurance Payment (IP) • SLUZBY = Payment on Statement (PS) • UROK = Interest Credited (ICR) • SANKC. UROK - Sanction Interest (SI) • SIPO - Household (H) • DUCHOD - Old-age Pension (OP) • UVER - Loan Payment (LP) Flag record - Do not use Translated for ease of use • Preprocessing Activities – Transaction Table Changes Made With Excel
Credit Card Analysis of Czech Bank Data Preprocessing Activities - Continued Column Description Changes Missing or Invalid Values Notes A2 District Name Replaced Spaces with underscore ("_") N/A A3 Region Name Replaced Spaces with underscore ("_") N/A • Distribution of Values: • Prague = 1% • Central_Bohemia = 16% • South_Bohemia = 10% • West_Bohemia = 13% • North_Bohemia = 13% • East_Bohemia = 16% • South_Moravia = 18% • North_Moravia = 14% Issued Date card issued • Removed • Eliminated null time-stamp and changed date format from YYMMDD to MM/DD/YYYY Ignore For our purposes, date card issued is not material • Preprocessing Activities – Demographic Table Changes Made With Excel
Credit Card Analysis of Czech Bank Data Preprocessing Activities - Continued Column Description Changes Missing or Invalid Values Notes Issued Date card issued • Removed • Eliminated null time-stamp and changed date format from YYMMDD to MM/DD/YYYY Ignore For our purposes, date card issued is not material • Preprocessing Activities – Credit Card Table Changes Made With Excel
Credit Card Analysis of Czech Bank Methodology • Methodology Overview • Attribute Ranking • Classification Analysis • Clustering Analysis
Credit Card Analysis of Czech Bank Methodology – Classification Analysis • Classification is a process where a model is built describing a predetermined set of data classes. • The model is constructed by analyzing all the records in the database. • Each tuple is assumed to belong to a predefined class, as determined by one of the attributes called the class label. • The tuples analyzed to build the model form the training data set. • Typically, the learned model is expressed in terms of decision trees or classsification rules. • These rules can be used to predict the test data set.
Credit Card Analysis of Czech Bank Methodology – Classification Analysis • We used See5 as the classification tool for our project. A brief description of the algorithm is as follows: • The tree starts as a single node representing the training samples. • If the samples are all of the same class, then the node becomes a leaf and is labeled with that class. • Otherwise, the algorithm uses an entropy-based measure known as Information Gain as a heuristic for selecting the attribute that will best separate the samples into individual classes. This attribute becomes the "test" or "decision" attribute at the node. • A branch is created for each known value of the test attribute and the samples are partitioned accordingly.
Credit Card Analysis of Czech Bank Methodology – Classification Analysis • See5 Algorithm, cont. • The algorithm uses the same process recursively to form a decision tree for the samples at each partition. • The recursive partition stops when all the samples for a given node belong to the same class or if there are no remaining attributes on which samples may be further partitioned.
Credit Card Analysis of Czech Bank Methodology – Classification Analysis We ran See5 using the RuleSets option and the Boost option with 3 trials. This represents simplified version of the decision tree generated by See5. To build this tree, we used Rules obtained, which had high confidence (>85%). Trans_Avg_Balance <= 37097.18 Class :0 (1828.3/93.4) Client_District_ID = 10 Class: 1 (6.5) Client_District ID = 38 Class: 1 (8.5 /0.9) Trans_Avg_Balance > 46123.76 Class: 1 (8.3) Account_Opened = 1995 Client_District ID = 1 Client_District ID = 15 Client_District ID = 13 Trans_Avg_Balance > 66899.77 Class: 1 (5.3) Class: 1 (5.3) Class: 1 (8.3) Client_Age = Senior Class: 1 (5.3)
Credit Card Analysis of Czech Bank Methodology – Classification Analysis • See5 also has the ability to express the clasifiers as Rule sets, which are easier to understand. 96 rules were generated, out of which we have listed 3 rules. All of these have confidence of more than 90% • Rule 1 • Rule 1/1: (1828.4/93.4, lift 1.3) Trans_Avg_Balance <= 37097.18 --> class 0 [0.948] • If Trans_Avg_Balance <= 37097.18, Then Class Prediction is taken as Non-CardHolder
Credit Card Analysis of Czech Bank Methodology – Classification Analysis • Rule 2 • Rule 1/3: (8.3, lift 3.7) Client_Age = M Client_District_ID = 1 Trans_Avg_Balance > 66899.77 --> class 1 [0.903] • If (Client_Age = Middle-Age) and (Client_District_ID = 1) and (Trans_Avg_Balance = 66899.77), Then Class Prediction is that of a Card Holder.
Credit Card Analysis of Czech Bank Methodology – Classification Analysis • Rule 3 • Rule 1/4: (8.3, lift 3.7) Client_District_ID = 60 Trans_Avg_Balance > 46123.76 Loan_Status = none --> class 1 [0.903] • If (Client_District_ID = 60) and (Trans_Avg_Balance > 46123.76) and (Loan_Status = none), Then Class Prediction is that of a Card Holder.
Credit Card Analysis of Czech Bank Methodology – Classification Analysis Training Set Evaluation • The estimated predictive error = (64 + 390)/3600 = 12.6% • The percentage of instances that were correctly classified as Non-card holders = (2816/2880) = 97% • The percentage of instances that were incorrectly classified as Non-card holders = (64/2880) = 22.2% • The percentage of instances that were in-correctly classified as Card holders = (390/720) = 54% The percentage of instances that were correctly classified as Card holders = (330/720) = 45%
Credit Card Analysis of Czech Bank Methodology – Classification Analysis Test Set Evaluation • The estimated predictive error on the Test set is = ( 59 +119)/900 = 19.7% • The percentage of instances that were correctly classified as Non-card holders = (669/728) =91% • The percentage of instances that were incorrectly classified as Non-card holders = (59/728) = 8.1% • The percentage of instances that were in-correctly classified as Card holders = (119/172) = 69% • The percentage of instances that were correctly classified as Card holders = (53/172)= 73.6%
Credit Card Analysis of Czech Bank Methodology – Cluster Analysis, Conceptual Method • We used cluster analysis to partition the data into a set of classes, grouping together customers or attributes with similar characteristics. • The purpose of cluster analysis is to place observations into groups or clusters suggested by the data such that observations in a given cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar. • The COBWEB Algorithm was chosen for this task. The COBWEB algorithm used was implemented in the Weka Toolkit.
Credit Card Analysis of Czech Bank Cluster 0# Instances: 892 Cluster 2# Instances: 71 Cluster 3# Instances: 185 Cluster 5# Instances: 83 Cluster 6# Instances: 84 Cluster 7# Instances: 469 Cluster 57# Instances: 278 Cluster 59# Instances: 88 Cluster 60# Instances: 103 Cluster 1# Instances: 423 Cluster 4# Instances: 167 Cluster 58# Instances: 191 Methodology – Cluster Analysis , Conceptual Method • The COBWEB algorithm was run twice. Once with a cutoff value of 0.18 and once with a cutoff value of 0.17. • This is done because Weka returns the cluster information only for the leaf clusters and prunes the tree by levels. • The results of the COBWEB Clustering analysis are interpreted in the following chart.
Credit Card Analysis of Czech Bank Methodology – Cluster Analysis , Partitioning Method
Credit Card Analysis of Czech Bank Methodology – Cluster Analysis , Partitioning Method • The distinction of each cluster was not obvious beyond the coupling of age group and card type. • To further investigate the characteristics of customers, we ran another clustering analysis using partitioning methods. We used the SAS Enterprise Miner product. SAS uses a partitioning clustering tool implementing the WARD (Minimum variance) method.
Credit Card Analysis of Czech Bank Methodology – Cluster Analysis , Conceptual Method
Credit Card Analysis of Czech Bank Results • The following rules represent our findings in analyzing whether or not a customer will/will not have a credit card: • If the average account balance is less-than-or-equal 37097.18, • then the client will not be a card holder. • If the client is middle-age and lives in Prague, and his/her average • account balance is greater-than 66899.77, then that client will • be a card holder. • If the client lives in Prostejov, and has an average account balance • in excess of 46123.76, then that client will be a card holder.
Credit Card Analysis of Czech Bank Results • The following are characterizations of cardholders within the current population of bank clients • There is no apparent difference between men and women in the • issuance of credit cards. • Middle aged single females have a higher percentage of classic • and gold cards than the general population. • All credit card customers that have taken a loan have good • status in repayment. • Customers who are in the age bracket less than 24 and whose • average balance is under 50000.00 are likely candidates for a • JUNIOR Card. • Otherwise, it is not easily determined from customer information • which card type has been selected. Perhaps further examination of • customer behavior is warranted.
Credit Card Analysis of Czech Bank Next Steps • Future continuation of this analysis might include • identifying customers with a junior or classic credit card to whom the bank can offer a higher-limit card. • Other Opportunities • Which accounts are likely to default on loan payments? Why? • What are the characteristics of a good bank client ("good" is defined here, in the most general of terms, as who will make payments on their outstanding balances in a timely manner)? • What are the identifying characteristics of a good bank branch (again, we define "good" only in the most general terms of which branches are most successful in the collection of loan payments)?