1 / 20

Clustering Algorithms

Clustering Algorithms. Minimize distance But to Centers of Groups. Clustering. First need to identify clusters Can be done automatically Often clusters determined by problem Then simple matter to measure distance from new observation to each cluster

sukey
Download Presentation

Clustering Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Algorithms Minimize distance But to Centers of Groups

  2. Clustering • First need to identify clusters • Can be done automatically • Often clusters determined by problem • Then simple matter to measure distance from new observation to each cluster • Use same measures as with memory-based reasoning

  3. Partitioning • Define new categorical variables • Divide data into fixed number (k) of regions • K-means clustering

  4. Clustering Uses • Segment customers • Find profitability of each, treat accordingly • Star classification: • Red giants, white dwarfs, normal • Brightness & temperature used to classify • U.S. Army • Identify sizes needed for female soldiers • (males – one size fits all)

  5. Tires • Segment customers into product categories • High end (they would buy Michelins) • Intermediate & Low • Standardize data (as in memory-based reasoning)

  6. Raw Tire Data

  7. Standardize • INCOME • MIN(1,INCOME/200000) • AGE OF CAR • IF({AGE OF CAR})<12 months,1, • ELSE[MIN{(8-Years)/7},1]

  8. Sort Data by Outcome

  9. Standardized Training Data

  10. Identify Cluster Means(could use median, mode)

  11. New Case #1 • From new data (could be test set or new observations to classify) squared distance to each centroid Michelin: 0.840 Goodyear 0.025 Opie’s tires 0.047 • So minimum distance to Goodyear

  12. New Case #2 • Squared distance to each centroid Michelin: 0.634 Goodyear 0.255 Opie’s tires 0.057 • So minimum distance to Opie’s

  13. Software Methods • Hierarchical clustering • Number of clusters unspecified a priori • Two-step a form of hierarchical clustering • K-means clustering • Self-organizing maps • Neural network • Hybrids combine methods

  14. Application: Credit Cards • Credit scoring critical • Use past applicants; develop model to predict payback • Look for indicators providing early warning of trouble

  15. British Credit Card Company • Monthly account status – over 90 thousand customers, one year operations • Outcome variable STATE: cumulative months of missed payments (integer) • Some errors & missing data (eliminated observations) • Biased sample of 10 thousand observations • Required initial STATE of 0

  16. British Credit Card Company • Compared clustering approaches with pattern detection method • Used medians rather than centroids • More stable • Partitioned data • Clustering useful for general profile behavior • Pattern search method sought local clusters • Unable to partition entire data set • Identified a few groups with unusual behavior

  17. Insurance Claim Application • Large data warehouse of financial transactions & claims • Customer retention very important • Recent heavy growth in policies • Decreased profitability • Used clustering to analyze claim patterns • Wanted hidden trends & patterns

  18. Insurance Claim Mining • Undirected knowledge discovery • Cluster analysis to identify risk categories • Data for 1996-1998 • Quarterly data • Claims for prior 12 months • Contribution to profit of each policy • Over 100,000 samples • Heavy growth in young people with expensive automobiles • Transformed data to normalize, remove outliers

  19. Insurance Claim Mining • Number of clusters • Too few – no discrimination – best here was 50 • Used k-means algorithm to minimize least squared error • Identified a few cluster with high claims frequency, unprofitability • Compared 1998 data with 1996 data to find trends • Developed model to predict new policy holder performance • Used for pricing

  20. Computational Constraints • Each cluster should have adequate sample size • Since cluster averages are used, cluster analysis not as sensitive to disproportional cluster sizes relative to matching • The more variables you have, the greater the computational complexity • The curse of dimensionality • (it won’t run in a reasonable time if you have too many variables)

More Related