Linear Clustering Algorithm

Linear Clustering Algorithm BY Horne Ken & Khan Farhana & Padubidri Shweta

Overview Introduction Data Preprocessing Data Mining Data Visualization Experiment Conclusion

Responsibility Data Preprocessing : Farhana & Ken Data Mining : Ken Data Visualization: Shweta

Overview A Linear Clustering Algorithm Applications Feature selection Choose features based on information gain Discretization Partition based on data set characteristics

Data Preprocessing Data Ferret(Federated Electronic Research,Review,Extraction & Tabulation Tool) Install the software Web-version http://www.thedataweb.org/what_ferrett.html

Data Pre-processing : Step • Extracted data from CPS (Current Population Survey) • Pre-processing • Number of features 43 • Year 2007-2008 • 115,000/month rows over 50 states • After preprocessing 23 • Normalization

Data Mining Algorithm Choose an ordinal attribute (X) Order data points based on attribute List potential partition points (between successive values of X) For each potential partition point P Calculate distance of data points where X < P to X > P Results Can partition data points Order data points by information gain

Data Mining • Test dataset

Data Mining • Test dataset 2

Experimental Setup • Environment • Data Ferret : Data Pre-processing • Java Platform : Implement the Data Mining Algorithm • Data Visualization • Google App Engine • Datastore API • Python, javascript and Django Framework • Google Chart API • Hardware: • Windows XP laptop Core2 2.16 GHz • 2.00 GB RAM (that hurt)

Visualization Demo Link for the web-site http://householdstructure-project.appspot.com/

Conclusions Preliminary results are encouraging Discretization was successful Lessons learnt and future work Comparison with other methods on well known datasets Evaluate performance in feature selection OPTIMIZE Don't pick a novel dataset & novel algorithm at the same time

Thank you • Questions

Linear Clustering Algorithm