130 likes | 422 Views
Linear Clustering Algorithm. BY Horne Ken & Khan Farhana & Padubidri Shweta. Overview. Introduction Data Preprocessing Data Mining Data Visualization Experiment Conclusion. Responsibility. Data Preprocessing : Farhana & Ken Data Mining : Ken Data Visualization: Shweta. Overview.
E N D
Linear Clustering Algorithm BY Horne Ken & Khan Farhana & Padubidri Shweta
Overview Introduction Data Preprocessing Data Mining Data Visualization Experiment Conclusion
Responsibility Data Preprocessing : Farhana & Ken Data Mining : Ken Data Visualization: Shweta
Overview A Linear Clustering Algorithm Applications Feature selection Choose features based on information gain Discretization Partition based on data set characteristics
Data Preprocessing Data Ferret(Federated Electronic Research,Review,Extraction & Tabulation Tool) Install the software Web-version http://www.thedataweb.org/what_ferrett.html
Data Pre-processing : Step • Extracted data from CPS (Current Population Survey) • Pre-processing • Number of features 43 • Year 2007-2008 • 115,000/month rows over 50 states • After preprocessing 23 • Normalization
Data Mining Algorithm Choose an ordinal attribute (X) Order data points based on attribute List potential partition points (between successive values of X) For each potential partition point P Calculate distance of data points where X < P to X > P Results Can partition data points Order data points by information gain
Data Mining • Test dataset
Data Mining • Test dataset 2
Experimental Setup • Environment • Data Ferret : Data Pre-processing • Java Platform : Implement the Data Mining Algorithm • Data Visualization • Google App Engine • Datastore API • Python, javascript and Django Framework • Google Chart API • Hardware: • Windows XP laptop Core2 2.16 GHz • 2.00 GB RAM (that hurt)
Visualization Demo Link for the web-site http://householdstructure-project.appspot.com/
Conclusions Preliminary results are encouraging Discretization was successful Lessons learnt and future work Comparison with other methods on well known datasets Evaluate performance in feature selection OPTIMIZE Don't pick a novel dataset & novel algorithm at the same time
Thank you • Questions