150 likes | 239 Views
COMBO-17 Galaxy Dataset. Colin Holden COSC 4335 April 17, 2012. Contains data on 3,462 objects which have been classified as Galaxies in the Chandra Deep Field South which is basically a patch of sky that lies in the Fornax constellation.
E N D
COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012
Contains data on 3,462 objects which have been classified as Galaxies in the Chandra Deep Field South which is basically a patch of sky that lies in the Fornax constellation. • There is 65 columns of data in this dataset ranging from luminosities in 10 different bands of the spectrum to size and brightness. However the website mentions how a vast majority of these attributes are redundant and not independent. • Focusing on three main attributes of this dataset. • Total R (red band) magnitude is a measure of brightness of the galaxy. These are done in inverted logarithmic measurements. So a galaxy with R=21 is 100 more times brighter then one with R=26. • ApDRmag is the difference between the total and aperture magnitude in the R band. This is a rough measure of the size of the galaxy. • rsMAGwhich is the magnitude of the vector coming from the galaxy. Roughly a vector measurement of distance.
At first glance, Data appeared to have some sort of linear relationship. Started with the Pearson Correlation Coefficient to test for such a relationship. • The Pearson Correlation Coefficient Calculated was about .6789. • The Pearson Correlation Coefficient assumes the data is normally distributed, which may not be the case, but this was just a first step and the data seem to have a slightly linear relationship. • The brightness of the galaxy seems to decrease as the size grows.
K Means Clustering • Attempt to break the data set into smaller data sets. • Number of Clusters was chosen to be 5. • Had to limit the number of iterations of when to stop trying to improve the centroid for each cluster. • Initial centroids were chosen to be the first 5 records.
Hierarchical Clustering • Chose to stop at 5 clusters to have comparison with the K-Means results. • Proximity using Euclidean Distance. • Used Ward’s Method to determine cluster similarity when merging clusters. • Computationally Expensive
K Means with 3 Variables • Wanted to see what kind of results would be yielded from choosing 3 Variables to cluster against. • Same parameters for the previous K- Means algorithms. • Chose Brightness, Size, and Distance from Earth as the 3 Variables. • Difficult to present graphically.
Conclusions • Got to see how the affects of outliers can affect the clustering algorithms for AHC vs K-Means. K-Means was more sensitive to outliers. • Also got to see how these cluster analysis can be so versatile with lots of different options i.e. value for K, number of attributes to compare etc. • The lots of options can be a downfall of clustering also in that one small change can yield very different results.
Afterthoughts • I would have done another K-Means clustering analysis after removing the outliers from my original data and see how the difference in the clusters and their centroids. • I would have experimented with different values of K and looked at the results.