110 likes | 121 Views
This term project explores the use of kernel PCA and KNN classification to identify the low-dimensional manifold of climate data sets and make predictions on the original space. The project focuses on monthly sea surface temperature data and discusses the results and conclusions.
E N D
Columbia University Advanced Machine Learning & Perception – Fall 2006 Term Project Nonlinear Dimensionality Reduction and K-Nearest Neighbor Classification Applied to Global Climate Data Carlos Henrique Ribeiro Lima New York – Dec/2006
Outline • Goals • Motivation and Dataset • Methodology • Results • Low-Dimensional Manifold • KNN on Low-Dimensional Manifold • Conclusion
1. Goals • Use of kernel PCA based on Semidefinite Embedding to identify the low-dimensional, non-linear, manifold of climate data sets identification of main modes of spatial variability; • Classification on the feature space predictions on the original space (KNN method);
2. Motivation • Dataset of Monthly Sea Surface Temperature (SST) Huge economical and social impacts of extreme El Nino events (e.g. 1997) Need of forecasting models!
2. Dataset • Monthly Sea Surface Temperature (SST) Data • from Jan/1856 to Dec/2005 • Latitudinal Band: 25oS-25oN • Grid with 599 cells; • Training data: Jan/1856 to Dec/1975 = 120 years • Testing set: Jan/1976 to Dec/2005 = 30 years • Input matrix: n = 1440 points m = 599 dimensions
3. Methodology • 1) Semidefinite Embedding (Code from K. Q. Weinberger) Semipositive definiteness Inner product centered on the origin Isometry - local distances of the input space are preserved on the feature space 2) KNN Euclidian Distance 3) Probabilistic Forecasting Skill Score (RPS)
4. Results Low-Dimensional Manifold
4. Results Labeling on the feature space
4. Results Forecasts – Testing Set KNN method and skill score E.g. March – 1997; 1) Want to predict the class of nino3 in Dec/1997 lead time = 9 months. 2) KNN on feature space (March:1856 to 1975); 3) Take classes and weights of the k neighbors; 4) Skill score.
4. Results Forecasts – Testing Set KNN method and skill score – El Nino of 1982 and 1997
5. Conclusions • Semidefinite Embedding performs well on the SST data (high dimensional just 3 dimensions ~90%of exp. variance); • KNN method provides very good classification and forecasts; • Need to check sensibility to change in some parameters (# local neighbors, #KNN); • Plan to extend to other climate datasets; • Try other metrics, multivariate data, etc.