770 likes | 910 Views
Framework for creating large-scale content-based image retrieval (CBIR) system for solar data analysis. Juan M. Banda. Agenda. Project Objectives Datasets Framework Description Feature Extraction Attribute Evaluation Dimensionality Reduction Dissimilarity Measures Component
E N D
Framework for creating large-scale content-based image retrieval (CBIR) system for solar data analysis Juan M. Banda
Agenda • Project Objectives • Datasets • Framework Description • Feature Extraction • Attribute Evaluation • Dimensionality Reduction • Dissimilarity Measures Component • Indexing Component
Creation of a CBIR system building framework • Creation of a composite multi-dimensional data indexing technique • Creation of a CBIR system for Solar Dynamics Observatory
Contributions • Framework is the first of its kind • Custom solution for high-dimensional data indexing and retrieval • First domain-specific CBIR system for solar data • Motivation • Lack of simple CBIR system creation tools • High-dimensional data indexing and retrieval has shown to be very domain-specific • SDO (with AIA) produces around 69,120 images per day. Around 700 Gigabytes of image data per day
TRACE Dataset • Created using the Heliophysics Events Knowledgebase (HEK) portal • Contains 8 classes: Active Region, Coronal Jet, Emerging Flux, Filament, Filament Activation, Filament Eruption, Flare, and Oscillation • 200 images per class, available on the web: http://www.cs.montana.edu/angryk/SDO/data/TRACEbenchmark/
Sample Images from subset of classes Active Region Oscillation Flare Filament Filament Eruption Filament Activation
INDECS Database • Images of indoor environment’s under changing conditions • Contains 8 Classes: Corridor Cloudy and Night, Kitchen Cloudy, Night, and Sunny, Two-persons Office Cloudy, Night, and Sunny • 200 images per class, available on the web: http://cogvis.nada.kth.se/INDECS/
Samples Images from subset of classes Corridor - Cloudy Corridor - Night Kitchen - Cloudy Kitchen - Night Kitchen - Sunny Two-persons Office - Cloudy
ImageCLEFmed Dataset • The 2005 dataset contains 9,000 radio graph images divided in 57 classes • 2006-2007 datasets increased to 116 classes and by 1,000 images each year • 2010 dataset contains over 77,000 images (perfect for scalability evaluation)
Sample Images from subset of classes Head Profile Lungs Hand Vertebrae
Labeling • TRACE Dataset • One label per image (as a whole) • One label per cell (several per image) • INDECS Database • One label per image (as a whole) • ImageCLEFmed • One label per image (as a whole)
Comparative Evaluation Puposes • Future work: Tune parameters better • Why? • Naïve Bayes • C 4.5 • Support Vector Machines (SVM) • Adaboosting C 4.5
Refereed publications from this work • 2010 J.M Banda and R. Angryk “Selection of Image Parameters as the First Step Towards Creating a CBIR System for the Solar Dynamics Observatory”. TO APPEAR. International Conference on Digital Image Computing: Techniques and Applications (DICTA). Sydney, Australia, December 1-3, 2010 J.M Banda and R. Angryk “Usage of dissimilarity measures and multidimensional scaling for large scale solar data analysis”. TO APPEAR. NASA Conference of Intelligent Data Understanding (CIDU 2010). Computer History Museum, Mountain View, CA October 5th - 6th, 2010 (Invited for submission to Best of CIDU 2010 issue of Statistical Analysis and Data Mining journal (the official journal of ASA)) J.M Banda and R. Angryk “An Experimental Evaluation of Popular Image Parameters for Monochromatic Solar Image categorization” Proceedings of the twenty-third international Florida Artificial Intelligence Research Society conference (FLAIRS-23), Daytona Beach, Florida, USA, May 19–21 2010. pp. 380-385. • 2009 J.M Banda and R. Angryk “On the effectiveness of fuzzy clustering as a data discretization technique for large-scale classification of solar images” Proceedings of the 18th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE ’09), Jeju Island, Korea, August 2009, pp. 2019-2024.
Image Segmentation / Feature Extraction 8 by 8 grid segmentation (128 x 128 pixels per cell)
Comparative Evaluation Average classification accuracy with cell labeling Some of these results are part of the paper accepted for publication in the FLAIRS-23 conference (2010)
Motivation for this stage • By selecting the most relevant image parameters we will be able to save processing and storage costs for each parameter that we remove • SDO Image parameter vector will grow 6 Gigabytes per day
Unsupervised Attribute Evaluation a) b) • Average correlation map for the Active Region class in the one image as a query against: • the same class scenario (intra-class correlation) ( 1 image vs. 199 images) • other classes (inter-class correlation) scenario (1 image vs. 1,400 images)
Better Visualization? a) b) • MDS map for the Active Region class in the one image as a query against: • the same class scenario (intra-class correlation) ( 1 image vs. 199 images) • other classes (inter-class correlation) scenario (1 image vs. 1,400 images) Multidimensional Scaling (MDS) allows us to better visualize these correlations
Supervised Attribute Evaluation • Chi Squared • Gain Ratio • Info Gain User Extendable (WEKA has more than 15 other methods that the user can select)
Experimental Set-up • Objective: 30% dimensionality reduction • Remove 3 parameters for each set of experiments
Attribute Evaluation - Preliminary Conclusions • Removal of some image parameters maintains comparable classification accuracy • Saving up to 30% of storage and processing costs • Paper: Accepted for publication in DICTA 2010 conference
Motivation • By eliminating redundant dimensions we will be able to save retrieval and storage costs • In our case: 540 kilobytes per dimension per day, since we will have a 10,240 dimensional image parameter vector per image (5.27 GB per day)
Linear dimensionality reduction methods • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD) • Locality Preserving Projections (LPP) • Factor Analysis (FA)
Non-linear Dimensionality Reduction Methods • Kernel PCA • Isomap • Locally-Linear Embedding (LLE) • Laplacian Eigenmaps (LE)
Experimental Set-up • We selected 67% of our data as the training set and an the remaining 33% for evaluation • Full Image Labeling • For comparative evaluation we utilize the number of components returned by standard PCA and SVD’s algorithms, setting up a variance threshold between 96 and 99% of the variance
Dimensionality Reduction - Preliminary Experimental Results Average classification accuracy per method
Dimensionality Reduction - Preliminary Experimental Results Average classification accuracy per method
Dimensionality Reduction - Preliminary Experimental Results Average classification accuracy per number of generated dimensions
Dimensionality Reduction – Preliminary Conclusions • Selecting anywhere between 42 and 74 dimensions provided stable results • For our current benchmark dataset we can reduce around 90% from 640 dimensions we started with • For the SDO mission a 90% reduction would imply savings of up to 4.74 Gigabytes per day (from 5.27 Gigabytes of data per day) • Paper: Under Review
Motivation for this stage • Literature reports very interesting results for different measures in different scenarios • The need to identify peculiar relationships between image parameters and different measures
Dissimilarity Measures • 1) Euclidean distance [30]: Defined as the distance between two points give by the Pythagorean Theorem. Special case of the Minkowski metric where p=2. • 2) Standardized Euclidean distance [30]:Defined as the Euclidean distance calculated on standardized data, in this case standardized by the standard deviations.
Dissimilarity Measures • 3) Mahalanobis distance [30]: Defined as the Euclidean distance normalized based on a covariance matrix to make the distance metric scale-invariant. • 4) City block distance [30]: Also known as Manhattan distance, it represents distance between points in a grid by examining the absolute differences between coordinates of a pair of objects. Special case of the Minkowski metric where p=1.
Dissimilarity Measures • 5) Chebychev distance [30]: Measures distance assuming only the most significant dimension is relevant. Special case of the Minkowski metric where p = ∞. • 6) Cosine distance [26]: Measures the dissimilarity between two vectors by finding the cosine of the angle between them.
Dissimilarity Measures • 7) Correlation distance [26]: Measures the dissimilarity of the sample correlation between points as sequences of values. • 8) Spearman distance [25]: Measures the dissimilarity of the sample’s Spearman rank [25] correlation between observations as sequences of values.