Data mining Satellite images indexation

Data mining Satellite images indexation Feature Selection Marine Campedel 5 février 2004

Data Mining (1) • “Fouille de données”, ECD, KDD,… • Automatic process giving access to raw data in the context of a given application ; • Necessity because of databases increasing sizes -> find the “relevant” information ; • Indexation : automatic process that associates a set of labels to a raw data.

Data acquisition Data Mining (2) Off-line process Raw Data (images) Features Information Extraction Semantic Models Supervised learning User Interface Information Retrieval User query On-line process

Information Extraction From raw data and a priori knowledge (unsupervised) Between raw data and application-based knowledge (supervised) Information Retrieval Goal : get relevant examples (raw images) corresponding to any user query (‘find frozen woods area’) in a specified application (‘satellite image retrieval’) Data Mining (3)

Any a priori knowledge from data type or final application ? Data acquisition Features Selection (1) Raw Data (images) Features Information Extraction • Computation cost and storage capacity  reduce the number of features (dimension) ; • Reduce redundancy while maintaining noise robustness and discriminative power ; •  Feature selection algorithm is needed

Features Selection (2) Domain a priori knowledge ? Raw Data (images) Predefined properties ? Relevance definition ? Compute all a priori features (colour, texture, shape features,…) Construct new features (PCA, ICA,…) Feature Selection Relevant features

Unsupervised Quantization Define selection criterion from a priori knowledge (‘filter’ approach) Typical use of correlation coeffs, mutual information,… and thresholding Traditional drawback : cannot evaluate set of features Supervised Define selection criterion according to the final application (‘wrapper’ or ‘embedded’ approach) Typical use of labelled databases and classifiers Traditional drawback : computation cost Features Selection (3)

Supervised Features Selection • Inputs : labelled database + classification task + exhaustive feature library • Goal : select the features set that achieves the best classification score • Pbs : selection of the inputs (database, classifier type, feature library are chosen from a priori knowledge)

Constraints • The (hand-)labelled database size is limited by the acquisition process (hundreds to thousands ?) • The features library size can be huge (hundreds ?) The classifier must be able to train from a limited number of data in high dimensional space, ensuring strong generalization property

SVM choice • Support Vector Machine • Parametric classifier; • Support vectors : examples that define the limits of each class; • Designed to be robust to outliers; • Tractable with high dimensional data; • Lots of recent literature and tools on the web (matlab: SPIDER, C-C++: svmlib, svmlight, Java: WEKA).

SVM principle (1/4) • 2 classes linear SVM without error Labelled training patterns Linearly separable if there exists w (weights) and b (bias) such that • The optimal hyperplane separates the data with the maximal margin (determines the direction w/|w| where the distance between the projections of two different classes data is maximal)

SVM principle (2/4) • Support vectors • SVM problem :

SVM principle (3/4) • Dual problem • Kernel

SVM principle (4/4) • Soft margin • Multi-classes : 1-vs-all and MC-SVM

Selection algorithms using SVM • RFE (Recursive Feature Elimination) [Guyon, 2002] • Iteratively eliminates features corresponding to small weights until the desired number of features is reached. • Minimize L0 norm of feature weights (minimize the number of non-zero weights) [Weston, 2003] • Iterative process using linear SVM ; • Update data by multiplying by the estimated weights.

Proposed experiment • Database : synthetic or Brodatz (texture images) or satellite image database • Feature library : using Gabor, orthogonal wavelets, co-occurrence matrices, basic local stats,…with several neighbourhoods sizes (scales) • Classifier : SVM • Goal : compare performance of different selection algorithms (supervised + unsupervised ones) • Robustness to database modification ? to classifier parameter modification ?

Spider example : 2-classes and 2-relevant dimensions synthetic linear problem The 2 first dimensions are relevant (uniform distribution) The next 6 features are noisy versions of the two first dimensions The 42 other one are independent uniformly distributed variables (noise) 400 examples, 50 dimensions Evaluation using cross-validation (train on 80% of the data, test on 20%, 5 attempts) Score = classification error rate

Spider example • Results confirm the selection process gain • Correlation-based selection algorithm performs poorly compared to the proposed ‘wrapper’ methods

Conclusion and what next ? • Subject : feature selection algorithms • Determine an automatic procedure for selecting relevant features in the context of satellite image indexing • Applicable to any data indexing ? (Is the data-type a-priori knowledge concentrated in the feature library design ?) • Experiment in progress…

Bibliography • [Elisseeff, 2003] “Technical documentation of the multi-class SVM”,2003. • [Guyon,2002] “Gene selection for cancer classification using support vector machines”, I.Guyon, J.Weston, S.Barnhill and V.Vapnik - Machine Learning 46(1-3) 389-422, 2002. • [Guyon,2003] “An introduction to Variable and Feature selection”, I.Guyon, A.Elisseeff, JMLR 3, 1157-1182, 2003. • [Schoelkopf and Smola,2002] “Learning with Kernels-Support Vector Machines, Regularization, Optimization and Beyond”, B.Schoelkopf and A.J.Smola, MIT press, 2002. • [Weston,2003] “Use of the Zero-Norm with Linear Models and Kernel Methods”, Weston, Elisseff, Schoelkopf and Tipping, JMLR 3, 1439-1461,2003.

Data mining Satellite images indexation