CLASSIFICATION

CLASSIFICATION

Periodic Table of Elements

1789 Lavosier 1869 Mendelev

Measures of similarityi) distanceii) angular (correlation)

Var 2 xkT dkl = || xTk-xTl|| xlT angular Var 1 Variable space • Two objects plotted in the two-dimensional variable space. The difference between the object vectors is defined as the Euclidean distance between the objects, dkl

Measuring similarityDistance i) Euclidean ii) Minkowski (“Manhatten”, “taxis”)iii) Mahalanobis (correlated variables)

X2 p1 Euclidean p2 X1 Distance Euclidean: Manhattan:

Classification using distance: Nearest neighbor(s) define the membership of an object. KNN (K nearest neighbors) K = 1 K = 3

X2 X1 Classification X1 and X2 is uncorrelated, cov(X1, X2) = 0 for both subsets (classes) => can use KNN to measure similarity

Classification X2 PC1 Class 4 PC2 Class 3 Class 2 Class 1 X1 Univariate classification can NOT provide a good separation between class 1 and class 2. Bivariate classification (KNN) provides separation. For class 3 and class 4, PC analysis provides excellent separation on PC2.

Classification X2 X1 X1 and X2 is correlated, cov(X1, X2)  0 for both “classes” (high X1 => high X2) KNN fails, but PC analysis provides the correct classification

Classification Cluster methods like KNN (K nearest neighbors) use all the data in the calculation of distances. Drawback: No separation of noise from information Cure: Use scores from major PCs

VARIABLE CORRELATIONAND SIMILARITYBETWEEN OBJECTS

CORRELATION&SIMILARITY Var 2 Var 1 Variable space

CORRELATION&SIMILARITY PCclass 2 Var 2 PCclass 1 Var 1 Variable space SUPERVISED COMPARISON (SIMCA)

CORRELATION-SIMILARITY PC2 Var 2 PC1 Var 1 Variable space UNSUPERVISED COMPARISON (PCA)

CORRELATION&SIMILARITY Var 2 eTk xcT xTk Var 1 Variable Space

CORRELATION&SIMILARITY Unsupervised: PCA - score plot Fuzzy clustering Supervised: SIMCA

0 10 20 30 KM CORRELATION-SIMILARITY Characterisation and Correlation of crude oils…. Kvalheim et al. (1985) Anal. Chem.

CORRELATION&SIMILARITY Sample 1 Sample 2 Sample N

t2 4 11 4 3 11 8 10 3 8 9 9 6 PC2 6 2 13 2 13 5 14 7 1 1 t1 PC1 CORRELATION&SIMILARITY SCORE PLOT

Soft Independent Modelling of Class Analogies (SIMCA)

SIMCA Model (Covar. pattern) Residuals (Unique variance, noise) Data (Variance) + = Angular correlation Distance

Data matrix Variables Objects 1 2 3 4 ………… …………...M 1 2 3 . . . . N N+1 N+N’ Class 1 Training set (Reference set) Xki Class 2 Class Q Unassigned objects Test set SIMCA Class - group of similar objects Object - sample, individual Variable - feature, characteristics, attribute

Data matrix Peak area Chromatogram 1 2 3 4 ………… …………...M 1 2 3 . . . . N N+1 N+N’ Oil field 1 Training set (Reference set Xki Oil field 2 Oil field Q New samples Test set SIMCA

3* xki = xi + eki x’k = x’ + e’k 3 x 2* 2 1* 1 3’ p1 xki = xi + tkp’i + eki x’k = x’ + tkp’ + e’k 3 x 2’ 2 1’ 1 PC MODELS

p2 3’ xki = xi + tkp’i + eki x’k = x’ + tk1p’1 + tk2p’2 + e’k 3 X 2’ 2 1’ 1 p1 PC MODELS

XC = XC + TCP`C + EC information (structure) noise PRINCIPAL COMPONENT CLASS MODEL k = 1,2,…,N (object,sample) i = 1,2,…,N (variable) a = 1,2,….,A (principal component c = 1,2,----,C (class)

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 PC MODELS • Deletion pattern for objects in the leave-out-one group-of elements-at-a-time cross validation procedure developed by Wold

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 i)Calculate scores and loadings for PC a+1; ta+1 and p`a+1, excluding elements in one group ii) Predict values for elements eki, a+1 = tk,a+1 p`a+1,i iii) Sum over the elements iv) Repeat i)-iii) for all the other groups of elements v) Compare with Adjust for degrees of freedom 7 8 CROSS VALIDATING PC MODELS

Smax p= 0.01 Smax p= 0.05 PC 1 1-component PC model

Smax S0 PC 1 Residual Standard Deviation (RSD) Mean RSD of class: RSD of object:

smax PC 1 1/2st 1/2st tmax tmin tupper tlower

CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test)

CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model Defines For a = 1,2,...,A Calculate the residuals to the object: ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test) <= Fcritical => k  class q > Fcritical => k  class q

smax PC 1 1/2st 1/2st tmax tmin tupper tlower Objects outside the model

RSDmax tmin - 1/2st tmax tl tmin tmax+ 1/2st sl sk tlower Detection of atypical objects PC 1 Object k: Sk > RSDmax => k is outside the class Object l: tl is outside the “normal area”, {tmin-1/2st, tmax+1/2st} => Calculate the distance to the extreme point, that is, sl > RSDmax => l is outside the class

Detection of outliers 1. Score plots 2. DIXON-TESTS on eachLATENT VARIABLE, 3. Normal plots of scores for eachLATENT VARIABLE 4. Test of residuals, F-test (class model)

MODELLEING POWERDISCRIMINATION POWER

MODELLEING POWER The variables contribution to the class model q (intra-class variation) MPiq= 1 - Sqi,A/ Sqi,0 MPi= 1.0 => the variable i is completely explained by the class model MPi= 0.0 => the variable i does NOT contribute to the class model

DISCRIMINATION POWER The variables ability to separate two class models (inter-class variation) DPr,qi = 1.0 => no discrimination power DPr,qi > 3-4 => “Good” discrimination power

Class q sk(q) k sl(q) sk(r) l sl(r) Class r SEPARATION BETWEEN CLASSES Worst ratio: ,lr Class distance: => “good separation”

POLISHED CLASSES 1) Remove “outliers” 2) Remove variables with both low MP < 0.3-0.4 and low DP < 2-3

How does SIMCA separate from other multivariate methods? i) Models systematic intra-class variation (angular correlation) ii) Assuming normally distributed population, the residuals can be used to decide class belonging (F-test)! iii) “Closed” models iv) Considers correlation, important for large data sets v) SIMCA separates noise from systematic (predictive) variation in each class

Separating surface Latent Data Analysis (LDA) • New classes ? • Outliers • Asymmetric case? • Looking for dissimilarities

x2 ? f1(x1,x2) ? f2(x1,x2) ? ? x1 MISSING DATA

WHEN DOES SIMCA WORK? 1. Similarity between objects in the same class, homogenous data. 2. Some relevant variables for the problem in question (MP, DP) 3. At least 5 objects, 3 variables.

Read Raw-data Eliminate variables with low modelling and discriminated power Pretreatment of data Square Root, Normalise and more Select Subset/Class Evaluation of subsets Variable Weighting Standardise Yes Remodel? Cross validated PC-model “Polished” subsets Outliers? Yes Fit new objects Yes More Classes? ALGORITHM FOR SIMCA MODELLING

CLASSIFICATION

CLASSIFICATION

Presentation Transcript

Classification

Classification

Classification

Classification

Classification

Classification

Classification

Classification

CLASSIFICATION

Classification

Classification Techniques: Bayesian Classification

CLASSIFICATION

Classification

Classification

Classification

Classification

Classification