500 likes | 621 Views
CLASSIFICATION. Periodic Table of Elements. 1789 Lavosier 1869 Mendelev. Measures of similarity i) distance ii) angular (correlation). Var 2. x k T d kl = || x T k - x T l || x l T. angular. Var 1. Variable space.
E N D
1789 Lavosier 1869 Mendelev
Var 2 xkT dkl = || xTk-xTl|| xlT angular Var 1 Variable space • Two objects plotted in the two-dimensional variable space. The difference between the object vectors is defined as the Euclidean distance between the objects, dkl
Measuring similarityDistance i) Euclidean ii) Minkowski (“Manhatten”, “taxis”)iii) Mahalanobis (correlated variables)
X2 p1 Euclidean p2 X1 Distance Euclidean: Manhattan:
Classification using distance: Nearest neighbor(s) define the membership of an object. KNN (K nearest neighbors) K = 1 K = 3
X2 X1 Classification X1 and X2 is uncorrelated, cov(X1, X2) = 0 for both subsets (classes) => can use KNN to measure similarity
Classification X2 PC1 Class 4 PC2 Class 3 Class 2 Class 1 X1 Univariate classification can NOT provide a good separation between class 1 and class 2. Bivariate classification (KNN) provides separation. For class 3 and class 4, PC analysis provides excellent separation on PC2.
Classification X2 X1 X1 and X2 is correlated, cov(X1, X2) 0 for both “classes” (high X1 => high X2) KNN fails, but PC analysis provides the correct classification
Classification Cluster methods like KNN (K nearest neighbors) use all the data in the calculation of distances. Drawback: No separation of noise from information Cure: Use scores from major PCs
CORRELATION&SIMILARITY Var 2 Var 1 Variable space
CORRELATION&SIMILARITY PCclass 2 Var 2 PCclass 1 Var 1 Variable space SUPERVISED COMPARISON (SIMCA)
CORRELATION-SIMILARITY PC2 Var 2 PC1 Var 1 Variable space UNSUPERVISED COMPARISON (PCA)
CORRELATION&SIMILARITY Var 2 eTk xcT xTk Var 1 Variable Space
CORRELATION&SIMILARITY Unsupervised: PCA - score plot Fuzzy clustering Supervised: SIMCA
0 10 20 30 KM CORRELATION-SIMILARITY Characterisation and Correlation of crude oils…. Kvalheim et al. (1985) Anal. Chem.
CORRELATION&SIMILARITY Sample 1 Sample 2 Sample N
t2 4 11 4 3 11 8 10 3 8 9 9 6 PC2 6 2 13 2 13 5 14 7 1 1 t1 PC1 CORRELATION&SIMILARITY SCORE PLOT
SIMCA Model (Covar. pattern) Residuals (Unique variance, noise) Data (Variance) + = Angular correlation Distance
Data matrix Variables Objects 1 2 3 4 ………… …………...M 1 2 3 . . . . N N+1 N+N’ Class 1 Training set (Reference set) Xki Class 2 Class Q Unassigned objects Test set SIMCA Class - group of similar objects Object - sample, individual Variable - feature, characteristics, attribute
Data matrix Peak area Chromatogram 1 2 3 4 ………… …………...M 1 2 3 . . . . N N+1 N+N’ Oil field 1 Training set (Reference set Xki Oil field 2 Oil field Q New samples Test set SIMCA
3* xki = xi + eki x’k = x’ + e’k 3 x 2* 2 1* 1 3’ p1 xki = xi + tkp’i + eki x’k = x’ + tkp’ + e’k 3 x 2’ 2 1’ 1 PC MODELS
p2 3’ xki = xi + tkp’i + eki x’k = x’ + tk1p’1 + tk2p’2 + e’k 3 X 2’ 2 1’ 1 p1 PC MODELS
XC = XC + TCP`C + EC information (structure) noise PRINCIPAL COMPONENT CLASS MODEL k = 1,2,…,N (object,sample) i = 1,2,…,N (variable) a = 1,2,….,A (principal component c = 1,2,----,C (class)
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 PC MODELS • Deletion pattern for objects in the leave-out-one group-of elements-at-a-time cross validation procedure developed by Wold
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 i)Calculate scores and loadings for PC a+1; ta+1 and p`a+1, excluding elements in one group ii) Predict values for elements eki, a+1 = tk,a+1 p`a+1,i iii) Sum over the elements iv) Repeat i)-iii) for all the other groups of elements v) Compare with Adjust for degrees of freedom 7 8 CROSS VALIDATING PC MODELS
Smax p= 0.01 Smax p= 0.05 PC 1 1-component PC model
Smax S0 PC 1 Residual Standard Deviation (RSD) Mean RSD of class: RSD of object:
smax PC 1 1/2st 1/2st tmax tmin tupper tlower
CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test)
CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model Defines For a = 1,2,...,A Calculate the residuals to the object: ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test) <= Fcritical => k class q > Fcritical => k class q
smax PC 1 1/2st 1/2st tmax tmin tupper tlower Objects outside the model
RSDmax tmin - 1/2st tmax tl tmin tmax+ 1/2st sl sk tlower Detection of atypical objects PC 1 Object k: Sk > RSDmax => k is outside the class Object l: tl is outside the “normal area”, {tmin-1/2st, tmax+1/2st} => Calculate the distance to the extreme point, that is, sl > RSDmax => l is outside the class
Detection of outliers 1. Score plots 2. DIXON-TESTS on eachLATENT VARIABLE, 3. Normal plots of scores for eachLATENT VARIABLE 4. Test of residuals, F-test (class model)
MODELLEING POWER The variables contribution to the class model q (intra-class variation) MPiq= 1 - Sqi,A/ Sqi,0 MPi= 1.0 => the variable i is completely explained by the class model MPi= 0.0 => the variable i does NOT contribute to the class model
DISCRIMINATION POWER The variables ability to separate two class models (inter-class variation) DPr,qi = 1.0 => no discrimination power DPr,qi > 3-4 => “Good” discrimination power
Class q sk(q) k sl(q) sk(r) l sl(r) Class r SEPARATION BETWEEN CLASSES Worst ratio: ,lr Class distance: => “good separation”
POLISHED CLASSES 1) Remove “outliers” 2) Remove variables with both low MP < 0.3-0.4 and low DP < 2-3
How does SIMCA separate from other multivariate methods? i) Models systematic intra-class variation (angular correlation) ii) Assuming normally distributed population, the residuals can be used to decide class belonging (F-test)! iii) “Closed” models iv) Considers correlation, important for large data sets v) SIMCA separates noise from systematic (predictive) variation in each class
Separating surface Latent Data Analysis (LDA) • New classes ? • Outliers • Asymmetric case? • Looking for dissimilarities
x2 ? f1(x1,x2) ? f2(x1,x2) ? ? x1 MISSING DATA
WHEN DOES SIMCA WORK? 1. Similarity between objects in the same class, homogenous data. 2. Some relevant variables for the problem in question (MP, DP) 3. At least 5 objects, 3 variables.
Read Raw-data Eliminate variables with low modelling and discriminated power Pretreatment of data Square Root, Normalise and more Select Subset/Class Evaluation of subsets Variable Weighting Standardise Yes Remodel? Cross validated PC-model “Polished” subsets Outliers? Yes Fit new objects Yes More Classes? ALGORITHM FOR SIMCA MODELLING