780 likes | 909 Views
Multivariate Resolution in Chemistry. Lecture 1. Roma Tauler IIQAB-CSIC, Spain e-mail: rtaqam@iiqab.csic.es. Lecture 1. Introduction to data structures and soft-modelling methods. Factor Analysis of two-way data: Bilinear models. Rotation and intensity ambiguities.
E N D
Multivariate Resolution in Chemistry Lecture 1 Roma Tauler IIQAB-CSIC, Spain e-mail: rtaqam@iiqab.csic.es
Lecture 1 • Introduction to data structures and soft-modelling methods. • Factor Analysis of two-way data: Bilinear models. • Rotation and intensity ambiguities. • Pseudo-rank, local rank and rank deficiency. • Evolving Factor Analysis.
T T pH pH Chemical sensors and analytical data structures one variable x1 e.g. pH two variables x1,x2 e.g pH i T three variables x1, x2 and x3 e.g. pH, T i P n variables ????? pH * ***** * ** *** * * * * * * * * * P
Data Structures: Zero order Zero-way data c h x; one sample gives one scalar (tensor 0th order) Examples: - selective electrodes, pH - absorption at one wavelength - height/area chromatographicpeak Assumptions: - total selectivity - known lineal response Tools: - univariate algebra and statistics Advantages:- simple and easy to understand Disadvantages: - only one compound information - total selectivity - one sensor for every analyte - low information content Time x x x x x x hi x x x ci
Data Structures: First order One-way data 0 10 20 30 40 min Absorbance Spectrum Wavelength (nm) x1, x2, ....., xn; one sample gives one vector (tensors of order 1) Examples: - matrix of sensors - absorption at many (spectra) - chromatograms at a single - current intensities at many E - readings with time (kinetics) -.................. Assumptions: - known lineal responses - different and independent responses Tools: - linear algebra - multivariate statistics - spectral analysis - chemometrics (PCA,MLR, PCR, PLS...) Advantages: - Calibration in presence of interferences is possible - Multicomponent analysis is possible Disadvantages: Interferences should be present in calibration samples Chromatogram Time
Data Structures: Second order / Two-way data xij; each sample gives a data table/matrix; tensor of order 2 X = xkykT Examples: - LC-DAD; LC-FTIR; GC-MS; LC-MS; FIA-DAD; CE-MS,.. (hyphenated techniques) - esp. excitation/emission (fluorescence) - MS/MS, NMR 2D, GCxGC-MS ... - spectroscopic/voltammetric monitoring of chemical reactions/processes with pH, time, T, etc. Assumptions: - linear responses - sufficient rank (of the data matrices) Tools: - linear algebra - chemometrics Advantages: - calibration for the analyte in the presence of interferences not modelled in calibration samples is possible - full characterization of the analyte and interferents may be possible - few calibration samples are needed (only one sample calibration)
Data Structures: Third order Three-way data Run nr. time D Di time D • xijk; each sample gives a data • cube; tensor of order 3 • X = xkykzk • Examples • - Several spectroscopic matrices • - Several hyphenated chromatographic • - Hyphenated multidimensional • chromatography (GC x GC / MS) • - excitation/emission/time • .............. • Assumptions: - bilinear/trilinear responses • - sufficient rank (of the data matrices) • Tools: - multilinear and tensor algebra • - chemometrics • Advantages: - unique solutions (no ambiguities) • - calibration for the analyte in the presence of interferences not modelled in calibration samples is possible • - full characterization of the analyte and interferents is possible • - few calibration samples are needed (only one sample calibration) Multi-way data analysis (PARAFAC, GRAM) Extended multivariate resolution
0th order data: ISE, pH,.. 1th order data: spectra 2nd order data: LC/DAD GC/MS fluorescence 3rf order data: time/ /excitation/ /emission
Examples Chemical reaction systems monitored using spectroscopic measurements (even at femtosecond scale) to follow the evolution of a reaction with time, pH, temperature, etc., and the detection of the formation and disappearance of intermediate and transient species Monitoring chemical reactions. C o m p u t e r P r i n t e r A u t o b u r e t t e S t i r r e r p H m e t e r - 1 2 5 . 3 P e r i s t a l t i c p u m p D (NR,NC) S p e c t r o p h o t o m e t e r pH 0 . 0 5 0 m l wavelength o T = 3 7 C T h e r m o s t a t i c b a t h
* * * * * * * * * * * Spectrometer C o m p u t e r Examples Quality control and optimisation of industrial batch reactions and processes, where on-line measurements are applied to monitor the process. Process analysis probe wavelength D (NR,NC) time
Examples Analytical characterisation of complex environmental, industrial and food mixtures using hyphenated (chromatography, continuous flow methods with spectroscopic detection) Chromatographic Hyphenated techniques LC-DAD, GC-MS, LC-MS, LC-MS/MS.... D (NR,NC) time wavelength
Examples FIA-DAD-UV with pH gradient for the analysis of a mixture of drugs. D (NR,NC) pH wavelength
Examples Analytical characterisation of complex sea-water samples by means of Excitation-Emission spectra for an unknown with tripheniltin (in the reaction with flavonol) Excitation emission (fluorescence) EEM techniques D (NR,NC) excitation emission
Examples Protein folding and dynamic protein-nucleic acid interaction processes. In the post-genomic era, understanding these biochemical complex evolving processes is one of the main challenges of the current proteomics research. Conformation changes Primary structure Secondary structure Tertiary structure Quaternary structure Val Leu Ser Ala Asp Ala Trp Gly Val His -helix -sheet turn Random coil Amino acids Globule formation Assembled subunits Helix, sheet formation D (NR,NC) Temperature wavelength
Total number of pixels (x y) x y Examples Image analysis of spatially distributed chemicals on 2D surfaces measured using coupled microscopy-spectroscopy techniques in geological samples, biological tissues or food samples. Spectroscopic Image analysis
Data Structures in Chemistry Experimental Data two orders/ways/modes of measurement D(NR,NC) row-order (way,mode) i.e. usually change in chemical composition (concentration order) column order (way,mode) i.e usually change in system properties like in spectroscopy, voltammetry,... (spectral order)
Chemical data tables (two-way data) J variables (wavelengths) Instrumental measurements (spectra, voltammograms,...) Data table or matrix concentration changes measurements (time, tempera-ture, pH, .... I spectra (times) D Plot of spectra (rows) Plot of elution profiles (columns)
Chemical data modelling • Chemical data modelling methods may be divided in: • Hard- modelling methods (deterministic) • Soft-modelling methods (data driven) • Hybrid hard-soft modelling methods Hard modelling Soft modelling Physical Hard Model Analytical Information Data Data ? Data driven soft model Physical Model Analytical Information
Hard-modelling • Hard-modelling approaches for chemical (stationary, dynamic, evolving…) systems are based on an accurate physical description of the system and on the solution of complex systems of (differential) equations fitting the experimental measurements describing the evolution and dynamics of these systems. They are deterministic models. • Hard-modelling methods usually use non-linear least squares regression (Marquardt algorithm) and optimisation methods to find out the best values for the parameters of the model. • Hard-modelling usually deal with univariate data. It has been often used in the past until the advent of modern instrumentation and computers giving large amounts of data outputs. • Hard-modelling is often successful for laboratory experiments, where all the variables are under control and the physicochemical nature of the dynamic model is known and can be fully described using a known mathematical model
Hard-modelling • However, and even at a laboratory level, there are examples where hard-modelling requirements and constraints are not totally fulfilledor no physicochemical model is known to describe the process (e.g. in chromatographic separations or in protein folding experiments). • Data sets obtained from the study of natural and industrial evolving processes are too complex and difficult to analyse using hard-modelling methods.In these cases, there is no known physical model available or it is too complex to be set in a general way. • Advanced hard-modelling in industrial applications has been attempted to model experimental difficulties, such as changes in temperature, pH, ionic strength and activity coefficients.This is a very difficult task! Data Fitting in the Chemical Sciences P. Gans, John Wiley and Sons, New York 1992
ST C D 2.5 1 4 x 10 3 0.9 2 0.8 2.5 0.7 1.5 2 Absorbance 0.6 Concentration Absortivities 0.5 1.5 1 0.4 1 0.3 0.5 0.2 0.5 0.1 0 0 10 20 30 40 50 60 70 80 90 100 0 0 0 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 80 90 100 Wavelength Time Wavelengths Non-linear model fitting min(D(I-CC+) C = f(k1, k2) LS (D, C) (ST) Hard modelling • Output:C, S and model parameters. • The model should describe all the variation in the experimental measurements.
Soft-modelling • Soft-modelling instead, attempts the description of these systems without the need of an a priori physical or (bio)chemical model postulation. The goal of the latter methods is the explanation of the variations observed in the systems using the minimal and softer assumptions about data. They are data driven models. • Soft models usually give an improved analytical description of the analysed process. • Soft modelling needs more data than hard-modelling. Soft modelling methods deal with multivariate data. Its use has augmented in the recent years because of the advent of modern analytical instrumentation and computers providing large amounts of data outputs. • The disadvantage of soft models is their poorer extrapolating capabilities (compared with hard-modelling).
Soft-modelling • A soft model is hardly able to predict the behaviour of the system under very different conditions from which it was derived. • Complex multivariate soft-modelling data analysis methods have been introduced for the study of chemical processes/systems like Factor Analysis derived methods. • Factor Analysis is a multivariate technique for reducing matrices of data to their lowest dimensionality by the use of orthogonal factor space and transformations that yield predictions and/or recognizable factors. Factor Analysis in Chemistry 3rd Edition, E.R.Malinowski, Wiley, New York 2002
2.5 4 x 10 3 2 2.5 1.5 2 Absorbance Absortivities 1.5 1 1 0.5 0.5 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Wavelength Wavelengths 1 0.9 0.8 0.7 0.6 Concentration 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Time Soft modelling ST C D , Constrained ALS optimisation LS (D,C) S* LS (D,S*) C* min (D –C*S*) • Output:C and S. • All absorbing contributions in and out of the process are modelled.
Lecture 1 • Introduction to data structures and soft-modelling methods. • Factor Analysis of two-way data: Bilinear models. • Rotation and intensity ambiguities. • Pseudo-rank, local rank and rank deficiency. • Evolving Factor Analysis.
Soft-modelling Factor Analysis (Bilinear Model) experimental data is modelled as a linear sum of weighted (scores) factors (loadings) In matrix form data scores loadings
D = A + B + C + ... + E = + + D + + ... + E + A B C Soft-modelling BILINEARITY Assumption: Bilinearity (the contributions of the components in the two orders of measurement are additive)
Soft-modelling GOALS OF BILINEAR MODEL 0.35 0.35 0.3 0.3 0.25 0.25 = 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 20 40 60 0 50 100 Recovery of the responses of every component (chemical species) in the different modes of measurement
Soft-modelling: Factor Analysis Principal Components Experimental Data Matrix Target testing Cluster Analysis Factor Identification Real Factor Models Predictions
Soft-modelling: Factor Analysis (traditional approach) matrix multiplication Covariance matrix Data matrix decomposition combination abstract reproduction Abstract Factors Real Factors target transformation abstract rotation New Abstract Factors
Soft-modelling methods (I) • Factor Analysis methods based on the use of latent variables or eigenvalue/singular value data matrix decompositions. Examples • PCA, SVD, rotation FA methods • Evolving Factor Analysis methods • Rank Annihilation methods • Window Factor Analysis methods • Heuristic Evolving Latent Projections methods • Subwindow Factor Analysis methods • …..
Soft-modelling methods (II) • Multivariate Resolutionmethods do a data matrix decomposition into their ‘pure’ components without using explicitly latent variables analysis techniques. Examples: • SIMPLISMA • Orthogonal Projection Approach (OPA), • Positive Matrix Factorization methods (and Multilinear Engine extensions) • Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) • Gentle • .....
Soft-modelling methods (III) • Three-way and Multiway methods • which decompose three-way or multiway data structures. Examples: • Multiway and multiset extensions of PCA • Genralized rank Annihilation, GRAM; Direct Trilear Decomposition (DTD, TLD) • Multiway and multiset extensions of MCR-ALS methods • PARAFAC-ALS • Tucker3-ALS • .......
Soft-modelling Factor Analysis in Chemistry, 3rd Ed., E.R.Malinowski, John Wiley & Sons, New York, 2002 Principal Component Analysis, I.T. Jollife, 2nd Ed., Springer, Berlin, 2002 Multiway Analysis, Applications in the Chemical Sciences, A.Smilde, R.Bro and P.Geladi, John Wiley & Sons, New York, 2004 Multivariate Image Analysis, P.Geladi, John Wiley and Sons, 1996 Soft modeling of Analytical Data. A.de Juan, E.Casassas and R.Tauler, Encyclopedia of Analytical Chemistry: Instrumentation and Applications, Edited by R.A.Meyers, John Wiley & Sons, 2000, Vol 11, 9800-9837
Soft-modelling i=1,...,I k=1,...,K conditions j=1,...,J • Data structures Type of Models • One way data (vectors) Linear and non-linear models • di = b0 + b ci; • di = fnon-linear(ci) • Two way data (matrices) Bilinear and non-bilinear models • Non-bilinear data can still be linear in one of the two modes • Three-way data (cubes) Trilinear and non-trilinear models • Non-trilinear data can still be bilinear in two modes di I samples J variables dij I samples D
Soft-modelling Bilinear models for two way data: J dij I D dijis the data measurement (response) of variable j in sample i n=1,...,N are the number of components (species, sources...) cin is the concentration of component n in sample i; snjis the response of component n at variable j
Soft-modelling Bilinear models for two way data J J J U or C VT or ST N D E I + I I N << I or J N MCR D = CST + E Other constraints (non-negativity, unimodality, local rank,… ) U=C and VT=STnon-negative,... C or ST normalization Non-unique solutions but with physical meaning Useful for resolution (and obviously for interpretation)! PCA D = UVT + E U orthogonal, VT orthonormal VT in the direction of maximum variance Unique solutions but without physical meaning Useful for interpretation but not for resolution!
PCA Model (Principal Component Analysis) X = U VT + E U ‘scores’ matrix (orthogonal) VTloadings matrix (orthonormal) SVD Model (Singular Value Decomposition) D = U* S VT + E U* ‘scores’ matrix (orthonormal) S diagonal matrix of the singular values s s = 1/2 eigenvalues of the covariances matrix DDT VT‘loadings’ matrix (orthonormal)
PCA Model: D = U VT unexplained variance VT D E = + loadings (projections) U scores D = u1v1T + u2v2T + ……+ unvnT + E n number of components (<< number of variables in D) = + +….+ + D u1v1T u2 v2T unvnT E rank 1 rank 1 rank 1
PCA Model • X = U VT + E • X = structure + noise • It is an approximation to the experimental data matrix X • Loadings, Projections: VT relationships between original variables • and the principal components (eigenvectors of the covariances matrix). • Vectors in VT (loadings) are orthonormals (orthogonal and normalized). • Scores, Targets: U relationships between the samples (coordinates of • samples or objects in the space defined by the principal components • Vectors in U (scores) are orthogonal • Noise E Experimental error, non-explained variances
Summary of Principal Component Analysis PCA • Formulation of the problem to solve • Plot of the original data • 3. Data pretreatment. • (data centering, autoscaling, logarithmic transformation…) • 4. Built PCA model. Determination of the number of • components. Graphical inspection of explained/residual • plots) • 5. Study of the PCA model PCA. Multivariate data exploration • - ‘loadings’ plot ==> map of the variables • - ‘scores’ plot ==> map of the samples • Interpretation of the PCA mode. Identification of the • main sources of data variance • 7. Analysis of the residuals matrix E = D -U VT
Data set Pollutant concentration [org]1 [org]2 [org]3 [org]96 Site 1 Site 2 D Sampling sites Site 3 Site 22 Scoresplot Loadings plot 0.2 3 12 79 2 0 75 1 84 82 A 77 21 11 87 24 83 86 23 -0.2 80 1 18 22 89 47 66 8 6 PC2 (27%) 21 67 76 85 81 PC2 (27%) 74 71 3 73 20 15 96 95 19 17 68 93 78 92 12 14 6 26 88 72 69 5 20 39 90 4 91 13 16 13 52 27 -0.4 94 70 0 16 55 14 10 34 18 9 56 15 2 19 48 49 7 B 11 25 17 10 50 7 62 9 8 54 4 22 -0.6 -1 57 5 30 44 58 1 35 53 43 38 51 1 3 33 45 63 29 46 28 60 59 61 42 41 65 40 37 31 2 32 2 64 -0.8 -2 36 3 -1 -3 0 -1 -0.5 0.5 1 -2 -1 0 1 2 3 4 PC1 (41%) PC1(41%) Biplot 3 12 2 21 11 1 18 8 6 PC2 (27%) 20 15 14 5 4 13 79 0 16 75 84 82 77 24 87 23 83 86 80 22 89 47 66 21 81 67 76 85 96 95 73 74 71 19 17 68 93 92 12 78 6 26 69 20 39 88 90 72 91 13 19 52 16 27 94 70 55 14 10 34 18 56 15 9 2 48 49 11 7 25 50 54 62 8 7 4 30 57 5 44 58 53 43 35 38 33 45 51 1 3 17 29 63 46 60 42 61 59 28 10 41 40 37 65 31 32 64 36 9 22 -1 1 2 -2 3 -3 -2 -1 0 1 2 3 4 PC1 (41%) PCA U scores VT loadings
Mixed information tR D Wavelengths Multivariate Curve Resolution (MCR) Pure component information s1 sn ST c c 1 n C Retention times Pure signals Compound identity source identification and Interpretation Pure concentration profiles Chemical model Process evolution Compound contribution relative quantitation
Lecture 1 • Introduction to data structures and soft-modelling methods. • Factor Analysis of two-way data: Bilinear models. • Rotation and intensity ambiguities. • Pseudo-rank, local rank and rank deficiency. • Evolving Factor Analysis.
Factor Analysis Ambiguities in the analysis of a data matrix (two-way data) Rotation and scale/intensity ambiguities Rotation Ambiguities Factor Analysis (PCA) Data Matrix Decomposition D = U VT + E ‘True’ Data Matrix Decomposition D = C ST + E
Factor Analysis Ambiguities in the analysis of a data matrix (two-way data) Rotation and scale/intensity ambiguities Rotation Ambiguities D = U T T-1 VT + E = C ST + E C = U T; ST = T-1 VT How to find the rotation matrixT?
Rotation and scale/intensity ambiguities D = C ST + E = D* + E Cnew = C T (NR,N) (NR,N) (N,N) STnew = T-1 ST (N,NC) (N,N) (N,NC) D* = C ST = CnewSTnew Matrix decomposition is not unique! T(N,N) is any non-singular matrix Rotational freedom for any T
t1,1 t1,2 T T = = t2,1 t2,2 Cnew,1 Cnew,2 Cold,1,Cold,2 Cold,1,Cold,2 T-1 t-11,2 STnew,1 t-11,1 STold,1 = STold,2 STnew,2 STold,1 = STold,2 t-12,2 t-12,1 Rotation and scale/intensity ambiguities Rotation ambiguities and rotation matrix T(N,N)
1 d c s k c s ij in in nj nj k n n Rotation and scale/intensity ambiguities Intensity (scale) ambiguities: For any scalar k Intensity/scale ambiguities make difficuly to obtain quantitative information When they are solved then it is also possible to have quantitative information
Rotation and scale/intensity ambiguities Intensity (scale) ambiguities: cold xk = cnew cold sold = = (cold xk)(1/kx sold) = cnew snew x x 1/kx sold = snew