140 likes | 303 Views
1. Introduction to multiway analysis. Quimiometria Teórica e Aplicada Instituto de Química - UNICAMP. Why build models of chemical data?. Data exploration e.g. find important sources of variation in complex environmental samples Compound identification and calibration in mixtures
E N D
1. Introduction to multiway analysis Quimiometria Teórica e Aplicada Instituto de Química - UNICAMP
Why build models of chemical data? • Data exploration • e.g. find important sources of variation in complex environmental samples • Compound identification and calibration in mixtures • e.g. identification and quantification of pollutants in river water • Statistical process control • e.g. detect disturbances in product quality • Models are useful approximations of reality • first-principles models are based on chemical/physical knowledge– do they fit well with the measured data? • empirical models (e.g. PCA, PLS) are purely mathematical– do they have a chemical meaning?
Multiway data • Multiway data is becoming more common in chemistry. Examples are • Chromatography • sample number elution time wavelength • On-line analysis • experiment number time wavelength/temperature/pressure • Tandem mass spectroscopy (MS-MS) • sample number parent ion mass daughter ion mass • Image analysis • experiment number time x-position y-position
Multiway data – an example • Batch process data: time time batch process variable process variable One batch A series of batches X (JK) X (IJK)
Multiway modelling • The PARAFAC (or ‘CANDECOMP’) and Tucker models were developed by psychometricians 30 years ago, but are especially useful in chemistry, because chemical data often has a multilinear structure. • PARAFAC and Tucker are different generalizations of PCA for higher-order data. • There also exist generalizations of PLS for higher-order data, e.g. N-PLS.
G S BT VT U A } These models give the same residuals, E (2) X = USVT + E SVD (3) X = AGBT + E TMCA Two-way modelling • Two-way data can be modelled using bilinear models: PT + = E X T time process variable (1) X = TPT + E PCA
Multiway models - PARAFAC • Multiway data can be modelled using multilinear models, such as the PARAFAC model... CT + = BT E X batch A time process variable
core array Multiway models - Tucker • ...or the Tucker model: CT + = G BT E X batch time A process variable
Unfolding • Another option is to matricize (or ‘unfold’) the data and use standard two-way methods: X X1 ... XI I I K XIJK JK J • Can also unfold along other modes:XJKI and XKIJ • But if a multiway structure exists in the data, multiway methods have some important advantages!!
Advantages of multiway • Multiway models need fewer model parameters to describe the data, e.g. a three-component model of X (30 800 200) uses • 540090 parameters for unfold-PCA • 3090 parameters PARAFAC • PARAFAC is more parsimonious than unfold-PCA. • Multiway models use one set of loadings for each mode – results are much easier to plot and understand.
However, ALS algorithms are easy to understand and there is now some high-quality, free MATLAB code available on the internet: • The N-way Toolbox (Andersson & Bro, http://www.models.kvl.dk) Disadvantages of multiway • PARAFAC and Tucker models are usually calculated using a technique called ‘alternating least squares’ (ALS). • This is sometimes slow... • ...and sometimes gives convergence problems if an inappropriate model is used.
Conclusions • PARAFAC and Tucker are both generalizations of the PCA model for multiway data. • PARAFAC and Tucker models use fewer parameters and are easier to interpret than unfold-PCA. • Models can be calculated in MATLAB using ‘N-way Toolbox’ (or ‘PLS_Toolbox’)