350 likes | 883 Views
2. The PARAFAC model. Quimiometria Teórica e Aplicada Instituto de Química - UNICAMP. Example: fluorescence data (1). Each fluorescence spectrum is a matrix of emission vs excitation wavelengths: X i (201 61). emission spectrum of pure tryptophan.
E N D
2. The PARAFAC model Quimiometria Teórica e Aplicada Instituto de Química - UNICAMP
Example: fluorescence data (1) Each fluorescence spectrum is a matrix of emission vs excitation wavelengths: Xi (201 61)
emission spectrum of pure tryptophan concentration of tryptophan in sample i excitation spectrum of pure tryptophan c3T c1T c2T Xi ai3 ai1 ai2 = + + b3 b1 b2 Example: fluorescence data (2) • Each spectrum is a linear sum of three components: tryptophan, phenylalanine and tyrosine. Xi = ai1b1c1T+ai2b2c2T+ ai3b3c3T + Ei +Ei
c2T c3T c1T X5 b2T b3T b1T X4 X1 = + X3 + X2 5 samples 61 excitation ’s a2 a3 a1 201 emission ’s concentration of tryptophan in each sample Example: fluorescence data (3) • Five samples were measured and stacked to give a three-way array: X (5 201 61). + E
Example: fluorescence data (4) • If we are given a set of fluroescence spectra, X, how can we determine: • How many chemical species are present? • Which chemical species are present? What are their pure excitation and emission spectra? • i.e. self-modelling curve resolution (SMCR) • What is the concentration of each species in each sample? • i.e. (second-order) calibration • Answer: use the PARAFAC model!
c1T c2T cRT + … + + E b1T b2T bRT X a1 a2 aR K I } J Triad CT = + BT E A The PARAFAC model (1) =
CT = + BT E A The PARAFAC model (2) X K I J • Loadings • A (IR) describes variation in the first mode. • B (JR) describes variation in the second mode. • C (KR) describes variation in the third mode. • Residuals • E (IJK) are the model residuals.
CT = + BT E 5 samples 61 excitation ’s 201 emission ’s A Example: fluorescence data (5) X • Loadings • A (5 3) describes the component concentrations. • B (201 3) describes the pure component emission spectra. • C (61 3) describes the pure component excitation spectra. • Residuals • E (5 201 61) describes instrument noise.
B (201 3) C (61 3) phenylalanine phenylalanine tyrosine tyrosine tryptophan tryptophan Example: fluorescence data (6) • A 3-component PARAFAC model describes 99.94% of X.
A (5 3) -0.0853 -1.8151 2.7867 -0.0135 -0.0042 13.172 0.2714 0.0147 2.0803 0.0006 0.1484 785.09 0.0492 0.0234 1.8358 5.3045 341.68 1.6140 0.8378 0.7990 0.8790 4.4000 297.00 0.9179 0.6949 0.6945 Example: fluorescence data (7) • The A-loadings describe the relative amounts of species 1 (tryptophan), 2 (tyrosine) and 3 (phenylalanine) in each sample: Concentrations (ppm) 2.6685 0.0141 0.0471 1.5455 • In order to know the absolute amounts, it is necessary to use a standard of known concentrations, i.e. sample 5.
Khatri-Rao matrix product The PARAFAC formula • Data array • X (IJK) is matricized into XIJK (IJK) XIJK = A(CB)T + EIJK • Loadings • A (IR) describes variation in the first mode • B (JR) describes variation in the second mode • C (KR) describes variation in the third mode • Residuals • E (IJK) is matricized into EIJK (IJK)
Trilinear model Bilinear model XIJK = A(CB)T + EIJK X = ABT + E PCA vs PARAFAC PCA PARAFAC Components are calculated sequentially in order of importance. Components are calculated simultaneously in random order. Orthogonal, i.e.BTB = I Not (usually) orthgonal. Solution has rotational freedom. Solution is unique (i.e. not possible to rotate factors without losing fit).
Rotational freedom • The bilinear model X = ABT + E contains rotational freedom. There are many sets of loadings (and scores) which give exactly the same residuals, E: X = ABT+ E = ARR-1BT+ E = A*B*T+ E (A*=AR B*T=R-1BT) • This model is not unique – there are many different sets of loadings which give the same % fit.
PARAFAC solution is unique • The trilinear model X= A(CB)T + E is said to be unique, because it is not possible to rotate the loadings without changing the residuals, E: X = A(CB)T + E = ARR-1(CB)T + E = A*(C*B*)T + E* • This is why PARAFAC is able to find the correct fluorescence profiles – because the unique solution is close to the true solution.
Spot the difference! PCA loadings PARAFAC loadings
Step 1 - Estimate A using least squares: • Step 2 - Estimate B using least squares: Each update must reduce the sum-of-squares, Alternating least squares (ALS) • How to estimate the PCA model X = ABT + E? • Step 0 - Initialize B • Step 3 - Check for convergence - if not, go to Step 1.
Three different unfoldings – the formula is symmetric XIJK = A(CB)T + EIJK XIJK or XJKI = B(AC)T + EJKI XJKI or XKIJ = C(BA)T + EKIJ XKIJ
Step 1 - Estimate A: • Step 2 - Estimate B in same way: • Step 3 - Estimate C in same way: How is the PARAFAC model calculated? • How to estimate the model X = A(CB)T + E? • Step 0 - Initialize B & C • Step 4: Check for convergence. If not, go to Step 1.
initialize B & C good solution initialize B* & C* local minium ALS ALS Good initialization is sometimes important Initialization methods • random numbers (do this ten times and compare models) • use another method to give rough estimate (e.g. DTLD, MCR) • use sensible guesses (e.g. elution profiles are Gaussian) response surface
Conclusions (1) • The PARAFAC model decomposes a three-way array array into three sets of loadings – one for each ‘mode’.Each set of loadings describes the variation in that mode, e.g. differences in concentration, changes in time, spectral profiles etc. • PARAFAC components are calculated together and have no particular order. PARAFAC components are not orthogonal and cannot be rotated. • PARAFAC can be used for curve resolution and for calibration.
Conclusions (2) • Some data sets have a chemical structure which is particularly suitable for the PARAFAC model, e.g. fluorescence spectroscopy. • The PARAFAC model can also be used for four-way, five-way, N-way etc. data by simply using more sets of loadings.