310 likes | 478 Views
Modelling procedures for directed network of data blocks. Agnar Höskuldsson, Centre for Advanced Data Analysis, Copenhagen. Data structures : Directed network of data blocks Input data blocks Output data blocks Intermediate data blocks Methods
E N D
Modelling procedures for directed network of data blocks Agnar Höskuldsson, Centre for Advanced Data Analysis, Copenhagen Data structures: Directed network of data blocksInput data blocksOutput data blocksIntermediate data blocks Methods Optimization procedures for each passage through the networkBalanced optimization of fit and prediction (H-principle) Scores, loadings, loading weights, regression coefficients for each data blockMethods of regression analysis applicable at each data blockEvaluation procedures at each data blockGraphic procedures at each data block
Chemometric methods • Regression estimation, • X, Y. Traditional presentation: Yest=XB, and standard deviations for B. • Latent structure: • X=TP’ + X0. X0 not used. • Y=TQ’+Y0. Y0 not explained. • Fit and precision. • Both fit and precision are controlled. • Selection of score vectors • As large as possible • describe Y as well as possible • modelling stops, when no more found (cross-validation) • 4. Graphic analysis of latent structure • Score and loading plots • Plot of weight (and loading weight) vectors
Chemometric methods 5. Covariance as measure of relationship X’Y for scaled data measures strength X1’Y=0, implies that X1 is remmoved from analysis 6. Causal analysis T=XR From score plots we can infer about the original measurement values Control charts for score values can be related to contribution charts 7. Analysis of X Most time of analysis is devoted to understand the structure of X. Plots are marked by symbols to better identify points in scor or loading plots. 8. Model validation. Cross-validation is used to validate the results Bootstrapping (re-sampling from data) used to establish confidence intervals
Chemometric methods 9. Different methods Different types of data/situations may require different type of method One is looking for interpretations of the latent structure found 10. Theory generation Results from analysis are used to establish views/theories on the data Results motivate further analysis (groupings, non-linearity etc)
Partitioning data, 1 Responsedata Reference data Measurement data Z1 XL X1 X2 Y1 Y2 Z2 Z3
Partitioning data, 2 • There is often a natural sub-division of data. • It is often required to study the role of a sub-block • Data block with few variables may ’disappear’ among one with many variables, e.g. Optical instruments often give many variables.
Path diagram 1 X1 X4 X6 X7 X2 X5 X3 Examples: Production processOrganisational dataDiagram for sub-processesCausal diagram
Path diagram 2, schematic application of modelling X1 X4 x10 X6 X7 X2 x20 X5 X3 Resulting estimating equations X4,est=X1B14+X2B24+X3B34 X5,est=X1B15+X2B25+X3B35 X6,est=X4B46+X5B56 X7,est=X6B67 x30 x10 is a new sample from X1,x20 is a new one from X2,x30 is a new one from X3, how do they generate new samples for X4, X5, X6 and X7?
Path diagram 3 Time t1 Time t2 X1 X4 X6 X7 X2 X5 X3 Data blocks can be aligned to time. Modelling can start at time t2.
Notation and schematic illustrations Instrumental data Response data w X Y t u q w: weight vector (to be found)t: score vector, t = Xw =w1x1 + ... + wKxKq: loading vector, q =YTt = [ (y1Tt), ... , (yMTt) ]u: Y-score vector, u=Yq = q1y1 + ... + qMyMVectors are collected into matrices, e.g., T=(t1, ... , tA) Adjustments: XX – tpT/(tTt) YY – tqT/(tTt)
Conjugate vectors 1 w X r: t=Xw, p=XTt. paTrb=0 for ab. t p r X r: t=Xq, qaTrb=0 for ab. t q r w sv r and s: t=Xw, p=XTv, paTrb=0, taTsb=0 for ab. X t p r
Conjugate vectors 2 The conjugate vectors R=(r1, r2, ..., rA) satisfy: T=XR. Latent structure solution: X = TPT + X0, where X0 is the part of X that is not used Y = TQT + Y0, where Y0 is the part of Y that could not be explained Y = TQT + Y0= X (RQT) + Y0= X B + Y0, for B= RQT The conjugate vectors are always computed together with the score vectors. When regression on score vectors has been computed, the regression on the original variables is computed as shown.
Optimization procedure, 1 w1 |t1|2 max One data block: X1 t1 w1 |q2|2 max Two data blocks: X1 X2 q2 t1
Three data blocks w Start Z X Y tz t ty qz qy |qz|2 max X basis Y estimated Y basis Z estimated w X1 X2 X3 X4 t1 t3 t4 q2 q4 Adjustments:t1 describes X1: X1X1-t1p1T/(t1Tt1), p1=X1Tt1. t1 describes X2: X2X2-t1q2T/(t1Tt1), q2=X2Tt1. q2 describes X3: X3X3-t3q2T/(q2Tq2), t3=X3q2. t3 describes X4: X4X4-t3q4T/(t3Tt3), q4=X4Tt3.
Optimization procedure, 2 Two input and two output data blocks: w1 X3 X1 Find w1 and w2: |q13+q23+q14+q24|2 max q13 t1 w2 q23 X2 X4 q14 t2 q24 Two input, one intermediate and one output data blocks: w1 X1 Find w1 and w2: |q134+q234|2 max t1 w2 X3 X4 q134 q13 X2 q234 q23 t2
Balanced optimization of fit and prediction (H-principle) X Y In linear regression we are looking for a weight vector w, so that the resulting score vector t=Xw is good! The basic measure of quality is the prediction variance for a sample, x0. Assuming negligible bias it can be written (assuming standard assumptions) F(w) = Var(y(x0)) = k[1 – (yTt)2/(tTt)][1 + t02/(tTt)]. It can be shown that F(cw)=F(w) for all c>0. Choose c such that (tTt)=1. Then F(w) = k[1 – (yTt)2][1 + t02]. In order to get a prediction variance as small as possible, it is natural to choose w such that (yTt)2 becomes as large as possible, maximize (yTt)2 = maximize |q|2 (PLS regression) Linear regression
Optimization procedure, 3 Weighing along objects (rows) (same algorithm, but using the transposes): Task: find weight vector v1: maximize |t2|2 X1 v1 p1 X2 t2 Task: find weight vector v1: maximize |q3|2 X1 v1 p1 X2 X3 t2 q3
Optimization procedure, 4 w1 Task: find weight vector w1: maximize |q3|2, where X1 t1 p1 q3=X3Tt2 =X3TX2p1 =X3TX2X1Tt1 =X3TX2X1TX1w1 X2 X3 t2 q3 If p1 is a good weight vector for X2, a good result may be expected. Pre-processing may be needed to find variables in X1 and in X2 that are highly correlated to each other. Regression equations X3,est=X2B23 X2,est=B12X1 X1,est=X1B11
Three types of reports Reports: How a data block is doing in a network How a data block can be described bydata blocks that lead to it. How a data block can be described byone data block that leads to it. Xi Xi Xi-1 Xi Xi-2 Xi-3 Xi-2 Xi
Production data, 1 X2 Y X1 No |X2|2 |Y|2 |X|2 |Y|2 1 78,961 51,483 74,969 51,964 2 91,538 67,559 86,786 69,553 3 96,351 76,291 91,627 80,643 4 97,942 81,383 95,373 85,058 5 98,620 83,900 95,919 89,056 6 98,967 85,705 97,054 90,050 7 99,205 87,917 97,508 91,990 8 99,294 90,472 97,990 93,455 9 99,349 92,183 98,667 94,020 10 99,426 92,947 98,896 94,708 11 99,606 93,084 99,103 95,082 12 99,657 93,376 99,202 95,740 X1: Process parameters, 8 variables X2: NIR data, 1560 variables (reduced to 120) X1 ’disappears’ inthe NIR data X2.
Production data, 2 Results for X2, process parameters:5 score vectors explain 11.92% of Y. At each step: w1 X1 t1 w2 Y X2 t2 Results for X1, NIR data:12 score vectorsexplain 84.141% of Y. At each step the score vectors are evaluated. Non-significant ones are excluded. Total 96.06%=11.920%+84.14% is explained of Y.
Production data, 3 R2-values: Plot of estimated versus observed quality variable using only score vectors for process parameters. X2 96.06% 75.12% Y X1 87.75% The process parameters contribute marginally by 11.92%. But if only they were used, they would explain 75.12% of the variation of Y. R2=0.7512
Directed network of data blocks Input blocks Intermediate blocks Output blocks ... ... ... Are described by previous blocks and give score vectors for succeeding blocks Give weight vectors for initial score vectors Are described by previous blocks
Magnitudes computed between two data blocks Xk • Different views: • As a part of a path • If the results are viewed marginally • If only XiXk • ... Xi Ti: Score vectorsQi: Loading vectorsBi: Regression coefficients Measures of precision Measures of fit Etc
Stages in batch processes Time Y X1 X2 Xk Batches K Stages 1 2 Final quality Paths: X1 X2 ... XK Y Given a sample x10, the path model gives estimated samples for later blocks [X1X2X3] X4 Y Given values of (x10x20x30), estimates for values of x4 and y are given. [X1X2X3] [X4X5] Y Given values of (x10x20x30), estimates for values of (x4x5) and y are given.
Schematic illlustration of the modelling task for sequential processes Stages Now X1 X2 X3 X4 Y Known process parameters Later stages Initial conditions Next stage
Plots of score vectors X1 X2 XL t1 t2 tL t2 tL X1 X1 – X2 X1 – XL t1 t1 The plots will show how the changes are relative to the first data block.
Graphic software to specify paths X4 X5 X1 X3 ... X2 XL Blocks are dragged into the screen. Relationships specified.
Pre-processing of data • Centring. If desired centring of data is carried out • Scaling. In the computations all variables are scaled to unit length (or unit standard deviation if centred). It is checked if scaling disturbs the variable, e.g. if it is constant except for two values, or if the variable is at the noise level. When analysis has been completed, values are scaled back so that units are in original values. • Redundant variable. It is investigated if a variable does not contribute to the explanation of any of the variables that the presnt block lead to. If it is redundant, it iseliminated from analysis. • Redundant data block. It is investigated if a data block can provide with a significant description of the block that it is connected to later in the network. If it can not contribute to the description of the blocks, it is removed from the network.
Post-processing of results Score vectors computed in the passages through the network are evaluated in the analysis at one passage. Apart from the input blocks the score vectors found between passages are not independent. The score vectors found in a relationship XiXj are evaluated to see if all are significant or some should be removed for this relationship. Cross-validation like in standard regression methods Confidence intervals for parmeters by resampling technique
International workshop on Multi-block and Path Methods 24. – 30. May 2009, Mijas, Malaga, Spain