1 / 24

Scientific Methods 1

Scientific Methods 1. ‘Scientific evaluation, experimental design & statistical methods’ COMP80131 Lecture 11: Statistical Methods- Intro to PCA & Monte Carlo Methods. Barry & Goran. www.cs.man.ac.uk/~barry/mydocs/MyCOMP80131. Correlation & covariance.

Download Presentation

Scientific Methods 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scientific Methods 1 ‘Scientific evaluation, experimental design & statistical methods’ COMP80131 Lecture 11: Statistical Methods- Intro to PCA & Monte Carlo Methods Barry & Goran www.cs.man.ac.uk/~barry/mydocs/MyCOMP80131 COMP80131-SEEDSM11

  2. Correlation & covariance Correlation coeff for two samples of M variables Lies between -1 & 1 Covariance between two samples of M variables: Use 1/(M-1) if meanx is sample-mean & 1/M if meanx is pop-mean COMP80131-SEEDSM11

  3. In vector notation (1/(varx.vary)) xTy or (1/M) xTy when means have been subtracted x = y= Both measure similarity between 2 columns of numbers (vectors). COMP80131-SEEDSM11

  4. Principal Components Analysis • PCA converts samples of M variables into samples of a smaller number of variables called principal components. • Produces shorter columns. • Exploits interdependency or correlation among the M variables in each col. • Evidence is the similarity between columns as seen in lots of examples. • If there is none, PCA cannot reduce the number of variables. • First princ comp has the highest variance. • It accounts for as much variability as possible. • Each succeeding princ comp has the highest variance possible while being orthogonal to (uncorrelated with) the previous ones. COMP80131-SEEDSM11

  5. P C A • Reduces number of variables (dimensionality) - without significant loss of information • Also named: ‘discrete Karhunen–Loève transform (KLT)’, ‘Hotelling transform’ ‘proper orthogonal decomposition (POD)’. Related to (but not the same as): ‘Factor analysis’ COMP80131-SEEDSM11

  6. Example Assume 5 observations of behaviour of 3 variables x1, x2, x3: x1: 1 2 3 1 4 sample-mean = 11/5 = 2.2 x2: 2 1 3 -1 2 sample-mean = 1.4 x3: 3 4 7 1 8 sample mean = 4.6 Subtract means: x1' : -1.2 -0.2 0.8 -1.2 1.8 x2' : 0.6 -0.4 1.6 -2.4 0.6 call this matrix X x3' : -1.6 -0.6 2.4 -3.6 3.4 Calculate ‘Covariance Matrix’ (C): x1' x2' x3' x1' : 1.7 1.15 3.6 cov(x1',x2') = average value ofx1'.x2' x2' : 1.15 2.3 3.45 = (-1.20.6 +0.20.4 + 0.81.6+1.22.4+1.80.6) x3' : 3.6 3.45 8.3 = 4.6/4 = 1.15 COMP80131-SEEDSM11

  7. Eigenvalues & eigenvectors [U, diagV] = eig(C); 0.811 0.458 0.364 0 0 0 0.324 -0.87 0.372 0 0.964 0 -0.487 0.184 0.854 0 0 11.34 u3u2u13 0 0 0 2 0 D U 0 0 1 COMP80131-SEEDSM11

  8. Transforming the measurements • For each column of matrix X, multiply by UT to transform it to a different set of numbers. • For each column x’ transform it to UTx’ • Or do it all at once by calculating Y = UT*X. • We get: 0 0 0 0 0 Y = -1.37 0.146 -0.583 0.874 0.929 -1.58 -0.734 2.936 -4.404 3.781 • First column of X is now expressed as: 0  u1 -1.37  u2 – 1.58  u3 • Similarly for all the other four columns of X. • Each column is now expressed in terms of the eigenvectors of C. COMP80131-SEEDSM11

  9. Reducing dimensionality UT C U = D therefore C = U D UT ( since UT is inverse of U) Now we can express: C = 1 (u1u1T) + 2 (u2u2T) + 3 (u3u3T) Now look at the eigenvalues 1, 2, 3 Strike out zero valued ones (3) with corresponding eigenvectors (u3). Leaves u1 & u2 as princ components. Can represent all the data, without loss, with just these two. Can remove smaller eigenvalues (such as 2) with its eigenvector. (If they do not affect C much they should not affect X) Whole data can represented by just u1 without serious loss of accuracy. COMP80131-SEEDSM11

  10. Reconstructing orig data from princ comps • Because Y = UT*X, then X = U*Y. • So if we don’t strike out any eigenvals & eigenvecs, this gets us back to orig data. • If we strike out row 1 of Y and u1 (first col of U), we still get back to orig data. • If we strike out row 2 of Y and u2, we get back something close to orig data., • We do not lose much info by keeping just one princ. comp • Dimensionality reduces from 3 to 2 or 1. • (Normally eigenvals reordered in decreasing magnitude, but I have not done that here) COMP80131-SEEDSM11

  11. In MATLAB clear all; origData = [1 2 3 1 4 ; 2 1 3 -1 2 ; 3 4 7 1 8] [M,N] = size(origData); meanofCols = mean(origData,2); % subtract off mean for EACH dimension zmData = origData - repmat(meanofCols,1,N) covarMat = 1 / (N-1) * (zmData * zmData') % find the eigenvectors and eigenvalues of covarMat [eigVecs, diagV] = eig(covarMat) eigVals = diag(diagV) [reigVals, Ind] = sort(eigVals,'descend'); % sort the variances in decreasing order reigVecs = eigVecs(:,Ind); % Reorder eigenvectors accordingly proj_zmData = reigVecs' * zmData disp('Approx to original data taking just a few principal components'); nPC = input('How many PCs do you need (look at eigenvals to decide):'); PCproj_zmData = proj_zmData(1:nPC,:) PCVecs = reigVecs(:,1:nPC) %Only keep the first few reordered eigVecs RecApprox_zmData = PCVecs * PCproj_zmData RecApprox_origData = RecApprox_zmData + repmat(meanofCols,1,N) COMP80131-SEEDSM11

  12. Monte Carlo Methods • Use of repeated random sampling of the behaviour of highly complex multidimensional mathematical equations describing real or simulated systems, to determine their properties. • The repeated random sampling produces observations to which statistical inference can be applied. COMP80131-SEEDSM11

  13. Pseudo-random processes • Name Monte Carlo refers to the famous casino. • Gambling requires a random process such as the spinning of a roulette wheel. • Monte Carlo methods use pseudo-random processes implemented in software. • ‘Pseudo-random’ processes are not truly random. • The variables produced can be predicted by a person who knows the algorithm being used. • However, they can be used to simulate the effects of true randomness. • Simulations are not required to be numerically identical to real processes. • Aim is to produce statistical results such as averages & distributions. • Requires a ‘sampling’ of the population of all possible modes of behaviour of the system. COMP80131-SEEDSM11

  14. Illustration • Monte Carlo methods may be used to evaluate multi-dimensional integrals. • Consider the problem of calculating the area of an ellipse by generating a set of N pseudo-random number pairs (xi,yi) uniformly covering the area -1<x<1, -1<y<1 as illustrated next: COMP80131-SEEDSM11

  15. y 1 x -1 1 Area of an ellipse Area of square is 2  2 = 4 If there are N points and M of them fall inside the ellipse, area of ellipse  4  M / N as N  (Frequentist approach) COMP80131-SEEDSM11

  16. Simplicity of MC methods • This example illustrates the simplicity of MC techniques, but not their computational advantages. • We could have use a regularly placed grid of points rather than randomly placed points in the rectangle as on next slide COMP80131-SEEDSM11

  17. Regular grid y 1 x -1 1cm COMP80131-SEEDSM11

  18. Multi-dimensional integration • In fact there are no advantages for such a 2-dimensional dimensional problem. • Consider a multi-dimensional integration COMP80131-SEEDSM11

  19. Disadvantage of regular grid • f(x1, x2, …, xL) may be evaluated at regularly spaced points as a means of evaluating the integral. • Number of regularly spaced points, N, must increase exponentially with dimension L if error is not to increase exponentially with L. • If N = 100 when L=2, then adjacent points will be 0.2 cm apart. • If L increases to 3, we need N=103 points to maintain the same separation between them. • When L = 4, we need N= 104 etc. – ‘Curse of dimensionality’ • Look at this another way: • Assume N remains fixed with regular sampling, and L increases. • Each dimension must be sampled more & more sparsely - and less and less efficiently. • More & more points with same value in each dimension. • Error increases in proportion to N-2/L COMP80131-SEEDSM11

  20. Advantage of random sampling • Uniformly distributed random points in L-dimensional space. • Avoids the inefficiency of the rectangular grids created by regular sampling by using a purely random sample of N points uniformly distributed • For high dimensions K, error is proportional to 1/(N) • To reduce the error by a factor of 2, the sample size N must be increased by a factor of 4 regardless of the dimensionality. • There are ways of decreasing the Monte Carlo error to make the technique still more efficient. • One approach is to use ‘quasi-random’ or ‘low-discrepancy’ sampling. • The use of such quasi-random sampling for numerical integration is referred to as “quasi–Monte Carlo” integration. COMP80131-SEEDSM11

  21. MATLAB: Area of Semicircle for N=1:200 M=0; for i=1:N x=2*rand -1; y=rand*1.0; I = sqrt(1-x*x); if y <= I, M=M+1; end; %if y <= I point (x,y) is below curve !!! end; Int(N)=M*2/N; end; % of N loop figure(6); plot(Int); title('Area of semicircle'); xlabel('Number of points'); COMP80131-SEEDSM11

  22. Convergence as N COMP80131-SEEDSM11

  23. MATLAB code for scatter plot • clear; N=6000; • M=0; • for i=1:N • x(i)=rand*2-1; y(i)=rand*1.0; • I = sqrt(1-x(i)*x(i)); • if y(i)<=I, M=M+1; C(i) = 2; else C(i)=1; end; • end; • scatter(x,y,6,C,'filled'); • Int = M*2/N ; • title(sprintf('Scatter of MC area method: N=%d, Int=%d',N,Int)); • disp('Red if I<= y, blue if I>y'); • xlabel('x Plot red if y <= I'); ylabel('y'); COMP80131-SEEDSM11

  24. Integration of circle – scatter plot COMP80131-SEEDSM11

More Related