P rinciple C omponent A nalysis

Principle Component Analysis Royi Itzhack Algorithms in computational biology

אלגברה ליניארית matlab - הגדרת מטריצה רב מימדית על ידי סוגריים מרובעים , גישה לאיברים על ידי סוגריים עגולים .

חיבור איבר איבר הכפלה ע"פ כללי מטריצות פעולות של סקלרים הם על כל המטריצה

בסיס קבוצה של וקטורים נקראת בסיס אם היא פורשת את כל המרחב והיא בלתי תלויה ליניארית . אי תלות לינארית ניתנת לבדיקה אם הדטרמיננטה של המטריצה המייצגת את כל וקטורי הבסיס שונה מ 0 או מימד שורות המטריצה נשאר כמימד הוקטורים לאחר דירוג גאוס.

בסיס אורתוגונאלי כל הוקטורים השונים המרכיבים את הבסיס ניצבים אחד לשני – המכפלה הפנימית שלהם שווה ל 0 נרמול של וקטור – חלוקה של איברי הוקטור בנורמה 2 של הוקטור כך שהכפלה הפנימית של הוקטור בעצמו תהיה שווה ל 1 .

הגדרת נורמה – norm(vector,k norm) נרמול וקטור – חלוקה איבר איבר בסקלר (אופרטור “.” ) מכפלה סקלרית

Matrix arithmetic, etc. • Product A*BIf either factor is 1X1, i.e., a scalar, then this is scalar multiplication. • Transpose A’Conjugate-transpose for complex matrix • Inverse A^(-1) or inv(A)There is also a pseudoinverse, pinv, for nonsquare matrices. • Determinantdet(A)

משוואה אופינית X מציין וקטור עצמי מציינת ערך עצמי תואם לוקטור

דוגמא:

המשך הדטרמיננטה של המטריצה צריכה להיות שווה לאפס

חישוב משוואה אופינית על פי הדטרמיננטה

חישוב הוקטורים העצמיים על מנת למצוא את הוקטור המתאים ל l= 3 מציבים במטריצה האופינית את הערך העצמי ובודקים איך יראה הוקטור שמכפלתו במטריצה תהיה שווה לאפס כאשר המטריצה לא רגולרית יש אין סוף אפשרויות וצריך לבחור את אחד הבסיסים למרחב הפתרונות

המשך... אז נציב t=1 ונמצא את ולכן כל וקטור מהצורה שבו שני הרכיבים זהים יהווה וקטור עצמי לערך עצמי 3

המשך... אותו תהליך מתבצע לערך עצמי השני 1-

The dimension problem • Suppose , we want to calculate the probability to have a hard disease base on N parameters : age , height , weight ,blood pressure , country , historical treatments ,genetics ext.. • We calculate for each sample M feature ,if we have N samples we can describe it as MxN matrix • probably that only few number of features are important - how can we find them?

The dimension problem • Some features are not informative • Constant feature – the variance of the vector is zero or close to zero , lets say that in our experiments , we check the birth country of the samples ,and 98% of them was born in Israel while 2% was born in other country • Feature that are linearly dependent on other features like blood pressure and weight Informative features - high variance between groups and low variance in the group

Algebraic Interpretation – 1D • Given m points in a n dimensional space, for large n, how does one project on to a 1 dimensional space? • Choose a line that fits the data so the points are spread out well along the line

Algebraic Interpretation – 1D • Given m points in a n dimensional space, for large n, how does one project on to a low dimensional space while preserving broad trends in the data and allowing it to be visualized? • Formally, minimize sum of squares of distances to the line. • Why sum of squares? Because it allows fast minimization, assuming the line passes through 0

All principal components (PCs) start at the origin of the ordinate axes. First PC is direction of maximum variance from origin Subsequent PCs are orthogonal to 1st PC and describe maximum residual variance Principal Components 30 25 20 Wavelength 2 PC 1 15 10 5 0 0 5 10 15 20 25 30 Wavelength 1 30 25 20 PC 2 Wavelength 2 15 10 5 0 0 5 10 15 20 25 30 Wavelength 1

שונות , תלות • שונות – היא מדד למידת הפיזור סביב הממוצע • נוסחה להסתברות מותנית • תלות זוג משתנים A,B נקראים בלתי תלויים אם או (באופן טרויאלי מהמשוואה...) לדוגמא: מה הסיכוי שסכום זריקת 2 קוביות הוא 6 בהנתן שבזריקה הראשונה התקבלה התוצאה 4 , האם המאורעות תלויים? נסמן ב A - סכום 2 התוצאות הוא 6 נסמן ב B – התוצאה בהטלה הראשונה היא 4 המאורעות תלויים!

התפלגות משותפת , שונות משותפת • כאשר שני משתנים x,y מעורבים יש לחשב טבלה שבה כל זוג ערכים (אחד מ x ואחד מ y ) יש הסתברות – • סכום ערכי הטבלה הינו 1 • סכום שורה ועמודה מיצגות את ההתפלגות השולית • שונות משותפת – היא מדד לתיאום בין משתנים (כמה הנטייה שלהם להשתנות ביחד ) • אי תלות => אי תיאום

Covariance matrix 1. מרכיבים באקראי מס' דו סיפרתי מהספרות 1,2,3,4. יהי X מס' הספרות השונות המופיעות במס' ו Y מס' הפעמים שהספרה 1 מופיעה. מצא: א.ההתפלגות המשותפת של הזוג (X,Y) ב. האם X ו Y בת"ל ג. מצא את השונות המשותפת – COV(X,Y). P(X|Y)!=P(X) E(x)=0.25*1+0.75*2=1.75 E(y)=0*9/16+1*6/16+2*1/16=0.5 Cov(x,y)=1*2*6/16+1*2*1/16-1.75*0.5=0

Covariance Matrix • Each i,j is the cov(xi, xj) • Each i,i is the var(xi) • In the previous question • V(X)=1*0.25+4*0.75-1.75*1.75= • V(Y)=1*6/16+4*1/16-0.5*0.5

The Algorithm • Step 1:Calculate the Covariance Matrix of the observation matrix. • Step 2: Calculate the eigenvalues and the corresponding eigenvectors. • Step 3: Sort eigenvectors by the magnitude of their eigenvalues. • Step 4: Project the data points on those vectors.

PCA – Step 1: Covariance Matrix C • - Data Matrix

The Algorithm • Step 1: Calculate the Covariance Matrix of the observation matrix. • Step 2: Calculate the eigenvalues and the corresponding eigenvectors. • Step 3: Sort eigenvectors by the magnitude of their eigenvalues. • Step 4: Project the data points on those vectors.

Linear Algebra Review – Eigenvalue and Eigenvector • C - a square nn matrix eigenvalue eigenvector

PCA – Step 3 • Sort eigenvectors by the magnitude of their eigenvalues

The Algorithm • Step 1: Calculate the Covariance Matrix of the observation matrix. • Step 2: Calculate the eigenvalues and the corresponding eigenvectors. • Step 3: Sort eigenvectors by the magnitude of their eigenvalues. • Step 4: Project the data points on those vectors.

Project the input data onto the principal components. The new data values are generated for each observation, which are a linear combination as follows: PCA – Step 4 • score • observation • principal component • loading (-1 to 1) • variable

PCA: General From k original variables: x1,x2,...,xk: Produce k new variables: y1,y2,...,yk: y1 = a11x1 + a12x2 + ... + a1kxk y2 = a21x1 + a22x2 + ... + a2kxk ... yk = ak1x1 + ak2x2 + ... + akkxk

PCA: General From k original variables: x1,x2,...,xk: Produce k new variables: y1,y2,...,yk: y1 = a11x1 + a12x2 + ... + a1kxk y2 = a21x1 + a22x2 + ... + a2kxk ... yk = ak1x1 + ak2x2 + ... + akkxk such that: yk's are uncorrelated (orthogonal) y1 explains as much as possible of original variance in data set y2 explains as much as possible of remaining variance etc.

2nd Principal Component, y2 1st Principal Component, y1

xi2 yi,1 yi,2 xi1 PCA Scores

λ2 λ1 PCA Eigenvalues

PCA: Another Explanation From k original variables: x1,x2,...,xk: Produce k new variables: y1,y2,...,yk: y1 = a11x1 + a12x2 + ... + a1kxk y2 = a21x1 + a22x2 + ... + a2kxk ... yk = ak1x1 + ak2x2 + ... + akkxk yk's are Principal Components such that: yk's are uncorrelated (orthogonal) y1 explains as much as possible of original variance in data set y2 explains as much as possible of remaining variance etc.

הפקודה imagesc() –מציגה את המטריצה בצבעי חום – ערכים גבוהים הינם אדומים וערכים נמוכים הן כחולים דוגמא + matlab נתבונן על ה microarray הבא , המכיל מידע של 100 גנים , ו 60 פציינטים 30 בריאים ו30 חולים

Cov(X)- מגדיר מטריצת קוראינס [V E] = eig(C) מחזיר שתי מטריצות , מטריצה E אלכסונית של הע"ע (אך ורק על האלכסון) מסודרים בסדר עולה , ומטריצה V של הו"ע התואמים לע"ע של מטריצה E. diag(X)- מחזיר וקטור המיצג את האלכסון של המטריצה

מציאת אינדקסים בוקטור המקיימים את התנאי - find(logic condition on vector) מציאת ה Pc ועל פיהם הגדרת מטריצת הסיבוב

Example from 2005b Perform pca for the following data sets X=(0,0),(1,1),(2,2),(3,3),(-1,-1),(-2,-2),(-3,-3) Mean(x)=0

P rinciple C omponent A nalysis