Microarray Analysis: Gene Expression Classification

L15:Microarray analysis (Classification) Bafna

Final exam syllabus • Take home • The entire course, but emphasis will be given to post-midterm lectures • HMMs, • Gene-finding, • mass spectrometry, • Micro-array analysis,

Biol. Data analysis: Review Assembly Protein Sequence Analysis Sequence Analysis/ DNA signals Gene Finding Bafna

Other static analysis is possible Genomic Analysis/ Pop. Genetics Assembly Protein Sequence Analysis Sequence Analysis Gene Finding Bafna ncRNA

A Static picture of the cell is insufficient • Each Cell is continuously active, • Genes are being transcribed into RNA • RNA is translated into proteins • Proteins are PT modified and transported • Proteins perform various cellular functions • Can we probe the Cell dynamically? • Which transcripts are active? • Which proteins are active? • Which proteins interact? Gene Regulation Proteomic profiling Transcript profiling Bafna

Protein quantification viaLC-MS Maps Peptide 2 I Peptide 1 m/z time Peptide 2 elution • A peptide/feature can be labeled with the triple (M,T,I): • monoisotopic M/Z, centroid retention time, and intensity • An LC-MS map is a collection of features x x x x x x x x x x x x x x x x x x x x m/z time Bafna

Map 1 (normal) Map 2 (diseased) LC-Map Comparison for Quantification Bafna

Quantitation: transcript versus Protein Expression Sample 1 Sample2 Protein 1 35 4 Protein 2 Protein 3 Our Goal is to construct a matrix as shown for proteins, and RNA, and use it to identify differentially expressed transcripts/proteins Bafna

Transcript quantification • Instead of counting protein molecules, we count active transcripts in the cell. Bafna

Transcript quantification with RNA-seq Bafna

Quantification via microarrays Bafna

Quantitation: transcript versus Protein Expression Sample 1 Sample2 Sample 1 Sample 2 Protein 1 35 4 100 20 mRNA1 Protein 2 mRNA1 Protein 3 mRNA1 mRNA1 mRNA1 Our Goal is to construct a matrix as shown for proteins, and RNA, and use it to identify differentially expressed transcripts/proteins Bafna

Gene Expression Data • Gene Expression data: • Each row corresponds to a gene • Each column corresponds to an expression value • Can we separate the experiments into two or more classes? • Given a training set of two classes, can we build a classifier that places a new experiment in one of the two classes. s1 s2 s g Bafna

Formalizing Classification • Classification problem: Find a surface (hyperplane) that will separate the classes • Given a new sample point, its class is then determined by which side of the surface it lies on. • How do we find the hyperplane? How do we find the side that a point lies on? 1 2 3 4 5 6 1 2 g1 3 1 .9 .8 .1 .2 .1 g2 .1 0 .2 .8 .7 .9 Bafna

Basic geometry • What is ||x||2 ? • What is x/||x|| • Dot product? x=(x1,x2) y Bafna

Dot Product • Let  be a unit vector. • |||| = 1 • Recall that • Tx = ||x|| cos  • What is Tx if x is orthogonal (perpendicular) to ? x   Tx = ||x|| cos  Bafna

Find the unit vector that is perpendicular (normal to the hyperplane) Hyperplane • How can we define a hyperplane L? Bafna

Points on the hyperplane • Consider a hyperplane L defined by unit vector , and distance 0 • Notes; • For all x  L, xTmust be the same, xT = 0 • For any two points x1, x2, • (x1- x2)T =0 x2 x1 Bafna

Hyperplane properties • Given an arbitrary point x, what is the distance from x to the plane L? • D(x,L) = (Tx -0) • When are points x1 and x2 on different sides of the hyperplane? x 0 Bafna

+ x2 - x1 Separating by a hyperplane • Input: A training set of +ve & -ve examples • Goal: Find a hyperplane that separates the two classes. • Classification: A new point x is +ve if it lies on the +ve side of the hyperplane, -ve otherwise. • The hyperplane is represented by the line • {x:-0+1x1+2x2=0} Bafna

+  x2 - x1 Error in classification • An arbitrarily chosen hyperplane might not separate the test. We need to minimize a mis-classification error • Error: sum of distances of the misclassified points. • Let yi=-1 for misclassified +ve example i, • yi=1 otherwise. • Other definitions are also possible. Bafna

Gradient Descent • The function D() defines the error. • We follow an iterative refinement. In each step, refine  so the error is reduced. • Gradient descent is an approach to such iterative refinement. D() D’()  Bafna

Rosenblatt’s perceptron learning algorithm Bafna

Classification based on perceptron learning • Use Rosenblatt’s algorithm to compute the hyperplane L=(,0). • Assign x to class 1 if f(x) >= 0, and to class 2 otherwise. Bafna

Perceptron learning • If many solutions are possible, it does not choose between solutions • If data is not linearly separable, it will converge to some local minimum. • Time of convergence is not well understood Bafna

Supervised classification • We have learnt a generic scheme for understanding biological data • Represent each data point (gene expression samples) as a vector in high dimensional space. • In supervised classification, we have two classes of points Bafna

+  x2 - x1 Supervised classification • We have learnt a generic scheme for understanding biological data • Represent each data point (gene expression samples) as a vector in high dimensional space. • In supervised classification, we have two classes of points. • We separate the classes using a linear surface (hyperplane) • The perceptron algorithm is one of the simplest to implement Bafna

End of Lecture Bafna

+  x2 - x1 Linear Discriminant analysis • Provides an alternative approach to classification with a linear function. • Project all points, including the means, onto vector . • We want to choose  such that • Difference of projected means is large. • Variance within group is small Bafna

+ + 1 x2 x2 - - 2 x1 x1 Choosing the right  • 1 is a better choice than 2 as the variance within a group is small, and difference of means is large. • How do we compute the best ? Bafna

Linear Discriminant analysis • Fisher Criterion Bafna

+  x2 - x1 LDA cont’d • What is the projection of a point x onto ? • Ans: Tx • What is the distance between projected means? x Bafna

LDA Cont’d Fisher Criterion Bafna

LDA Therefore, a simple computation (Matrix inverse) is sufficient to compute the ‘best’ separating hyperplane Bafna

Supervised classification summary • Most techniques for supervised classification are based on the notion of a separating hyperplane. • The ‘optimal’ separation can be computed using various combinatorial (perceptron), algebraic (LDA), or statistical (ML) analyses. Bafna

The dynamic nature of the cell • Proteomic and transcriptomic analyses provide a snapshot of the active proteins (transcripts) in a cell in a particular state (columns of the matrix) • This snapshot allows us to classify the state of the cell. • Other queries are also possible (not in the syllabus). • Which genes behave in a similar fashion? Each row describes the behaviour of genes under different conditions. Solved by ‘clustering’ rows. • Find a subset of genes that is predictive of the state of the cell. Using PCA, and other dimensionality reduction techniques. Bafna

Silly Quiz • Who are these people, and what is the occasion? Bafna

Genome Sequencing and Assembly Bafna

DNA Sequencing • DNA is double-stranded • The strands are separated, and a polymerase is used to copy the second strand. • Special bases terminate this process early. Bafna

PCA: motivating example • Consider the expression values of 2 genes over 6 samples. • Clearly, the expression of g1 is not informative, and it suffices to look at g2 values. • Dimensionality can be reduced by discarding the gene g1 g1 g2 Bafna

Principle Components Analysis • Consider the expression values of 2 genes over 6 samples. • Clearly, the expression of the two genes is highly correlated. • Projecting all the genes on a single line could explain most of the data. • This is a generalization of “discarding the gene”. Bafna

Projecting • Consider the mean of all points m, and a vector emanating from the mean • Algebraically, this projection on  means that all samples x can be represented by a single value T(x-m)  m x x-m = T T(x-m) Bafna M

Higher dimensions • Consider a set of 2 (k) orthonormal vectors 1, 2… • Once proejcted, each samplemeans that all samples x can be represented by 2 (k) dimensional vector • 1T(x-m), 2T(x-m) 2 1 m x 1T(x-m) x-m 1T = Bafna M

How to project • The generic scheme allows us to project an m dimensional surface into a k dimensional one. • How do we select the k ‘best’ dimensions? • The strategy used by PCA is one that maximizes the variance of the projected points around the mean Bafna

PCA • Suppose all of the data were to be reduced by projecting to a single line  from the mean. • How do we select the line ? m Bafna

PCA cont’d • Let each point xk map to x’k=m+ak. We want to minimize the error • Observation 1: Each point xk maps to x’k = m + T(xk-m) • (ak= T(xk-m)) xk  x’k m Bafna

Proof of Observation 1 Differentiating w.r.t ak Bafna

Minimizing PCA Error • To minimize error, we must maximize TS • By definition, = TS implies that  is an eigenvalue, and  the corresponding eigenvector. • Therefore, we must choose the eigenvector corresponding to the largest eigenvalue. Bafna

PCA steps • X = starting matrix with n columns, m rows X xj Bafna

Bafna

Microarray Analysis: Gene Expression Classification

Microarray Analysis: Gene Expression Classification

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7