1 / 66

Computational BioMedical Informatics

Computational BioMedical Informatics. SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. Course Information. Instructor: Dr. Jinbo Bi Office: ITEB 233 Phone: 860-486-1458 Email: jinbo@engr.uconn.edu

tegan
Download Presentation

Computational BioMedical Informatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational BioMedical Informatics SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept.

  2. Course Information • Instructor: Dr. Jinbo Bi • Office: ITEB 233 • Phone: 860-486-1458 • Email:jinbo@engr.uconn.edu • Web: http://www.engr.uconn.edu/~jinbo/ • Time: Mon / Wed. 2:00pm – 3:15pm • Location: CAST 204 • Office hours: Mon. 3:30-4:30pm • HuskyCT • http://learn.uconn.edu • Login with your NetID and password • Illustration

  3. Review of previous classes • Introduced unsupervised learning – particularly, cluster analysis techniques • Discussed one important application of cluster analysis – cardiac ultrasound view recognition • Review papers on medical, health or public health topics • This class, we start to discuss dimension reduction

  4. Topics today • What is feature reduction? • Why feature reduction? • More general motivations of component analysis • Feature reduction algorithms • Principal component analysis • Canonical correlation analysis • Independent component analysis

  5. What is feature reduction? • Feature reduction refers to the mapping of the original high-dimensional data onto a lower-dimensional space. • Criterion for feature reduction can be different based on different problem settings. • Unsupervised setting: minimize the information loss • Supervised setting: maximize the class discrimination • Given a set of data points of p variables Compute the linear transformation (projection)

  6. What is feature reduction? Original data reduced data Linear transformation

  7. High-dimensional data Gene expression Face images Handwritten digits

  8. Feature reduction versus feature selection • Feature reduction • All original features are used • The transformed features are linear combinations of the original features. • Feature selection • Only a subset of the original features are used.

  9. Why feature reduction? • Most machine learning and data mining techniques may not be effective for high-dimensional data • Curse of Dimensionality • Query accuracy and efficiency degrade rapidly as the dimension increases. • The intrinsic dimension may be small. • For example, the number of genes responsible for a certain type of disease may be small.

  10. Why feature reduction? • Understanding: reason about or obtain insights from data • Combinations of observed variables may be more effective bases for insights, even if physical meaning is obscure • Visualization: projection of high-dimensional data onto 2D or 3D. Too much noise in the data • Data compression: Need to “reduce” them to a smaller set of factors for an efficient storage and retrieval • Better representation of data without losing much information • Noise removal: Can build more effective data analyses on the reduced-dimensional space: classification, clustering, pattern recognition, (positive effect on query accuracy)

  11. Application of feature reduction • Face recognition • Handwritten digit recognition • Text mining • Image retrieval • Microarray data analysis • Protein classification

  12. More general motivations • We study phenomena that can not be directly observed • ego, personality, intelligence in psychology • Underlying factors that govern the observed data • We want to identify and operate with underlying latent factors rather than the observed data • E.g. topics in news articles • Transcription factors in genomics • We want to discover and exploit hidden relationships • “beautiful car” and “gorgeous automobile” are closely related • So are “driver” and “automobile” • But does your search engine know this? • Reduces noise and error in results

  13. More general motivations • Discover a new set of factors/dimensions/axes against which to represent, describe or evaluate the data • For more effective reasoning, insights, or better visualization • Reduce noise in the data • Typically a smaller set of factors: dimension reduction • Better representation of data without losing much information • Can build more effective data analyses on the reduced-dimensional space: classification, clustering, pattern recognition • Factors are combinations of observed variables • May be more effective bases for insights, even if physical meaning is obscure • Observed data are described in terms of these factors rather than in terms of original variables/dimensions

  14. Feature reduction algorithms • Unsupervised • Principal component analysis (PCA) • Canonical correlation analysis (CCA) • Independent component analysis (ICA) • Supervised • Linear discriminant analysis (LDA) • Sparse support vector machine (SSVM) • Semi-supervised • Research topic

  15. Feature reduction algorithms • Linear • Principal Component Analysis (PCA) • Linear Discriminant Analysis (LDA) • Canonical Correlation Analysis (CCA) • Independent component analysis (ICA) • Nonlinear • Nonlinear feature reduction using kernels • Kernel PCA, kernel CCA, … • Manifold learning • LLE, Isomap

  16. Basic Concept • Areas of variance in data are where items can be best discriminated and key underlying phenomena observed • Areas of greatest “signal” in the data • If two items or dimensions are highly correlated or dependent • They are likely to represent highly related phenomena • If they tell us about the same underlying variance in the data, combining them to form a single measure is reasonable • Parsimony • Reduction in Error • So we want to combine related variables, and focus on uncorrelated or independent ones, especially those along which the observations have high variance • We want a smaller set of variables that explain most of the variance in the original data, in more compact and insightful form

  17. Basic Concept • What if the dependences and correlations are not so strong or direct? • And suppose you have 3 variables, or 4, or 5, or 10000? • Look for the phenomena underlying the observed covariance/co-dependence in a set of variables • Once again, phenomena that are uncorrelated or independent, and especially those along which the data show high variance • These phenomena are called “factors” or “principal components” or “independent components,” depending on the methods used • Factor analysis: based on variance/covariance/correlation • Independent Component Analysis: based on independence

  18. What is Principal Component Analysis? • Principal component analysis (PCA) • Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables • Retains most of the sample's information. • Useful for the compression and classification of data. • By information we mean the variation present in the sample, given by the correlations between the original variables.

  19. Principal Component Analysis • Most common form of factor analysis • The new variables/dimensions • are linear combinations of the original ones • are uncorrelated with one another • Orthogonal in original dimension space • capture as much of the original variance in the data as possible • are called Principal Components • are ordered by the fraction of the total information each retains

  20. Some Simple Demos • http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html

  21. What are the new axes? Original Variable B PC 2 PC 1 Original Variable A • Orthogonal directions of greatest variance in data • Projections along PC1 discriminate the data most along any one axis

  22. Principal Components • First principal component is the direction of greatest variability (covariance) in the data • Second is the next orthogonal (uncorrelated) direction of greatest variability • So first remove all the variability along the first component, and then find the next direction of greatest variability • And so on …

  23. Principal Components Analysis (PCA) • Principle • Linear projection method to reduce the number of parameters • Transfer a set of correlated variables into a new set of uncorrelated variables • Map the data into a space of lower dimensionality • Form of unsupervised learning • Properties • It can be viewed as a rotation of the existing axes to new positions in the space defined by original variables • New axes are orthogonal and represent the directions with maximum variability

  24. Computing the Components • Data points are vectors in a multidimensional space • Projection of vector x onto an axis (dimension) u is u.x • Direction of greatest variability is that in which the average square of the projection is greatest • I.e. u such that E((u.x)2) over all x is maximized • (we subtract the mean along each dimension, and center the original axis system at the centroid of all data points, for simplicity) • This direction of u is the direction of the first Principal Component

  25. Computing the Components • E((uTx)2) = E ((uTx) (uTx)) = E (uTx.xTu) • The matrix C = x.xT contains the correlations (similarities) of the original axes based on how the data values project onto them • So we are looking for w that maximizes uTCu, subject to u being unit-length • It is maximized when w is the principal eigenvector of the matrix C, in which case • uTCu= uTlu= l if u is unit-length, where l is the principal eigenvalue of the covariance matrix C • The eigenvalue denotes the amount of variability captured along that dimension

  26. Why the Eigenvectors? Maximise uTxxTu s.tuTu = 1 Construct Langrangian uTxxTu–λuTu Vector of partial derivatives set to zero xxTu –λu =(xxT –λI) u = 0 As u ≠ 0 then u must be an eigenvector of xxT with eigenvalue λ

  27. Singular Value Decomposition The first root is called the prinicipal eigenvalue which has an associated orthonormal (uTu = 1) eigenvectoru Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D) non-zero values. Eigenvectors form an orthonormal basis i.e. uiTuj = δij The eigenvalue decomposition: C = xxT = UΣUT whereU = [u1, u2, …, uM] and Σ= diag[λ1, λ2, …, λM] Similarly the eigenvalue decomposition ofxTx = VΣVT The SVD is closely related to the above x=U Σ1/2 VT The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues.

  28. Computing the Components • Similarly for the next axis, etc. • So, the new axes are the eigenvectors of the covariance matrix of the original variables, which captures the similarities of the original variables based on how data samples project to them • Geometrically: centering followed by rotation • Linear transformation

  29. PCs, Variance and Least-Squares • The first PC retains the greatest amount of variation in the sample • The kth PC retains the kth greatest fraction of the variation in the sample • The kth largest eigenvalue of the correlation matrix C is the variance in the sample along the kth PC • The least-squares view: PCs are a series of linear least squares fits to a sample, each orthogonal to all previous ones

  30. How Many PCs? • For n original dimensions, correlation matrix is nxn, and has up to n eigenvectors. So n PCs. • Where does dimensionality reduction come from?

  31. Dimensionality Reduction • Can ignore the components of lesser significance. • You do lose some information, but if the eigenvalues are small, you don’t lose much • p dimensions in original data • calculate p eigenvectors and eigenvalues • choose only the first d eigenvectors, based on their eigenvalues • final data set has only d dimensions

  32. Eigenvectors of a Correlation Matrix

  33. Geometric picture of principal components • the 1st PC is a minimum distance fit to a line in X space • the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous.

  34. Optimality property of PCA Reconstruction Dimension reduction Original data

  35. Optimality property of PCA Main theoretical result: The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem: reconstruction error PCA projection minimizes the reconstruction error among all linear projections of size d.

  36. Applications of PCA • Eigenfaces for recognition. Turk and Pentland. 1991. • Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001. • Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003.

  37. PCA applications -Eigenfaces • the principal eigenface looks like a bland and rogynous average human face http://en.wikipedia.org/wiki/Image:Eigenfaces.png

  38. Eigenfaces – Face Recognition • When properly weighted, eigenfaces can be summed together to create an approximate gray-scale rendering of a human face. • Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces • Hence eigenfaces provide a means of applying data compression to faces for identification purposes. • Similarly, Expert Object Recognition in Video

  39. PCA for image compression d=1 d=2 d=4 d=8 Original Image d=16 d=32 d=64 d=100

  40. Topics today • Feature reduction algorithms • Principal component analysis • Canonical correlation analysis • Independent component analysis

  41. Canonical correlation analysis (CCA) • CCA was developed first by H. Hotelling. • H. Hotelling. Relations between two sets of variates. Biometrika, 28:321-377, 1936. • CCA measures the linear relationship between two multidimensional variables. • CCA finds two bases, one for each variable, that are optimal with respect to correlations. • Applications in economics, medical studies, bioinformatics and other areas.

  42. Canonical correlation analysis (CCA) • Two multidimensional variables • Two different measurements on the same set of objects • Web images and associated text • Protein (or gene) sequences and related literature (text) • Protein sequence and corresponding gene expression • In classification: feature vector and class label • Two measurements on the same object are likely to be correlated. • May not be obvious on the original measurements. • Find the maximum correlation on transformed space.

  43. Canonical Correlation Analysis (CCA) Correlation Transformed data measurement transformation

  44. Problem definition • Find two sets of basis vectors, one for x and the other for y, such that the correlations between the projections of the variables onto these basis vectors are maximized. Given Compute two basis vectors

  45. Problem definition • Compute the two basis vectors so that the correlations of the projections onto these vectors are maximized.

  46. Algebraic derivation of CCA The optimization problem is equivalent to where

  47. Geometric interpretation of CCA • The Geometry of CCA Maximization of the correlation is equivalent to the minimization of the distance.

  48. max s.t. Algebraic derivation of CCA The optimization problem is equivalent to

  49. Algebraic derivation of CCA

  50. Applications in bioinformatics • CCA can be extended to multiple views of the data • Multiple (larger than 2) data sources • Two different ways to combine different data sources • Multiple CCA • Consider all pairwise correlations • Integrated CCA • Divide into two disjoint sources

More Related