1 / 107

LECTURE 3 Introduction to PCA and PLS K-mean clustering

LECTURE 3 Introduction to PCA and PLS K-mean clustering Protein function prediction using network concepts Network Centrality measures. Handling Multivariate data. Multivariate data example. Principle Component Analysis (PCA) and Partial Least Square (PLS).

nhung
Download Presentation

LECTURE 3 Introduction to PCA and PLS K-mean clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LECTURE 3 Introduction to PCA and PLS K-mean clustering Protein function prediction using network concepts Network Centrality measures

  2. Handling Multivariate data Multivariate data example

  3. Principle Component Analysis (PCA) and Partial Least Square (PLS) • Two major common effects of using PCA or PLS • Convert a group of correlated predictive variables to a group of independent variables • Construct a “strong” predictive variable from several “weaker” predictive variables • Major difference between PCA and PLS • PCA is performed without a consideration of the target variable. So PCA is an unsupervised analysis • PLS is performed to maximiz the correlation between the target variable and the predictive variables. So PLS is a supervised analysis

  4. PLS PCA X (n x p) Y (n x q) A (n x p) 1 2 1 T (n x c) U (n x c) PC (n x p) max cov. Decomposition step 1 Regression step 2 X = matrix of predictors Y = matrix of responses T = factors of predictors U = factors of responses n = # of observations p = # of predictors q = # of responses c = # of extracted factors A = data matrix PC = principal component matrix n = # of observations p = # of variables

  5. Principle Component Analysis (PCA) • In Principal Component Analysis, we look for a few linear combinations of the predictive variables which can be used to summarize the data without loosing too much information. • Intuitively, Principal components analysis is a method of extracting information from a higher dimensional data by projecting it to a lower dimension. Example: Consider the scatter plot of a 3-dimentional data (3 variables). Data across the 3 variables are higly correlated and majority of the points cluster around the center of the space. This is also the direction of the 1st PC, which roughly gives equal weight to 3 variables PC1 = – 0.56 X1 – 0.57 X2 – 0.59 X3

  6. Properties of Principal Components • Var(PCi) = i • Cov(PCi,PCj) = 0 • Var(PC1)  Var(PC2)  … Var(PCp)

  7. Numerical Example • The following is the high school • grade of 10 students on 6 subjects • (scale 1-10) • Math = Mathematics • Chem = Chemistry • Phy = Phisics • Bio = Biology • Eco = Economy • Soc = Sociology

  8. Results

  9. Partial Least Squares (PLS) • Unlike PCA, the PLS technique works by successively extracting factors from both predictive and target variables such that covariance between the extracted factors is maximized • Decomposition step • X = TWt + E • Y = UVt + F • Regression step • Y = TB + D = XWB + D = XBPLS + D; BPLS = WB

  10. Numerical Example • The following is the high school • grade of 10 students on 6 subjects • (scale 1-10) • Math = Mathematics • Chem = Chemistry • Phy = Phisics • Bio = Biology • Eco = Economy • Soc = Sociology • and the corresponding GPA score • during undergraduate level. Objective: Can we use information of student’s performance during high school to predict their GPA score when they enter undergraduate level?

  11. K-mean clustering

  12. Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al.

  13. Source: Teknomo, Kardi. K-Means Clustering Tutorials http:\\people.revoledu.com\kardi\ tutorial\kMean\

  14. Initial value of centroids: Suppose we use medicine A and medicine B as the first centroids. Let c1 and c2denote the coordinate of the centroids, then c1 = (1,1) and c2 = (2,1)

  15. Protein function prediction using network concepts

  16. Topology of Protein-protein interaction is informative but further analysis can reveal other information. A popular assumption, which is true in many cases is that similar function proteins interact with each other. Based on these assumption, we have developed methods to predict protein functions and protein complexes from the PPI networks mainly based on cluster analysis.

  17. Cluster Analysis Cluster Analysis, also called data segmentation, implies grouping or segmenting a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. In the context of a graph densely connected nodes are considered as clusters Visually we can detect two clusters in this graph

  18. K-cores of Protein-Protein Interaction Networks Definition Let, a graph G=(V, E) consists of a finite set of nodes V and a finite set of edges E. A subgraph S=(V, E) where V V and E E is a k-core or a core of order k of G if and only if  v  V: deg(v)  k within S and S is the maximal subgraph of this property.

  19. Concept of a k-core graph Graph G 1-core graph: The degree of all nodes are one or more

  20. Concept of a k-core graph 1-core graph: The degree of all nodes are one or more

  21. Concept of a k-core graph 2-core graph: The degree of all nodes are two or more

  22. Concept of a k-core graph 1-core graph: The degree of all nodes are one or more

  23. Graph G 3-core graph: The degree of all nodes are three or more The 3-core is the highest k-core subgraph of the graph G

  24. Application of a k-core graph Analyzing protein-protein interaction data obtained from different sources, G. D. Bader and C.W.V. Hogue, Nature biotechnology, Vol 20, 2002

  25. Protein function prediction using k-core graphs

  26. Introduction : Function prediction Schwikowski, B., Uetz, P. and Fields, S. A network of protein-protein interactions in yeast. Nature Biotech. 18, 1257-1261 (2000) Deals with a network of 2039 proteins and 2709 interactions. 65% of interactions occurred between protein pairs with at least one common function Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., and Tagaki, T. Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast 18, 523-531 (2001) Reported similar results..

  27. Introduction : Function prediction Hypothesis Unknown function proteins that form densely connected subgraph with proteins of a particular function may belong to that functional group. We utilize this concept by determining k-cores of strategically constructed sub-networks.

  28. Prediction of Protein Functions Based on K-cores of Protein-Protein Interaction Networks “Prediction of Protein Functions Based on K-cores of Protein-Protein Interaction Networks and Amino Acid Sequences”, Md. Altaf-Ul-Amin, Kensaku Nishikata, Toshihiro Koma, Teppei Miyasato, Yoko Shinbo, Md. Arifuzzaman, Chieko Wada, Maki Maeda, Taku Oshima, Hirotada Mori, Shigehiko Kanaya The 14th International Conference on Genome Informatics December 14-17, 2003, Yokohama Japan.

  29. E.Coli PPI network Total 3007 proteins and 11531 interactions Around 2000 are unknown function proteins Highest K-core of this total graph is not so helpful

  30. 10-core graph—the highest k-core of the E.Coli PPI network

  31. We separate 1072 interactions (out of 11531) involving protein synthesis and function unknown proteins. P. S. U. F. P. S. P. S.

  32. Function unknown Proteins of this 6-kore graph are likely to be involved in protein synthesis Unknown

  33. Extending the k-core based function prediction method and its application to PPI data of Arabidopsis thaliana Protein Function Prediction based on k-cores of Interaction Networks, Norihiko Kamakura, Hiroki Takahashi, Kensuke Nakamura, Shigehiko Kanaya and Md. Altaf-Ul-Amin, Proceedings of 2010 International Conference on Bioinformatics and Biomedical Technology (ICBBT 2010)

  34. Materialsand Methods : Dataset All PPI data of Arabidopsis thaliana • 3118 interactions involving 1302 proteins. • Collected from databases and scientific literature by our laboratory. Green= Unknown proteins (289 proteins) Pink= Known proteins (1013 proteins)

  35. Materials and Methods : DatasetFunctional groups in the network The PPI dataset contains proteins of 19 different functions according to the first level categories of the KNApSAcK database.

  36. Materials and Methods : DatasetThe trends of interactions in the context of functional similarity Diagonal elements show number of interactions between similar function proteins.

  37. Materials And Methods : Flowchart of the method

  38. Results : Subnetworks Subnetwork Name Number of interactions we do not consider in this work the sub-networks that contain less than 100 interactions. And finally I consider subnetworks corresponding to 9 functional classes.

  39. Results : Subnetwork corresponding to cellular communication As an example here we show the subnetworks and k-cores corresponding to cellular communication. Subnetwork extraction We extracted the following 3 types of interactions. Cellular communication-Cellular communication Cellular communication-Unknown, Unknown-Unknown Total 603 interactions

  40. Results : Subnetwork corresponding to cellular communication 1-core The red nodes : known proteins. The green nodes : unknown proteins.

  41. Results : k-cores corresponding to cellular communication 2-core 3-core The red nodes : known proteins. The green nodes : unknown proteins. The red color nodes represent known proteins, the green color nodes represent function unknown proteins.

  42. Results : k-cores corresponding to cellular communication 4-core 5-core The red nodes : known proteins The green nodes : unknown proteins. 6-core 7-core This figure implies that determination of k-cores in strategically constructed sub-networks can reveal which unknown proteins are densely connected to proteins of a particular functional class.

  43. Results: Function Predictions The number of unknown genes included in different k-cores corresponding to different functional groups

  44. Results: Function Predictions Prediction based on 2-cores, 3-cores and 4-cores 4-core 2-core Most proteins have been assigned unique functions 3-core Most proteins have been assigned unique functions and some have been assigned multiple functions

More Related