410 likes | 540 Views
Characterizing Gene Functional Expression Profiles. Zoran Obradovic Slobodan Vucetic Hongbo Xie, Hao Sun, Pooja Hedge Information Science and Technology Center, Temple University. Outline. Microarray Data Analysis Process Functional Expression Profile Analysis
E N D
Characterizing Gene Functional Expression Profiles Zoran Obradovic Slobodan Vucetic Hongbo Xie, Hao Sun, Pooja Hedge Information Science and Technology Center, Temple University
Outline • Microarray Data Analysis Process • Functional Expression Profile Analysis • Functional Expression Profile Ranking • Functional Expression Profile Clustering • Functional Characterization of • Plasmodium Falciparum, • Saccharomyces Cerevisiae, • Mus Musculus and • Homo Sapiens
What is a DNA Microarray? DNA microarray technology allows measuring expressions for tens of thousands of genes at a time Analysis of Replicated Experiments Gordon Smyth, Walter and Eliza Hall Institute
equal expression higher expression in Cy3 higher expression in Cy5 Scanning/Signal Detection Cy3 channel Cy5 channel
Microarray Data Analysis Process • Designing gene expression experiments • Image processing and analysis • Preprocessing raw intensity data • Discovering differentially expressed genes • Advanced analysis • Finding relevant pathways • Discovering gene expression patterns • Understanding gene functions More information: • www.ist.temple.edu/research/biocore.html
Designing Gene Expression Experiments reference design loop design Design experiment A saturated design Comparative designing http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp
Image Processing and Analysis(figure is obtained using Imagene software)
Preprocessing Raw Intensity Data normalize Analysis of Replicated Experiments Gordon Smyth, Walter and Eliza Hall Institute
Discovering Differentially Expressed Genes • Fold change (log ratio) • Statistics methods 1)T-test 2)ANOVA 3)Non-parametric analysis Wilcoxon Rank-Sum Test
Advanced Analysis: Finding Relevant Pathways(figure is obtained using Ingenuity software)
Advanced Analysis: Discovering Gene Expression Patterns • Plasmodium Falciparumintraerythrocytic developmental cycle • Genes are sorted based on expression time peaks Bozdech Z et al., PLoS Biol. 2003 Oct;1(1))
Advanced Analysis: Identifying Unknown Gene Functions Based on Expression Profiles Is this alignment reliable ? • Standard practice: • Basic Assumption:Expression profiles of functionally related genes are correlated • Objectives:Confirm a specific biological hypothesis; predict functional properties of less characterized genes; or uncover new/unexpected biological knowledge • Methodology:clustering genes based on similarity of their expression profiles; followed by functional analysis of the obtained clusters Gene 2 expression profile with function B Unknown sequence has high correlation With gene 1 expression profile Unknown sequence Tag Gene 1 expression profile with function A Functions ? Sequence Tag has function A
Problems with old approaches • Genes with same function do not necessarily have the same expression profiles • Clustering on all genes expression profiles could be unreliable
Our Approach: Analyzing Microarray Functional Expression Profiles (FEP)FEPs:Compute FEP as the average profile of all genes associated with a given highly correlated GO term Advanced Analysis: Identifying Unknown Gene Functions Based on Expression Profiles GO:0004721 : phosphoprotein phosphatase activity GO:0016311 : Dephosphorylation
Questions that we address: • How to perform functional analysis in an objective manner • How to estimate biological significance of discovers
Tools and Applications • Developed tools to identify: • (1) Explore which functions have the conserved expression profiles (Tool 1: functional expression profile ranking package) • (2) Explore which functions have similar expression profiles and test of their functional similarity (Tool 2: functional expression profile clustering package) • Applications: • Functional characterization of gene expression related to Intraerythrocytic Developmental Cycle of Plasmodium Falciparum, Saccharomyces Cerevisiae, Mus Musculus and Home Sapiens
Tools Architecture Microarray raw data Report List of significantly correlated GO terms Clusters of functional Expression profiles Gene function annotation database Data pre- processing Functional expression profile ranking Functional expression profile clustering Gene Function Semantic Distance Mapping Space
Tool 1: Functional Expression Profile (FEP) Ranking Package • Objective: • Identify genes with same function having correlated expression profiles • Task: • Evaluate gene expression correlation within each FEP • Methodology • Step 1: calculate average pairwise correlation coefficient S among n gene expression profiles for a given function term • Step 2: randomly select n genes from the whole dataset and compute average pairwise correlation coefficient S’ • Step 3: repeated Step 2 m times (m>10,000) and compare the distribution S’ to the original S to evaluate p-value
Dataset 1: Plasmodium Falciparum Intraerythrocytic Developmental Cycle Objective: Identification of P.falciparum genes whose RNA levels vary periodically within the asexual intraerythrocytic developmental cycle (IDC) transcriptom Materials: 5080 ORFs, 3532 unique genes, 46 assays (sampled in time) using cDNAs Methods: Permutation test with Fast Fourier Transform alg. and correlations Found: 60% of genes transcriptionally active and most genes only active once during the IDC Figure: Major morphological stages during the IDC and 2712 genes’ transcriptional profiles (Bozdech Z et al., (2003) PLoS Biol. Oct; 1(1))
Objective: Identification of yeastgenes whose RNA levels vary periodically within cell cycle process Materials: 6178 ORFs, 4450 unique genes, 77 assays (sampled in time) using cDNAs Methods: Periodicity and correlation algorithm Found: Identified 800 genes that meet an objective minimum criterion for cell cycle regulation Figure : The M/G1 clusters Dataset 2: Saccharomyces Cerevisiae Cell Cycle(Spellman et al., (1998) Molecular Biology of the Cell 9, 3273-3297)
Objective: Identification of human genes whose RNA levels vary periodically within cell cycle process Materials: 6800 ORFs, 5795 unique genes, 14 assays (sampled in time) Using affymatrix arrays Methods: Fold change Found: 700 genes that display transcriptional fluctuation with a periodicity consistent with that of the cell cycle Figure: Clustering analysis of cell-cycle–regulated transcripts Dataset 3: Homo Sapiens Cell Cycle(R.Cho, et al (2001) Nature, 27)
Objective: Analysis of gene regulation during the mammalian cell cycle Materials: 6347 unique genes, 14 assays Methods: Clustering Found: Identified 7 distinct clusters of genes that exhibit unique patterns of expression Figure: Patterns of gene expression following growth stimulation and during the mammalian cell cycle DataSet 4: Mus Musculus Cell Cycle(Ishida, S et al (2001) Mol. Cell. Biol. 21, 4684-4699 )
Applying FEP Ranking Package:Cumulative Distributions of GO Term p-Values of Human, Yeast, Mouse and P.F.
Applying FEP Ranking Package: GO Terms with the Most Conserved FEP Among Multi-organisms
Applying FEP Ranking Package: Selection of GO Terms with Significantly Correlated Expression Patterns at Plasmodium Falciparum Developmental Cycle Data Cumulative distribution of p-values for GO termsassociated with at least two genes GO:0016311 : Dephosphorylation GO: 0007028: cytoplasm Organization and biosynthesis Selected: 46% functions of all function GO terms are significantly correlated 52% processes of all process GO terms are significantly correlated
Plasmodium Falciparum: Processes and Functions with the Highest/Lowest Correlation Highest correlation Lowest correlation
Plasmodium Falciparum: Findings by FEP Ranking Package • Of 12 FEPs referenced by Bozdech et al, two have p-value larger than 0.05. • E.g. the average correlation coefficient among genes associated with Robonucleotide Synthesis function is only 0.258 (p-value = 0.11) which weakens the claim that is related to the Ring stage of IDC. • No linear relationship were found between number of genes associated with a given GO term and average correlation coefficient among these genes • Ranking of GO terms based on p-value could be useful in rapid identification of functions that are closely related with a specific developmental stage (of Plasmodium Falciparum)
All Datasets: Findings by FEP Ranking Package • To some extent genes with identical functions have similar expression profiles • However, a large fraction of functions do not follow the underlying hypothesis! • Higher level organisms seem to have lower fraction of significantly correlated expression profiles for identical functions. • Fractions of correlated FEPs: • Saccharomyces Cerevisiae: 59% (643/1,083)* • Plasmodium Falciparum: 48.4% (428/ 884) • Homo Sapiens: 16.4% (249/1514) • Mus musculus: 13.3% (182/1366) *fractions are for both processes and functions
Tool 2: FEP Clustering Package • Objective: • Identifying genes with similar functions and similar expression profiles • Tasks: • Cluster FEPs selected by FEP ranking package • Evaluate found clusters for biological relevance by • Identifying similar functions based on GO term hierarchy tree structure • Evaluating inter-cluster GO term distance • Methodology • Randomly generate k sets each containing same number of GO terms as the corresponding cluster • Calculate total GO term distance within each generated set and sum total distance of all sets to get the overall score S’ • Repeat the procedure 1000 times and compare the distribution S’ to the overall distance obtained through clustering
Structure of GO Term Tree (Example) GO:0008150 : Biological Process Level 1 GO:0007275 : development GO:0007582 : physiological process Level 2 GO:0007389 : pattern specification GO:0008152 : metabolism Level 3 GO:0000003 : reproduction GO:0009798 : axis specification Level 4 GO:0009948 : anterior/posterior axis specification Level 5 • Measuring Distance of GO Terms -- length of the minimal chain between X and Y terms in GO tree -- is length of maximal chain from the top to the bottom
Determination of Number of Clusters • Measured • Larger z-score indicates a better grouping of functions within clusters.
Number of Clusters vs Z-score: Results for Plasmodium Falciparum Plasmodium Falciparum biological processes number of clusters vs z-scores Plasmodium Falciparum molecular function number of clusters vs z-scores
Applying FEP Clustering Package:Results on Plasmodium Falciparum Processes k-mean clustering profiles of FEPs for 238 identified processes 1 2 Cluster vs Stage of IDC 3 4
Applying FEP Clustering Package:Results on Plasmodium Falciparum Functions k-means clustering profiles of FEPs for 199 identified molecular functions 1 2 Cluster vs stage of IDC 3 4
Statistical Evaluation:Fund vs. Random Clusters for P. Falciparum Biological Processes Molecular Functions found clusters found clusters • larger distance from found cluster to random clusters for biological processes. • random clusters for biological processes have smaller variance
Clustering all GO terms will lead to smaller z-score which means that we have worse quality clusters Right figure is P.F. functional clustering result. Z-score is 8.5 compared to 12 for clustering correlated GO terms only Statistical Evaluation: ClusteringAll GO TermsforP. Falciparum found clusters
Statistical Evaluation:Found vs. Random Clusters at S. Cerevisiae and Homo Sapiens found clusters found clusters Yeast Processes Human Processes found clusters found clusters Yeast functions Human functions
Remarks • Statistical significance of identified clusters (separation between clusters and random groupings) is increased by • Normalizing data (Plasmodium Falciparum) • Eliminating noise through singular vector decomposition (SVD) • Reducing data through Principle Components Analysis
Conclusions • Proposed microarray tools help identifying • genes with same function and correlated expression profiles • genes with similar functions have similar expression profiles • Measuring GO tree based distance was useful for evaluating biological relevance of clusters; however, • many GO terms have only 1 associated gene • many genes do not even have a GO term • parenthood and siblings in GO trees should be differentiated, but there should be a smaller penalty for siblings relationship compared to parenthood • More robust clustering methods could be used
Thank You ! More information: www.ist.temple.edu/research/biocore.html Contact: Zoran Obradovic, director IST Center, Temple University 215 204-6265 zoran@ist.temple.edu