450 likes | 667 Views
FUNCTIONAL GENOMICS COURSE 26.5.2006. Petri Pehkonen Laboratory of Functional Genomics and Bioinformatics Department of Neurobiology A.I.Virtanen Institute for Molecular Sciences University of Kuopio. [ I N T R O D U C T I O N ] Gene lists from microarrays. 2 X UP AA407331 BG062929
E N D
FUNCTIONAL GENOMICS COURSE26.5.2006 Petri Pehkonen Laboratory of Functional Genomics and Bioinformatics Department of Neurobiology A.I.Virtanen Institute for Molecular Sciences University of Kuopio
[ I N T R O D U C T I O N ]Gene lists from microarrays 2 X UP AA407331 BG062929 BG062930 BG062931 BG062932 AA407367 BG062933 BG062934 AA407377 BG062935 BG062936 BG062937 BG063015 BG063016 BG063017 BG063018 • Genes are compared between two or more samples • t-test used to detect dissimilar expression • Genes are ranked according to p-values • p-value cut off is set to select differently expressed genes (up/down regulated) • Output: up, down and non-regulated gene lists 2 X DOWN BG069315 BG082333 BG082334 BG069318 BG069319 BG069320 BG069321 BG069322 AU018797 BG069323 BG069324 BG069325 BG069326 BG069327 BG069328 BG069406 BG069407 BG082348 AU018835 BG069409 BG082350 BG082351 BG069412 BG069413 NON REGULATED BG068248 AU022405 BG068249 BG068250 BG068251 BG068252 BG068253 BG068254 BG081291 BG081292 BG068257 BG068258 BG068259 BG068260 BG068261 BG068262 BG068263 BG068264 BG068265 AU022448 BG081301 BG068267 AU022455 BG068268 BG068269 BG068270 BG081306 BG068272 BG068273 BG068274 BG068275 BG068276 AU022477 BG068277 BG068278 BG068279 BG068280 BG068281 BG081317 BG068283 BG068284 Dataset Hybridization mRNA
[ I N T R O D U C T I O N ]Gene associated data = a gene G = element Chromosomal regions Transcription factor binding sites T G1 G2 G3 G4 6p12 6p13 6p14 6p15 G1 G2 Functional classes Words from scientific literature Cell cycle "is" G1 G1 Apoptosis "segregation" G2 G2 Neuorogenesis "protein" G3 G3 Cell death "antioxidant" G4 G4 ATPase activity "dopamine"
[ C H R O M O S O M A L A N A L Y S I S ]Motivation Aim • Find the chromosomal regions where our genes are over-represented Biological basis • Gene duplication during evolution • Genes located nearby and/or having similar regulatory elements may be regulated by same factors and participate to same biological process
[ C H R O M O S O M A L A N A L Y S I S ]Data representation • We can have gene sets obtained from any kind of laboratory technology or in silico technique • Gene locations in chromosome can be represented as ordered binomial data vectors • 1 indicates that gene is from the row's class, 0 indicates the opposite
[ C H R O M O S O M A L A N A L Y S I S ]First analysis • We can select a region and find the number of genes there Total: 6 genes Over-expressed: 4 genes Under-expressed: 1 gene
[ C H R O M O S O M A L A N A L Y S I S ]Our first analysis • We can select a region and find the number of genes there • We can calculate the statistical significance of the region • Contigency table tests: Fisher's exact test, Chi squared test etc. Total: 6 genes Over-expressed: 4 genes Under-expressed: 1 gene
[ C H R O M O S O M A L A N A L Y S I S ]Our first analysis • Contigency table test: determining if there is a difference between two proportions • We can now test whether there is difference between the selected chromosomal region and whole genome IN PROPORTIONS OF OVER-EXPRESSED GENES Total: 6 genes Over-expressed: 4 genes Under-expressed: 1 gene
[ C H R O M O S O M A L A N A L Y S I S ]Our first analysis • 2X2 contigency table for over-expressed vs. other genes:
[ C H R O M O S O M A L A N A L Y S I S ]Hypergeometric distribution Classes • Regulated genes can be seen as a sampletaken one by one without replacement from apopulation of all genes tested in experiment • Hypergeometric distribution describes how random this kind of sample is for one associated class C1 C2 C3 Regulated genes Hypergeometric Distribution Function (HygeCDF) G e n e s Hypergeometric Probability Density Function (HygePDF) Hypergeometric probability Other genes Number of genes that were associated to class C in the regulated list
[ C H R O M O S O M A L A N A L Y S I S ]Hypergeometric probability Classes • Hypergeometric probability f answers if class C is randomly distributed between regulated and other genes • For calculating f for class C, we need sizes of regulated gene list (N) and array (M), and amounts of class C associated genes in gene list (x) and in the array (n) C1 C2 C3 x Regulated genes G e n e s x f Hypergeometric probability Other genes Number of genes that were associated to class C in the regulated list
[ C H R O M O S O M A L A N A L Y S I S ]Hypergeometric probability Classes • Low probability means non random distribution => the class is either over or under-represented • Figures: class C1 is over-represented and C3 under-represented in gene list. How about C2? C1 C2 C3 Regulated genes C3 C2 C1 G e n e s Hypergeometric probability Other genes Number of genes that were associated to class C in the regulated list
[ C H R O M O S O M A L A N A L Y S I S ]Fisher's exact test Classes • Hypergeometric probabilityf(x,M,N,n) measures the probability to detect exactly x genes that are associated to the class C from a random sample C1 C2 C3 Regulated genes C2 G e n e s Hypergeometric probability Other genes x Number of genes that were associated to class C in the regulated list
[ C H R O M O S O M A L A N A L Y S I S ]Fisher's exact test Classes • Hypergeometric probabilityf(x,M,N,n) measures the probability to detect exactly x genes that are associated to the class C from a random sample • It’s more natural to ask: what is probability to detect x or more class associated genes in a random sample? • Fisher’s exact testF answers this by summing the tail from HygePDF C1 C2 C3 Regulated genes C2 G e n e s FISHER’S TEST FOR CLASS C2: F(x=36, M=1300, N=400, n=95) = 0.1916 C2 DOES NOT SHOW STATISTICALLY SIGNIFICANT OVER-REPRESENTATION (WITH A SIGNIFICANCE LEVEL α=0.05) Hypergeometric probability Other genes x Number of genes that were associated to class C in the regulated list
[ C H R O M O S O M A L A N A L Y S I S ]Analysis of a chromosome • We want to analyse the chromsome from whole of its length
[ C H R O M O S O M A L A N A L Y S I S ]Analysis of a chromosome • We want to analyse the chromsome from whole of its length • We can discretize the chromosome into equal sized regions • based on number of genes or • based on physical locations of genes • Then we can calculate the significance of each region separately
[ C H R O M O S O M A L A N A L Y S I S ]Statistics applied • We wanted to see if there exist over-represented chromosome regions in down regulated gene list obtained from C. Elegans strain comparison • Chromosomes were split into 250 kbp segments (themes) • Fisher's test from Hyge distribution was used to calculate over-expression Chromosomal regions as themes FISHER’S EXACT TEST F(x, M, N, n) where: x = Down-regulated genes in region T M = All chip genes in chromosome N = All down-regulated genes n = All genes in region T Genes G G1 G2 G3 G4 250-500 kbp 500-750 kbp 750-1000 kbp 1000-1250 kbp Regions T
[ C H R O M O S O M A L A N A L Y S I S ]Statistics applied • Resulting p-values were transformed into ten based negative logaritmic form where p<0.05 corresponds to log p > 1,3 • Some regions were found where genes were highly over-represented Kaja Reisner, Petri Pehkonen, Garry Wong FISHER’S EXACT TEST EXAMPLE For region T(12750...13000 kbp): x = Down-regulated genes in region T = 14 M = All chip genes in chromosome = 3407 N = All down-regulated genes = 92 n = All genes in region T = 59 P(x,M,N,n) ~ 0.0000000001984 -log P ~ 9.73 CHR I CHR II CHR V CHR X CHR III CHR IV Significance GENE LIST: DOWN REGULATED BETWEEN C. ELEGANS HAWAIIAN VS. N2 STRAINS
[ C H R O M O S O M A L A N A L Y S I S ]Analysis of a chromosome • What is the shortcoming of the previous approach?
[ C H R O M O S O M A L A N A L Y S I S ]Analysis of a chromosome • What is the shortcoming of the previous approach? • By discretization we can split 'good' regions which causes that we may not found them Total: 4 genes Over-expressed: 3 genes Under-expressed: 1 gene
[ C H R O M O S O M A L A N A L Y S I S ]Analysis of a chromosome • Our bin size can be also too large to detect some regions and too small to detect others Total: 8 genes Over-expressed: 4 genes Under-expressed: 4 genes
[ C H R O M O S O M A L A N A L Y S I S ]Analysis of a chromosome Solution (partial): • Sliding window technique • Simply sliding a bin through the chromosome • Calculating significance in each position • Window size can be of amount of genes or physical region size
[ C H R O M O S O M A L A N A L Y S I S ]Analysis of a chromosome Good sides • Now we do not split any region to halves • Rather we gather information in every possible position
[ C H R O M O S O M A L A N A L Y S I S ]Analysis of a chromosome Shortcomings • User have to select a constant window size • We can still miss 'good' regions that are a lot smaller or larger to the window
[ C H R O M O S O M A L A N A L Y S I S ]Analysis of a chromosome Our solution (Pehkonen, Törönen and Wong) • Hierarchical segmentation analysis of chromosome • Possibility to find different sizes of 'significant' regions in different locations of chromosome
[ S E G M E N T A T I O N ]Background • Ordered set of data includes signal rich regions among noise • Data can be partitioned into segments in order to separate signal from noise, and patterns from other patterns • Existing method: recursive segmentation • Existing applications: DNA-segmentation, image analysis etc. Classes of data Ordered set of data points as binary image Noisy region Segmentatio split Signal rich region
[ S E G M E N T A T I O N ]Existing methods • Recursive segmentation • Used with e.g. maximum likelyhood ratio test to decide whether to proceed in splitting • Shortcomings: • Unsatisfactory definition of stopping criterion • Algorithm is often unable to detect global optimum • ML-model bases on classical probability theory and do not take account the uncertainity
[ S E G M E N T A T I O N ]Existing methods • Iterative algorithm instead of recursive • Proceeds into split that increases most the global segmentation score • Reasonable visualization of result with dendogram • Facilitates observation of changes in segmentation score between local maxima and minima, and global maxima • Bayesian segmentation score • Dirichlet multinomial model • Takes account uncertainity
[ S E G M E N T A T I O N ]Our improvements • Improved model selection criterion for finding the best segmentation level:
[ S E G M E N T A T I O N ]Our improvements • Improved model selection criterion for finding the best segmentation level • There are N data points in the data => There are N-1 positions between the
[ S E G M E N T A T I O N ]Our improvements • Improved model selection criterion for finding the best segmentation level • There are N data points in the data => There are N-1 positions between them • There are k clusters in the model => There are k-1 edges (splits)
[ S E G M E N T A T I O N ]Our improvements • Improved model selection criterion for finding the best segmentation level • There are N data points in the data => There are N-1 positions between them • There are k clusters in the model => There are k-1 edges (splits) • How many times k-1 edges can be positioned into N-1 positions between data points
[ S E G M E N T A T I O N ]Our improvements • Improved model selection criterion for finding the best segmentation level • There are N data points in the data => There are N-1 positions between them • There are k clusters in the model => There are k-1 edges (splits) • How many times k-1 edges can be positioned into N-1 positions between data points • Corresponds to MDL-based model selection criterion for detecting clustering level
[ S E G M E N T A T I O N ]Evaluation of our method • Simulated data creator • Creates artificial data with random clusters and noise from given model • Evaluation of segmentation methods • How closely clustering result corresponds to the given original model • Kullback Leiber -distance, Jenssen-Shannon divergence, Mutual Information
[ S E G M E N T A T I O N ]Evaluation of our method • Comparison of given model vs. created clustering result with KL-divergence
[ S E G M E N T A T I O N ]Evaluation of our method • Multinomial data with 5 classes: Clustering vs. model -distance (KL divergence) Simulated artificial datasets
[ S E G M E N T A T I O N ]Evaluation of our method • Multinomial data with 30 classes: Clustering vs. model -distance (KL divergence) Simulated artificial datasets
CATALIST SOFTWARE • Two main parts • Method testing: creation of artificial datasets and evaluation of segmentation methods • Analysis of biological datasets: import of miroarray data or gene lists from files
[ C A T A L I S T ]Results visualization • Draws a dendogram that shows the progress of the segmentation • Dendogram reveals hierarchical localization of significant regions • Detailed information on genes by clicking found regions • C.Elegans strain comparison See three clusters found in chromosome 4 CHROMOSOME IV
[ L O C A T I O N A N A L Y S I S ]Results • Yeast cell cycle genes were clustered according to gene expression data • K-means clustering was performed with 2..10 clusters
[ L O C A T I O N A N A L Y S I S ]Results • Each clustering into k groups was encoded as multinomial data vector with k classes
[ L O C A T I O N A N A L Y S I S ]Results • Yeast cell cycle gene expression clusters, chromosome 4:
[ L O C A T I O N A N A L Y S I S ]Results • Result of hierarchical segmentation:
[ L O C A T I O N A N A L Y S I S ]Results • Real results (red) compared to segmentation of several randomized data sets (blue):
[ L O C A T I O N A N A L Y S I S ]Results • Nuclear receptors gene list: