350 likes | 687 Views
Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data. Stefan Bentink Joint groupmeeting Klipp/Spang 11-20-2002. Overview. Microarrays and the Gene Ontology (GO) database Scoring differential gene-expression in GO groups
E N D
Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data Stefan Bentink Joint groupmeeting Klipp/Spang 11-20-2002
Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results
Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results
A B B RNA-Isolation and synthesis of cDNA with labeled Nucleotides (reverse Transcription) Transcription C C D mRNA Genes Differential Gene Expression Fluorescense indicates that gene B and gene C are transcribed B A B C Hybridisation C D labeled cDNA Microarrays: sample scheme
? ranking Microarrays: comparative analysis
How to interprete the data? • Long list of siginficant genes • Which genes are of interest? • Solution: pooling of genes into functional classes • provides a general overview • Gene Ontology database provides such a functional classification
The Gene Ontology database • GO is a database of terms for genes • Known genes are annotated to the terms • Terms are connected as a directed acyclic graph • Levels represent specifity of the terms
Gene Ontology Molecular function Apoptosis regulator Enzyme activator Apoptosis activator Protease activator The Gene Ontology database Apoptotic protease activator
The Gene Ontology database • Every child-term is a member of its parent-term • GO contains three different sub-ontologies: • Molecular function • Biological process • Cellular component • Unique identfier for every term: • GO:0003673(root=Gene Ontology)
Gene Ontology and microarrays • Hypothesis: Functionally related, differentially expressed genes should accumulate in the corresponding GO-group. • Problem: Find a method, which scores accumulation of differential gene expression in a node of the Gene Ontology.
samples GO:3 GO:1 genes GO:2 GO:4 tissue type 1 2 Gene Ontology and microarrays P-value for every gene by a two-sample t-test
Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results
Σ -log P p-value p-value GO: p-value p-value p-value p-value 1, 2, 3, ... Scoring methods ? • Number of significant genes in a GO-group • Sum of negative logarithms of all p-values • sup|P(n)-F(n)| according to Kolmogorov-Smirnov
The p-value t<0 => p = cdf t>0 => p = 1-cdf => p(0, 0.5] m(0, 1] m=2*p • cdf: cummulative distribution function t
Sum of log-score • Pavalidis, Lewis, Noble 2001; Zien, Küffner, Zimmer, Lengauer 2000 • 2*p -> 1 => -log(2*p) -> 0 • Small p-values, high score
Hypothesis: the calculated p-values (multiplied by 2) are equally distributed between 0 and 1. 1 0 x x x x x xx xx x x x x 1 0 n 1 0 xxxx xx x x x x 1 empirical theoretical 0 n Kolmogorov-Smirnov-Score S=sup|P(n)-F(n)| P(n): p-values for genes that fall into a GO-group. F(n): equally distributed values between 0 and 1.
Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results
Null hypothesises • The significant genes (according to Bonferoni: α=0.05/n) are distributed over the GO-groups by chance • The existing differential gene expression is distributed over the GO-groups by chance • There is no differential gene expression in a GO-group
samples genes Checking H0 by permutation Permutation of rows Mapping of p-values into GO-groups is randomized. H0: Distribution of differential gene expression Permutation of columns Level of p-values is randomized. H0: No differential gene expression in a GO-group
Checking H0 by permutation • 1000 random permutations => background distributions • H0: Distr. of significant genes • Randomizing GO-groups (rows) • H0: Distr. of all p-values • Randomizing GO-groups (rows) • H0: Level of p-values • Permutation of columns
Number of significant genes Sum of –log P sup|P(n)-F(n)| Check against 1000 permutations of rows (GO-groups) Check against 1000 permutations of columns (samples => level of p-values) Methods (summary) Data P-values
Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results
Results: Data (Breast Cancer) • Two major subclasses • Estrogen receptor postive (ER+) • Estrogen receptor negative (ER-) • Estrogen receptor postive • Succeptible to Tamoxifen • Slightly better survival rate • Great molecular differences between the two types
Results: Data (Breast Cancer) • Data: 25 ER+, 24 ER- • Array: Affymetrix HuGeneFL • ~ 7000 Genes • ~ 4000 annotated to GO-terms • Data were normalized by variance stabilization (Heydebreck et. al 2001)
Results: Pre-conditions • GO-group considered to be significant if less than 5% of the random permutations exceeds the score • Only GO-groups with more than 5 and less than 1000 genes were taken into account
Results: Number of significant genes According to the pre-conditions 16 GO-groups were found
Results: Permutation of rows (distribution hypothesis) Sum of –log P Kolmogorov-Smirnov
Results: Permutation of columns (differential gene-expression hypothesis) Sum of –log P Kolmogorov-Smirnov
Results • The column-permutation leads to a very low background distribution • Many „significant“ GO-groups • May help to find functional groups without differential gene-expression • Different scoring methods seem to be complementary as indicated by the results of the row-permutation
Results: Permutation of the rows Sum of log: 44 GO-groups were found (5% cond., ...) KS-score: 77 GO-groups were found (5% cond., ...) GO:0000087 M-Phase of mitotic cell-cycle (37 genes)
B C A Results: Comparing the scoring-methods (from the row-permutation) A: counting of significant genes in GO-groups B: Kolomogorov-Smirnov C: sum of logarithms A: 16 B: 77 C: 43 A and B: 3 A and C: 13 C and B: 13 A, B and C: 3 C without A: 30 B without A: 74
Results: Interesting GO-term (M-Phase) • Contains a couple of interesting proliferative genes (p-value ~5*10-4 => „not significant“) • E.g.: polo-like kinase • t-value: -3.45; p-value: 5.59*10-4 • would not been found by a single-gene approach • correlation with ER-Receptor could be found in literature (Wolf et al, 2000)
Summary/ outlook • GO provides a general view on large-scale gene-expression data • Less deregulated but very interesting genes could be found • Third null hypothesis => differential gene expression over a wide range of genes (outlook: which GO-groups contain no differential gene-expression) • No bias of scores by top-level genes (outlook: leaving out top-level genes for scoring) • Possible modification of scoring-methods: up- and downregulation