1 / 35

Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data

Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data. Stefan Bentink Joint groupmeeting Klipp/Spang 11-20-2002. Overview. Microarrays and the Gene Ontology (GO) database Scoring differential gene-expression in GO groups

nicholas
Download Presentation

Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data Stefan Bentink Joint groupmeeting Klipp/Spang 11-20-2002

  2. Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results

  3. Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results

  4. A B B RNA-Isolation and synthesis of cDNA with labeled Nucleotides (reverse Transcription) Transcription C C D mRNA Genes Differential Gene Expression Fluorescense indicates that gene B and gene C are transcribed B A B C Hybridisation C D labeled cDNA Microarrays: sample scheme

  5. ? ranking Microarrays: comparative analysis

  6. How to interprete the data? • Long list of siginficant genes • Which genes are of interest? • Solution: pooling of genes into functional classes • provides a general overview • Gene Ontology database provides such a functional classification

  7. The Gene Ontology database

  8. The Gene Ontology database • GO is a database of terms for genes • Known genes are annotated to the terms • Terms are connected as a directed acyclic graph • Levels represent specifity of the terms

  9. Gene Ontology Molecular function Apoptosis regulator Enzyme activator Apoptosis activator Protease activator The Gene Ontology database Apoptotic protease activator

  10. The Gene Ontology database • Every child-term is a member of its parent-term • GO contains three different sub-ontologies: • Molecular function • Biological process • Cellular component • Unique identfier for every term: • GO:0003673(root=Gene Ontology)

  11. Gene Ontology and microarrays • Hypothesis: Functionally related, differentially expressed genes should accumulate in the corresponding GO-group. • Problem: Find a method, which scores accumulation of differential gene expression in a node of the Gene Ontology.

  12. samples GO:3 GO:1 genes GO:2 GO:4 tissue type 1 2 Gene Ontology and microarrays P-value for every gene by a two-sample t-test

  13. Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results

  14. Σ -log P p-value p-value GO: p-value p-value p-value p-value 1, 2, 3, ... Scoring methods ? • Number of significant genes in a GO-group • Sum of negative logarithms of all p-values • sup|P(n)-F(n)| according to Kolmogorov-Smirnov

  15. The p-value t<0 => p = cdf t>0 => p = 1-cdf => p(0, 0.5] m(0, 1] m=2*p • cdf: cummulative distribution function t

  16. Sum of log-score • Pavalidis, Lewis, Noble 2001; Zien, Küffner, Zimmer, Lengauer 2000 • 2*p -> 1 => -log(2*p) -> 0 • Small p-values, high score

  17. Hypothesis: the calculated p-values (multiplied by 2) are equally distributed between 0 and 1. 1 0 x x x x x xx xx x x x x 1 0 n 1 0 xxxx xx x x x x 1 empirical theoretical 0 n Kolmogorov-Smirnov-Score S=sup|P(n)-F(n)| P(n): p-values for genes that fall into a GO-group. F(n): equally distributed values between 0 and 1.

  18. Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results

  19. Null hypothesises • The significant genes (according to Bonferoni: α=0.05/n) are distributed over the GO-groups by chance • The existing differential gene expression is distributed over the GO-groups by chance • There is no differential gene expression in a GO-group

  20. samples genes Checking H0 by permutation Permutation of rows Mapping of p-values into GO-groups is randomized. H0: Distribution of differential gene expression Permutation of columns Level of p-values is randomized. H0: No differential gene expression in a GO-group

  21. Checking H0 by permutation • 1000 random permutations => background distributions • H0: Distr. of significant genes • Randomizing GO-groups (rows) • H0: Distr. of all p-values • Randomizing GO-groups (rows) • H0: Level of p-values • Permutation of columns

  22. Number of significant genes Sum of –log P sup|P(n)-F(n)| Check against 1000 permutations of rows (GO-groups) Check against 1000 permutations of columns (samples => level of p-values) Methods (summary) Data P-values

  23. Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results

  24. Results: Data (Breast Cancer) • Two major subclasses • Estrogen receptor postive (ER+) • Estrogen receptor negative (ER-) • Estrogen receptor postive • Succeptible to Tamoxifen • Slightly better survival rate • Great molecular differences between the two types

  25. Results: Data (Breast Cancer) • Data: 25 ER+, 24 ER- • Array: Affymetrix HuGeneFL • ~ 7000 Genes • ~ 4000 annotated to GO-terms • Data were normalized by variance stabilization (Heydebreck et. al 2001)

  26. Results: Pre-conditions • GO-group considered to be significant if less than 5% of the random permutations exceeds the score • Only GO-groups with more than 5 and less than 1000 genes were taken into account

  27. Results: Number of significant genes According to the pre-conditions 16 GO-groups were found

  28. Results: Permutation of rows (distribution hypothesis) Sum of –log P Kolmogorov-Smirnov

  29. Results: Permutation of columns (differential gene-expression hypothesis) Sum of –log P Kolmogorov-Smirnov

  30. Results • The column-permutation leads to a very low background distribution • Many „significant“ GO-groups • May help to find functional groups without differential gene-expression • Different scoring methods seem to be complementary as indicated by the results of the row-permutation

  31. Results: Permutation of the rows Sum of log: 44 GO-groups were found (5% cond., ...) KS-score: 77 GO-groups were found (5% cond., ...) GO:0000087 M-Phase of mitotic cell-cycle (37 genes)

  32. B C A Results: Comparing the scoring-methods (from the row-permutation) A: counting of significant genes in GO-groups B: Kolomogorov-Smirnov C: sum of logarithms A: 16 B: 77 C: 43 A and B: 3 A and C: 13 C and B: 13 A, B and C: 3 C without A: 30 B without A: 74

  33. Browsing the results

  34. Results: Interesting GO-term (M-Phase) • Contains a couple of interesting proliferative genes (p-value ~5*10-4 => „not significant“) • E.g.: polo-like kinase • t-value: -3.45; p-value: 5.59*10-4 • would not been found by a single-gene approach • correlation with ER-Receptor could be found in literature (Wolf et al, 2000)

  35. Summary/ outlook • GO provides a general view on large-scale gene-expression data • Less deregulated but very interesting genes could be found • Third null hypothesis => differential gene expression over a wide range of genes (outlook: which GO-groups contain no differential gene-expression) • No bias of scores by top-level genes (outlook: leaving out top-level genes for scoring) • Possible modification of scoring-methods: up- and downregulation

More Related