Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data

Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data Stefan Bentink Joint groupmeeting Klipp/Spang 11-20-2002

Overview • Microarrays and the Gene Ontology (GO) database • Scoring differential gene-expression in GO groups • Checking scores against different null hypothesises • Sample data (two types of Breast Cancer) and results

A B B RNA-Isolation and synthesis of cDNA with labeled Nucleotides (reverse Transcription) Transcription C C D mRNA Genes Differential Gene Expression Fluorescense indicates that gene B and gene C are transcribed B A B C Hybridisation C D labeled cDNA Microarrays: sample scheme

? ranking Microarrays: comparative analysis

How to interprete the data? • Long list of siginficant genes • Which genes are of interest? • Solution: pooling of genes into functional classes • provides a general overview • Gene Ontology database provides such a functional classification

The Gene Ontology database

The Gene Ontology database • GO is a database of terms for genes • Known genes are annotated to the terms • Terms are connected as a directed acyclic graph • Levels represent specifity of the terms

Gene Ontology Molecular function Apoptosis regulator Enzyme activator Apoptosis activator Protease activator The Gene Ontology database Apoptotic protease activator

The Gene Ontology database • Every child-term is a member of its parent-term • GO contains three different sub-ontologies: • Molecular function • Biological process • Cellular component • Unique identfier for every term: • GO:0003673(root=Gene Ontology)

Gene Ontology and microarrays • Hypothesis: Functionally related, differentially expressed genes should accumulate in the corresponding GO-group. • Problem: Find a method, which scores accumulation of differential gene expression in a node of the Gene Ontology.

samples GO:3 GO:1 genes GO:2 GO:4 tissue type 1 2 Gene Ontology and microarrays P-value for every gene by a two-sample t-test

Σ -log P p-value p-value GO: p-value p-value p-value p-value 1, 2, 3, ... Scoring methods ? • Number of significant genes in a GO-group • Sum of negative logarithms of all p-values • sup|P(n)-F(n)| according to Kolmogorov-Smirnov

The p-value t<0 => p = cdf t>0 => p = 1-cdf => p(0, 0.5] m(0, 1] m=2*p • cdf: cummulative distribution function t

Sum of log-score • Pavalidis, Lewis, Noble 2001; Zien, Küffner, Zimmer, Lengauer 2000 • 2*p -> 1 => -log(2*p) -> 0 • Small p-values, high score

Hypothesis: the calculated p-values (multiplied by 2) are equally distributed between 0 and 1. 1 0 x x x x x xx xx x x x x 1 0 n 1 0 xxxx xx x x x x 1 empirical theoretical 0 n Kolmogorov-Smirnov-Score S=sup|P(n)-F(n)| P(n): p-values for genes that fall into a GO-group. F(n): equally distributed values between 0 and 1.

Null hypothesises • The significant genes (according to Bonferoni: α=0.05/n) are distributed over the GO-groups by chance • The existing differential gene expression is distributed over the GO-groups by chance • There is no differential gene expression in a GO-group

samples genes Checking H0 by permutation Permutation of rows Mapping of p-values into GO-groups is randomized. H0: Distribution of differential gene expression Permutation of columns Level of p-values is randomized. H0: No differential gene expression in a GO-group

Checking H0 by permutation • 1000 random permutations => background distributions • H0: Distr. of significant genes • Randomizing GO-groups (rows) • H0: Distr. of all p-values • Randomizing GO-groups (rows) • H0: Level of p-values • Permutation of columns

Number of significant genes Sum of –log P sup|P(n)-F(n)| Check against 1000 permutations of rows (GO-groups) Check against 1000 permutations of columns (samples => level of p-values) Methods (summary) Data P-values

Results: Data (Breast Cancer) • Two major subclasses • Estrogen receptor postive (ER+) • Estrogen receptor negative (ER-) • Estrogen receptor postive • Succeptible to Tamoxifen • Slightly better survival rate • Great molecular differences between the two types

Results: Data (Breast Cancer) • Data: 25 ER+, 24 ER- • Array: Affymetrix HuGeneFL • ~ 7000 Genes • ~ 4000 annotated to GO-terms • Data were normalized by variance stabilization (Heydebreck et. al 2001)

Results: Pre-conditions • GO-group considered to be significant if less than 5% of the random permutations exceeds the score • Only GO-groups with more than 5 and less than 1000 genes were taken into account

Results: Number of significant genes According to the pre-conditions 16 GO-groups were found

Results: Permutation of rows (distribution hypothesis) Sum of –log P Kolmogorov-Smirnov

Results: Permutation of columns (differential gene-expression hypothesis) Sum of –log P Kolmogorov-Smirnov

Results • The column-permutation leads to a very low background distribution • Many „significant“ GO-groups • May help to find functional groups without differential gene-expression • Different scoring methods seem to be complementary as indicated by the results of the row-permutation

Results: Permutation of the rows Sum of log: 44 GO-groups were found (5% cond., ...) KS-score: 77 GO-groups were found (5% cond., ...) GO:0000087 M-Phase of mitotic cell-cycle (37 genes)

B C A Results: Comparing the scoring-methods (from the row-permutation) A: counting of significant genes in GO-groups B: Kolomogorov-Smirnov C: sum of logarithms A: 16 B: 77 C: 43 A and B: 3 A and C: 13 C and B: 13 A, B and C: 3 C without A: 30 B without A: 74

Browsing the results

Results: Interesting GO-term (M-Phase) • Contains a couple of interesting proliferative genes (p-value ~5*10-4 => „not significant“) • E.g.: polo-like kinase • t-value: -3.45; p-value: 5.59*10-4 • would not been found by a single-gene approach • correlation with ER-Receptor could be found in literature (Wolf et al, 2000)

Summary/ outlook • GO provides a general view on large-scale gene-expression data • Less deregulated but very interesting genes could be found • Third null hypothesis => differential gene expression over a wide range of genes (outlook: which GO-groups contain no differential gene-expression) • No bias of scores by top-level genes (outlook: leaving out top-level genes for scoring) • Possible modification of scoring-methods: up- and downregulation

Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data

Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data

Presentation Transcript

Using the Gene Ontology for Data Analysis

Gene Expression meets Gene Ontology: A novel statistical method for Microarray analysis

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM L ARGE SCALE GENE EXPRESSION DATA

Unlocking the potential of public available gene expression data for large-scale analysis

Microarray Gene Expression Data Analysis

Analysis of Gene Expression Data

Gene Expression Analysis

Gene Ontology Analysis

Large-scale mining of gene expression patterns

Gene expression: Microarray data analysis

Large Scale Gene Expression with DNA Microarrays

Large Scale Gene Expression with DNA Microarrays

Extraction of functional information from large-scale gene expression data

Gene expression analysis

Gene Expression Analysis

Gene Expression Analysis

4. Gene Expression Data Analysis

Gene Expression Data

More Analysis of Gene Expression Data

Cluster Analysis for Gene Expression Data

Using the Gene Ontology for Data Analysis

Bioinformatics : Gene Expression Data Analysis