340 likes | 482 Views
Microarrays and Cancer Segal et al. . CS 466 Saurabh Sinha. Genomics and pathology. Genomics provides high-throughput measurements of molecular mechanisms Microarrays, ChIP-on-chip, etc. Genomics may provide the molecular underpinnings of pathology, in a highly comprehensive manner
E N D
Microarrays and CancerSegal et al. CS 466 Saurabh Sinha
Genomics and pathology • Genomics provides high-throughput measurements of molecular mechanisms • Microarrays, ChIP-on-chip, etc. • Genomics may provide the molecular underpinnings of pathology, in a highly comprehensive manner • Revolutionize the diagnosis and management of diseases, including cancer
Prior applications to cancer • Gene expression measurements have been applied to cancer diagnosis • Measure each gene’s expression in several normal tissue samples, and several pathological (diseased) samples • Find subset of genes differentially expressed in the two sample groups • If such “gene signatures” of particular cancer types are found, they can become the basis of tests for malignancy
We want better … • Genes may be differentially expressed, but not enough to cross certain thresholds used in the analysis • Analyzing the data on a gene-by-gene basis is error prone -- microarray data has inherent noise • Finding the genes involved in one type of cancer is only the first step; it does not reveal the underlying processes
A “module” level view • Many methods use “gene modules” (sets of genes) as basic blocks for analysis • Instead of trying to find changes in individual gene expression profiles, look out for entire sets of genes with changing expression profiles
The study of Mootha et al. • Showed that expression of “oxidative phosphorylation” genes (a particular set of genes) is reduced in diabetic muscle • Signal not very strong when looking at individual genes, but highly significant when looking at the “gene module”
Source: Nature Genetics 37, S38 - S45 (2005) Disease tissue (Diabetes mellitus type 2) Grey: all genes Red: oxidative phosphorylation genes Normal tissue (Normal tolerance to glucose)
Segal et al.: Methodology • Compile a large collection of cancer-related microarrays • microarrays measuring gene expression in cancer tissues or normal tissue • Compile a large collection of gene sets (modules) from earlier studies • Identify gene set (modules) induced or repressed in a microarray • Identify modules induced in several arrays, or repressed in several arrays • Check if these arrays are enriched in some clinical annotation
Identify gene set (modules) induced or repressed in a microarray • Given expression value Eg,mof each gene g in the microarray experiment m • Compute average expression Egof the gene g over all microarrays • If Eg,mis 2-fold greater than Eg, call the gene g as induced in array m • Categorize each gene as being induced or not-induced in the array. Source: Nature Genetics 36, 1090 - 1098 (2004)
Identify gene set (modules) induced or repressed in a microarray All genes • |All genes| = N • |Module| = n • |Induced| = m • |Intersection| = k • Hypergeometric test(N,n,m,k): • If a set of m genes was chosen at random (sampling w/o replacement), what is the probability that the intersection would be larger than or equal to k? Induced Module Intersection
Identify gene set (modules) induced or repressed in a microarray All genes • |All genes| = N • |Module| = n • |Induced| = m • |Intersection| = k • Hypergeometric test(N,n,m,k): • Sum over i>=k: If a set of m genes was chosen at random (sampling w/o replacement), what is the probability that the intersection would be equal to i? Induced Module Intersection
“p-value” of the Hypergeometric test Identify gene set (modules) induced or repressed in a microarray All genes • |All genes| = N • |Module| = n • |Induced| = m • |Intersection| = k • Hypergeometric test(N,n,m,k): Induced Module Intersection
Identify gene set (modules) induced or repressed in a microarray All genes • |All genes| = N • |Module| = n • |Induced| = m • |Intersection| = k • Hypergeometric test(N,n,m,k) • If the “p-value” is very small, then we infer that the intersection is “statistically significant”, i.e., the module is induced in the microarray • Similarly define module repressed in microarray Induced Module Intersection
Segal et al.: Methodology • Compile a large collection of cancer-related microarrays • microarrays measuring gene expression in cancer tissues or normal tissue • Compile a large collection of gene sets (modules) from earlier studies • Identify gene set (modules) induced or repressed in a microarray • Identify modules induced in several arrays, or repressed in several arrays • Check if these arrays are enriched in some clinical annotation
Segal et al.: Methodology • Compile a large collection of cancer-related microarrays • microarrays measuring gene expression in cancer tissues or normal tissue • Compile a large collection of gene sets (modules) from earlier studies • Identify gene set (modules) induced or repressed in a microarray • Identify modules induced in several arrays, or repressed in several arrays • Check if these arrays are enriched in some clinical annotation
Source: Nature Genetics 36, 1090 - 1098 (2004) Identify modules induced in several arrays, or repressed in several arrays
Segal et al.: Methodology • Compile a large collection of cancer-related microarrays • microarrays measuring gene expression in cancer tissues or normal tissue • Compile a large collection of gene sets (modules) from earlier studies • Identify gene set (modules) induced or repressed in a microarray • Identify modules induced in several arrays, or repressed in several arrays • Check if these arrays are enriched in some clinical annotation
Source: Nature Genetics 36, 1090 - 1098 (2004) Check if these arrays are enriched in some clinical annotation
Segal et al: Cancer “module maps” Red(m,c): Microarrays in which module m was overexpressed (induced) are enriched in condition c Green: Microarrays in which module m was underexpressed (repressed) are enriched in condition c Rows and columns are not in an arbitrary order. They have been “clustered” to display similar rows (or columns) together Source: Nature Genetics 37, S38 - S45 (2005)
Insights from cancer module map • Some modules activated or repressed across many tumor types. Such modules could be related to general tumorogenic processes • Some modules specifically activated or repressed in certain tumor types or stages of tumor progression
From modules to regulation • A module map shows the transcriptional changes underlying cancer • Transcriptional changes are a result of transcription factors and their binding sites • A deeper understanding of cancer would come from finding out which transcription factors and binding sites led to the transcriptional changes
Genomics and gene regulation • Such knowledge comes from genomics data • ChIP-chip studies identify which transcription factors bind which DNA sequences • Analysis of DNA sequence, using known binding site motifs, gives us putative binding sites • Cross-species conservation also tells us something about possible locations of binding sites
Cis-regulatory analysis • Identify a set of genes whose promoters contain the same binding sites • Such a set of genes is likely to be regulated by the same TF • Often called a “regulatory module” • Earlier studies mined microarrays for “co-expressed” genes, then used motif finding algorithms to discover their shared binding sites
Cis-regulatory analysis • Another approach (Segal et al. 2003) tried to solve the problem in an integrated manner • Find a set of genes such that • their expression profiles are similar (microarrays) • they share the same binding sites (sequence) • Joint learning of “regulatory module” from two very different types of data: microarray and sequence • An important theme in current bioinformatics
Cis-regulatory analysis • Connection between gene expression and cis-regulatory elements (binding sites) also explored in Beer & Tavazoie. • Found rules on combinations and locations of binding sites that would cause the gene to be over- or under-expressed
Source: Nature Genetics 37, S38 - S45 (2005) • The binding sites “RRPE” and “PAC” must occur within • 240 bp and 140 bp of gene start • Genes containing both motifs, following certain rules on location, are tightly co-regulated • Genes containing any one motif, or both in incorrect positional configuration, have close to random expression
Eukaryotes • These studies have mostly focused on yeast (which is a eukaryote, but has a small, compact genome) • Not much work of this type in the longer, more complex genomes of metazoans (e.g., humans, rodents, fruitflies) • The genome is not compact; may not suffice to look at sequence right next to a gene. Intergenic regions are long, and cis-regulatory signals may not be close to gene
One study in humans • HeLa cells are an “immortal” cell-line derived from cervical cancer cells in a person who died in 1951. • Used extensively in studying cancer • Method of Segal et al. (joint learning of regulatory modules from gene expression and sequence data) applied to these cells
One study in humans • Gene expression data used: microarrays measuring genes during cell cycle in HeLa cells • Sequence: 1000 bp promoters (upstream) of human genes
Source: Nature Genetics 37, S38 - S45 (2005) Result of analysis: Two motifs found to be shared by this set of genes. The genes have similar expression profiles. One of the identified motifs (NFAT) known to be involved in cell-cycle
Summary • The common theme is to analyze sets of genes, and relate their common expression patterns to cancer types or to presence of cis-regulatory motifs • Search algorithms may be required to identify some of these features