560 likes | 725 Views
Microarrays and Promoter Analysis. Gregory Gonye Topic 5 ELEG-667. Overview. Opportunity: general cDNA data, ESTs Genomic data Microarrays nuts and bolts analysis Opportunity: specific Yeast example (Church et al.) Statement of Work: Project Review. Opportunity.
E N D
Microarrays and Promoter Analysis Gregory Gonye Topic 5 ELEG-667
Overview • Opportunity: general • cDNA data, ESTs • Genomic data • Microarrays • nuts and bolts • analysis • Opportunity: specific • Yeast example (Church et al.) • Statement of Work: Project Review
Opportunity • Large scale EST projects for many organisms • Genome sequencing projects for many organisms • Highly parallel technologies for gene expression measurement • Sophisticated analysis for associating genes by expression • Full length cDNA sequence data
Opportunity • Large scale EST projects for many organisms • Sequence data for mRNAs and some protein • Clones corresponding to mRNAs (cDNA clones) for reagents • Expression data (significance with scale) • Homologies between species
Opportunity • Genome sequencing projects for many organisms • genomic DNA data contains all the information • mRNA • transcriptional control elements: promoters and enhancers • organization • Comparative genomics • Conservation=Function
Opportunity • Highly parallel technologies for gene expression measurement • Sophisticated analysis for associating genes by expression • exploits EST projects for data and reagents • produces associations/hypotheses to be tested • uses and produces annotation
Opportunity • Large scale EST projects for many organisms • Genome sequencing projects for many organisms • Highly parallel technologies for gene expression measurement • Sophisticated analysis for associating genes by expression
Opportunity • Full length cDNA sequence data • Promoter-proximal mRNA sequence • “real” protein data (full length ORFs) • reagents for expression • data for converting mRNA to gene
Opportunity • Large scale EST projects for many organisms • Genome sequencing projects for many organisms • Highly parallel technologies for gene expression measurement • Sophisticated analysis for associating genes by expression • Full length cDNA sequence data
Opportunity • Whole greater then sum of parts: [mRNA data, EST and full length]X[genomic data] = Genes [Genes]X[genomic data] = [Promoters] [cDNA]microarray X [mRNA] = [Genes]assoc [Genes]assocX [Promoters] = Functional data
Opportunity • In plain English: The cDNA data is used to identify and locate genes in the genomic data. Using the gene location and gene structure, we predict the promoter region for each gene. A subset of these genes are associated by a microarray experiment. Analysis of the promoters of the associated genes, within and across species, can lead to knowledge of how these genes are regulated.
Overview • Opportunity: general • cDNA data, ESTs • Genomic data • Microarrays • nuts and bolts • analysis • Opportunity: specific • Yeast example (Church et al.) • Statement of Work: Project Review
Molecular Biology 101:Genome (DNA) to Genes to mRNA Required to understand relationships of data sets Three large scale efforts: • Expressed Sequence Tag • Full length cDNA • Genomic DNA
Biological Information Flow = Central Dogma TACTGACGAAAA ATGACTGCTTTT DNA transcription AUGACUGCUUUU splicing (higher organisms) RNA translation Protein Met-Thr-Ala-Phe
Exons and Introns in Eucaryotes intron1 exon 1 exon 2 intron 2 DNA Primary Transcript mature messenger RNA
cDNA and ESTs mRNA is converted to a DNA copy = complementary DNA, cDNA
cDNA Synthesis mRNA 3’ AAAAAn TTTTTn 3’ Reverse Transcriptase dNTPs and primer mRNA 3’ AAAAAn TTTTTn cDNA first strand dNTPs RNAse H, DNAP AAAAAn TTTTTn Second Strand
cDNA and ESTs mRNA is converted to a DNA copy = complementary DNA, cDNA
cDNA and ESTs • mRNA is converted to a DNA copy • = complementary DNA, cDNA • cDNA is directionally inserted into a vector (plasmid) DNA • 1. Clonal propagation/amplification in E. coli • 2. Addition of known sequence flanking unknown cDNA
cDNA and ESTs • mRNA is converted to a DNA copy • = complementary DNA, cDNA • cDNA is directionally inserted into a vector (plasmid) DNA • 1. Clonal propagation/amplification in E. coli • 2. Addition of known sequence flanking unknown cDNA • EST is obtained from cDNA insert (~400-800 bases) using known vector sequence (universal) as priming site • EST is sequence data. EST clone is reagent used to obtain that data. EST is sequence for only part of cDNA in EST clone .
EST data resources • dbEST and UniGene • http://ncbi.nlm.nih.gov/UniGene • TIGR Gene Indexes • http://www.tigr.org/tdb/tgi.shtml Objectives: • Cluster sequences to identify individual mRNAs (~1 cluster=1 mRNA) • Annotate clusters • Distribute clones
Molecular Biology 101:Genome (DNA) to Genes to mRNA Required to understand relationships of data sets Three large scale efforts: • Expressed Sequence Tag • Full length cDNA • Genomic DNA
Full length cDNA • Why?: • complete protein coding information • complete exon information (=gene when combined with genomic data) • 5’ end directs search for promoter elements
Full length cDNA • Why?: • complete protein coding information • complete exon information (=gene when combined with genomic data) • 5’ end directs search for promoter elements • Who?: • Y. Hayashizaki, RIKEN group Yokohama, Japan http://genome.rtc.riken.go.jp/
Full Length cDNA: How? • Issues: • Need copying to be complete, no partial products • Need to differentiate full length mRNAs from degraded mRNAs • Cloning requirements remain the same: • directional, universal vector, high efficiency • Need to redirect sequencing to full insert, not single run from one end
Full Length cDNA: How? • Issues: • Need copying to be complete, no partial products • Problem is secondary structure of mRNAs and lack of processivity of Reverse Transcriptase • Solution: Thermo-stabilize RTase using trehalose, run RT reaction at high temperature destabilizing secondary structure, allowing longer elongation produces
Full Length cDNA: How? • Issues: • Need to differentiate full length mRNAs from degraded mRNAs Step One Full length 7meG P Degraded Cap-specific chemical biotinylation RTase+trehalose Bio-G Full length P Degraded
Full Length cDNA: How? • Issues: • Need to differentiate full length mRNAs from degraded mRNAs Step Two Full length Bio-G Degraded SA Purified Full length SA Bio-G
Full length cDNA • Results: • FANTOM Consortium: Functional Annotation of Mouse ~21,000 nonredundant cDNAs • Full length clones used for protein expression • Large scale protein-protein interaction matrix • 5’ mRNA sequence available on large scale
Molecular Biology 101:Genome (DNA) to Genes to mRNA Required to understand relationships of data sets Three large scale efforts: • Expressed Sequence Tag • Full length cDNA • Genomic DNA
Genomic DNA Sequencing • Genomic DNA Structure: • Genes: Promoters, Enhancers, Exons, Introns • Intergenic: Structural DNA, repeats, telomeres • Chromosomes: Varied size and number intron1 exon 1 exon 2 intron 2 Promoter
Genomic DNA Sequencing • Mammalian genomes about 3 billion bases • Genomes broken into chunks of 100-500kb • Bacterial Artificial Chromosomes, BAC libraries • BAC inserts ordered by end sequencing and cross hybridization to generate nonredundant “Golden path” • Sequencing effort distributed/coordinated internationally
Genomic DNA Sequencing • Two complementary approaches: • Walking: • Subclone pieces of BAC inserts • start from both ends of sub-BAC inserts and sequence inward • from sequence generated design next set of sequencing primers, rerun, redesign,… • Shotgun: • Generate random fragments and size-select ~2000bp • Sequence from both ends • Assemble sequences to contigs, assemble contigs
Genomic DNA Sequencing • Resources: • Trace Archives (NCBI): Individual unassembled shotgun sequence data • Genomic section of GenBank (NCBI): Contigs, BAC end sequences, BAC assemblies • Whitehead Institute: Assemblies • Ensembl: Annotation of assembled genomes
Convergence of Data Three large scale efforts: • Expressed Sequence Tag>>partial Exons • Full length cDNA>>5’ Exons • Genomic DNA>>Gene Predictions Annotated Genome
Overview • Opportunity: general • cDNA data, ESTs • Genomic data • Microarrays • nuts and bolts • analysis • Opportunity: specific • Yeast example (Church et al.) • Statement of Work: Project Review
Overview • Opportunity: general • cDNA data, ESTs • Genomic data • Microarrays • nuts and bolts • analysis • Opportunity: specific • Yeast example (Church et al.) • Statement of Work: Project Review
Highly Parallel Gene Expression Analysis: cDNA Microarrays • Molecular reagents produced from EST sequencing projects • EST= Expressed Sequence Tag • Clustering of ESTs identifies mRNA diversity • Nonredundant sets of reagents available (>40,000 for human) • Technology to use reagents in parallel: microarrays • cDNA or oligonucleotide microarrays
Microarray-based Gene Expression Analysis • Manufactured at high density with robotics • 1x3” glass slide common format • “printing” or “in situ synthesis” • 20-30,000 spots per slide • mRNA is converted to labeled cDNA for fluorescent hybridization analysis • Ratio-metric approach compares expression between samples or to a reference sample by cohybridization: “fold-change”
Tissue2 RNA2 labeled cDNA2 Tissue1 RNA1 labeled cDNA1 Schematic Diagram: Microarray-based Analysis EST project cDNA clones cDNA microarray PCR printing Cohybridize to microarray Scan microarray to detect each fluorophore (16 bit grayscale images) Identify signal pixels (spot finding) Quantitate pixel intensity>>ratios
C Control Mix Control 92 Ethanol 91 Control Ethanol Control Control
Advanced Analyses • Clustering • What: • genes • experiments • experiments and genes • Why: • Classification: how many “types” of samples are there? can type be predicted? Which genes are best predictors? • Coregulation: which genes respond alike? Pathway(s) implicated? Epistasis? Regulatory network prediction
Overview • Opportunity: general • cDNA data, ESTs • Genomic data • Microarrays • nuts and bolts • analysis • Opportunity: specific • Yeast example (Church et al.) • Statement of Work: Project Review
Overview • Opportunity: general • cDNA data, ESTs • Genomic data • Microarrays • nuts and bolts • analysis • Opportunity: specific • Yeast example (Church et al.) • Statement of Work: Project Review
Post-Clustering Analysis • Why are members of a cluster clustering together? • Functional (Pathway): all ribosomal proteins, DNA synthesis machinary, lysine biosynthesis • Serial regulation: cascades • Transcriptional coregulation via common regulatory elements: conserved promoters
What is a promoter? • Cis-acting: physically associated with the gene • Directional: defines transcription initiation site and coding strand • with TATA box fairly homogeneous • without TATA box less stringent • Core elements recruit pol II (but inactive) • Regulatory elements are binding sites for Transcription Factors (+/- active pol II complex) • Sequence-specificity determines P(occupied)
Convergence at Harvard • Yeast: • Large genomic-scale gene expression data set • complete annotated genome • NO introns, minimal intergenic sequence • Predicted promoters for every ORF • Church et al.: Combined microarray data, clustering, and promoter informatics to identify conserved, cluster enriched, regulatory domains
Tavazoie et al. • Whole genome expression data (Affy) • Synchronized culture • 15 time points across two cell cycles • K-means clustering (Euclidian distance) of 3000 highest variance ORFs to 30 clusters • Obtained 600bp promoter sequence from genome sequence for each ORF • Used AlignAce (Gibb’s sampling algorithm) to discover motifs
Livesey et al: extend to mouse • Crx knockout mouse vs. wt • cone and rod differentiation in retina • Microarray analysis • Crx+/+ vs Crx-/- retina RNA • 16 genes out of 960 tested • Promoter analysis • proximal 250bp of genes with available promoters (5) used AlignAce to detect motifs • found single or tandem CBE elements in all