BIONF/BENG 203: Functional Genomics

BIONF/BENG 203:Functional Genomics Lecture TI 1 Trey Ideker UCSD Department of Bioengineering Sources of Functional DataLectures 1 and 2

Grading • 40% Problem Sets (best 4 of 5) • 30% Midterm • 30% Final Project

Outline of the course Biological data sources (2) Data pre-processing (6) Total of 17 lectures Project Presentations (2)

Functional Genomics Data • Expression mRNA, protein • Molecular interactions Protein, mRNA, small molecules • Knockout phenotypes 1st, 2nd, higher orders • SNP sequence (polymorphism) data • Imaging data Sub-cellular localization Cell morphology • Gene ontology

Dividing the data into two classes of information:Biological Networks and Network States • Directly observe the network “wires” themselves • Protein-protein interactions: • Two-hybrid system, coIP, protein antibody arrays • BIND, DIP • Protein-DNA interactions: • Chromatin IP • BIND, Transfac, SCPD • Other types not yet possible: • e.g., protein-small molecule • Observe molecular states that result from the interaction wiring • DNA/RNA Gene expression: • DNA microarrays, SAGE • Protein levels, locations, and modifications: • Mass spectrometry, fluorescence microscopy, protein arrays • Gross phenotypes: • e.g., growth rates of single and double deletion strains 1) 2)

High-throughput methods for measuring cellular states • Gene expression levels: RT-PCR, arrays • Protein levels, modifications: mass specProtein locations: fluorescent tagging • Metabolite levels: NMR and mass spec • Systematic phenotyping

The transcriptome and proteome • The transcriptome is the full complement of RNA molecules produced by a genome • The proteome is the full complement of proteins enabled by the transcriptome • DNA  RNA  protein • Genome  transcriptome  proteome • 30,000 genes  ??? RNAs  ??? proteins? • For example, the drosophila gene Dscam can generate 40,000 distinct transcripts through alternative splicing. • What is the minimum number of exons that would be required?

Expression: High-throughput approaches RNA • DNA Microarrays • cDNA / EST sequencing • RT-PCR • Differential display • SAGE • Massively parallel signature sequencing (MPSS) Proteins • 2D PAGE • Mass spectrometry

Gene expression arrays They are really, really, really, really, really, really, really, really, really, really, really, really, really important

Microarrays • Monitors the level of each gene: • Is it turned on or off in a particular biological condition? • Is this on/off state different between two biological conditions? • Microarray is a rectangular grid of spots printed on a glass microscope slide, where each spot contains DNA for a different gene

Two-color DNA microarray design Reverse Transcription

cDNA-chip of brain glioblastoma

Types of microarrays • Spotted (cDNA) • Robotic transfer of cDNA clones or PCR products • Spotting on nylon membranes or glass slides coated with poly-lysine • Synthetic (oligo) • Direct oligo synthesis on solid microarray substrate • Uses photolithography (Affymetrix) or ink-jet printing (Agilent) • All configurations assume the DNA on the array is in excess of the hybridized sample—thus the kinetics are linear and the spot intensity reflects that amount of hybridized sample. • Labeling can be radioactive, fluorescent (one-color), or two-color

Microarray Spotter

Affymetrix High Density Arrays

Microarrays (continued) Imaging • Radioactive 32P labeling: Autoradiography or phosphorimager • Fluorescent labeling: Confocal microscope (invented by Marvin Minsky!!) Feature density • Nylon membrane macroarrays  100-1000 features • Glass slide spotted array  5,000 features / cm2 • Synthesized arrays  50,000 features / cm2

Microarrayconfocal scanner • Collects sharply defined optical sections from which 3D renderings can be created • The key is spatial filtering to eliminate out-of-focus light or glare in specimens whose thickness exceeds the immediate plane of focus. • Two lasers for excitation • Two color scan in less than 10 minutes • High resolution, 10 micron pixel size

cDNA / EST sequencing projects • cDNA = complementary or copy DNA • EST = Expressed Sequence Tag • The microarray could be described as a “closed system” because information about RNAs is limited by the targets available for hybridization. RNAs not represented on the array are not interrogated. • Direct sequencing of cDNAs (yielding ESTs) overcomes this problem by large-scale random sampling of sequences from a whole-cell RNA extract • Statistical counting of distinct sequences provides an estimate of expression level • Conversely, cDNA library can be normalized to capture rare messages • Requires large scale sequencing to get statistical significance

cDNA / EST Sequencing:Preparation of a cDNA library in phage l vector

SAGE Technology SerialAnalysis ofGeneExpression Takes idea of sequence sampling to the extreme Generates short ESTs (9-14nt) which are joined into long concatamers and then sequenced 49 is 262,144, ~5-fold the number of human genes The count of each type of tag estimates RNA copy number >50X more efficient than cDNA sequencing because many RNAs are represented in a single sequencing run

Steps to SAGE • Copy mRNA  ds cDNA using biotinylated (dT) • Cleave with anchoring enzyme (AE) which cleaves within ~250bp of poly-A tail at 3’ end. • Capture this segment on streptavidin beads • Ligate to linkers containing a type IIs restriction site, which cleave DNA 14 bp away from this site. • Ligate sequences to each other and PCR amplify • Cleave with AE to remove linkers • Concatenate, clone, and sequence

Velculescu et al. Science (1995) WHY DI-TAGS? Ditags are used to detect bias in the PCR amplification step. The probability of any two tags being coupled in the same ditag is small. Biased amplification can be detected as many ditags always having the same 2 tags present. B A B A B A PrimerA PrimerB PrimerA PrimerB

SAGE (continued) Example of a concatemer: CATGACCCACGAGCAGGGTACGATGATACATGGAAACCTATGCACCTTGGGTAGCACATG TAG1 TAG2 TAG3 TAG4 Counting the tags:

Proteomics SDS PAGE 2D PAGE MS/MS

An example SDS-PAGE How many proteins are in a band? Protein stains: Silver Copper Coomassie Blue

2D-PAGE Dimension 2: size Dimension 1: Isoelectric focusing gel

2D gel from macrophage phagosomes

Mass spectrometry Mass spectrometers consist of three essential parts • Ionization source: Converts peptides into gas-phase ions (MALDI + ESI) • Mass analyzer: Separates ions by mass to charge (m/z) ratio (Ion trap, time of flight, quadrupole) • Ion detector: Current over time indicates amount of signal at each m/z value

MS/MS Overview

A raw fragmentation spectrum By calculating the molecular weight difference between ions of the same type the sequence can be determined. SEQUEST uses the fragmentation pattern to search through a complete protein database to identify the sequence which best fits the pattern.

Tandem Mass Spec (MS/MS)

Typical nanoelectrospray source

X X X X X X X X Isotope Coded Affinity Tags (ICAT) Mass spec based method for measuring relative protein abundances between two samples Heavy reagent: d8-ICAT(X=deuterium) Normal reagent: d0-ICAT (X=hydrogen) ICATReagents: O N N O O O I N O N O S Thiol specific reactive group Biotin tag Linker (d0 or d8)

Protein Quantification & Identification via ICAT Strategy 100 Mixture 1 Light Heavy 0 550 560 570 580 m/z ICAT-labeled cysteines Quantitation 100 NH2-EACDPLR-COOH Combine and proteolyze (trypsin) Affinity separation (avidin) Mixture 2 0 200 400 600 800 m/z ICAT Flash animation: http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/ICAT/ICAT.html Protein identification

ICAT continued • The heavy (blue) and light (gray) peptides are separated and quantified to produce a ratio for each peptide – here, a single peptide ratio is shown • Each peptide is subjected to CID fragmentation in the second MS stage in order to identify it

Metabolomic measurements 2D NMR or mass spectrometry Currently not global and in less widespread use than microarrays, but have tremendous potential

Gene knockout and RNAi libraries for model speciesExample from yeast: Replacement of yeast ORFS with kanMX gene flanked by unique oligo barcodes– Yeast Deletion Project Consortium

YFP tagging for protein localization YPF is green, transmitted light is red NIC96 Nuclear Pore TUB1 Tubulin cytoskeleton HHF2 Histone Nucleus BNI4 Bud neck Images courtesy T. Davis lab See also recent work byWeissman and O’Shea labs at UCSF

yfg1D yfg2D yfg3D Systematic phenotyping Barcode (UPTAG): CTAACTC TCGCGCA TCATAAT … Deletion Strain: Growth 6hrs in minimal media (how many doublings?) Rich media Harvest and label genomic DNA

Systematic phenotyping with a barcode arrayRon Davis and friends… These oligo barcodes are also spotted on a DNA microarray Growth time in minimal media: • Red: 0 hours • Green: 6 hours

Molecular Interactions Among proteins, mRNA, small molecules, and so on…

Protein→DNA interactions ▲Chromatin IP ▼DNA microarray Gene levels (on/off) Protein—protein interactions ▲Protein coIP ▼Mass spectrometry Protein levels (present/absent) Biochemical reactions ▲Not yet!!! Metabolic flux ▼measurements Biochemical levels

Also like sequence, protein interaction data are exponentially growing… EMBL Database Growthtotal nucleotides (gigabases) DIP Database Growthtotal interactions 10 5 0 1980 1990 2000 (As are the false positives!!!)

High-throughput methods for measuring interaction networks • 2-hybrid • co-immunoprecipitation w/ mass spec • chIP-on-chip • systematic genetic analysis

Yeast two-hybrid method Fields and Song

Detection of protein interactions with antibody arrays McBeath and Schreiber

Kinase-target interactions Mike Snyder and colleagues

BIONF/BENG 203: Functional Genomics