350 likes | 424 Views
Managing and Exploiting “Post-Genomics Era” Data. David O. Nelson Matt Coleman Lawrence Livermore National Laboratory. The “New” Biology: X -omics. Traditional reductionistic approach: One gene/protein/reaction at a time. Test/validate isolated models at bench. New “systems” approach:
E N D
Managing and Exploiting “Post-Genomics Era” Data David O. Nelson Matt Coleman Lawrence Livermore National Laboratory
The “New” Biology: X-omics • Traditional reductionistic approach: • One gene/protein/reaction at a time. • Test/validate isolated models at bench. • New “systems” approach: • All DNA/RNA/proteins surveyed at once. • Need to • Manage data globally (across labs, sites, …) • Analyze large batches of intermediate results. • Provide links to minute details when required.
Outline • Introduction to DOE Low-Dose Program • Microarrays • Overview and analyses • Microarrays at LLNL • An Example Project • Gene regulation after exposure to Ionizing Radiation (IR)
The DOE Low-Dose Radiation Research Program • Goal: develop radiation standards based on risk. • Focus: biological mechanisms of radiation response • low-dose (< 0.1 Gy) • low dose-rate (< 0.1 Gy / Yr) • Scope: • ~54 projects • See http://lowdose.org
Microarrays in the Low-Dose Program Use microarrays to • Identify candidate low-dose biomarkers of radiation • exposure, • early cellular response, • downstream effects, and • susceptibility. • Assess risk through mechanism-based understanding of cell and tissue response. • Genomic regulation of low dose response. • Identifying affected biological pathways and functions. • Predicting novel biological pathways and functions.
What Is A Microarray? • Microarray—a 2d array of spots on a glass slide. • Each spot contains DNA (or RNA). • Usually different DNA on each spot. • Some within-slide reps for QC. • Make bunches at a time and hybridize with tissue extracts.
Hybridize Arrays With Tissue Extracts • An array is simultaneously exposed to one or more tissue extracts (different treatments). • DNA in each extract labeled with fluorescent tag. • Different tag for each tissue extract. • DNA in tissue sticks (hybridizes) with its mate on slide. • Competitive hybridization when >1 tissue. • Quantity hybridized ~ concentration.
Scan Arrays to Acquire Concentration Data • Fluorescent tags are excited by laser scanner. • Intensity is read off by PMT’s. • One band-pass filter and PMT per fluorophore. • Result is an intensity image in 1+ spectral bands.
The “Raw” Data Is Really Highly Cooked • Find ROI for each spot. • Estimate background outside of spots. • Eliminate background. • Combine intensities from each pixel to estimate signal. • Assess quality of estimate. • Result: an n x k array of intensities for hybridization. • n spots, k colors.
Two probes per experiment Probes are cDNA from tissues Labeled with different dyes Data are intensities from two band-pass filters Usually red and green Experiments with cDNA Microarrays
~1200 interesting genes Genes associated with radiation effect IR modulated associated with differential IR response Genes for DNA repair and stress response DNA repair cell cycle control Apoptosis stress response meiosis genes Developmental and spermatogenesis genes Mouse 50 Gene Array Includes ~15 radiation response genes 403 Gene Array Includes ~100 radiation response genes 833 Gene Array Includes ~200 radiation response genes Human 99 Gene Array Includes ~ 40 radiation response genes ~500 Gene Array Includes ~ 200 radiation response genes ~800 Gene Array Includes ~ 200 radiation response genes Custom Arrays at LLNL
One probe per experiment Probes are labeled RNA Data is intensity from one band-pass filter Treatment and blocking completely confounded Oligonucleotide Microarrays from Affymetrix (Affy Arrays)
The Human Genome U95 and U133 Set - (6 chips) comprehensive transcript for human genome. Study the expression level of >60,000 human genes. Mouse Genome U74 Set - (3 chips) Biggest mouse genome gene set currently available. 36,000 mouse genes and EST’s Others available SNP Analysis Cancer Set Yeast, Fly, E. coli, Rat Available Affy Arrays
Problems in Analyzing Microarray Data • Experimental design creates problems. • Chip-to-chip variation confounded with treatment differences. • Normalization is used to try to adjust. • Only one or two treatments per chip. • Treatment comparisons more complex. • Modern designs can help. • “Best” way to arrange and process data still in flux. • How to pick the winners with 104 tests? • Multiple testing vs. exploratory analyses.
Mouse IR Gene Expression Tissue: Brain, Testis Dose: 0, 0.1, 2.0 Gy Time: 0.5, 4 hours Human IR Gene Expression Tissue: 3 cell lines Dose: 0, 2.0, 0.05+2.0, 0.1 Gy Time: 4 hours Y. Pestis 3 strains, 2 host cell lines Temp: 20° and 37° C Time: 1.0 and 10 hours Mouse Development 5 early embryo stages 5 spermatogenesis stages 5+ mutagens Other IR-related labs A. Fornace, NCI G. Chu, Stanford B. Lehnert, LANL D. Chen, LBNL Experiments at LLNL and Elsewhere
Infrastructure Development Underway at LLNL • Web-based data acquisition and storage. • Open-source tools in R for describing designs and analyses. • HTML and XML tools for results presentation. • Integrate with SDM for exploitation of downstream tools.
What Information Must Be Stored—MGED Effort • Minimum Information About a Microarray Expmt (MIAMI) • Experimental Design • Array Design • Samples • Hybridizations • Measurements • Controls • Interchange Formats, Ontologies, etc. • http://www.mged.org
Analytical Tools Are Developing Rapidly • Bioconductor project. • Open-source tools for microarray analysis. • http://www.bioconductor.org • Statisticians at HSPH, Lucent, UC Berkeley, Stanford, Johns Hopkins, LLNL, etc. are involved. • Mike Eisner at LBNL one of biology pioneers.
Summary: Microarray Data Integration Needs • Current experiments will produce 1010 – 1012 bytes of data. • Planned experiments much more. • Must integrate with data from • Intermediate results/analyses. • Local data repositories. • External sequence/protein data from NCBI and elsewhere. • External analysis tools. • Tool set undergoing rapid development.
Example Project: Genome-scale Modeling of IR Gene Networks • Hypothesis: Similar expression patterns in response to low-dose IR => genes in coordinated expression groups. • Significance: Understanding regulation of expression groups will help • Understand biological processes. • Identify determinants of IR susceptibility.
Find interesting genes using microarrays. Obtain cDNA sequences for genes of interest. BLAST cDNA sequences against Unfinished High-Throughput Genomic Sequences or “Nonredundant Databases”. Identify start of transcription based on cDNA-genomic sequence alignment. Select 1000 bases in front of transcription site. Analyze sequences for TFBS’s using ModelInspector. Build a consensus model using location and consensus of TFBS’s. Search for other genes with same promoter model. Compare new genes with genes already in group. Building and Extending a Promoter Model
3. Search DBMS’s for Sequences 1. Do Microarray Experiment 2. Extract cDNA Cluster(s) A B C 6. Hypothesize Promoter Model 4. Extract Upstream Sequences 5. Identify Potential Promoters 7. Search DBMS’s for Other Genes 8. Potential Genes Promoter Model Discovery Workflow Adapted from Thomas Werner Biomolecular Engineering, 17: 87-94 (2001)
Assume 1st Position of cDNA is Start of Transcription • Need 1000 bases upstream. cDNA Genomic
Step 3: Get and Annotate 1kb Upstream • User-added annotation for use in later analysis (loc, clone & cDNA accession number, direction)
Step 4: Use External Tool Made for Promoter Analysis • http://genomatix.gsf.de
Transcription factor binding sites DNA Start of transcription Putative model for down regulation A Putative Model for the HK2 Cluster…
Subsequent Steps… • Go back into GenBank and find other sequences with same promoter patterns. • May reflect genes that are co-regulated. • Figure out how new genes fit into picture.
What Do We Know Now? • Data management problems will be severe, by traditional biological standards. • Exploiting this data will require better tools for integrating disparate data, data bases, and analytics. • Must adapt rapidly to changing technology/scientific directions.
Key Low-Dose Personnel and Collaborators Expression array technology Custom and Commercial Arrays ImageCapture Paul VanHummelen &Processing Rajiv Raja Matt Coleman Laura Kegelmeyer Brenda Marsh Don Peters Shalini Mabery Array InformaticsImage Clones, selection David Nelson Christa Prange Tom Slezak, etc. Dave Wilson Leif Peterson, Baylor U. Jeff Gregg, UC Davis Mouse in vivo Model BaselineIR response Lisa Cheeseman Eric Yin Human Lymphoblastoid Model Effects of IRAdaptive Response Matt Coleman Jim Tucker Brenda Marsh Karen Sorensen Matt Coleman Other Key Collaborators J. Gregg, UC Davis, bioinformatics and hybridization technology S. McCutchin-Maloney, LLNL, protein analyses D. Wilson, LLNL, DNA repair F. Marchetti, LLNL, cytogenetics