320 likes | 634 Views
CS491JH: Data Mining in Bioinformatics Introduction to Microarray Technology Technology Background Data Processing Procedure Characteristics of Data Data integration and Data mining. Nylon Membrane. Glass Slides. GeneChip. Substrates for High Throughput Arrays. Single label P 33.
E N D
CS491JH: Data Mining in Bioinformatics • Introduction to Microarray Technology • Technology Background • Data Processing Procedure • Characteristics of Data • Data integration and Data mining
Nylon Membrane Glass Slides GeneChip Substrates for High Throughput Arrays Single label P33 Single label biotin streptavidin Dual label Cy3, Cy5
* * * * * GeneChip® Probe Arrays Hybridized Probe Cell GeneChipProbe Array Single stranded, labeled RNA target Oligonucleotide probe 24µm Millions of copies of a specific oligonucleotide probe 1.28cm >200,000 different complementary probes Image of Hybridized Probe Array
5´ 3´ Multiple oligo probes GeneChip® Expression Array Design Gene Sequence Probes designed to be Perfect Match Probes designed to be Mismatch
Procedures for Target Preparation Cells Labeled transcript AAAA IVT (Biotin-UTP Biotin-CTP) L L L L Poly (A)+/ Total RNA cDNA Fragment (heat, Mg2+) L L Wash & Stain Hybridize (16 hours) L L Scan Labeled fragments
NSF Soybean Functional Genomics Steve Clough / Vodkin Lab Printing Arrays on 50 slides
Cells from condition A Cells from condition B mRNA Label Dye 1 Label Dye 2 cDNA Mix NSF / U of Illinois Microarray Workshop -Steve Clough / Vodkin Lab equal over under Ratio of expression of genes from two sources Total or
NSF Soybean Functional Genomics Steve Clough / Vodkin Lab GSI Lumonics
Cattle and Soy Controls Beta Actin PKG HPRT Beta 2 microglobulin Rubisco AB binding protein Major latex protein homologue (MSG) Array of cattle and soy spiking controls. 50 ug of cattle brain total RNA was labeled with Cy3 (green). 1 ul each of in vitro transcribed soy Rubisco (5 ng), AB binding protein (0.5 ng) and MSG (0.05 ng) were labeled with Cy5. The two labeled samples were cohybridized on superamine slides (Telechem, Inc.). To the right of each set of spots are five negative controls (water).
Fetal Spleen-Cy3 Adult Spleen-Cy5 IgM IgM MYLK MYLK IgM heavy chain IgM heavy chain COL1A2 COL1A2
GenePix Image Analysis Software Placenta vs. Brain – 3800 Cattle Placenta Array cy3cy5
Microarray Data Process • Experimental Design • Image Analysis – raw data • Normalization – “clean” data • Data Filtering – informative data • Model building • Data Mining (clustering, pattern recognition, et al) • Validation
Scatterplot of Normalized Data Fetal Adult
<-0.3 >0.3
Characteristics of Data Data can be viewed as a NxM matrix (N >> M): N is the number of genes M is the number of data points for each gene Or Nx(M+K) K is the number of Features describing each gene(genome location, functional description, metabolic pathway et al)
Model for Data Analysis • Gene Expression is a Dynamic Process • Each Microarray Experiment is a snap shot of the process • Need basic biological knowledge to build model • For Example: • Assumption – In most of experiments, only a small set of genes (100s/1000s) have been affected significantly.
Data Mining Need for Data Mining • Data volumes are too large for traditional analysis methods • Large number of records and high dimensional data • Only small portion of data is analyzed • Decision support process becomes more complex Functions of Data Mining Use the data to build predictors – prediction, classification, deviation detection, segmentation Generates more sophisticated summaries and reports to aid understanding of the data – find clusters, partitions in data
Data Mining Methods Classification, Regression (Predictive Modeling) Clustering (Segmentation) Association Discovery (Summarization) Change and deviation detection Dependency Modeling Information Visualization
Clustered display of data from time course of serum stimulation of primary human fibroblasts. Cholesterol Biosynthesis Cell Cycle Immediate Early Response Signaling and Angiogenesis Wound Healing and Tissue Remodeling Eisen et al. Proc. Natl. Acad. Sci. USA 95 (1998) pg 14865
Gene Expression Profile of Aging and Its Retardation by Caloric Restriction Cheol-Koo Lee, Roger G. Klopp, Richard Weindruch, Tomas A. Prolla