650 likes | 842 Views
2. DNA. A deoxyribonucleic acid or DNA molecule is a double-stranded polymer composed of four basic molecular units called nucleotides. Each nucleotide comprises a phosphate group, a deoxyribose sugar, and one of four nitrogen bases: adenine (A), guanine (G), cytosine (C), and thymine (T).The tw
E N D
1. 1 Introduction to gene expression microarraysTerry Speed
2. 2 DNA A deoxyribonucleic acid or DNA molecule is a double-stranded polymer composed of four basic molecular units called nucleotides.
Each nucleotide comprises a phosphate group, a deoxyribose sugar, and one of four nitrogen bases: adenine (A), guanine (G), cytosine (C), and thymine (T).
The two chains are held together by hydrogen bonds between nitrogen bases.
Base-pairing occurs according to the following rule: G pairs with C, and A pairs with T.
3. 3
4. 4 Genes The human genome is distributed along 23 pairs of chromosomes.
22 autosomal pairs;
the sex chromosome pair, XX for females and XY for males.
In each pair, one chromosome is paternally inherited, the other maternally inherited.
Chromosomes are made of compressed and entwined DNA.
A (protein-coding) gene is a segment of chromosomal DNA that directs the synthesis of a protein.
5. 5 Chromosomes and DNA
6. 6 Exons and introns Genes comprise only about 2% of the human genome; the rest consists of non-coding regions, whose functions may include providing chromosomal structural integrity and regulating when, where, and in what quantity proteins are made (regulatory regions).
The terms exon and intron refer to coding (translated into a protein) and non-coding DNA, respectively.
7. 7 RNA A ribonucleic acid or RNA molecule is a nucleic acid similar to DNA, but
single-stranded;
ribose sugar rather than deoxyribose sugar;
uracil (U) replaces thymine (T) as one of the bases.
RNA plays an important role in protein synthesis and other chemical activities of the cell.
Several classes of RNA molecules, including messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), and other small RNAs.
8. 8 Central dogma The expression of the genetic information stored in the DNA molecule occurs in two stages:
(i) transcription, during which DNA is transcribed into mRNA;
(ii) translation, during which mRNA is translated to produce a protein.
DNA ? mRNA ? protein
Other important aspects of regulation: methylation, alternative splicing, etc.
The correspondence between DNA's four-letter alphabet and a protein's twenty-letter alphabet is specified by the genetic code, which relates nucleotide triplets to amino acids.
9. 9
10. 10 Differential expression Each cell contains a complete copy of the organism's genome.
Cells are of many different types and states
E.g. blood, nerve, and skin cells, dividing cells, cancerous cells, etc.
What makes the cells different?
Differential gene expression, i.e., when, where, and in what quantity each gene is expressed.
On average, 40% of our genes are expressed at any given time.
11. 11 Functional genomics The various genome projects have yielded the complete DNA sequences of many organisms. E.g. human, mouse, yeast, fruitfly, etc.
Human: 3 billion base-pairs, 30-40 thousand genes.
Challenge: go from sequence to function, i.e., define the role of each gene and understand how the genome functions as a whole.
12. 12 Gene expression assays DNA microarrays rely on the hybridization properties of nucleic acids to monitor DNA or RNA abundance on a genomic scale in different types of cells.
The main types of gene expression assays:
High-density nylon membrane arrays;
Serial analysis of gene expression (SAGE);
Short oligonucleotide arrays (Affymetrix);
Long oligonucleotide arrays (Agilent);
Fibre optic arrays (Illumina);
cDNA arrays (Brown/Botstein)*.
13. 13 Applications of microarrays Measuring transcript abundance;
Genotyping;
Estimating DNA copy number;
Determining identity by descent;
Measuring mRNA decay rates;
Identifying protein binding sites;
Determining sub-cellular localization of gene products;
……..
14. 14
15. 15 GeneChip® Probe Arrays GENECHIP PROBE ARRAYS
The core of the platform is our unique arrays
Oligonucleotides synthesized de novo (photolithography & combinatorial chemistry)
Currently 65,000 (50 micron features) to 250,000 (24 micron) different oligos on commercially available products
Each oligo represented in 107 to 108 full-length copies
GENECHIP PROBE ARRAYS
The core of the platform is our unique arrays
Oligonucleotides synthesized de novo (photolithography & combinatorial chemistry)
Currently 65,000 (50 micron features) to 250,000 (24 micron) different oligos on commercially available products
Each oligo represented in 107 to 108 full-length copies
16. 16 Affymetrix GeneChip® Format 25mer DNA oligos are synthesized on the chip by photolithography
16-20 pairs of oligos represent each gene
Hybridization samples represent mrna populations from, e.g. Control and drug treated samples
Expression of each gene is quantitated as fluorescence
Typically 10-30 chips/expt
Eg 5 trts x 3 animals/trt
We use pprimarily the Affymetrix Chip format, in which…verbatimWe use pprimarily the Affymetrix Chip format, in which…verbatim
17. 17 A probe set = 11-20 PM,MM pairs
18. 18 Affymetrix GeneChip® A little more detail
We use pprimarily the Affymetrix Chip format, in which…verbatimWe use pprimarily the Affymetrix Chip format, in which…verbatim
19. 15 replicates of a high responding probe set PM blue, MM red
21. 15 replicates of a middle responding probe set PM blue, MM red
23. 15 replicates of a low responding probe set PM blue, MM red
25. 25 Summarizing 11-20 probe intensity pairs to give a measure of expression for a probe set A key issue. The original Affymetrix solution left something to be desired.
There are many possible low-level summaries, but they fall into three main classes.
In part this talk is about comparing them.
In part is about using them to measure differential expression.
26. 26 Competing measures of expression The original GeneChip® software used AvDiff
AvDiff = |A|-1 ?{j?A} (PMj - MMj)
where A is a suitable set of pairs chosen by the software.
Here 30%-40-% could be <0, which was a major irritant.
Log PMj / MMj was also used in the above.
27. 27 Competing measures of expression, 2 Li and Wong (dChip) fit the following model to sets of chips
PMij - MMij = ?i?j + ?ij
where ?ij ~ N(0, ?2). They consider ?i to be expression in chip i. Their model is also fitted to PM only, or to both PM and MM. Note that by taking logs, assuming the LHS is >0, this is close to an additive model.
Efron et al consider log PMj -0.5 log MMj. It is much less frequently <0.
Another summary is the second largest PM, PM(2).
28. 28 Competing measures of expression, 3 The latest version of GeneChip® uses something else, namely
Log{Signal Intensity} = TukeyBiweight{log(PMj -MMj*)}
with MMj* a version of MMj that is never bigger than PMj. Here TukeyBiweight can be regarded as a kind of robust/resistant mean.
29. 29 What we do: four steps We use only PM, and ignore MM. Also, we
Adjust for background on the raw intensity scale;
Carry out quantile normalization of PM-*BG with chips in suitable sets, and call the result n(PM-*BG);
Take log2 of normalized background adjusted PM;
Carry out a robust multi-chip analysis (RMA) of the quantities log2n(PM-*BG).
We call our approach RMA, see later for a bit more.
30. 30 cDNA microarrays: the process
31. 31 Building the chip Slide 4
Ideally, you will have a slides consist of all genes in the genomeThe design of the array / template is an imporatnt bioinformatics question. As you are only able to detect expression of genes on this template.Slide 4
Ideally, you will have a slides consist of all genes in the genomeThe design of the array / template is an imporatnt bioinformatics question. As you are only able to detect expression of genes on this template.
32. 32
33. 33 Sample preparation
34. 34 HybridizationBinding of cDNA target samples to cDNA probes on the slide
35. 35 Humidity
Temperature
Formamide
(Lowers the Tm) Hybridization chamber
36. 36 Expression profiling with DNA microarrays Slide 5
You will have another sample of interests consists of a portion of the genes and the questions is to find out what genes are being expressed in this sample. The simplified version of the experiments is to label the sample with a color dye hybridize the sample on the slides and scan the slides under some laser to see which spots that light up. This provides a list of genes and it’s corresponding expression levels in our samples.
Our interest is on comparing the expression levels between the two samples.
Slide 5
You will have another sample of interests consists of a portion of the genes and the questions is to find out what genes are being expressed in this sample. The simplified version of the experiments is to label the sample with a color dye hybridize the sample on the slides and scan the slides under some laser to see which spots that light up. This provides a list of genes and it’s corresponding expression levels in our samples.
Our interest is on comparing the expression levels between the two samples.
37. 37 RGB overlay of Cy3 and Cy5 images
38. 38 Raw data Human cDNA arrays
~ 43K spots;
16–bit TIFFs: ~ 20Mb per channel;
~ 2,000 x 5,500 pixels per image;
Spot separation: ~ 136um;
For a “typical” array:
Mean = 43, med = 32, SD = 26 pixels per spots
39. 39 Image analysis The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes.
Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.
40. 40 Image analysis
41. 41 Image analysis Motivation behind background adjustment: A spot’s measured fluorescence intensity includes a contribution that is not specifically due to the hybridization of the target to the probe, but to something else, e.g. the chemical treatment of the slide, autofluorescence etc. Want to estimate and remove this unwanted contribution.Motivation behind background adjustment: A spot’s measured fluorescence intensity includes a contribution that is not specifically due to the hybridization of the target to the probe, but to something else, e.g. the chemical treatment of the slide, autofluorescence etc. Want to estimate and remove this unwanted contribution.
42. 42 Spot Batch automatic addressing.
Segmentation. Seeded region growing (Adams & Bischof 1994): adaptive segmentation method, no restriction on the size or shape of the spots.
Intensity extraction
Foreground. Mean of pixel intensities within a spot.
Background. Morphological opening: non-linear filter which generates an image of the estimated background intensity for the entire slide.
Spot quality measures.
Software package. Spot, built on the freely available software package R.
43. 43 Quality measures Spot
One channel, R or G
Signal/noise ratio;
Variation in pixel intensities;
Identification of “bad spots” (no signal), etc.
Two channels, R/G
Brightness: foreground/background ratio;
Uniformity: variation in pixel intensities and ratios of intensities;
Morphology: area, perimeter, circularity.
Array (slide)
Percentage of spots with no signal;
Range of intensities;
Distribution of spot signal area, etc.
How to use quality measures in subsequent analyses?
44. 44 Normalization
45. 45 Which spots to use?
All genes on the array.
Constantly expressed genes (housekeeping).
Controls
Spiked controls (e.g. plant genes);
Genomic DNA titration series.
Rank invariant set.
46. 46 log2R/G ? log2R/G - c = log2R/ (kG)
Standard practice (in most software)
c is a constant such as the mean or median log ratio.
Our preference
c is a function of overall spot intensity A and print-tip
group, and possibly other variables.
Compute c by robust locally weighted regression of M
on these variables. E.g. lowess.
47. 47 Which spots to use, cont.?
48. 48 Location normalization: details
49. 49 MA-plot
50. 50 MA-plot
51. 51 Boxplot: print-tip effects remain after global normalization
52. 52 Normalization: details cont.
53. 53
54. 54
55. 55 Smoothed histograms of M values
56. 56
57. 57 How we normalize Quantile normalization is a method to make the distribution of probe intensities the same for every chip.
The normalization distribution is chosen by averaging each quantile across chips.
The diagram that follows illustrates the transformation.
58. 58 The two distribution functions are effectively estimated by the sample quantiles.
Quantile normalization is fast
After normalization, variability of expression measures across chips reduced
Looking at post-normalization PM vs pre-normalization PM (natural and log scales), you can see transformation is non linear.
59. 59
60. 60
61. 61
62. 62
63. 63 Terminology Probe: DNA spotted on the array, aka. spot, immobile substrate.
Target: DNA hybridized to the array, mobile substrate.
Sector: collection of spots printed using the same print-tip (or pin), aka. print-tip-group, pin-group, spot matrix, grid.
Batch: collection of slides with the same probe layout.
The terms slide or array are often used to refer to the printed microarray.
64. 64
65. 65 WWW resources Complete guide to “microarraying”
http://cmgm.stanford.edu/pbrown/mguide/
http://www.microarrays.org
Parts and assembly instructions for printer and scanner;
Protocols for sample prep;
Software;
Forum, etc.
Animation: http://www.bio.davidson.edu/courses/genomics/chip/chip.html
66. 66 Acknowledgments Introduction to biology: based on Bioconductor short course lecture 1 (http://www.bioconductor.org/) and Temple short course lecture 1 with pictures from http://www.accessexcellence.org/
Introduction to microarrays: based on IPAM, UCLA
http://www.ipam.ucla.edu/programs/fg2000/tutorials.html#terry_speed,
See also
http://www.stat.berkeley.edu/users/terry/zarray/Html/index.html
A special thanks to Yee Hwa (Jean) Yang for her help here.