Copy-number estimation on the latest generation of high-density oligonucleotide microarrays

Copy-number estimation on the latest generation of high-density oligonucleotide microarrays Henrik Bengtsson (work with Terry Speed) Dept of Statistics, UC Berkeley January 24, 2008 Postdoctoral Seminars, Mathematical Biosciences Institute, The Ohio State University

Acknowledgments UC Berkeley: James Bullard Kasper Hansen Elizabeth Purdom Terry Speed WEHI, Melbourne: Mark Robinson Ken Simpson ISREC, Lausanne: “Asa” Wirapati John Hopkins: Benilton Carvalho Rafael Irizarry Affymetrix, California: Ben Bolstad Simon Cawley Luis Jevons Chuck Sugnet Jim Veitch

Size = 264 kb, Number of loci = 72 Copy number analysis is about finding "aberrations" in a person's genome.

Single Nucleotide Polymorphisms (SNPs) make us unique Definition: A sequence variation such that two genomes may differ by a single nucleotide (A, T, C, or G). Allele A:A ...CGTAGCCATCGGTA/GTACTCAATGATAG... Allele B:G A person has either genotype AA, AB, or BB at this SNP.

Human Genetic Variation:Breakthrough of the Year 2007 (Science) • 3 billion DNA bases. • First sequenced 2001. • HapMap: 270 individuals genotyped.3 million known SNPs (places where one base differ from one person to another). Estimate: 15 million SNPs. • Genomewide association studies takeover (over linkage analysis). • Copy Number Polymorphism:- 1,000s to millions of bases lost or added.- Estimate: 20% of differences in gene activity are due to copy-number variants; SNPs (genotypes) account for the rest. • January 22, 2008: The 3-year "1,000 Genomes Project" will sequence 1,000 individuals. This follows the HapMap Project (SNPs).

Objectives of this presentation • Total copy number estimation/segmentation • Estimate single-locus CNs well(segmentation methods take it from there) • All generations of Affymetrix SNP arrays: • SNP chips: 10K, 100K, 500K • SNP & CN chips: 5.0, 6.0 • Small and very large data sets

Available in aroma.affymetrix “Infinite” number of arrays: 1-1,000s Requirements: 1-2GB RAM Arrays: SNP, exon, expression, (tiling). Dynamic HTML reports Import/export to existing methods Open source: R Cross platform: Windows, Linux, Mac

Affymetrix chips

Running the assaytake 4-5 working days • Start with target gDNA (genomic DNA) or mRNA. • Obtain labeled single-strandedtargetDNA fragments forhybridization to theprobeson the chip. • After hybridization, washing, and scanning we get a digital image. • Image summarized across pixels to probe-level intensities before we begin. Thisis our "raw data".

Restriction enzymes digest the DNA, which is then amplified and hybridized

* * * * 5 µm 5 µm > 1 million identical 25bp sequences The Affymetrix GeneChip is a synthesized high-density 25-mer microarray * 1.28 cm 1.28 cm 6.5 million probes/chip

Target DNA find their way to complementary probes by massive parallel hybridization

Hybridization + Scanning DAT File(s) [Image, pixel intensities] Image analysis CDF [Chip Description File] CEL File(s) [Probe Cell Intensity] + workable raw data Pre-processing Segmentation

Affymetrix copy-number & genotyping arrays

* * * * * * * * 25 nucleotides Other DNA Other DNA Target seq. Other seq. other PMs PM Terminology Target sequence: ...CGTAGCCATCGGTAAGTACTCAATGATAG... ||||||||||||||||||||||||| Perfect match (PM): ATCGGTAGCCATTCATGAGTTACTA

* * * * * * * * * CN=1 CN=2 CN=3 PM = c PM = 2¢c PM = 3¢c Copy-number probes are used to quantify the amount of DNA at known loci CN locus:...CGTAGCCATCGGTAAGTACTCAATGATAG... PM:ATCGGTAGCCATTCATGAGTTACTA

Raw copy numbers- log-ratios relative to a reference From the preprocessing, we obtain for sample i=1,2,...,I, CN locus j=1,2,...,J: Observed signals: (i1, i2, ..., iJ) These are not absolute copy-number levels. In order to interpret these, we compare each of them to a reference "R", i.e. ij / Rj, but even better "raw copy numbers": Mij = log2 (ij / Rj) = log2(ij) - log2(Rj) The reference can be from normal tissue, or from a pool of normal samples.

Even without a segmentation algorithm, we can easily spot a deletion here. Copy number regions are found by lining up estimates along the chromosome Example: Log-ratios for one sample on Chromosome 22.

Single Nucleotide Polymorphisms (SNPs) make us unique Definition: A sequence variation such that two genomes may differ by a single nucleotide (A, T, C, or G). Allele A:A ...CGTAGCCATCGGTA/GTACTCAATGATAG... Allele B:G A person has either genotype AA, AB, or BB at this SNP.

* * * * * * * * * * * * * * * * * * * AA BB AB PMA >> PMB PMA << PMB PMA ¼PMB Affymetrix probes for a SNP - can be used for genotyping PMA:ATCGGTAGCCATTCATGAGTTACTA Allele A:...CGTAGCCATCGGTAAGTACTCAATGATAG...Allele B:...CGTAGCCATCGGTAGGTACTCAATGATAG...PMB:ATCGGTAGCCATCCATGAGTTACTA

* * * * * * * * * * * * * * * * * * * * * * * * AA AB AAB PM =PMA+PMB = 2c PM = PMA + PMB = 2c PM =PMA+PMB = 3c BB PM =PMA + PMB = 2c SNPs can also be used forestimating copy numbers *

Combing CN estimates from SNPs and CN probes means higher resolution SNPs + CN probes

A brief history...

Genome-Wide Human SNP Array 6.0is the state-of-the-art array • > 906,600 SNPs: • Unbiased selection of 482,000 SNPs:historical SNPs from the SNP Array 5.0 (== 500K) • Selection of additional 424,000 SNPs: • Tag SNPs • SNPs from chromosomes X and Y • Mitochondrial SNPs • Recent SNPs added to the dbSNP database • SNPs in recombination hotspots • > 946,000 copy-number probes: • 202,000 probes targeting 5,677 CNV regions from the Toronto Database of Genomic Variants. Regions resolve into 3,182 distinct, non-overlapping segments; on average 61 probe sets per region • 744,000 probes, evenly spaced along the genome

How did we get here? Data from 2003 on Chr22 (on of the smaller chromosomes)

2003: 10,000 loci x1

2004: 100,000 loci x10

2005: 500,000 loci x50

2006: 900,000 loci x90

2007: 1,800,000 loci x180

4£ further out… 10K 294kb 100K 26kb 500K 6.0kb # loci 5.0 3.6kb 6.0 1.6kb year Rapid increase in density Distance between loci: next? 2003 2004 2005 2006 2007

Affymetrix & Illumina are competing - we get more bang for the buck (cup) Price source: Affymetrix Pricing Information, http://www.affymetrix.com/, January 2008.

Preprocessing forcopy-number analysisCopy-number estimation using Robust Multichip Analysis (CRMA)

Copy-number estimation using Robust Multichip Analysis (CRMA)

* * * * * * * * * * * * * * * * * * * AA BB AB PMA >> PMB PMA << PMB PMA ¼PMB Crosstalk between alleles adds significant artifacts to signals Cross-hybridization: Allele A: TCGGTAAGTACTC Allele B: TCGGTATGTACTC

BB AB PMB AA + PMA offset Crosstalk between alleles is easy to spot

PMB PMA Crosstalk between alleles can be estimated and corrected for

Before removing crosstalk the arrays differ significantly... Crosstalk calibration corrects for differences in distributions too log2 PM

When removing crosstalk systemdifferences between arrays goes away Crosstalk calibration corrects for differences in distributions too log2 PM

Four measurements of the same thing: log2 PM With different scales: log(b*PM) = log(b)+log(PM) log2 PM With different scales and some offset:log(a+b*PM) = ... log2 PM How can a translation and a rescaling make such a big difference?

* * * * * * * * * * * * * * * * * * PM = PMA +PMB Copy-number estimation using Robust Multichip Analysis (CRMA) AA AB PM =PMA+PMB * BB PM =PMA +PMB

1 2 3 4 5 6 7 1 2 3 4 5 6 7 PMA PMB PMA PMB Genotype AA Genotype BB 1 2 3 4 5 6 7 PMA PMB Genotype AB For robustness (against outliers), there are multiple probes per SNP

Copy-number estimation using Robust Multichip Analysis (CRMA) The log-additive model: log2(PMijk) = log2ij+ log2jk + ijk sample i, SNP j, probe k. Fit using robust linear models (rlm)

Probe-level summarization- probe affinity model For a particular SNP, the total CN signal for sample i=1,2,...,I is: i Which we observe via K probe signals: (PMi1, PMi2, ..., PMiK) rescaled by probe affinities: (1, 2, ..., K) A model for the observed PM signals is then: PMik = k * i + xik where xik is noise.

Probe-level summarization- the log-additive model For one SNP, the model is: PMik = k * i + ik Take the logarithm on both sides: log2(PMik) = log2(k * i + ik) ¼ log2(k * i)+ ik = log2k + log2i + ik Sample i=1,2,...,I, and probe k=1,2,...,K.

Probe-level summarization- the log-additive model With multiple arrays i=1,2,...,I, we can estimate the probe-affinity parameters {k} and therefore also the "chip effects" {i} in the model: log2(PMik) = log2k + log2i + ik Conclusion: We have summarized signals (PMAk,PMBk)for probes k=1,2,...,K into one signal iper sample.

100K Copy-number estimation using Robust Multichip Analysis (CRMA) Longer fragments ) less amplified by PCR ) weaker SNP signals 

500K Copy-number estimation using Robust Multichip Analysis (CRMA) Longer fragments ) less amplified by PCR ) weaker SNP signals 

Copy-number estimation using Robust Multichip Analysis (CRMA) Normalize to get same fragment-length effect for all hybridizations

Copy-number estimation on the latest generation of high-density oligonucleotide microarrays

Copy-number estimation on the latest generation of high-density oligonucleotide microarrays

Presentation Transcript

Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays

More on Microarrays

Probability density estimation

Exploratory Data Analysis of High Density Oligonucleotide Array

Applied Kernel Density Estimation: Dynamic Spatiotemporal Analysis of Density Maps on Crime Data

DNA Copy Number Analysis

INFERRING HAPLOTYPES OF COPY NUMBER VARIATIONS

DNA Copy Number Analysis

A NOVEL APPROACH TO IMPROVE THE NOISE IN DETECTING COPY NUMBER VARIATIONS USING OLIGONUCLEOTIDE

Estimation of the spectral density function

Kernel Density Estimation

Evaluation of oligonucleotide rice microarrays as a high-throughput, low-cost

Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Copy Number Variation

Printing High Quality Microarrays

Density Estimation

DNA Copy Number Analysis

Copy Number

Chapter 5 – Density estimation based on distances

Color Reproduction on High Density Printing

Probability density estimation

Printing High Quality Microarrays