410 likes | 609 Views
Class 12 – 2 nd next generation seq. method High throughput DNA sequencing using “bridging” (surface) pcr and reversible terminator chemistry Article from Illumina , Nature 456:53 (2008). Why sequence? 1. basic biology determine amino acid seq. of proteins
E N D
Class 12 – 2nd next generation seq. method High throughput DNA sequencing using “bridging” (surface) pcr and reversible terminator chemistry Article from Illumina, Nature 456:53 (2008)
Why sequence? 1. basic biology determine amino acid seq. of proteins learn role of non-coding seq. study evolutionary relationships – can help identify functional regions of DNA 2. medical applications some mutations cause disease (CF, SCD) shed light on disease mechanism some sequence variants assoc w/disease, drug sensitivity(“personalized med”) diagnose microbial infections 3. non-medical applications – e.g. plant engineering 4. forensics – individual identification
Key ideas and innovations in Illumina Method • Biochemistry • “bridging” pcr to get array of ~108DNA spots on • glass slide, each containing ~104 copies of an • individual ~ 200 bp DNA species in ~ 1mm area • sequencing by synthesis, 1 base at a time, using • dNTPs with removablefluors and 3’ blocking groups • reading ~35b from both ends of each DNA species to • get seq that should be known distance apart in ref. seq. • Image analysis – automated collection and analysis of ~106 • microscope images/run • Informatics – mapping short seq. runs to genome
How does Illumina method differ from Sanger sequencing? 1. clone DNA in bacteria to get many copies needed for adequate signal -> ‘bridging” pcr to create surface array of clusters of dna fragments, each cluster containing many copies of a single dna species 2. seqrxn: DNA template + primer + DNA pol + dNTP + 4 diff. dye-labeled ddNTP chain terminators -> reversibly 3’blocked dye-lableled dNTPs (each a diff. color) , extend 1 base, then remove 3’block, add next base, etc 3. electrophoresis to size separate dna species -> sequential photos to see order of base add. in @ cluster
First challenge – how to assemble multiple copies of individual templates on solid surface where sequencing will be done Shear genomic DNA (nebulizer) into segments ~200 – 2000bp “blunt” ends w/ DNA pol B’ A A’ A A B’ B’ B Ligate “forked” adapter oligonucleotide Pcr w/ oligos complementary to adapter seq forked ends A, B -> at 5’-ends of alternate strands of all fragments
Substrate = glass flow cell, 8 channels ~100mm height, thin layer of polyacrylamide applied in each channel Polyacrylamide contains bromo… (BRAPA) which covalently links to phospho-thioate group on 5’ end of new primers 3’ ~20 bases of attached primers match those of oligo A or B used to pcr the genomic fragments, so melted amplified genomic fragments anneal to the attached primers. Primer ext. w/ DNA pol makes copy of 1 strand of particular genomic fragment at some spot on surface Next challenge – make multiple copies of each fragment in small region on substrate surface (to have enough copies to get a strong sequencing signal)
Now melt off template Newly synthsized strand anneals at its 3’-end to nearby, 5’-attached oligo A A A’ A’ B’ A A’ A A B B B B Repetition grows thicket of both strands of particular genomic fragment in small spot on surface “bridging” pcr; note all strands are covalently attached via 5’ ends For unexplained reason they do this surface pcr by repeated cycles of chemical rather than thermal denaturation
Image of DNA fragments on surface after bridging pcr; each fragment is labeled (during sequencing) with 1 of 4 differently colored fluors by method explained below Each spot = “polony” or “cluster” of many copies of single DNA fragment Spot diameters ~1mm; each spot contains ~ 104 strands; -> primers ~10nm apart; areal density c/w initial conc. of annealed genomic fragments ~3pM
Next challenge – how to make surface pcr’d DNA single-stranded to serve as sequencing template B B Clever method – cut one strand of DNA at chemically sensitive site (*) engineered in oligo B, then melt off non-coval. attached DNA, add free primer B that anneals to distal (3’) end of attached template, extend B w/pol
How to make the single-strand cut? Put diol modified base in attached oligo B; diol can be chemically cleaved by periodate How to sequence other end of template? B ii A diol A A A After sequencing 1st strand, melt off primer-ext. product, perform another cycle of bridging pcr (ii), make single- stranded cut in attached oligo A, melt off oligo A extension product, seq. w/ soluble primer A
Note you need a new way to make ss cut in oligo A so you can make the A and B cuts separately; here are 2 ways: Synthesize oligo A with uracil U instead of T at given position; enzyme uracilglycosylase removes uracil (not normally in DNA); heat or high pH then breaks A strand at site of removed U Alternative: put oxoG in place of one G in oligo A; enzyme Fpgglycosylase removes abnormal oxoG; heat or high pH then breaks A strand at site of removed G Novel use of enzymes that remove abnormal bases (repair mutations in vivo) plus ability to insert abnormal bases during oligo synthesis makes this possible
Additional complication: any free 3’ ends on DNA on surface might “fold-back” and serve as primer for competing sequence rxn They block this by enzymatically adding nucleotide w/blocked 3’OH group to all DNAs before adding seq. primer
How is sequence read biochemically? They synthesized novel nucleotides! base T modified with flour sugar 3’ azide group N3 blocks extension A, C and G similarly modified but with diff. colored fluors; only one base is added at a time due to 3’ blocking group
Treatment with TCEP removes fluor and 3’ blocking group, which allowsnext nucleotide to be added and its color detected, (prev. fluor is removed)
Amazing that bulky, unnatural chemical groups left attached do not inhibit polymerase, or mess up base-pairing They say they had to engineer (mutate) DNA polymerase to get it to incorporate these modified bases efficiently This is another innovative step!
Repeated cycles of flowing in polymerase plus 4 modified nucleotides (1 of which gets incorporated in given spot), washing, taking picture, treating with TCEP -> sequence Picture taken at step n during sequencing run; all strands in a given cluster label with A, C, G or T depending on sequence at nth base in template strand. How does spot density compare to ion torrent?
“custom” Note they use TIRF microscopy to reduce background, only see fluors within < 1mm of surface Why “custom” objective?
How big is typical microscope field of view (FOV) at 60x magnification? Imagine FOV expanded 60x in each direction and mapped to 3x3mm CCD How many images would they need to cover ~10cm2 flow cell surface? How long would it take to collect these images serially if they have to move slide 1 FOV between images? Their “custom” lens gives them ?? (0.1mm)2 FOV How many sets of images do they need (1 for each base addition)? How long does it take to collect data for 1 run? ~week
How do they adjust focus to correct stage drift over hours? Laser spot off-centered on lens and reflected off of surface has different x-y position depending on z-position of slide. Adjusting z so spot is in same x-y position in FOV fixes z so image is in focus
Do they need to align the spots in images of the same FOV taken hours apart? Automated spot alignment program Cross-talk of different fluors – they need to adjust image intensities to correct for “red” fluorescence of “green” fluor, etc to get best estimate of which dNTP was incorp. If base extension or deblocking is not complete for all strands in cluster, different nucleotides will be incorporated at subsequent steps, purity of fluorescence signal will erode (phasing prob.) Quality control measures used to decide when base calling is unreliable; e.g. purity filter: intensity of 1 base must be > .6 sum of it plus next brightest base in 1st 12 positions
# errors determined by sequencing DNA with known seq. # errors/35 bases 2 1 Even with QC criteria to select good reads get only ~35 b reliable seq.!
Informatics – mapping shorts seq. reads to genome 2 programs used to look for matches betw. the ~35b end seq. they obtain for a cluster and ref seq. ELAND – finds all seq. in reference that match first 32 bases of cluster seq, allowing up to 2 mismatches but no gaps; then sees which of these best match cluster seq at any bases beyond 32 MAQ – more sophisticated in allowing gaps betw. ref. and cluster seq., so picks up more matches with small “indels”, but potentially more errors
If genome seq. were random, what length seq. would be unique (unlikely to occur more than once)? Complication: some sequences >35b occur many times “selfish” genes have replicated and re-inserted in different positions in the genome, e.g. short interspersed nuclear elements (SINES, alu) ~300 bp; ~106 copies (~10% of genome) long interspersed nuclear elements (LINES) ~6000 bp; ~105 copies (~20% of genome)
Two features help assignment of 35b reads to correct position in genome they know the paired end read should map to other DNA strand about 200 bp away in reference sequence each region of DNA is read many times, so they can just map consensus sequence for any segment
Tests of quality • How uniformly does their data cover the ref. seq.? • If some DNA segments don’t amplify well (? due to • high GC content) they might be absent in their seq. • If cluster seq. is random sample of ref seq., Poisson dist. • predicts how many times, n, a ref. seq. base should • appear in cluster seq. • pn=e-mmn/n! where m = aver. # times • m=130Gb of cluster seq/3Gb per genome = 43
Fig. 2 Take every 50th base of ref seq.; how many times is an overlapping frag. found in a cluster seq. mapped to the ref seq.? Make a histogram of the # of such bases found n times in the cluster seq data set. For interest, consider separately bases that don’t occur in repetitive elements like SINES and LINES (unique only) The dist. is pretty close to Poisson (only slt. more samples in tails), so the method seems to sample pretty randomly
Does GC content affect how often a region is sampled? Plot # times a particular base is sequenced in the data set) as function of GC content of seq. in which it occurs. Only cluster sequences with most extreme GC contents were sequenced less than the average ~40 times So what? If a seq. (with extreme GC content) is under- sampled, you might get only the maternal or paternal copy (allele) in the seq data set and so miss finding a polymorphism (false negative)
Next evaluation – compare how often SNPs are identified in the seq. vs. SNP hybridization assays (“GT, genotyping”) Note this company makes SNP hybrid. assay, so it working hard on technology that may replace its current platform! std version of hybrid. assay (GT) w/.5M SNPs Using ELAND program: latest version of hybrid. assay w/ >3M SNPs <1% discordant calls most often the array assay (GT) finds a SNP missed by seq.
Same table, using MAQ program, seq. does slt. better, but in general GT and seq. have similar fail-to-detect rates Their new, favorite set of SNPs with least ambiguity Most GT failures-to-detect are due to person carrying so variant a seq. that it fails to hybridize to anything on the chip Most seq. failures-to-detect are due to low sampling rate of one allele
But seq. picked up ~1M new SNPS in this person! Why? Std SNP panels selected for SNPs that occur fairly frequently in population This individual of African ancestry - ?underrepresented in std SNP panel Maybe most of us carry lots of “private” SNPs that are very rare in the population
How can you get information about structural changes larger than 35bases from 35base long reads? Use info from paired end reads! Idea – label ends of genomic DNA segments w/ biotin nucleotides (B) using DNA pol circularize DNA segments (ligate diluted sample) re-shear DNA; purify biotinylated DNA; make clusters as before and read seq of ends of junctionfrags.
Now sequence at opposite ends of small frags comes from genomic DNA regions separated by length of circularized fragments; also, orientionwrt each other is flipped If you can map both end sequences to genome, you can find deletions (end seq. further apart in ref. seq. than circularized fragment length), insertions (end seq. closer together in ref. seq. than circ. frag. length), inversions (orientation reversed)
They identified 1000s of >50bp deletions, many of which were known selfish DNA elements present in reference seq. but not in the seq. of the person whose DNA they analyzed 90% of these are SINES present in reference but not in this individual 60% are LINES
They also found 2345 insertions How many are in coding sequences? How many are homozygous?
Map of a region containing an inversion flanked by 2 small deletions. What do symbols represent? • Note ~2kb region of ref • seq. with no read pairs • (green) • “short insert” pairs • flanking this region • (orange) map to sites • ~2kb apart in ref. but • ~.5kb in this sample • (i.e. span deletion)
Last level of complexity – bio-medical interpretation of seq. information Example - variability greater in certain areas of genome e.g. parts of X chromosome - why?
Potentially medically relevant findings – your DNA is likely similar! 26,140 SNPs in protein-coding regions 5,361 encode non-conservative amino acid changes 153 encode premature terminations “many of which are expected to affect protein function” excerpt of Table 9
Summary - Impressive accomplishment! Innovations in many fields – all needed for useful product molecular biology: bridging pcr to get ~104 copies of individual fragments arrayed on surface, nicking tricks to convert pcr products to ss for sequencing and getting the complementary ss for sequencing, new dNTPs with reversibly blocked 3’ ends and chem- ically removable fluors, to seq. 1 base at a time engineered DNA pols that use these new dNTPs photonics, data acquisition, informatics … Lots of detail -> fuller explanation than ion-torrent
Major challenges remaining quantifying errors methods for resequencing variants for confirmation identifying structural variants larger than the pieces of dna sequenced – e.g. deletions, insertions, duplications, inversions speeding up (parallelization of) data acquisition interpretation – clinical significance of variants; implications for human biology
Some key ideas you should take away from today: How they get array of spots, each with many copies of a DNA to sequence How they get sequence, 1 base at a time, using reversible dye terminator chem. How they get information about structural variants larger than the 35 bp runs (paired end reads) How over sequencing (fold-coverage) helps How they evaluate seq. accuracy What kind of mutational load are we all likely to carry in our DNA