190 likes | 363 Views
Hierarchical Sequencing. a BAC clone. map. Hierarchical Sequencing Strategy. Obtain a large collection of BAC clones Map them onto the genome (Physical Mapping) Select a minimum tiling path Sequence each clone in the path with shotgun Assemble Put everything together. genome.
E N D
a BAC clone map Hierarchical Sequencing Strategy • Obtain a large collection of BAC clones • Map them onto the genome (Physical Mapping) • Select a minimum tiling path • Sequence each clone in the path with shotgun • Assemble • Put everything together genome
a BAC clone map Hierarchical Sequencing Strategy • Obtain a large collection of BAC clones • Map them onto the genome (Physical Mapping) • Select a minimum tiling path • Sequence each clone in the path with shotgun • Assemble • Put everything together genome
Methods of physical mapping Goal: Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence Methods: • Hybridization • Digestion
1. Hybridization Short words, the probes, attach to complementary words • Construct many probes • Treat each BAC with all probes • Record which ones attach to it • Same words attaching to BACS X, Y overlap p1 pn
2. Digestion Restriction enzymes cut DNA where specific words appear • Cut each clone separately with an enzyme • Run fragments on a gel and measure length • Clones Ca, Cb have fragments of length { li, lj, lk } overlap Double digestion: Cut with enzyme A, enzyme B, then enzymes A + B
The Walking Method • Build a very redundant library of BACs with sequenced clone-ends (cheap to build) • Sequence some “seed” clones • “Walk” from seeds using clone-ends to pick library clones that extend left & right
Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host) that incorporated the fragment BACBacterial Artificial Chromosome, a type of insert–vector combination, typically of length 100-200 kb read a 500-900 long word that comes out of a sequencing machine coveragethe average number of reads (or inserts) that cover a position in the target DNA piece shotgun the process of obtaining many reads sequencing from random locations in DNA, to detect overlaps and assemble
cut many times at random Whole Genome Shotgun Sequencing genome plasmids (2 – 10 Kbp) forward-reverse paired reads known dist cosmids (40 Kbp) ~800 bp ~800 bp
Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm
Steps to Assemble a Genome Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig 1. Find overlapping reads 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs 4. Derive consensus sequence ..ACGATTACAATAGGTT..
1. Find Overlapping Reads (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient., pos.) aaactgcag aactgcagt acggatcta actgcagta actgcagta cccaaactg cggatctac ctactacac ctgcagtac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc gtacggatc tacggatct tacggatct tactacaca aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca
T GA TACA | || || TAGA TAGT 1. Find Overlapping Reads • Find pairs of reads sharing a k-mer, k ~ 24 • Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC • Caveat: repeats • A k-mer that occurs N times, causes O(N2) read/read comparisons • ALU k-mers could cause up to 1,000,0002 comparisons • Solution: • Discard all k-mers that occur “too often” • Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available
1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA
1. Find Overlapping Reads • Correcterrors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A correlated errors— probably caused by repeats disentangle overlaps replace T with C TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors TAG-TTACACAGATTATTGA TAG-TTACACAGATTATTGA
2. Merge Reads into Contigs • Overlap graph: • Nodes: reads r1…..rn • Edges: overlaps (ri, rj, shift, orientation, score) Reads that come from two regions of the genome (blue and red) that contain the same repeat Note: of course, we don’t know the “color” of these nodes