1.12k likes | 1.43k Views
From Bauhaus to Bio-House. NATURE | VOL 422 | 24 APRIL 2003 |www.nature.com/nature. Dude, Where is My Genome?. Past Present & Future Of Genomics Technologies. Bud Mishra. Professor of Computer Science, Mathematics and Cell Biology ¦
E N D
From Bauhaus to Bio-House NATURE | VOL 422 | 24 APRIL 2003 |www.nature.com/nature
Dude, Where is My Genome? Past Present & Future Of Genomics Technologies
Bud Mishra Professor of Computer Science, Mathematics and Cell Biology ¦ Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine
Tools of the trade Where we collect three important tools from biotechnology: scissors, glues and copiers…
Scissors • Type II Restriction Enzyme • Biochemicals capable of cutting the double-stranded DNA by breaking two -O-P-O bridges on each backbone • Restriction Site: • Corresponds to specific short sequences: EcoRI GAATTC • Naturally occurring protein in bacteria…Defends the bacterium from invading viral DNA…Bacterium produces another enzyme that methylates the restriction sites of its own DNA Tools of the Trade
Glue • DNA Ligase • Cellular Enzyme: Joins two strands of DNA molecules by repairing phosphodiester bonds • T4 DNA Ligase (E. coli infected with bacteriophage T4) • Hybridization • Hydrogen bonding between two complementary single stranded DNA fragments, or an RNA fragment and a complementary single stranded DNA fragment… results in a double stranded DNA or a DNA-RNA fragment Tools of the Trade
Copier • DNA Amplification: • Main Ingredients: Insert (the DNA segment to be amplified), Vector (a cloning vector that combines with an insert to create a replicon), Host Organism (usually bacteria). Tools of the Trade
Copier • PCR (Polymerase Chain Reaction): • Main Ingredients: Primers, Catalysts, Templates, and the dNTPs. Tools of the Trade
Sir Ernest Rutherford “For Mike’s sake, Soddy, don’t call it transmutation. They’ll have our heads off as alchemists.” Rutherford, winner of 1908 Nobel prize for chemistry for cataloging alpha and beta particles… “All science is either physics or stamp collecting.”
The Middle Way • Two Extremes: • Indexing: For each character ‘b’ in the genome, make a list of each position where it occurs. • Shotgunning: For each long sentence in the genome, select it with low probability (o(lgn/n)), and then read it reasonably accurately. • The Middle way: • Indexed-Shotgun: For each short word in the genome, select it with high probability (o(1)), and then measure its position and read it reasonably accurately. • Where is the middle???
Outline: • Physical Mapping & Sequencing: • Map: • assign physical locations to important markers (e.g., restriction sites or hybridization probes). • Sequence: • align short sequence reads to the markers (map-based sequence assembly) or • align long sequence reads to each other (shotgun assembly) Array Mapping Optical Mapping Sequencing
6 5 4 3 2 1 1 2 3 4 5 6 Measuring distances: • A one dimensional “Buffon’s needle problem.” • Take two points on a line, and drop unit-length needles of some color. • The probability that the two points will have different colors monotonically increases with the distance between these two points • as distance increases from 0 to 1; • attains a fixed value for all distances konger than 1. • One can generalize by considering • More than two points…P points. • Dropping a small set of bichromatic needles… p p p Distance ¼ 3/6 = 0.5
cX coverage subsample cX coverage subsample M High Coverage BAC Library cX coverage subsample cX coverage subsample The Experiments: • Probes are “points” • BACs are “needles” • Hybridization on an array simulates “dropping the bichromatic needles”
A Mathematical Problem • A set of P points: {x1, x2, …, xP} µ [0,G] with pdf f(x) = 1/G i.i.d. for all x 2 [0,G] • Distance di,j = d(|xi –xj|), “measured” between two arbitrary points xi and xj = x. • Given O(P2) distances infer positions.
Matrix-to-Line • Given a P £ P positive symmetric real-valued matrix D of “measured distances”. • The entry di,j» f(d |x). • Choose an embedding of the points: • {x’1, x2, …, x’P} ½ [0,G], • which maximizes a likelihood function • Õ1 · i, j · f(|x’i – x’j| | di,j)
A Physical Model P2 d1,2 d2,3 P1 P2 P3 P4 d2,4 P1 P3 d1,3 d3,4 d1,4 Mass-less Balls connected with springs of different stiffness… P4
Algorithm Join • Consider measured distances of length L’ ·q L; Examine these distances in increasing order. • q2 (0,1) to be determined by the Chernoff bounds • Initially, every probe is a singleton contig. • Two operations: Join and Adjust either combines smaller contigs or improve an existing contig.
Algorithm Adjust • Join and adjust locally minimizes the “log-likelihood cost function” • Local minimum of a weighted sum-of-square error function
Probe 111 Probe 79 Probe 101 Probe 85 Probe 95 Local Distances
Optical Approaches are Inherently Noisy! • Since many biological macromolecules are smaller than the Raleigh limit, the optical approaches involve attaching single fluorescent probes to specific macromolecules. • Controlling Noise: • Magnitude of Stoke-shift • Steric hinderance • Absorption cross-section • Point spread function (PSF) • Image Processing
Optical Mapping • Capture and immobilize whole genomes as massive collections of single DNA molecules Cells gently lysed to extract genomic DNA DNA captured in parallel arrays of long single DNA molecules using microfluidic device Genomic DNA, captured as single DNA molecules produced by random breakage of intact chromosomes
Optical Mapping 2. Interrogate with restriction endonucleases 3. Maintain order of restriction fragments in each molecule Digestion reveals 6-nucleotide cleavage sites as ”gaps”
Optical Mapping 4. Determine size of fragments
Optical Mapping 5. GENTIG Robust Bayesian Map Assembler to make whole-genome restriction map
Computational Analysis Single DNA molecule on Optical Chip after digestion, staining • Image analysis software measures size and order of restriction fragments • Overlapping single molecule maps are aligned to produce a map assembly covering an entire chromosome
Map Assembly Overlapping single molecule maps are aligned to produce a map assembly covering an entire chromosome
Complexity Issues Various combinations of error sources lead to NP-hard Problems
s1j s2j s3j sM,j sR3j sRM,j sR2j sR1j SMRM(Single Molecule Restriction Map) DRj Dj
Probabilistic Analysis Where we design the experiments to generate easy instances of a difficult problem…
+ - - - Intuition