1 / 16

Today’s Topics

Today’s Topics. Computer Science Enabled by Computing : Decoding the Human Genome Upcoming Review for Final Exam. Enabled by Computers. Things we now take for granted: Possible only because of computing-- Several Examples (most mentioned before) Modern Camera Zoom Lens

Download Presentation

Today’s Topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Today’s Topics Computer Science Enabled by Computing : Decoding the Human Genome Upcoming Review for Final Exam

  2. Enabled by Computers • Things we now take for granted: Possible only because of computing-- • Several Examples (most mentioned before) • Modern Camera Zoom Lens • Certain Space Missions: e.g., “Sling Shot” paths • Medical Imaging • CAT scans (Nobel Prize!) • Other imaging procedure: PET, MRI, … • Designing and Manufacturing a modern Computer • Communications (error checking, compression, …) • Decoding the Human Genome

  3. The Human Genome • Each cell contains • Nucleus • The human Nucleus contains • 24 Chromosomes • Chromosomes (composed of DNA), collectively include • 20-25 thousand Genes • E.g, Chromosome 5 includes 5923 genes • Chromosomes composed of, collectively, • 3.5 Gpb (3,500,000,000 base pairs) • Good Diagram of DNA • http://www.accessexcellence.org/RC/VL/GG/dna2.html

  4. The Human Genome • Makeup: The Double Helix - DNA • 3.5 Gpb • (how big a number can an int hold?) • Bases denoted by letters A, C, G, T • Adenine, Cytosine, Guanine, Thymine • Each strand of DNA (in each of our cells) approx 6 feet long! • (packed into volume approx. 0.0004 inches across) • Letters printed as string 1mm apart is almost 1900 miles long • Good Diagram of DNA • http://www.accessexcellence.org/RC/VL/GG/dna2.html

  5. How to Read (Sequence) DNA? • Look at following strings • Assume we didn’t know alphabet • Can we reconstruct alphabet from these fragments? A AB ABCDE ABCDEF BCDEF CDEFGH FGHIJ GHIJK GHIJKL IJKLMN KLMNO LMNOP MNOPQR OPQRST PQRST QRSTU STUVWX UVWXY UVWXYZ VWXYZ YZ Z • If we assume each letter used only once, can match on single character ABCDEF + FGHIJ yields ABCDEFGHIJ • If uncertain of nature, may require longer overlap: IJKLMN + MNOPQR yields IJKLMNOPQR • Can reconstruct Complete Alphabet from fragments

  6. Reconstruction from DNA fragments • Problem is more difficult • Only 4 characters: A C G T • All kinds of repetition in the sequence • Need larger overlap – how large? • Depends on kind of repetition we find • Look at example with a sequence much longer than alphabet • Fragments shown come from chopping up three identical copies of the sequence • Breaks at “random” points

  7. Reconstruction from DNA sequence • Look at following fragments (from 3 originals) AAGATGGTTCATTCT ACGGGCGGTGTTGGAGCAGA AGAGCT AGGTATATTGAGGAAG ATTGT CAAGTAAAAGGA CATTGTCAAGTAAAAG CCAACTAGTCAGCACTAC CCAACTAGTCAGCACTACAT CGGGCGGTGTTGGAGC CTGCAATTTCTG GAAGGTATAT GACTTGGGTA GCTCTGCAATTTCTG GCTGGGG GCTGGGGA GCTGGGGACGGGCGGTGT TAGTCAGCACTA TCTGCAATTTCTGCCAAC TGAGGAAGAAGA TGAGGAAGAAGATGGTTCA TGGAGCAGAGC TGGTTCATTCTGACTTGGGTA TGTCAAGTAAAAGGAAGGTATAT TTCTGACTTGGGTA • Identify Overlaps to reconstruct TCTGCAATTTCTGCCAACTAGTCAGCACTACAT AGGTATATTGAGGAAGAAGATGGTTCA • Eventually can get original sequence GCTGGGGACGGGCGGTGTTGGAGCAGAGCTCTGCAATTTCTGCCAACTAGTCAGCACTA CATTGTCAAGTAAAAGGAAGGTATATTGAGGAAGAAGATGGTTCATTCTGACTTGGGTA

  8. The Real World • Have looked at toy problems: back to reality • String lengths are huge: (3 * 109) • Why the obsession with fragments? • If we can sequence (read) a fragment, why not just do the whole thing? • Automatic Sequencers Available • Limited to lengths of the order of 1000 from end • (Can sequence whole strand if short enough) • Thus the use of the Shotgun Method of Sequencing

  9. Shotgun Sequencing • Strand much too long for automatic sequencing • Randomly cut them into small pieces (~5 Kbp) • Make many identical copies of these strands • Each of these small pieces are sequenced to produce reads • What’s left is a Data Processing Problem • Need to reconstruct original DNA strand by matching ends • If random reads match nicely, work can be completed • There may be problems!

  10. Shotgun Problems • Gaps • Due to random nature of shearing strands, there may be gaps in the sequence • (Maybe all pieces broke at same place) • May need to repeats for some sub-areas to fill gaps • Repeats • Long repeats may make matching ambiguous • Need extra long fragments with ends sequenced • Can tell how many repeats “fit” in • Also can bridge gaps that have resisted sequencing • Sequencing Errors • Automatic sequencing is error prone – need multiple passes

  11. The Computations Required • Appears to be a Simple String Matching Problem • Remember int indexOf(String) method String a, b; ... // input or compute data int pos = a.indexOf(b); • pos tells where in a, b is located • Combined with use of String substring(int, int) can check for overlap in the ends of strings • Effectively “slide” ends over each other for match

  12. The Computations Required • Seems simple enough in principle, but… • Large numbers involved make task daunting • E.g., must compare each read to every other read • For N reads, involve N2 compares. • Wouldn’t seem bad except when we calculate N • 3*109/103 (divide by approx size of read) • N2 is ~ 9*1012 compares • That’s only for 1 times coverage (need more!) • Each compare also involves up to N2 char compares! (where N is length of string)

  13. The Computations Required • Previous analysis is naïve • + Can do better by grouping things • Like matching “words” rather than “letters” • - Other problems not considered make thing much more complex • Whole process is not in an error free environment • Maybe string matches that match at 99% of positions must be considered a match • Many good computer scientists and mathematicians involved

  14. Interesting Competition • BAC to BAC Sequencing • Public Human Genome Project (1988 - ) • Many cooperating laboratories, world wide • Started much earlier than competition • Started with more primitive technologies • Top down approach using bacterial artificial chromosome (BAC) • Builds framework (scaffolding) first • Then fill in details

  15. Interesting Competition • Whole Genome Shotgun Sequencing • Celera Genomics (private: Craig Ventnor, Eugene Myers) • Later start (1998 - ), “finished” at same time • Benefited from much improved technology • Sequencers much better • Longer strands, better accuracy • Faster computers • Shotgun from the top down • Use three sizes of fragments (1 Mbp, 50 Kbp, 10 Kbp) • Can user longer pieces to deal with repeats • Everything done in parallel.

  16. Interesting Competition • Whole Genome Shotgun method appears to have won • Much controversy at first • Hybrid methods • Job just beginning! • Need to find out what in Genome affects what in practice • Much labeled “junk” DNA because it doesn’t seem to affect anything. • Is that the last word?

More Related