1 / 13

Information Theory of DNA Sequencing

Information Theory of DNA Sequencing. David Tse Dept. of EECS U.C. Berkeley ITA 2012 Feb. 10 Research supported by NSF Center for Science of Information. Abolfazl Motahari. Guy Bresler. TexPoint fonts used in EMF: A A A A A A A A A A A A A A A A. DNA sequencing.

casey
Download Presentation

Information Theory of DNA Sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Theory of DNA Sequencing David Tse Dept. of EECS U.C. Berkeley ITA 2012 Feb. 10 Research supported by NSF Center for Science of Information. AbolfazlMotahari Guy Bresler TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

  2. DNA sequencing DNA: the blueprint of life Problem: to obtain the sequence of nucleotides. …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

  3. Impetus: Human Genome Project 1990: Start 2001: Draft 3 billion basepairs 2003: Finished $3 billion

  4. Sequencing Gets Cheaper and Faster Cost of one human genome • HGP: $ 3 billion • 2004: $30,000,000 • 2008: $100,000 • 2010: $10,000 • 2011: $4,000 • 2012-13: $1,000 • ???: $300 Time to sequence one genome: years/months  hours/days Massive parallelization.

  5. But many genomes to sequence 100 million species (e.g. phylogeny) 7 billion individuals (SNP, personal genomics) 1013 cells in a human (e.g. somatic mutations such as HIV, cancer)

  6. Whole Genome Shotgun Sequencing Reads are assembled to reconstruct the original DNA sequence.

  7. A Gigantic Jigsaw Puzzle

  8. Computation versus Information View • Many proposed assembly algorithms for many sequencing technologies. • But what is the minimum number of reads required for reliable reconstruction? • How much intrinsic information does each read provide about the DNA sequence? • This depends on the sequencing technology but not on the assembly algorithm.

  9. Communication and Sequencing: An Analogy Communication: source sequence Sequencing: Question: what is the max. sequencing rate such that reliable reconstruction is possible?

  10. The read channel • Capacity depends on • read length: L • DNA length: G • Normalized read length: • Eg. L = 100, G = 3 £109 : AGGTCC AGCTTATAGGTCCGCATTACC read channel

  11. Result: Sequencing Capacity no coverage (Lander-Waterman 88) duplication (Arratia et al 96) H2(p) is (Renyi) entropy rate of the DNA sequence: The higher the entropy, the easier the problem! L L L L greedy algorithm

  12. Complexity is in the eyes of the beholder Low entropy High entropy harder to communicate easier to communicate easier jigsaw puzzle harder jigsaw puzzle

  13. Conclusion • DNA sequencing is an important problem. • Many new technologies and new applications. • An analogy between sequencing and communication is drawn. • A notion of sequencing capacity is formulated. • A principled design framework?

More Related