130 likes | 238 Views
Information Theory of DNA Sequencing. David Tse Dept. of EECS U.C. Berkeley ITA 2012 Feb. 10 Research supported by NSF Center for Science of Information. Abolfazl Motahari. Guy Bresler. TexPoint fonts used in EMF: A A A A A A A A A A A A A A A A. DNA sequencing.
E N D
Information Theory of DNA Sequencing David Tse Dept. of EECS U.C. Berkeley ITA 2012 Feb. 10 Research supported by NSF Center for Science of Information. AbolfazlMotahari Guy Bresler TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA
DNA sequencing DNA: the blueprint of life Problem: to obtain the sequence of nucleotides. …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…
Impetus: Human Genome Project 1990: Start 2001: Draft 3 billion basepairs 2003: Finished $3 billion
Sequencing Gets Cheaper and Faster Cost of one human genome • HGP: $ 3 billion • 2004: $30,000,000 • 2008: $100,000 • 2010: $10,000 • 2011: $4,000 • 2012-13: $1,000 • ???: $300 Time to sequence one genome: years/months hours/days Massive parallelization.
But many genomes to sequence 100 million species (e.g. phylogeny) 7 billion individuals (SNP, personal genomics) 1013 cells in a human (e.g. somatic mutations such as HIV, cancer)
Whole Genome Shotgun Sequencing Reads are assembled to reconstruct the original DNA sequence.
Computation versus Information View • Many proposed assembly algorithms for many sequencing technologies. • But what is the minimum number of reads required for reliable reconstruction? • How much intrinsic information does each read provide about the DNA sequence? • This depends on the sequencing technology but not on the assembly algorithm.
Communication and Sequencing: An Analogy Communication: source sequence Sequencing: Question: what is the max. sequencing rate such that reliable reconstruction is possible?
The read channel • Capacity depends on • read length: L • DNA length: G • Normalized read length: • Eg. L = 100, G = 3 £109 : AGGTCC AGCTTATAGGTCCGCATTACC read channel
Result: Sequencing Capacity no coverage (Lander-Waterman 88) duplication (Arratia et al 96) H2(p) is (Renyi) entropy rate of the DNA sequence: The higher the entropy, the easier the problem! L L L L greedy algorithm
Complexity is in the eyes of the beholder Low entropy High entropy harder to communicate easier to communicate easier jigsaw puzzle harder jigsaw puzzle
Conclusion • DNA sequencing is an important problem. • Many new technologies and new applications. • An analogy between sequencing and communication is drawn. • A notion of sequencing capacity is formulated. • A principled design framework?