180 likes | 358 Views
Computational Genomics. Lecture 1, Tuesday April 1, 2003. Biology in One Slide: 2 Paradigms. Molecular Paradigm. Evolution Paradigm. High Throughput Biology. Gene Expression. Biology is becoming an information science. DNA Sequencing. …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT
E N D
Computational Genomics Lecture 1, Tuesday April 1, 2003
Biology in One Slide: 2 Paradigms Molecular Paradigm Evolution Paradigm
High Throughput Biology Gene Expression Biology is becoming an information science DNA Sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…
Goals of this course • Introduction to Computational Biology • Basic biology for computer scientists • Breadth: mention many topics & applications • In-depth coverage of Computational Genomics • Algorithms for sequence analysis • Current applications, trends, and open problems • Coverage of useful algorithms • Hidden Markov models • Dynamic Programming • String algorithms • Applications of AI techniques
Topics in CS262 Part 1: In-depth coverage of basic computational methods for analysis of biological sequences • Sequence Alignment & Dynamic Programming • Hidden Markov models These methods are used heavily in most genomics applications: • DNA sequencing • Comparison of DNA and proteins across organisms • Discovery of genes, promoters, regulatory sites
Topics in CS262 Part 2: Topics in computational genomics, more algorithms, and areas of active research • DNA sequencing & assembly: reading a complete genome such as the human DNA • Gene finding: marking genes on the DNA sequence • Large-scale comparative genomics: comparing whole genomes from multiple organisms • Microarrays & regulation: understanding the regulatory code, and potential disease-causing genes • RNA structure: predicting the folding of RNA • Phylogeny and evolution: quantifying the evolution of biological sequences
Course responsibilities • Homeworks [72%] • 4 challenging problem sets, 4-5 problems/pset • Collaboration allowed – please give credit • Hws due Thursday, solutions explained Friday • Two worst problems in all hws do not count • Final [18%] • Takehome, 1 day • Collaboration not allowed • Basic questions – much easier than homeworks • Scribing [10%] • Due one week after the lecture, except special permission
Reading material • Books • “Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchinson • Chapters 1-4, 6, (7-8), (9-10) • “Algorithms on strings, trees, and sequences”by Gusfield • Chapters (5-7), 11-12, (13), 14, (17) • Papers • Lecture notes
Evolution at the DNA level C …ACGGTGCAGTCACCA… …ACGTTGCAGTCCACCA… SEQUENCE EDITS REARRANGEMENTS
Evolutionary Rates next generation OK OK OK Changes in non-functional sites are OK, so will be propagated X X Still OK? Most changes in functional sites are deleterious and will be rejected
Sequence conservation implies function 100% Interleukin region in human and mouse 40%
Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2...xM, y = y1y2…yN, an alignment is an assignment of gaps to positions 0,…, M in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence
What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”? Alignment: A hypothesis that the two sequences come from a common ancestor through sequence edits Parsimonious explanation: Find the minimum number of edits that transform one sequence into the other
Scoring Function • Sequence edits: AGGCCTC • Mutations AGGACTC • Insertions AGGGCCTC • Deletions AGG.CTC Scoring Function: Match: +m Mismatch: -s Gap: -d Score F = (# matches) m - (# mismatches) s – (#gaps) d