490 likes | 603 Views
Algorithmic Functions of Computational Biology – Course 1 Professor Istrail. Of Sea Urchins, Birds and Men. Darwin ’ s Finches. and Coco. 2. Algorithmic Functions of Computational Biology – Course 1 Professor Istrail. The Father of All Dot Plots. The Human Genome. The Synteny Problem.
E N D
Algorithmic Functions of Computational Biology – Course 1 Professor Istrail Of Sea Urchins, Birds and Men
Darwin’s Finches and Coco 2
Algorithmic Functions of Computational Biology – Course 1 Professor Istrail The Father of All Dot Plots The Human Genome
The Synteny Problem Algorithmic Functions of Computational Biology - Course 1 Professor Istrail • Between distant species can reveal function • Conservation reveals selective pressure • Between near species • Conservation reveals evolutionary history • Between similar or the same species • Recent events in subpopulations • Phenotypic differences
Chaining Phase Extension Phase Matching, Chaining, Extension Algorithmic Functions of Computational Biology – Course 1 Professor Istrail Matching Phase
Dot Plots 101 Algorithmic Functions of Computational Biology – Course 1 Professor Istrail • a,b,c,d stand for letters A,B,C,D for words • Where letters match, put a dot • Where words match, put a line (words can be rc-ed)
Dot Plots 101 Algorithmic Functions of Computational Biology – Course 1 Professor Istrail • When words line up • Reversed • Misplaced • Something gained (relative to horizontal) • Something lost (relative to horizontal)
Algorithmic Functions of Computational Biology – Course 1 Professor Istrail Some large reversals in GP
Algorithmic Functions of Computational Biology – Course 1 Professor Istrail NCBI has more of the centromere than anyone else (or is that N’s?)
Algorithmic Functions of Computational Biology – Course 1 Professor Istrail Many reversals in GP, a piece of the end is re-ordered to the middle, celera assemblies boringly good.
Algorithmic Functions of Computational Biology – Course 1 Professor Istrail Again everyone misses the first 10MB (or are those N’s) of NCBI31
Rube Goldberg’s Innovation GENOMIC REGULATORY SYSTEMS Mixed character of the problem : continuous mathematics discrete mathematics
Rube Goldberg’s Pencil Sharpener invention String (C) lifts small door (D) Emergency knife (S) is always handy in case opossum or the woodpecker gets sick and can't work. allowing woodpecker (Q) to chew wood from pencil (R), exposing lead. Open window (A) and fly kite (B). allowing moths (E) to escape and eat red flannel shirt (F). pulling rope (O) and lifting cage (P), which jumps into basket (N), As weight of shirt becomes less, shoe (G) steps on switch (H) which heats electric iron (I) and burns hole in pants (J). Smoke (K) enters hole in tree (L), smoking out opossum (M)
A Tale of Two Networks Algorithmic Functions of Computational Biology – Course 1 Professor Istrail Drosophila Sea Urchin
One gene, 30 years of study, 300 docs and postdocs A Proposal for Nobel Prize “Programs built into the DNA of every animal.” Eric H. Davidson Genomic Regulatory Systems
The Dogma Algorithmic Functions of Computational Biology - Course 1 Professor Istrail
Genomic Regulatory Regions Algorithmic Functions of Computational Biology – Course 1 Professor Istrail
TF Binding Site Complexity Algorithmic Functions of Computational Biology – Course 1 Professor Istrail
Genome Complexity 1 Billion DNA bases 20,000 Genes
Algorithmic Functions of Computational Biology - Course 1 Professor Istrail cis-Regulatory Modules Complexity 200,000 cis-Modules
The DNA program that regulates the expression of endo16 in sea urchin • THE FIRST GENE
The View from the Genome Algorithmic Functions of Computational Biology – Course 1 Professor Istrail
The View from the Nucleus Algorithmic Functions of Computational Biology – Course 1 Professor Istrail
Building Protein-DNA Assemblies Algorithmic Functions of Computational Biology - Course 1 Professor Istrail • DNA • cismodule • Cooperativity • Linear-amp • Gates • Potentiality • Inter-cismodule linkage • Insulation • Communication
The Building Blocks Algorithmic Functions of Computational Biology - Course 1 Professor Istrail • Free Energy Free energy is the “GLUE” • Protein • DNA • Protein-DNA Binding (free energy)
Information Processing Algorithmic Functions of Computational Biology - Course 1 Professor Istrail
Algorithmic Functions of Computational Biology - Course 1 Professor Istrail 0 1 1 0 0 1 0 0 0 • Boolean Circuit • Synchronous input and output • Completely defined gates
1.4 0.5 0 1 1 0 0 1 0 0 • 0 • 1.1 • Boolinear Circuit • Boolean Circuit • Asynchronous input and output • Synchronous input and output • Completely defined gates • Incompletely defined gates
1 1 0 1 AND OR NOT OR 1 IF (x1 = 1 AND x2= 1) THEN ….. GTAGGATTAAG …... CATCCTAATTC ……. GTATCTAGAAG …….
Web page : • http://www.its.caltech.edu/~chyuh/cathy-mirsky-info.html Caltech, Davidson Lab October 2004
Introduction SNPs, HAPLOTYPES
Single Nucleotide Polymorphism (SNP) GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG • The most abundant type of polymorphism A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%. The two alleles at the site are G andT
tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctcatttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca t c g a g a t c t c g a g c t c g a t c t c t c g a g a t c g a t c g a g c g c g a g a t c g a g c g a g a • Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes. • Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs. • SNPs occur once every ~600 bp • Average gene in the human genome spans ~27Kb • ~50 SNPs per gene
Haplotype C A G Haplotypes T T G G C T C G A C A A C A G G T T C G T C A A C A G SNP SNP SNP Two individuals
Mutations Infinite Sites Assumption: Each site mutates at most once
Haplotype Pattern C A G T T T G A C A T G C T G T 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 At each SNP site label the two alleles as 0 and 1. The choice which allele is 0 and which one is 1 is arbitrary.
Recombination G T T C G A C A A C A T A C G T A T C T A T T A G T T C G A CT A T T A
Recombination The two alleles are linked, I.e., they are “traveling together” G T T C G A C A A C A T A C G T A T C T A T T A Recombination disrupts the linkage ? G T T C G A CT A T T A
Emergence of Variations Over Time Disease Mutation Common Ancestor present time Linkage Disequilibrium (LD) Variations in Chromosomes Within a Population
Disease-Causing Mutation 2,000 gens. ago 1,000 gens. ago Extent of Linkage Disequilibrium Time = present
A Data Compression Problem • Select SNPs to use in an association study • Would like to associate single nucleotide polymorphisms (SNPs) with disease. • Very large number of candidate SNPs • Chromosome wide studies, whole genome-scans • For cost effectiveness, select only a subset. • Closely spaced SNPs are highly correlated • It is less likely that there has been a recombination between two SNPs if they are close to each other.
Control Non-responder Disease Responder Allele 0 Allele 1 Marker A: Allele 0 = Allele 1 = Marker A is associated with Phenotype Association studies
Evaluate whether nucleotide polymorphisms associate with phenotype T T C T C T A G G G G A G A A A G G A C A A A A T T G T G G Association studies
T T T C C T G G A G A G G A G G A A A A C A A A G T T T G G Association studies
1 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 0 Association studies
Data Compression A------A---TG-- G------G---CG-- A------G---TC-- A------G---CC-- G------A---TG-- ACGATCGATCATGAT GGTGATTGCATCGAT ACGATCGGGCTTCCG ACGATCGGCATCCCG GGTGATTATCATGAT Selecting Tagging SNPs in blocks Haplotype Blocks based on LD (Method of Gabriel et al.2002)
Real Haplotype Data A region of Chr. 2245 Caucasian samples Our block-free algorithm Two different runs of the Gabriel el al Block Detection method + Zhang et al SNP selection algorithm