280 likes | 417 Views
UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL. Bioinformatics Tutorials. BIOINFORMATICS AND GENE DISCOVERY. Iosif Vaisman. 1998. From genes to proteins. From genes to proteins. DNA. PROMOTER ELEMENTS. TRANSCRIPTION. RNA. SPLICE SITES. SPLICING. mRNA. START CODON. STOP CODON.
E N D
UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials BIOINFORMATICSANDGENE DISCOVERY Iosif Vaisman 1998
From genes to proteins DNA PROMOTER ELEMENTS TRANSCRIPTION RNA SPLICE SITES SPLICING mRNA START CODON STOP CODON TRANSLATION PROTEIN
Comparative Sequence Sizes • Yeast chromosome 3 350,000 • Escherichia coli (bacterium) genome 4,600,000 • Largest yeast chromosome now mapped 5,800,000 • Entire yeast genome 15,000,000 • Smallest human chromosome (Y) 50,000,000 • Largest human chromosome (1) 250,000,000 • Entire human genome 3,000,000,000
Low-resolution physical map of chromosome 19
Computational Gene Prediction • Where the genes are unlikely to be located? • How do transcription factors know where to bind a region of DNA? • Where are the transcription, splicing, and translation start and stop signals? • What does coding region do (and non-coding regions do not) ? • Can we learn from examples? • Does this sequence look familiar?
Artificial Intelligence in Biosciences Neural Networks (NN) Genetic Algorithms (GA) Hidden Markov Models (HMM) Stochastic context-free grammars (CFG)
Information Theory 0 1 1 bit
Information Theory 00 01 1 bit 11 10 1 bit
Information Theory 1 bit 1 bit
Stochastic models Mechanistic models Mechanism Black box Predictive power Elegance Consistency Predictive power Hidden Markov models Stochastic mechanism Scientific Models Physical models -- Mathematical models
Neural Networks • interconnected assembly of simple processing elements (units or nodes) • nodes functionality is similar to that of the animal neuron • processing ability is stored in the inter-unit connection strengths (weights) • weights are obtained by a process of adaptation to, or learning from, a set of training patterns
Genetic Algorithms Search or optimization methods using simulated evolution. Population of potential solutions is subjected to natural selection, crossover, and mutation choose initial population evaluate each individual's fitness repeat select individuals to reproduce mate pairs at random apply crossover operator apply mutation operator evaluate each individual's fitness until terminating condition
Parent A Parent B crossover point Child AB Child BA Crossover Mutation
Markov Model (or Markov Chain) A A G T C T Probability for each character based only on several preceding characters in the sequence # of preceding characters = order of the Markov Model Probability of a sequence P(s) = P[A] P[A,T] P[A,T,C] P[T,C,T] P[C,T,A] P[T,A,G]
G T A C A C T Hidden Markov Models States -- well defined conditions Edges -- transitions between the states ATGAC ATTAC ACGAC ACTAC Each transition asigned a probability. Probability of the sequence: single path with the highest probability --- Viterbi path sum of the probabilities over all paths -- Baum-Welch method
Hidden Markov Model of Biased Coin Tosses • States (Si): Two Biased Coins {C1, C2} • Outputs (Oj): Two Possible Outputs {H, T} • p(OutputsOij): p(C1, H), p(C1, T), p(C2, H) p(C2, T) • Transitions: From State X to Y {A11, A22, A12, A21} • p(Initial Si): p(I, C1), p(I, C2) • p(End Si): p(C1, E), p(C2, E)
Hidden Markov Model for Exon and Stop Codon (VEIL Algorithm)
REFINED EXON POSITIONS FINAL EXON CANDIDATES POSSIBLE EXONS GRAIL gene identification program
Suboptimal Solutions for the Human Growth Hormone Gene (GeneParser)
FN TN FN TP FN TN TN TP FP REALITY PREDICTION REALITY Sensitivity c nc Sn = TP / (TP + FN) FP TP c PREDICTION Specificity FN nc TN Sp = TP / (TP + FP) Measures of Prediction Accuracy Nucleotide Level
number of correct exons Sensitivity Sn = number of actual exons number of correct exons Sp = Specificity number of predicted exons Measures of Prediction Accuracy Exon Level MISSING EXON WRONGEXON CORRECTEXON REALITY PREDICTION
Bibliography http://linkage.rockefeller.edu/wli/gene/list.html and http://www-hto.usc.edu/software/procrustes/fans_ref/ Gene Discovery Exercise http://metalab.unc.edu/pharmacy/Bioinfo/Gene