580 likes | 842 Views
BCB 444/544. Lecture 33 Genomics #33_Nov09. Required Reading ( before lecture). √ Mon Nov 5 - Lecture 31 Phylogenetics – Parsimony and ML Chp 11 - pp 142 – 169 √ Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33 Functional and Comparative Genomics
E N D
BCB 444/544 Lecture 33 Genomics #33_Nov09 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Required Reading (before lecture) √ Mon Nov 5 - Lecture 31 Phylogenetics – Parsimony and ML • Chp 11 - pp 142 – 169 √ Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33 Functional and Comparative Genomics • Chp 17 and Chp 18 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Assignments & Announcements Fri Nov 9 - HW#6 (will be posted this weekend) HW#6 - More fun with Machine Learning!! Due: Fri Nov 16 (or sometime before Mon Nov 26) BCB 444/544 F07 ISU Dobbs#33 - Genomics
Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html • Nov 7 Wed - BBMB Seminar 4:10 in 1414 MBB • Sharon Roth Dent MD Anderson Cancer Center • Role of chromatin and chromatin modifying proteins in regulating gene expression • Nov 8 Thurs - BBMB Seminar 4:10 in 1414 MBB • Jianzhi George Zhang U. Michigan • Evolution of new functions for proteins • Nov 9 Fri - BCB Faculty Seminar2:10 in 102 SciI • Amy AndreottiISU • T cell signaling: insights from protein NMR spectroscopy BCB 444/544 F07 ISU Dobbs#33 - Genomics
Chp 11 – Phylogenetic Tree Construction Methods and Programs SECTION IV MOLECULAR PHYLOGENETICS Xiong: Chp 11 Phylogenetic Tree Construction Methods and Programs • Distance-Based Methods • Character-Based Methods • Phylogenetic Tree Evaluation • Phylogenetic Programs BCB 444/544 F07 ISU Dobbs#33 - Genomics
Machine Learning • What is learning? • What is machine learning? • Learning algorithms • Machine learning applied to bioinformatics and computational biology • Some slides adapted from Dr. Vasant Honavar and Dr. Byron Olson BCB 444/544 F07 ISU Dobbs#33 - Genomics
Examples of Machine Learning Algorithms • Naïve Bayes (NB) • Bayes Theorem • Neural network (NN) or Artificial Neural Net (ANN) • Perceptrons • Support Vector Machine (SVM) • Kernel functions Lab - WEKA: Decision Trees (DT), NB, SVM BCB 444/544 F07 ISU Dobbs#33 - Genomics
An Application: Predicting RNA Binding Sites in Proteins • Problem: Given an amino acid sequence, classify each residue as RNA binding or non-RNA binding • Input to the classifier is a string of amino acid identities • Output from the classifier is a class label, either binding or not BCB 444/544 F07 ISU Dobbs#33 - Genomics
Bayes Theorem Applied to RNA Binding Site Prediction BCB 444/544 F07 ISU Dobbs#33 - Genomics
Naïve Bayes for Binary Classification Assign c = 1 if Otherwise, assign c = 0 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Example: Is ARG 6 RNA-binding or not? ARG 6 T S K K K R Q R G S R p(X1 = T | c = 1) p(X2 = S | c = 1) … ≥ θ p(X1 = T | c = 0) p(X2 = S | c = 0) … BCB 444/544 F07 ISU Dobbs#33 - Genomics
Predicted vs Actual RNA Binding for Ribosomal protein L15 (PDB ID 1JJ2:K) Predicted Actual BCB 444/544 F07 ISU Dobbs#33 - Genomics
Artificial Neural Networks (ANNs or NNs) • Neural networks - classify “input vectors” or “examples” into categories (2 or more) • They are loosely based on biological neurons • Some of most successful methods for predicting secondary structure are based on neural networks: • Neural networks are trained to recognize amino acid patterns corresponding to known secondary structure elements; these patterns are used to predict secondary structure type for aa sequences in proteins of unknown structure BCB 444/544 F07 ISU Dobbs#33 - Genomics
Biological Neurons “Sum” Input Signals & Generate Output Signal Dendrites receive inputs, Axon sends output Image from Christos Stergiou and Dimitrios Siganos http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html BCB 444/544 F07 ISU Dobbs#33 - Genomics
Simple Neuron = “Perceptron” Perceptron is “Simplest ANN” = feed-forward NN = linear classifier Image from Christos Stergiou and Dimitrios Siganos http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html BCB 444/544 F07 ISU Dobbs#33 - Genomics
The Perceptron X1 w1 T w2 X2 wN XN Input X Weights W Summation S Threshold T Output F Perceptroncombines input vectorsX1…N , compares “sum” S with a threshold T, and generates output class label: either 1 or 0 If weights W and threshold T are not known in advance, the perceptron must be trained.Ideally, perceptron is trained to return correct answer for all training examples, and perform well on test examples it has never seen. Training set must contain both classes of data (i.e.. with “1” and “0” output). BCB 444/544 F07 ISU Dobbs#33 - Genomics
1 1/2 0 0 Perceptron “Sums” Inputs by Computing Dot Product S = XW • Input is a vector X; Weight is are another vector W • Perceptron Summation S computes the dot product, S = XW • Perceptron Output F is a function of S: it is often discrete (1 or 0), in which case the function is a step function • For continuous output, a sigmoidal function is often used: BCB 444/544 F07 ISU Dobbs#33 - Genomics
Training a Perceptron Find the weights W that minimize the error function E: P: number of training examples Xi: training vectors F(WXi): output of perceptron t(Xi) : target value for Xi Use steepest descent: - compute gradient: - update weight vector: - iterate (: learning rate) BCB 444/544 F07 ISU Dobbs#33 - Genomics
Artificial Neural Network (ANN) • Artificial neural network • Set of perceptrons • interconnected such that • outputs of some units become inputs of other units • Many topologies are possible! • Can have multiple layers Neural networks are trained in same way perceptrons are trained, by minimizing an error function: BCB 444/544 F07 ISU Dobbs#33 - Genomics
Support Vector Machines - SVMs Image from http://en.wikipedia.org/wiki/Support_vector_machine BCB 444/544 F07 ISU Dobbs#33 - Genomics
SVM Finds Maximum-Margin Hyperplane(i.e., hyperplane that provides maximum separation between two classes of instances in dataset) Image from http://en.wikipedia.org/wiki/Support_vector_machine BCB 444/544 F07 ISU Dobbs#33 - Genomics
Kernel “Trick” BCB 444/544 F07 ISU Dobbs#33 - Genomics
Kernel Function BCB 444/544 F07 ISU Dobbs#33 - Genomics
Take Home Messages • Must consider how to set up the learning problem (supervised or unsupervised, generative or discriminative, classification or regression, etc.) • Lots of algorithms out there • No algorithm performs best on all problems BCB 444/544 F07 ISU Dobbs#33 - Genomics
Genomics - for excellent overview lectures, see these posted by NHGRI & Pevsner: 1- Genomic sequencing Mapping and Sequencing CTGA2005Lecture1.pdf Eric Green, NHGRI 2- Human genome project The Human Genome 2005-10-19_ch17.pdf Jonathan Pevsner, Kennedy Krieger Institute 3- SNPs Studying Genetic Variation II: Computational Techniques Jim Mullikin, NHGRITGA2005Lecture13.pdf 4- Comparative Genomics Comparative Sequence Analysis Elliott Margulies, NHGRI CTGA2005Lecture8.pdf BCB 444/544 F07 ISU Dobbs#33 - Genomics
1- Genomic sequencingMany thanks to:Eric Green, NHGRI for the following slides extracted from his lecture on:Mapping and SequencingCTGA2005Lecture1.pdf BCB 444/544 F07 ISU Dobbs#33 - Genomics
Genomic Sequencing - Brief Review E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Comparison of Sequenced Genome Sizes E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Comparison of Genetic & Physical Maps E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
STSs: Provide common markers for "linking" genetic & physical maps E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
With complete genomes (now), why bother to generate physical maps? E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Genomic sequencing requires assembly of sequences obtained from cloned DNA E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Human Genome Sequencing • Two approaches: • Public (government) - International Consortium • (6 countries, NIH-funded in US) • "Hierarchical" cloning & BAC-by-BAC sequencing • Map-based assembly • Private (industry) - Celera (Craig Venter) • Whole genome random "shotgun" sequencing • Computational assembly • (took advantage of public maps & sequences,too) • Guess which human genome Celera sequenced? BCB 444/544 F07 ISU Dobbs#33 - Genomics
NIH: "Hierarchical" BAC-by-BAC Sequencing E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
"Hierarchical" Subcloning Strategy E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Celera: Whole-Genome "Shotgun" Sequencing E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
"Shotgun" Sequencing Stategy E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Either Strategy: Sequence "Finishing" = Hardest part !! E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Advances in DNA Sequencing Technology E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Sequencing Method #1: Gilbert-Maxim "Chemical Degradation" E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Sequencing Method #2: Sanger "Di-deoxy Chain Termination" E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Automated Sequencing for Genome Projects: Sanger method - with improvements Another “recent” improvement: rapid & high resolution separation of fragments in capillaries instead of gels (E Yeung,Ames Lab, ISU) E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Recent technologies? Pyro- & 454 Sequencing BCB 444/544 F07 ISU Dobbs#33 - Genomics
1st Eukaryotic Genome Sequence: S. cerevisiae E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
1st Animal Genome Sequence: C. elegans E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Timetable for Human Genome Sequencing: Faster than expected! E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
1st Draft Human Genome: ”Complete" in 2001 E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
Public Sequencing - International Consortium E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
"Finishing" the Human Genome - continues… E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics
After "Complete" Human Genome Sequence What next? E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics