1 / 69

Master’s course Bioinformatics Data Analysis and Tools

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Master’s course Bioinformatics Data Analysis and Tools. Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW heringa@few.vu.nl.

keith
Download Presentation

Master’s course Bioinformatics Data Analysis and Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Master’s courseBioinformatics Data Analysis and Tools Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW heringa@few.vu.nl

  2. Course objectives • There are two extremes in bioinformatics work • Tool users (biologists): know how to press the buttons and know the biology but have no clue what happens inside the program • Tool shapers (informaticians): know the algorithms and how the tool works but have no clue about the biology Both extremes can be dangerous at times, need a breed that can do both

  3. Course objectives • How do you become a good bioinformatics problem solver? • You need to know basic analysis and data mining modes • You need to know some important backgrounds of analysis and prediction techniques (e.g. statistical thermodynamics) • You need to have knowledge of what has been done and what can be done (and what not) • Is this enough to become a creative tool developer? • Need to like doing it • Experience helps

  4. Course objectives The most important thing in tools creating (and science in general) • Be able to ask the proper question! • Should address a real problem • Should be targeted • Should be solvable • Bioinformatics challenge: from Genomics to Systems Biology • Bottom up: start at the components, assemble and learn the system • Top down: observe the system behaviour, model and learn the details • What about bottom down or top up questions?

  5. Contents (tentative dates) Date Lecture Title Lecturer 1 [wk 19] 01/04/08 Introduction Jaap Heringa 2 [wk 19] 03/04/08 Microarray data analysis Jaap Heringa 3 [wk 20] 08/04/08 Machine learning Jaap heringa 4 [wk 21] 10/04/08 Clustering algorithms Bart van Houte 5 [wk 21] 15/04/08 Feature Selection  Bart van Houte 6[wk 23] 17/04/08 Molecular Simulation & Sampling Techniques Anton Feenstra 7[wk 23] 22/04/08 Introduction to Statistical Thermodynamics I Anton Feenstra 8[wk 24] 24/04/08 Introduction to Statistical Thermodynamics II Anton Feenstra 9[wk 24] 06/05/08 Databases and parsing Sandra Smit 10[wk 24] 08/05/08 Semantic Web and Ontologies Frank van Harmelen 11[wk 25] 13/05/08 Parallelisation& Grid Computing Thilo Kielmann 12[wk 25] 15/05/08 Application area I: Protein Domain Prediction Jaap Heringa13[wk 25] 20/05/08 Application Area II: Repeats Detection Jaap Heringa

  6. At the end of this course… • You will have seen a couple of algorithmic examples • You will have got an idea about methods used in the field • You will have a firm basis of the physics and thermodynamics behind a lot of processes and methods • You will have learned about state-of-the-art computational issues, such as Semantic Web and HTP Computing • You will have an idea of and some experience as to what it takes to shape a bioinformatics tool

  7. Bioinformatics “Studying informatic processes in biological systems” (Hogeweg) “Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith) Applying algorithms and mathematical formalisms to biology (genomics)

  8. This course • General theory of crucial algorithms (GA, NN, HMM, SVM, etc..) • Method examples • Research projects within own group • Repeats • Domain boundary prediction • Physical basis of biological processes and of (stochastic) tools

  9. Bioinformatics Bioinformatics Large - external (integrative) ScienceHuman Planetary Science Cultural Anthropology Population BiologySociology SociobiologyPsychology Systems Biology BiologyMedicine Molecular Biology Chemistry Physics Small – internal (individual)

  10. Genomic Data Sources • DNA/protein sequence • Expression (microarray) • Proteome (xray, NMR, • mass spectrometry) • PPI • Metabolome • Physiome (spatial, • temporal) Integrative bioinformatics

  11. Protein structural data explosion Protein Data Bank (PDB): 14500 Structures (6 March 2001) 10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...

  12. Bioinformatics inspiration and cross-fertilisation Chemistry Biology Molecular biology Mathematics Statistics Bioinformatics Computer Science Informatics Medicine Physics

  13. Joint international programming initiatives • Bioperl http://www.bioperl.org/wiki/Main_Page http://bioperl.org/wiki/How_Perl_saved_human_genome • Biopython http://www.biopython.org/ • BioTcl http://wiki.tcl.tk/12367 • BioJava www.biojava.org/wiki/Main_Page

  14. Algorithms in bioinformatics • string algorithms • dynamic programming • machine learning (NN, k-NN, SVM, GA, ..) • Markov chain models • hidden Markov models • Markov Chain Monte Carlo (MCMC) algorithms • stochastic context free grammars • EM algorithms • Gibbs sampling • clustering • tree algorithms (suffix trees) • graph algorithms • text analysis • hybrid/combinatorial techniques and more… Some techniques can be reapplied to many different problems, e.g. clustering, NN, etc.

  15. Algorithms in bioinformatics • Approaches in methods can be reapplied, e.g. kowledge-based (inverse) prediction • Different approaches can be combined, e.g. consensus prediction, metaservers

  16. PSI-BLAST iteration Run query sequence against database Q Run PSSM against database hits T DB PSSM Discarded sequences

  17. Fold recognition by threading: THREADER and GenTHREADER Fold 1 Fold 2 Fold 3 Fold N Query sequence Compatibility scores

  18. Polutant recognition by microarray mapping: Cond. 1 Cond. 2 Cond. 3 Cond. N Contaminant 1 Contaminant 2 Query array Contaminant 3 Compatibility scores Contaminant N

  19. Protein-protein interaction prediction • Two extreme approaches, phlogenetic prediction and Molecular Dynamics (MD) simulation • Mesoscopic modelling • Soft-core Molecular Dynamics (MD) • Fuzzy residues • Fuzzy (surface) locations

  20. ENFIN WP5 - BioRange (Anton Feenstra) • Protein-protein interaction prediction • Mesoscopic modelling • Soft-core Molecular Dynamics (MD) • Fuzzy residues • Fuzzy (surface) locations

  21. Where are important new questions?

  22. New neighbouring disciplines • Translational Medicine A branch of medical research that attempts to more directly connect basic research to patient care. Translational medicine is growing in importance in the healthcare industry, and is a term whose precise definition is in flux. In particular, in drug discovery and development, translational medicine typically refers to the "translation" of basic research into real therapies for real patients. The emphasis is on the linkage between the laboratory and the patient's bedside, without a real disconnect. This is often called the "bench to bedside" definition. • Computational Systems Biology Computational systems biology aims to develop and use efficient algorithms, data structures and communication tools to orchestrate the integration of large quantities of biological data with the goal of modeling dynamic characteristics of a biological system. Modeled quantities may include steady-state metabolic flux or the time-dependent response of signaling networks. Algorithmic methods used include related topics such as optimization, network analysis, graph theory, linear programming, grid computing, flux balance analysis, sensitivity analysis, dynamic modeling, and others. • Neuro-informatics Neuroinformatics combines neuroscience and informatics research to develop and apply the advanced tools and approaches that are essential for major advances in understanding the structure and function of the brain

  23. Translational Medicine • “From bench to bed side” • Genomics data to patient data • Integration

  24. Natural progression of a gene

  25. Functional GenomicsFrom gene to function Genome Expressome Proteome TERTIARY STRUCTURE (fold) Metabolome TERTIARY STRUCTURE (fold)

  26. Systems Biology is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behaviour of that system (for example, the enzymes and metabolites in a metabolic pathway). The aim is to quantitatively understand the system and to be able to predict the system’s time processes • the interactions are nonlinear • the interactions give rise to emergent properties, i.e. properties that cannot be explained by the components in the system

  27. Systems Biology understanding is often achieved through modeling and simulation of the system’s components and interactions. Many times, the ‘four Ms’ cycle is adopted: Measuring Mining Modeling Manipulating

  28. Neuroinformatics • Understanding the human nervous system is one of the greatest challenges of 21st century science. • Its abilities dwarf any man-made system - perception, decision-making, cognition and reasoning. • Neuroinformatics spans many scientific disciplines - from molecular biology to anthropology.

  29. Neuroinformatics • Main research question: How does the brain and nervous system work? • Main research activity: gathering neuroscience data, knowledge and developing computational models and analytical tools for the integration and analysis of experimental data, leading to improvements in existing theories about the nervous system and brain. • Results for the clinic:Neuroinformatics provides tools, databases, models, networks technologies and models for clinical and research purposes in the neuroscience community and related fields.

  30. Bioinformatics algorithms For problems such as alignenment, secondary/tertiary structure prediction, phylogenetic tree determination, etc. Algorithmic main components: • Search function • Scoring function

  31. Bioinformatics algorithms • Search function • Search space can be large • High time complexity of algorithms, or even NP-complete/NP-hard problems • Scoring function • Also called the Objective Function • Often the most important

  32. Pair-wise alignmentComplexity of the problem T D W V T A L K T D W L - - I K Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 22n = ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

  33. Levinthal’s paradox (1969) • Denatured protein refolds in ~ 0.1 – 1000 seconds • Protein with e.g. 100 amino acids each with 2 torsions (f en y) • Each can assume 3 conformations (1 trans, 2 gauche) • 3100x2 1095 possible conformations! • Or: 100 amino acids with 3 possibilities in Ramachandran plot (a, b, coil): 3100 » 1047 conformations • If the protein can visit one conformation in one ps (10-12s) exhaustive search costs 1047 x 10-12 s = 1035 s  1027 years! • (the lifetime of the universe  1010 years…)

  34. Phylogenetic tree combinatorial explosion Number of unrooted trees = Number of rooted trees =

  35. Combinatoric explosion # sequences # unrooted # rooted trees trees 2 1 1 3 1 3 4 3 15 5 15 105 6 105 945 7 945 10,395 8 10,395 135,135 9 135,135 2,027,025 10 2,027,025 34,459,425

  36. QUERY DATABASE A recap on Scoring and Benchmarking True Positive True Positive True Negative POSITIVES False Positive T NEGATIVES False Negative True Negative

More Related