1 / 87

Introduction to Bioinformatics

Introduction to Bioinformatics. Juris Viksna , IMCS UL 2014 Alvis Brazma , European Bioinformatics Institute. Planned course schedule. Regular lectures: Wednesdays, 14:30-16:00 Most likely the first 4 (or so) will stick to this, schedule, followed by a break in

fay
Download Presentation

Introduction to Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Bioinformatics Juris Viksna, IMCS UL 2014 AlvisBrazma, European Bioinformatics Institute

  2. Planned course schedule Regular lectures: Wednesdays, 14:30-16:00 Most likely the first 4 (or so) will stick to this, schedule, followed by a break in March, and then restarting in April, with replacement lectures scheduled sometime in April-May. Will try my best to invite guest lecturers (most likely this will involve rescheduled lecture times), but this is subject to options that might (or might not) become available.

  3. Course requierements To obtain a credit for this course you must: - submit a programming project (worth 50% of grade);- take a (written) exam (worth 50% of grade). Course web page: http://susurs.mii.lu.lv/juris/courses/bi2014.html

  4. Bioinformatics • Databases and tools to store and access biomolecular data • Sequence algorithms – assembly from short fragments, alignment of similar sequences, analysis of properties • Evolution and phylogenetics • 3D structure analysis of biomolecules • Machine learning and data mining application to genome and related information • Biomolecular interaction analysis (e.g., protein interactions) • Dynamic systems, modelling of biological networks and systems • Analysis of noisy measurement data, statistical analysis • Data management, databases, interfaces, web services • Links with health records, biomedical informatics

  5. Why Bioinformatics might be important for you? • This is a growing science involving increasing number of computer professionals (e.g., 1000-human genome project just started) • Links with medical and health informatics information systems – a growing and important market for software • Latvian genome project and participation in European genotyping projects – software experts who understand the underlying problems are needed

  6. Topics covered in this course: • Introduction into biology as information science • Overview of some bioinformatics problems • Bio sequence and structure analysis, molecular evolution and phylogeny etc • Genomics – DNA assembly, haplotypes etc • Gene regulation network modelling (graph theory, Boolean networks, dynamic systems) • Analysis of gene expression data, cluster analysis, data mining and analysis • Data management and analysis for biomedical studies • Some new recently evolving topics (time and material availability permitting...)

  7. FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)

  8. FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • Bioloģiskā informācija - tās daudzveidība un apjoms • Bioloģija, statistika, informācijas tehnoloģijas un programmēšana kā bioinformātikas pamatelementi • Genomu organizācija un evolūcija • Salīdzinošā genomika • Bioloģiskās informācijas datubāzes. Informācijas meklēšanas un iegūšanas sistēmas

  9. FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • Nukleīnskābju un proteīnu sekvenču līdzības pamatprincipi. Dažādas salīdzināšanas metodes, to priekšrocības un pielietošanas nosacījumi • Filoģenētika. Klāsteru un kladistiskās metodes filoģenētisko koku rekonstruēšanā • Genoma ekspresijas analīze • DNS čipi genomu polimorfisma analīzē. Gēnu ekspresijas ģenētika

  10. FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • DNS topoloģija, proteīnu struktūra, tās paredzēšanas metodes un pielietojums farmakoloģijā • Proteomika un sistēmu bioloģija. Tīklveida struktūras kā bioloģisko sistēmu dabiska sastāvdaļa • Bioinformātikas perspektīvas. Bioinformātika kā priekšnosacījums modernās bioloģijas apgūšanai

  11. NIH WORKING DEFINITION OF BIOINFORMATICS ANDCOMPUTATIONAL BIOLOGYJuly 17, 2000 • Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualize such data. • Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, behavioural, and social systems.

  12. Human Genome Project • Began in 1990 in the US • The primary goal – to sequence 3 billion long human DNA • A working draft of the genome was released in 2000 • Finished in 2003, with further analysis still being published

  13. The results of HGP • 3 billion long sequence consisting of four letters: A, T, G and C containing all the human inhered information • Genomes of many other organisms • Development of biotechnology, not only allowing to sequence the DNA, but also study function of different biomolecules and producing many TB of data • Databases storing this information (GenBank and EMBL data library) • Data analysis and management needs leading to the emergence and development of bioinformatics The things however have recently changed again – with NGS technologies sequencing of a specific individual has become affordable – with direct implications on amount of data that needs to be stored and/or analyzed.

  14. All you need to know about Molecular Biology 

  15. One of the first textbooks in bioinformatics MIT press 2000

  16. Few other textbooks for «Computer Scientists» MIT press 2004 Chapman and Hall/CRC 1995

  17. Some bioinformatics problems from the perspective of Computer Science Genome sequencing and assembly

  18. Genome sequencing and assembly E.Green (2001) Strategies for the systematic sequencing of complex genomes. Nat Rev Genetics, Vol 2:8, 573-583.

  19. Ensembl genome browser

  20. Genome sequence assembly

  21. Genome sequence assembly

  22. Sequence assembly problem Ok, let us assume that we have these hybridizations. How can we reconstruct theinitial DNA sequence from them? Affymetrix GeneChip W.Bains, C.Smith (1988) A novel method for nucleic acid sequence determination.Journal of theoretical biology .Vol. 135:3, 303-307.

  23. Sequence assembly problem Ok, let us assume that we have these hybridizations. How can we reconstruct theinitial DNA sequence from them?

  24. SBH – Hamiltonian path approach

  25. SBH – Hamiltonian path approach

  26. Hamiltonian path (cycle problem) Hamiltonian path (cycle) problem For a given graph find a path (cycle) that visits every vertex exactly once (or show that such path does not exist). Unfortunately the problem is known to be NP-hard. That means that there are no algorithm that works in realistic time already for comparatively small graphs.

  27. SBH – Eulerian path approach

  28. Eulerian path (cycle) problem Eulerian path (cycle) problem For a given graph find a path (cycle) that visits every edge exactly once (or show that such path does not exist).

  29. Eulerian path (cycle) problem Eulerian cycle exists if and only if each of graph vertices has even degree. Moreover, there is a simple linear time algorithm for finding Eulerian cycle. Eulerian path (cycle) problem For a given graph find a path (cycle) that visits every edge exactly once (or show that such path does not exist).

  30. Next Generation Sequencing (Illumina) In case of de-novo sequencing we have essentially the same fragment assembly problem as for SBH, only the number of DNA sequence fragments are much higher and their size larger (~50-150 bp).

  31. Sequence mappers

  32. Sequence assembly – deBruijn graphs

  33. Sequence assembly – deBruijn graphs D.Zebino, E.Birney (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, Vol. 18:5, 821-829.

  34. All you need to know about Molecular Biology 

  35. Central dogma of molecular biology transcription translation DNA RNA Protein

  36. DNA Four different nucleotides : adenosine, guanine, cytosine and thymine. They are usually referred to as bases and denoted by their initial letters, A,C ,G and T 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5'

  37. DNA

  38. DNA - Biology as and information science 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5' Thus, for many information related purposes, the molecule can be represented as CGATTCAACGATGC The maximal amount of information that can be encoded in such a molecule is therefore 2 bits times the length of the sequence. Noting that the distance between nucleotide pairs in a DNA is about 0.34 nm, we can calculate that the linear information storage density in DNA is about 6x10 8 bits/cm, which is approximately 75 GB or 12.5 CD-Roms per cm.

  39. DNA replication – copying the information

  40. Polymerase chain reaction – PCR – Xeroxing the DNA

  41. Genome sequencing • Reading the nucleotides in the DNA molecule and storing the readout in a computer • Basic technology ideas • A version of PCR • Separation of molecules by chemical properties such as weight or length of the DNA • Molecule labelling and fluorescent labelling in particular • DNA fragmentation in random length bits

  42. Anatomy of a chromosome • Centromeres are the largest constriction of the chromosome • Site of attachment of spindle fibers • 100,000s of 171 base pair repeat, called alpha satellite sequences • Centromere associated proteins are bound [Adapted from R.Yasbin]

  43. Genomes, chromosomes Genome is a set of DNA molecules. Each chromosome contains (long) DNA molecule per chromosome The 23 human chromosomes

  44. Genome sizes Information in the human genome – up to 0.75 TB

  45. www.ensembl.org

  46. Genomes and genes Termination (stop) TATA box control statement control statement start gene Transcription (RNA polymerase) Ribosome binding 3’ utr 5’ utr mRNA Translation (Ribosome) Protein

  47. Chromosomes - Eucariots

  48. Chromosomes - Procariots

  49. (Eucariotic) cell [Adapted from Online Biology Book]

  50. (Procariotic) cell

More Related