Introduction to Bioinformatics

Introduction to Bioinformatics Juris Viksna, IMCS UL 2019 AlvisBrazma, European Bioinformatics Institute

Planned course schedule Regular lecture times: Thursdays 16:30-18:00and 18:15-19:45 On the second week of September and after each two weeks thereafter (i.e. on the dates 12.09, 26.09, 10.10, 24.10, 07.11, 21.11, 05.12, 19.12) 413. aud. It is likely that few lectures will be rescheduled (hopefully, not too many). The dates and times of these (an of replacement lectures) will be announced when known. Will try my best to invite guest lecturers (quite likely this might involve rescheduled lecture times), but this is subject to options that might (or might not) become available.

Course requierements To obtain a credit for this course you must: - submit a programming project (worth 50% of grade) or - submit a ‘data analysis’ project (worth 50% of grade) - take a (written) exam (open book, open internet :) (worth 50% of grade). Coursewebpage: http://susurs.mii.lu.lv/juris/courses/bi2019.html

Topics from the original (A.Brazma 2008) bioinformatics course • The subjects covered during the course will be roughly distributed as follows: • Biology as information science (4 hours) • Genome sequencing and architecture (4 hours) • Discrete vs. continuous problems in bioinformatics (2 hours) • Gene expression data analysis (2 hours) • Comparison of protein sequences - algorithms and heuristics (4 hours) • Phylogenetic trees (4 hours) • Modelling and comparison of protein structures (2 hours) • Comparative genomics (2 hours) • Supervised learning approaches to data analysis (2 hours) • Gene networks and methods for their analysis (4 hours) • Biomedical informatics(2 hours)

Bioinformatics • Databases and tools to store and access biomolecular data • Sequence algorithms – assembly from short fragments, alignment of similar sequences, analysis of properties • Evolution and phylogenetics • 3D structure analysis of biomolecules • Machine learning and data mining application to genome and related information • Biomolecular interaction analysis (e.g., protein interactions) • Dynamic systems, modelling of biological networks and systems • Analysis of noisy measurement data, statistical analysis • Data management, databases, interfaces, web services • Links with health records, biomedical informatics

Why Bioinformatics might be important for you? • This is a growing science involving increasing number of computer professionals (e.g., 1000-human genome project just started) • Links with medical and health informatics information systems – a growing and important market for software • Latvian genome project and participation in European genotyping projects – software experts who understand the underlying problems are needed

Topics covered in this course: • Introduction into biology as information science • Overview of some bioinformatics problems • Bio sequence and structure analysis, molecular evolution and phylogenyetc • Genomics – DNA assembly, haplotypesetc • Gene regulation network modelling (graph theory, Boolean networks, dynamic systems) • Analysis of gene expression data, cluster analysis, data mining and analysis • Some new recently evolving topics (time and material availability permitting...)

FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • Bioloģiskā informācija - tās daudzveidība un apjoms • Bioloģija, statistika, informācijas tehnoloģijas un programmēšana kā bioinformātikas pamatelementi • Genomu organizācija un evolūcija • Salīdzinošā genomika • Bioloģiskās informācijas datubāzes. Informācijas meklēšanas un iegūšanas sistēmas

FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • Nukleīnskābju un proteīnu sekvenču līdzības pamatprincipi. Dažādas salīdzināšanas metodes, to priekšrocības un pielietošanas nosacījumi • Filoģenētika. Klāsteru un kladistiskās metodes filoģenētisko koku rekonstruēšanā • Genoma ekspresijas analīze • DNS čipi genomu polimorfisma analīzē. Gēnu ekspresijas ģenētika

FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • DNS topoloģija, proteīnu struktūra, tās paredzēšanas metodes un pielietojums farmakoloģijā • Proteomika un sistēmu bioloģija. Tīklveida struktūras kā bioloģisko sistēmu dabiska sastāvdaļa • Bioinformātikas perspektīvas. Bioinformātika kā priekšnosacījums modernās bioloģijas apgūšanai

NIH WORKING DEFINITION OF BIOINFORMATICS ANDCOMPUTATIONAL BIOLOGYJuly 17, 2000 • Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualize such data. • Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, behavioural, and social systems.

Human Genome Project • Began in 1990 in the US • The primary goal – to sequence 3 billion long human DNA • A working draft of the genome was released in 2000 • Finished in 2003, with further analysis still being published

The results of HGP • 3 billion long sequence consisting of four letters: A, T, G and C containing all the human inhered information • Genomes of many other organisms • Development of biotechnology, not only allowing to sequence the DNA, but also study function of different biomolecules and producing many TB of data • Databases storing this information (GenBank and EMBL data library) • Data analysis and management needs leading to the emergence and development of bioinformatics The things however have recently changed again – with NGS technologies sequencing of a specific individual has become affordable – with direct implications on amount of data that needs to be stored and/or analyzed.

All you need to know about Molecular Biology 

One of the first textbooks in bioinformatics MIT press 2000

Few other textbooks for «Computer Scientists» MIT press 2004 Chapman and Hall/CRC 1995

Few other textbooks for «Computer Scientists» Cambridge University 2015 CRC2017

Few other textbooks Cambridge University 2009 Oxford University 2002

Some bioinformatics problems from the perspective of Computer Science Genome sequencing and assembly

Genome sequencing and assembly E.Green (2001) Strategies for the systematic sequencing of complex genomes. Nat Rev Genetics, Vol 2:8, 573-583.

Ensembl genome browser

Genome sequence assembly

Sequence assembly problem Ok, let us assume that we have these hybridizations. How can we reconstruct theinitial DNA sequence from them? Affymetrix GeneChip W.Bains, C.Smith (1988) A novel method for nucleic acid sequence determination.Journal of theoretical biology .Vol. 135:3, 303-307.

Sequence assembly problem Ok, let us assume that we have these hybridizations. How can we reconstruct theinitial DNA sequence from them?

SBH – Hamiltonian path approach

Hamiltonian path (cycle problem) Hamiltonian path (cycle) problem For a given graph find a path (cycle) that visits every vertex exactly once (or show that such path does not exist). Unfortunately the problem is known to be NP-hard. That means that there are no algorithm that works in realistic time already for comparatively small graphs.

SBH – Eulerian path approach

Eulerian path (cycle) problem Eulerian path (cycle) problem For a given graph find a path (cycle) that visits every edge exactly once (or show that such path does not exist).

Eulerian path (cycle) problem Eulerian cycle exists if and only if each of graph vertices has even degree. Moreover, there is a simple linear time algorithm for finding Eulerian cycle. Eulerian path (cycle) problem For a given graph find a path (cycle) that visits every edge exactly once (or show that such path does not exist).

Next Generation Sequencing (Illumina) In case of de-novo sequencing we have essentially the same fragment assembly problem as for SBH, only the number of DNA sequence fragments are much higher and their size larger (~50-150 bp).

Sequence mappers

Sequence assembly – deBruijn graphs

Sequence assembly – deBruijn graphs D.Zebino, E.Birney (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, Vol. 18:5, 821-829.

All you need to know about Molecular Biology 

Central dogma of molecular biology transcription translation DNA RNA Protein

DNA Four different nucleotides : adenosine, guanine, cytosine and thymine. They are usually referred to as bases and denoted by their initial letters, A,C ,G and T 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5'

DNA

DNA - Biology as and information science 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5' Thus, for many information related purposes, the molecule can be represented as CGATTCAACGATGC The maximal amount of information that can be encoded in such a molecule is therefore 2 bits times the length of the sequence. Noting that the distance between nucleotide pairs in a DNA is about 0.34 nm, we can calculate that the linear information storage density in DNA is about 6x10 8 bits/cm, which is approximately 75 GB or 12.5 CD-Roms per cm.

DNA replication – copying the information

Polymerase chain reaction – PCR – Xeroxing the DNA

Genome sequencing • Reading the nucleotides in the DNA molecule and storing the readout in a computer • Basic technology ideas • A version of PCR • Separation of molecules by chemical properties such as weight or length of the DNA • Molecule labelling and fluorescent labelling in particular • DNA fragmentation in random length bits

Anatomy of a chromosome • Centromeres are the largest constriction of the chromosome • Site of attachment of spindle fibers • 100,000s of 171 base pair repeat, called alpha satellite sequences • Centromere associated proteins are bound [Adapted from R.Yasbin]

Genomes, chromosomes Genome is a set of DNA molecules. Each chromosome contains (long) DNA molecule per chromosome The 23 human chromosomes

Genome sizes Information in the human genome – up to 0.75 TB

www.ensembl.org

Genomes and genes Termination (stop) TATA box control statement control statement start gene Transcription (RNA polymerase) Ribosome binding 3’ utr 5’ utr mRNA Translation (Ribosome) Protein

Chromosomes - Eukaryotes

Chromosomes - Prokaryotes Two subgroups: Archea Bacteria

Introduction to Bioinformatics