1.03k likes | 1.37k Views
Introduction to Bioinformatics. Juris Viksna , IMCS UL 2014 Alvis Brazma , European Bioinformatics Institute. Planned course schedule. Regular lectures: Wednesdays, 14:30-16:00 Most likely the first 4 (or so) will stick to this, schedule, followed by a break in
Introduction to Bioinformatics Juris Viksna, IMCS UL 2014 AlvisBrazma, European Bioinformatics Institute
Planned course schedule Regular lectures: Wednesdays, 14:30-16:00 Most likely the first 4 (or so) will stick to this, schedule, followed by a break in March, and then restarting in April, with replacement lectures scheduled sometime in April-May. Will try my best to invite guest lecturers (most likely this will involve rescheduled lecture times), but this is subject to options that might (or might not) become available.
Course requierements To obtain a credit for this course you must: - submit a programming project (worth 50% of grade);- take a (written) exam (worth 50% of grade). Course web page: http://susurs.mii.lu.lv/juris/courses/bi2014.html
Bioinformatics • Databases and tools to store and access biomolecular data • Sequence algorithms – assembly from short fragments, alignment of similar sequences, analysis of properties • Evolution and phylogenetics • 3D structure analysis of biomolecules • Machine learning and data mining application to genome and related information • Biomolecular interaction analysis (e.g., protein interactions) • Dynamic systems, modelling of biological networks and systems • Analysis of noisy measurement data, statistical analysis • Data management, databases, interfaces, web services • Links with health records, biomedical informatics
Why Bioinformatics might be important for you? • This is a growing science involving increasing number of computer professionals (e.g., 1000-human genome project just started) • Links with medical and health informatics information systems – a growing and important market for software • Latvian genome project and participation in European genotyping projects – software experts who understand the underlying problems are needed
Topics covered in this course: • Introduction into biology as information science • Overview of some bioinformatics problems • Bio sequence and structure analysis, molecular evolution and phylogeny etc • Genomics – DNA assembly, haplotypes etc • Gene regulation network modelling (graph theory, Boolean networks, dynamic systems) • Analysis of gene expression data, cluster analysis, data mining and analysis • Data management and analysis for biomedical studies • Some new recently evolving topics (time and material availability permitting...)
FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks)
FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • Bioloģiskā informācija - tās daudzveidība un apjoms • Bioloģija, statistika, informācijas tehnoloģijas un programmēšana kā bioinformātikas pamatelementi • Genomu organizācija un evolūcija • Salīdzinošā genomika • Bioloģiskās informācijas datubāzes. Informācijas meklēšanas un iegūšanas sistēmas
FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • Nukleīnskābju un proteīnu sekvenču līdzības pamatprincipi. Dažādas salīdzināšanas metodes, to priekšrocības un pielietošanas nosacījumi • Filoģenētika. Klāsteru un kladistiskās metodes filoģenētisko koku rekonstruēšanā • Genoma ekspresijas analīze • DNS čipi genomu polimorfisma analīzē. Gēnu ekspresijas ģenētika
FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • DNS topoloģija, proteīnu struktūra, tās paredzēšanas metodes un pielietojums farmakoloģijā • Proteomika un sistēmu bioloģija. Tīklveida struktūras kā bioloģisko sistēmu dabiska sastāvdaļa • Bioinformātikas perspektīvas. Bioinformātika kā priekšnosacījums modernās bioloģijas apgūšanai
NIH WORKING DEFINITION OF BIOINFORMATICS ANDCOMPUTATIONAL BIOLOGYJuly 17, 2000 • Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualize such data. • Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, behavioural, and social systems.
Human Genome Project • Began in 1990 in the US • The primary goal – to sequence 3 billion long human DNA • A working draft of the genome was released in 2000 • Finished in 2003, with further analysis still being published
The results of HGP • 3 billion long sequence consisting of four letters: A, T, G and C containing all the human inhered information • Genomes of many other organisms • Development of biotechnology, not only allowing to sequence the DNA, but also study function of different biomolecules and producing many TB of data • Databases storing this information (GenBank and EMBL data library) • Data analysis and management needs leading to the emergence and development of bioinformatics The things however have recently changed again – with NGS technologies sequencing of a specific individual has become affordable – with direct implications on amount of data that needs to be stored and/or analyzed.
One of the first textbooks in bioinformatics MIT press 2000
Few other textbooks for «Computer Scientists» MIT press 2004 Chapman and Hall/CRC 1995
Some bioinformatics problems from the perspective of Computer Science Genome sequencing and assembly
Genome sequencing and assembly E.Green (2001) Strategies for the systematic sequencing of complex genomes. Nat Rev Genetics, Vol 2:8, 573-583.
Sequence assembly problem Ok, let us assume that we have these hybridizations. How can we reconstruct theinitial DNA sequence from them? Affymetrix GeneChip W.Bains, C.Smith (1988) A novel method for nucleic acid sequence determination.Journal of theoretical biology .Vol. 135:3, 303-307.
Sequence assembly problem Ok, let us assume that we have these hybridizations. How can we reconstruct theinitial DNA sequence from them?
Hamiltonian path (cycle problem) Hamiltonian path (cycle) problem For a given graph find a path (cycle) that visits every vertex exactly once (or show that such path does not exist). Unfortunately the problem is known to be NP-hard. That means that there are no algorithm that works in realistic time already for comparatively small graphs.
Eulerian path (cycle) problem Eulerian path (cycle) problem For a given graph find a path (cycle) that visits every edge exactly once (or show that such path does not exist).
Eulerian path (cycle) problem Eulerian cycle exists if and only if each of graph vertices has even degree. Moreover, there is a simple linear time algorithm for finding Eulerian cycle. Eulerian path (cycle) problem For a given graph find a path (cycle) that visits every edge exactly once (or show that such path does not exist).
Next Generation Sequencing (Illumina) In case of de-novo sequencing we have essentially the same fragment assembly problem as for SBH, only the number of DNA sequence fragments are much higher and their size larger (~50-150 bp).
Sequence assembly – deBruijn graphs D.Zebino, E.Birney (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, Vol. 18:5, 821-829.
Central dogma of molecular biology transcription translation DNA RNA Protein
DNA Four different nucleotides : adenosine, guanine, cytosine and thymine. They are usually referred to as bases and denoted by their initial letters, A,C ,G and T 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5'
DNA - Biology as and information science 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5' Thus, for many information related purposes, the molecule can be represented as CGATTCAACGATGC The maximal amount of information that can be encoded in such a molecule is therefore 2 bits times the length of the sequence. Noting that the distance between nucleotide pairs in a DNA is about 0.34 nm, we can calculate that the linear information storage density in DNA is about 6x10 8 bits/cm, which is approximately 75 GB or 12.5 CD-Roms per cm.
Genome sequencing • Reading the nucleotides in the DNA molecule and storing the readout in a computer • Basic technology ideas • A version of PCR • Separation of molecules by chemical properties such as weight or length of the DNA • Molecule labelling and fluorescent labelling in particular • DNA fragmentation in random length bits
Anatomy of a chromosome • Centromeres are the largest constriction of the chromosome • Site of attachment of spindle fibers • 100,000s of 171 base pair repeat, called alpha satellite sequences • Centromere associated proteins are bound [Adapted from R.Yasbin]
Genomes, chromosomes Genome is a set of DNA molecules. Each chromosome contains (long) DNA molecule per chromosome The 23 human chromosomes
Genome sizes Information in the human genome – up to 0.75 TB
Genomes and genes Termination (stop) TATA box control statement control statement start gene Transcription (RNA polymerase) Ribosome binding 3’ utr 5’ utr mRNA Translation (Ribosome) Protein
(Eucariotic) cell [Adapted from Online Biology Book]