880 likes | 902 Views
This course provides an introduction to bioinformatics and molecular biology, covering algorithms and methods in bioinformatics from a computational viewpoint. Prerequisites include programming experience, a strong background in algorithms and data structures, basic understanding of statistics and probability, and an interest in biology.
E N D
CS 4233 & 5263 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology
Course description • A survey of algorithms and methods in bioinformatics, approached from a (more or less) computational viewpoint. • Prerequisite: • Programming experience • Strong background in algorithms and data structure • Basic understanding of statistics and probability • Appetite to learn some biology • For other information, check course website
Why bioinformatics • The advance of biomedical experimental technology has resulted in a huge amount of data • The human genome is “finished” • Even if it were, that’s only the beginning… • The bottleneck is how to integrate and analyze the data • Noisy • Diverse
Growth of GenBank vs Moore’s law • Last week I just received 100GB of DNA data from a “small” NGS experiment my collaborator did
Genome annotations Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006
What is bioinformatics • National Institutes of Health (NIH): • Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
What is bioinformatics • National Center for Biotechnology Information (NCBI): • the field of science in which biology, computer science, and information technologymerge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insightsas well as to create a global perspective from which unifying principles in biology can be discerned.
Computer Scientists vs Biologists(courtesy Serafim Batzoglou, Stanford)
Biologists vs computer scientists • (almost) Everything is true or false in computer science • (almost) Nothing is ever true or false in Biology
Biologists vs computer scientists • Biologists seek to understand the complicated, messy natural world • Computer scientists strive to build their own clean and organized virtual world
Biologists vs computer scientists • Computer scientists are obsessed with being the first to invent or prove something • Biologists are obsessed with being the first to discover something
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT ~500 nucleotides 1. Genome sequencing 3x109 nucleotides
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 1. Genome sequencing 3x109 nucleotides A big puzzle ~60 million pieces Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome
2. Gene Finding Where are the genes? In humans: ~22,000 genes ~1.5% of human DNA
Exon 3 Exon 1 Exon 2 Intron 1 Intron 2 5’ 3’ Splice sites Stop codon TAG/TGA/TAA Start codon ATG 2. Gene Finding Hidden Markov Models (Well studied for many years in speech recognition)
3. Protein Folding • The amino-acid sequence of a protein determines the 3D fold • The 3D fold of a protein determines its function • Can we predict 3D fold of a protein given its amino-acid sequence? • Holy grail of computational biology —40 years old problem • Molecular dynamics, computational geometry, machine learning
query DB 4. Sequence Comparison—Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Sequence Alignment Introduced ~1970 BLAST: 1990, one of the most cited papers in history Still very active area of research BLAST Efficient string matching algorithms Fast database index techniques
Lipman & Pearson, 1985 …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10minutes on a microcomputer (IBM PC). …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10minutes on a microcomputer (IBM PC). Database size today (2007): 1012 (increased by 2 million folds). BLAST search: 1.5 minutes
5. Microarray data analysisExample: Clinical prediction of Leukemia type • 2 types of leukemia • Acute lymphoid (ALL) • Acute myeloid (AML) • Different treatments & outcomes • Predict type before treatment? Bone marrow samples: ALL vs AML Measure amount of each gene
Some goals of biology for the next 50 years • List all molecular parts that build an organism • Genes, proteins, other functional parts • Understand the function of each part • Understand how parts interact physically and functionally • Study how function has evolved across all species • Find genetic defects that cause diseases • Design drugs rationally • Sequence the genome of every human, use it for personalized medicine • Bioinformatics is an essential component for all the goals above
Life • Two main categories: • Prokaryotes (e.g. bacteria) • Unicellular • No nucleus • Eukaryotes (e.g. fungi, plant, animal) • Unicellular or multicellular • Has nucleus
Prokaryote vs Eukaryote • Eukaryote has many membrane-bounded compartment inside the cell • Different biological processes occur at different cellular location
Organ Organism, Organ, Cell Organism
Chemical contents of cell • Water • Macromolecules (polymers) - “strings” made by linking monomers from a specified set (alphabet) • Protein • DNA • RNA • … • Small molecules • Sugar • Ions (Na+, Ka+, Ca2+, Cl- ,…) • Hormone • …
DNA • DNA: forms the genetic material of all living organisms • Can be replicated and passed to descendents • Contains information to produce proteins • To computer scientists, DNA is a string made from alphabet {A, C, G, T} • e.g. ACAGAACGTAGTGCCGTGAGCG • Each letter is a nucleotide • Length varies from hundreds to billions
RNA • Historically thought to be mainly an information carrier • DNA => RNA => Protein • Very important new roles have been found recently • To computer scientists, RNA is a string made from alphabet {A, C, G, U} • e.g. ACAGAACGUAGUGCCGUGAGCG • Each letter is a nucleotide • Length varies from tens to thousands
Protein • Protein: the actual “worker” for almost all processes in the cell • Enzymes: speed up reactions • Signaling: information transduction • Structural support • Production of other macromolecules • Transport • To computer scientists, protein is a string made from an alphabet of 20 letters • E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP • Each letter is called an amino acid • Length varies from tens to thousands
DNA/RNA zoom-in • Commonly referred to as Nucleic Acid • DNA: Deoxyribonucleic acid • RNA: Ribonucleic acid • Found mainly in the nucleus of a cell (hence “nucleic”) • Contain phosphoric acid as a component (hence “acid”) • They are made up of a string of nucleotides
Nucleotides • A nucleotide has 3 components • Sugar ring (ribose in RNA, deoxyribose in DNA) • Phosphoric acid • Nitrogen base • Adenine (A) • Guanine (G) • Cytosine (C) • Thymine (T) in DNA and Uracil (U) in RNA
Units of RNA: ribo-nucleotide • A ribonucleotide has 3 components • Sugar - Ribose • Phosphate group • Nitrogen base • Adenine (A) • Guanine (G) • Cytosine (C) • Uracil (U)
Units of DNA: deoxy-ribo-nucleotide • A deoxyribonucleotide has 3 components • Sugar – Deoxy-ribose • Phosphate group • Nitrogen base • Adenine (A) • Guanine (G) • Cytosine (C) • Thymine (T)
Nitrogen Base Nitrogen Base Nitrogen Base Phosphate Phosphate Phosphate Sugar Sugar Sugar Polymerization: Nucleotides => nucleic acids
A G C G A C T G 5’ Free phosphate 5 prime 3 prime 5’-AGCGACTG-3’ AGCGACTG DNA Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. DNA replication, transcription, etc. Base 5 Phosphate Sugar 4 1 2 3 3’
A G U G A C U G 5’ Free phosphate 5 prime 3 prime 5’-AGUGACUG-3’ AGUGACUG RNA Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. translation. 3’
A T G C C G G C A T C G A T G C 3’ 5’ Base-pair: A = T G = C Forward (+) strand 5’-AGCGACTG-3’ 3’-TCGCTGAC-5’ Backward (-) strand AGCGACTG TCGCTGAC One strand is said to be reverse- complementary to the other 3’ 5’ DNA usually exists in pairs.
DNA double helix G-C pair is stronger than A-T pair
Reverse-complementary sequences • 5’-ACGTTACAGTA-3’ • The reverse complement is: 3’-TGCAATGTCAT-5’ => 5’-TACTGTAACGT-3’ • Or simply written as TACTGTAACGT
Orientation of the double helix • Double helix is anti-parallel • 5’ end of one strand pairs with 3’ end of the other • 5’ to 3’ motion in one strand is 3’ to 5’ in the other • Double helix has no orientation • Biology has no “forward” and “reverse” strand • Relative to any single strand, there is a “reverse complement” or “reverse strand” • Information can be encoded by either strand or both strands 5’TTTTACAGGACCATG 3’ 3’AAAATGTCCTGGTAC 5’
RNA • RNAs are normally single-stranded • Form complex structure by self-base-pairing • A=U, C=G • Can also form RNA-DNA and RNA-RNA double strands. • A=T/U, C=G
Carboxyl group Amino group Protein zoom-in • Protein is the actual “worker” for almost all processes in the cell • A string built from 20 kinds of chars • E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH • Each letter is called an amino acid R | H2N--C--COOH | H Side chain Generic chemical form of amino acid
Units of Protein: Amino acid • 20 amino acids, only differ at side chains • Each can be expressed by three letters • Or a single letter: A-Y, except B, J, O, U, X, Z • Alanine = Ala = A • Histidine = His = H
Amino acids => peptide R R | | H2N--C--COOH H2N--C--COOH | | H H R R | | H2N--C--CO--NH--C--COOH | | H H Peptide bond
R R R R R R … H2N COOH C-terminal N-terminal Protein • Has orientations • Usually recorded from N-terminal to C-terminal • Peptide vs protein: basically the same thing • Conventions • Peptide is shorter (< 50aa), while protein is longer • Peptide refers to the sequence, while protein has 2D/3D structure
Protein structure • Linear sequence of amino acids folds to form a complex 3-D structure. • The structure of a protein is intimately connected to its function.
Genome and chromosome • Genome: the complete DNA sequences in the cell of an organism • May contain one (in most prokaryotes) or more (in eukaryotes) chromosomes • Chromosome: a single large DNA molecule in the cell • May be circular or linear • Contain genes as well as “junk DNAs” • Highly packed!
Formation of chromosome 50,000 times shorter than extended DNA The total length of DNA present in one adult human is the equivalent of nearly 70 round trips from the earth to the sun