1 / 96

CS 5263 Bioinformatics

CS 5263 Bioinformatics. Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology. Outline. Administravia What is bioinformatics Why bioinformatics Course overview Short introduction to molecular biology. Survey form. Your name Email Academic preparation Interests

gaia
Download Presentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology

  2. Outline • Administravia • What is bioinformatics • Why bioinformatics • Course overview • Short introduction to molecular biology

  3. Survey form • Your name • Email • Academic preparation • Interests • help me better design lectures and assignments

  4. Course Info • Instructor: Jianhua Ruan Office: S.B. 4.01.48 Phone: 458-6819 Email: jruan@cs.utsa.edu Office hours: MW 2-3pm • Web: http://www.cs.utsa.edu/~jruan/teaching/cs5263_fall_2008/

  5. Course description • A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint. • Prerequisite: • Programming experiences • Some knowledge in algorithms and data structures • Basic understanding of statistics and probability • Appetite to learn some biology

  6. Textbooks • An Introduction to Bioinformatics Algorithms by Jones and Pevzner • Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin, Eddy, Krogh and Mitchison • Additional resources • Papers • Handouts • See course website

  7. Grading • Attendance: 10% • At most 2 classes missed without affecting grade • Homeworks: 50% • About 5 assignments • Combination of theoretical and programming exercises • No exams • No late submission accepted • Read the collaboration policy! • Final project and presentation: 40%

  8. Why bioinformatics • The advance of experimental technology has generated huge amount of data • The human genome is “finished” • Even if it were, that’s only the beginning… • The bottleneck is how to integrate and analyze the data • Noisy • Diverse

  9. Growth of GenBank vs Moore’s law

  10. Genome annotations Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006

  11. What is bioinformatics • National Institutes of Health (NIH): • Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

  12. What is bioinformatics • National Center for Biotechnology Information (NCBI): • the field of science in which biology, computer science, and information technologymerge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insightsas well as to create a global perspective from which unifying principles in biology can be discerned.

  13. What is bioinformatics • Wikipedia • Bioinformatics refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data.

  14. Course objectives • Learn the basis of sequence analysis and other computational biology algorithms • Familiarize with the research topics in bioinformatics • Be able to • Read / criticize bioinformatics research articles • Identify subareas that best suit your background • Communicate and exchange ideas with (computational) biologists

  15. What you will learn? • Basic concepts in molecular biology and genetics • Algorithms to address selected problems in bioinformatics • Dynamic programming, string algorithms, graph algorithms • Statistical learning algorithms: HMM, EM, Gibbs sampling • Data mining: clustering / classification • Applications to real data

  16. What you will not learn? • Designing / performing biological experiments (duh!) • Programming (in perl, etc). • Building bioinformatics software tools (GUI, database, Web, …) • Using existing tools / databases (well, not exactly true)

  17. Covered topics 1 week • Biology • Sequence analysis • Sequence alignment • Pairwise, multiple, global, local, optimal, heuristic • String matching • Motif finding • Gene prediction • RNA structure prediction • Phylogenetic tree • Functional Genomics • Microarray data analysis • Biological networks 8 weeks 5 weeks

  18. Computer Scientists vs Biologists(courtesy Serafim Batzoglou, Stanford)

  19. Biologists vs computer scientists • (almost) Everything is true or false in computer science • (almost) Nothing is ever true or false in Biology

  20. Biologists vs computer scientists • Biologists seek to understand the complicated, messy natural world • Computer scientists strive to build their own clean and organized virtual world

  21. Biologists vs computer scientists • Computer scientists are obsessed with being the first to invent or prove something • Biologists are obsessed with being the first to discover something

  22. Some examples of central role of CS in bioinformatics

  23. AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT ~500 nucleotides 1. Genome sequencing 3x109 nucleotides

  24. AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 1. Genome sequencing 3x109 nucleotides A big puzzle ~60 million pieces Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome

  25. 2. Gene Finding Where are the genes? In humans: ~22,000 genes ~1.5% of human DNA

  26. Exon 3 Exon 1 Exon 2 Intron 1 Intron 2 5’ 3’ Splice sites Stop codon TAG/TGA/TAA Start codon ATG 2. Gene Finding Hidden Markov Models (Well studied for many years in speech recognition)

  27. 3. Protein Folding • The amino-acid sequence of a protein determines the 3D fold • The 3D fold of a protein determines its function • Can we predict 3D fold of a protein given its amino-acid sequence? • Holy grail of compbio—40 years old problem • Molecular dynamics, computational geometry, machine learning

  28. query DB 4. Sequence Comparison—Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Sequence Alignment Introduced ~1970 BLAST: 1990, most cited paper in history Still very active area of research BLAST Efficient string matching algorithms Fast database index techniques

  29. Lipman & Pearson, 1985 …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10minutes on a microcomputer (IBM PC). …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10minutes on a microcomputer (IBM PC). Database size today: 1012 (increased by 2 million folds). BLAST search: 1.5 minutes

  30. 5. Microarray analysisClinical prediction of Leukemia type • 2 types • Acute lymphoid (ALL) • Acute myeloid (AML) • Different treatments & outcomes • Predict type before treatment? Bone marrow samples: ALL vs AML Measure amount of each gene

  31. Some goals of biology for the next 50 years • List all molecular parts that build an organism • Genes, proteins, other functional parts • Understand the function of each part • Understand how parts interact physically and functionally • Study how function has evolved across all species • Find genetic defects that cause diseases • Design drugs rationally • Sequence the genome of every human, use it for personalized medicine • Bioinformatics is an essential component for all the goals above

  32. A short introduction to molecular biology

  33. Life • Two categories: • Prokaryotes (e.g. bacteria) • Unicellular • No nucleus • Eukaryotes (e.g. fungi, plant, animal) • Unicellular or multicellular • Has nucleus

  34. Prokaryote vs Eukaryote • Eukaryote has many membrane-bounded compartment inside the cell • Different biological processes occur at different cellular location

  35. Organ Organism, Organ, Cell Organism

  36. Chemical contents of cell • Water • Macromolecules (polymers) - “strings” made by linking monomers from a specified set (alphabet) • Protein • DNA • RNA • … • Small molecules • Sugar • Ions (Na+, Ka+, Ca2+, Cl- ,…) • Hormone • …

  37. DNA • DNA: forms the genetic material of all living organisms • Can be replicated and passed to descendents • Contains information to produce proteins • To computer scientists, DNA is a string made from alphabet {A, C, G, T} • e.g. ACAGAACGTAGTGCCGTGAGCG • Each letter is a nucleotide • Length varies from hundreds to billions

  38. RNA • Historically thought to be information carrier only • DNA => RNA => Protein • New roles have been found for them • To computer scientists, RNA is a string made from alphabet {A, C, G, U} • e.g. ACAGAACGUAGUGCCGUGAGCG • Each letter is a nucleotide • Length varies from tens to thousands

  39. Protein • Protein: the actual “worker” for almost all processes in the cell • Enzymes: speed up reactions • Signaling: information transduction • Structural support • Production of other macromolecules • Transport • To computer scientists, protein is a string made from 20 kinds of characters • E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP • Each letter is called an amino acid • Length varies from tens to thousands

  40. DNA/RNA zoom-in • Commonly referred to as Nucleic Acid • DNA: Deoxyribonucleic acid • RNA: Ribonucleic acid • Found mainly in the nucleus of a cell (hence “nucleic”) • Contain phosphoric acid as a component (hence “acid”) • They are made up of a string of nucleotides

  41. Nucleotides • A nucleotide has 3 components • Sugar ring (ribose in RNA, deoxyribose in DNA) • Phosphoric acid • Nitrogen base • Adenine (A) • Guanine (G) • Cytosine (C) • Thymine (T) or Uracil (U)

  42. Monomers of RNA: ribo-nucleotide • A ribonucleotide has 3 components • Sugar - Ribose • Phosphate group • Nitrogen base • Adenine (A) • Guanine (G) • Cytosine (C) • Uracil (U)

  43. Monomers of DNA: deoxy-ribo-nucleotide • A deoxyribonucleotide has 3 components • Sugar – Deoxy-ribose • Phosphate group • Nitrogen base • Adenine (A) • Guanine (G) • Cytosine (C) • Thymine (T)

  44. Nitrogen Base Nitrogen Base Nitrogen Base Phosphate Phosphate Phosphate Sugar Sugar Sugar Polymerization: Nucleotides => nucleic acids

  45. A G C G A C T G 5’ Free phosphate 5 prime 3 prime 5’-AGCGACTG-3’ AGCGACTG DNA Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. DNA replication, transcription, etc. Base 5 Phosphate Sugar 4 1 2 3 3’

  46. A G U G A C U G 5’ Free phosphate 5 prime 3 prime 5’-AGUGACUG-3’ AGUGACUG RNA Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. translation. 3’

  47. A T G C C G G C A T C G A T G C 3’ 5’ Base-pair: A = T G = C Forward (+) strand 5’-AGCGACTG-3’ 3’-TCGCTGAC-5’ Backward (-) strand AGCGACTG TCGCTGAC One strand is said to be reverse- complementary to the other 3’ 5’ DNA usually exists in pairs.

  48. DNA double helix G-C pair is stronger than A-T pair

  49. Reverse-complementary sequences • 5’-ACGTTACAGTA-3’ • The reverse complement is: 3’-TGCAATGTCAT-5’ => 5’-TACTGTAACGT-3’ • Or simply written as TACTGTAACGT

More Related