Biology 4900

Biology 4900 Biocomputing

Chapter 1 Introduction

Goals of the course • To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI) • To focus on the analysis of DNA, RNA and proteins • To introduce you to the analysis of genomes • To combine theory and practice to help you solve research problems

Websites to Bookmark • Make a “Biocomputing” favorites folder in your normal internet browser • Add these sites to the Biocomputing folder • Galileo: http://www.galileo.usg.edu/scholar/clayton/subjects/ • Interlibrary Loan: http://adminservices.clayton.edu/library/depts/articlerequestform.aspx • Specialized search engines & tools • NCBI (Entrez/Pubmed/etc): http://www.ncbi.nlm.nih.gov/ • Entrez: http://www.ncbi.nlm.nih.gov/Entrez/ • Tutorial for Entrez: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/entrez_tutorial_BIB.pdf • Swiss Institute of Bioinformatics portal : http://www.expasy.org/

More Websites to Bookmark • Biology WorkBench http://workbench.sdsc.edu/ • Pymol http://www.pymol.org/ • KEGG: http://www.genome.ad.jp/kegg/ • Swiss-Prot: http://us.expasy.org/sprot/ • PIR: http://pir.georgetown.edu/pirwww/search/textpsd.shtml • GenBANK: http://www.ncbi.nlm.nih.gov/Genbank/index.html • EMBL: http://www.ebi.ac.uk/embl/index.html • DDJB: http://www.ddbj.nig.ac.jp/ • Protein Data Bank: http://www.rcsb.org/pdb/ • ORF Finder: http://www.ncbi.nlm.nih.gov/gorf/gorf.html • GENSCAN: http://genes.mit.edu/GENSCAN.html

Software to Download/Install • You should have the following: • Clayton State University e-mail account (CHECK IT REGULARLY) • Microsoft Office (from the Hub Software Center) • Adobe Acrobat Reader: www.acrobat.com • Biology WorkBenchhttp://workbench.sdsc.edu/ • Pymolhttp://www.pymol.org/

Formal Definitions • Bioinformatics: Research, development or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, analyze or visualize such data. • Computational Biology: The development and application of data-analytical theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral and social systems. • Genomics: A discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism. Pevsner J, Bioinformatics and Functional Genomics, 2nd Edition, 2009

What is biocomputing? • Interface of biology, biochemistry and computers. • Analysis of proteins, genes and genomes using computer algorithms and computer databases. • Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects. • Application of computer algorithms and databases to store and analyze huge quantities of biological and biochemical data.

Why we need to use this computer stuff… Haploid human genome (23 chromosomes) • Contains 20,000–30,000 distinct genes. • Is ~ 3.2 billion bp in length • Represented by ~800 MB of data. http://www.214bio.com/BOOK/ch_11_genes.html; http://en.wikipedia.org/wiki/Human_genome

Many roles of biocomputing Allows us to ask a number of diverse questions, and use known data to provide full or partial answers to those questions • Explore evolutionary origins of genes/proteins and determine phylogeny (MSA and construction of phylogenetic trees). • Predict gene locations (ORF Finder, pattern searching) • Predict gene product function (Blast or FastA searches) • Predict protein structure and function (Protein Explorer) • Identify genes that are expressed before the onset of cancer through genome sequencing • Identify drugs that can be used to treat specific diseases • Determine who was the responsible for publishing information, data, results, related studies, etc., through literature searches (e.g., PubMed).

Why should we care? • Locate mutations responsible for genetic diseases • Aids in the treatment and diagnosis of those diseases • Pharmacogenomics (human genetic variability in relation to drug action) • Targeted drugs and therapies (e.g., design receptor targeting moieties) • Discover and exploit new proteins • Environmental clean-up (e.g., enzymatic bioremediation, Chakrabarty’s oil-eating microbes) • Antibiotics and other chemotherapeutic agents • Useful products

Where did these data come from? • History • First scientific journal published in France in 1600’s. • Discovery of DNA in 1860’s to our modern understanding of genetic code, protein synthesis, etc. • 1981 IBM releases first PC • 1996 First release of PubMed • Computers have made a dramatic impact in these areas • It would be impossible to analyze data on a large scale without computer databases to organize information, and computer programs to facilitate inquiries

Managing Data: The Database • Database: Organized collection of data • Relational model: Collection of tables storing different information, but linked with a common “key”. • Database Management System (DBMS): System to control creation, use and maintenance of database. • Accessible via query languages (ex. SQL) • Database System: DBMS and database combined

Database Example Example: PDB data http://www.qbyv.com/en/network_capabilities; http://www.museumsandtheweb.com/mw2001/papers/stuer/stuer.html

How can we exploit the available data? • Development of algorithms or databases to: • Compare sequences (DNA, RNA, proteins) • Predict structure • secondary structure • homology modelling, threading • ab initio 3D prediction • Analyze 3D structure • structure comparison/ alignment • prediction of function from structure • molecular mechanics/ molecular dynamics • prediction of molecular interactions, docking • Perform energy minimization calculations • Predict useful mutations for protein engineering • Statistical Analyses Extract and analyze meaningful information that can be applied toward some end

Three perspectives on bioinformatics The cell The organism The tree of life

The Cell

Central Dogma of Molecular Biology DNA RNA protein phenotype Transcription Translation in ribosome CELL ORGANISM Phenotype: Organisms traits Morphology, development, biochemistry, physiology, phenology (biological cycles), behavior (from both genes and environment)

Central Dogma of Molecular Biology DNA RNA protein phenotype Central Dogma of Genomics genome proteome phenotype transcriptome The “ome” , a collection of specified units DNA is collection of deoxyribonucleic acids RNA is collection of ribonucleic acids Protein is collection of amino acids Polymers Because these are all polymers, or sequences of repeating units, we can devise algorithms to study these sequences for trends or to compare the sequences

Nucleotides http://en.wikipedia.org/wiki/File:RNA-comparedto-DNA_thymineAndUracilCorrected.png

Nucleic Acids A-T G-C Base Pairing H-bonds ~7 kJ/mole http://en.wikipedia.org/wiki/File:RNA-comparedto-DNA_thymineAndUracilCorrected.png; http://mcat-review.org/molecular-biology-dna.php

Structures of Amino Acids • Proteins and polypeptides are biochemical compounds consisting of amino acids • Chains of amino acids bonded together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues • Proteins • Longer and more complex than polypeptides • Typically folded into a globular or fibrous form • Structure facilitates a biological function Peptide linkages Amino acid Protein Polypeptide

Proteins have different levels of structure • Primary (1°): Sequence of amino acids • Determines 3D structure • Secondary (2°): H-bonding interactions between AA residues begin to produce regular, identifiable structures • Alpha (α) helices • Beta (β) strands • Random coil • Tertiary (3°): Overall structure of single protein in 3 dimensions • Quaternary (4°): Assemblies of multiple polypeptides and/or proteins http://protein-pdb.com/2011/10/04/primary-protein-structure/

Amino Acid Codes Know these 1 letter AA codes, or you will know what it means to be roasted in the depths of the Slor…

Sequence Analyses DNA Sequence ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGATTTAGTTCGCTC Genomics Protein Sequence AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN Proteomics xxx

Sequence Analysis Using ClustalW

Protein Secondary Structure: PDBSum (EMBL-EBI) • http://www.ebi.ac.uk/pdbsum/ • Either enter PDB file or can load new/existing sequence

Applications of Sequence Analyses • Codons (3 RNA bases in sequence) determine each amino acid that will build the protein expressed

Statistical Analysis of PDB Data: Ca2+ vs. Pb2+ Holo- and Hemi-directed geometries Pentagonal bipyramidal geometry Pb: Ligand Distribution Ca: EF-Hand Ca: Non-EF-Hand (Kirberger, Wang et al. 2008; Kirberger and Yang 2008; Glusker et al. 1998)

Develop Algorithms/Programs to Address Specific Problems • Identify calcium-binding proteins by matching patterns of known calcium-binding sites in sequences.

The Organism

The Organism Time of development • Genes: Segments of DNA or RNA that code for a polypeptide or for functional segment of RNA. • Genes of an individual organism can change over time. http://en.wikipedia.org/wiki/File:Gene.png; http://www.mardianinmotion.com/2009/11/anti-aging-medicine-%E2%80%93-hope-hype-or-hucksters

The Tree of Life

Tree of Life http://www.allvoices.com/contributed-news/4553607-is-chimps-as-smart-as-human; After Pace NR (1997) Science 276:734; http://en.wikipedia.org/wiki/File:E_coli_at_10000x,_original.jpg

Biology 4900

Biology 4900

Presentation Transcript

ADMS 4900 Class 7

Student Design Competition Capstone Design ME 4900, 4902.01, 4902.02

Developmental Biology – Biology 4361

Vital Signs Monitor UConn BME 4900

Biology / Biology H

4900 S ROUTE 31 CRYSTAL LAKE 1

63 g 50,000 mg 0.08 kg 4900 cg 420 dg

Biology 4900

Biology 4900

Biology 4900

Biology 4900

Biology 4900

4900 Project

Biology 4900

CS 4900-020 Software Testing Fall 2009 Project Title

ME 4900 Intro. to Design Studies

AH Biology Environmental Biology

Biology 156 – Plant Biology

Biology 129 Human Biology

Marketing Planning & Problem Solving [Dr. Carter; MKTG. 4900 ]

Marketing Planning & Problem Solving [Dr. Carter; MKTG. 4900 ]

Biology 4900

Biology 4900

Presentation Transcript

ADMS 4900 Class 7

Student Design Competition Capstone Design ME 4900, 4902.01, 4902.02

Developmental Biology – Biology 4361

Vital Signs Monitor UConn BME 4900

Biology / Biology H

4900 S ROUTE 31 CRYSTAL LAKE 1

63 g 50,000 mg 0.08 kg 4900 cg 420 dg

Biology 4900

Biology 4900

Biology 4900

Biology 4900

Biology 4900

4900 Project

Biology 4900

CS 4900-020 Software Testing Fall 2009 Project Title

ME 4900 Intro. to Design Studies

AH Biology Environmental Biology

Biology 156 – Plant Biology

Biology 129 Human Biology

Marketing Planning &amp; Problem Solving [Dr. Carter; MKTG. 4900 ]

Marketing Planning &amp; Problem Solving [Dr. Carter; MKTG. 4900 ]

Marketing Planning & Problem Solving [Dr. Carter; MKTG. 4900 ]

Marketing Planning & Problem Solving [Dr. Carter; MKTG. 4900 ]