340 likes | 483 Views
REMINDERS. 2 nd Exam on Coverage: Central Dogma of DNA Replication Transcription Translation Recombinant DNA technology and molecular biology Protein analysis. BIOINFORMATICS. BIOINFORMATICS. Study of the structure of biological information and biological systems
E N D
REMINDERS • 2nd Exam on • Coverage: • Central Dogma of DNA • Replication • Transcription • Translation • Recombinant DNA technology and molecular biology • Protein analysis
BIOINFORMATICS • Study of the structure of biological information and biological systems • Integrates theories and tools of mathematics/statistics, computer science and information technology • Involves the use of hardware and software to study vast amounts of biological data
What is Bioinformatics? • the field of science in which biology, computer science, and information technology merge to form a single discipline • application of information technology to the storage, management and analysis of biological information • facilitated by the use of computers
FUNCTIONS • Data Management • Storage • Retrieval • Data Analysis *Literature/Bibliography, Sequence, Structure, Taxonomy, Expression, etc.
BIOLOGICAL DATABASES • Systematic data storage/retrieval • Maintained on a regular basis • Can contain various types of data (integration) • Sequence • Structure • Other pertinent information • Nucleic acids and proteins are most common
DATABASES • a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system • Biological databases consist usually of the nucleic acid sequences of the genetic material of various organisms as well as protein sequences and structures
DATABASES • e.g. nucleotide sequence database typically contains information such as • contact name • the input sequence with a description of the type of molecule • the scientific name of the source organism from which it was isolated • additional requirements • easy access to the information • a method for extracting only that information needed to answer a specific biological question
DATABASES • Sequence • GenBank, European Nucleotide Archive (ENA) and DNA Data Bank of Japan (DDBJ); managed by the International Nucleotide Sequence Database Collaboration (INSDC) • UniGene • Saccharomyces Genome Database (SGD) • UniProtKB (UniProtKB/Swiss-Prot or UniProt/TrEMBL) • ExPASy
DATABASES • Structure • Nucleic Acid Database (NDB) • Protein Data Bank (PDB) • Worldwide Protein Data Bank (wwPDB) • ExPASy
DATA MINING • Process by which testable hypotheses are created regarding function/structure of gene/protein of interest through identifying similar sequences in “more established” organisms • Tools: • Text-term search • Sequence similarity search
Machine Learning • Studies methods and the design of computer programs based on past experience • Why? • New methods are being introduced • Old ones should be improved
“Units” of Information • DNA (genome) • RNA (transcriptome) • Protein (proteome)
What is Being Analyzed? • Sequence • Structure • Interactions • Pathways • Mutations/Evolutions
Why? • Increasing amount of biological information entails • Organization • Archiving • Global unification/harmonization • More biological discoveries • Functional/Structural similarities • Phylogenetic/Evolutionary patterns
Applications • Medicine • Pharmaceuticals • Biotechnology • Agriculture
Molecular Data • When you draw a molecule, • You start with atoms • Then proceed with the structure • And the three-dimensional data • What can be stored? • Coordinates • Sequences • Chemical graphs • Atoms and bonds
Databases • Protein Data Bank (PDB) • Molecular Modeling Database (MMDB)
Techniques in the Laboratory • X-ray Crystallography • Nuclear Magnetic Resonance
Formats • PDB • mmCIF • MMDB
Structure Viewers • Cn3D • RasMol • WebMol • Mage • VRML • CAD • Swiss PDB Viewer
Promises of bioinformatics • Medicine • Knowledge of protein structure facilitates drug design • Understanding of genomic variation allows the tailoring of medical treatment to the individual’s genetic make-up • Genome analysis allows the targeting of genetic diseases • The effect of a disease or of a therapeutic on RNA and protein levels can be elucidated • The same techniques can be applied to biotechnology, crop and livestock improvement, etc...
Challenges in bioinformatics • Explosion of information • Need for faster, automated analysis to process large amounts of data • Need for integration between different types of information (sequences, literature, annotations, protein levels, RNA levels etc…) • Need for “smarter” software to identify interesting relationships in very large data sets • Lack of “bioinformaticians” • Software needs to be easier to access, use and understand • Biologists need to learn about the software, its limitations, and how to interpret its results
Two or More Sequences • Measure similarity • Determine correspondences between residues • Find patterns of conservation • Derive evolutionary relationships
Alignment • Correspondences of nucleotides/amino acids in two sequences or more are assigned • An assignment of correspondences that preserves the order of the residues within the sequences is an alignment • Gaps are used to achieve this • Sequence alignment refers to the identification of residue-residue correspondences
Uses • Homology • Similarities • “Ancestry” • Genome annotation • Assigning structure and function to genes • Database queries • For newly-discovered/unknown sequences
Tools • Dot Plots • Diagonal lines of dots showing similarities between two sequences • Scoring Matrices • Score reflects quality of each possible alignment; best possible score is identified • Scoring scheme is crucial • PAM (Point Accepted Mutations) and BLOSUM (BLOCKS Substitution Matrix) • Dynamic Programming • Algorithmic technique that reuses previous computations
Scoring • Penalties/Scores • Match (e.g. A – A) • Mismatch (e.g. A C) • Gap (e.g. A _) • Linear Gap Penalty: Uniform • Affine Gap Penalty: Gap Existence vs. Gap Extension
Local vs. Global Alignments • Global Alignment • Similarities between majority of two sequences • Local Alignment • Similarities between specific parts of two sequences
Programs Pairwise Sequence Alignment • BLAST • VAST • FASTA Multiple Sequence Alignment • MAFFT
Needleman-Wunsch Algorithm • Can be used for global and alignments • Maximum-value function • A simple scoring scheme is assumed Three steps • Initialization • Matrix fill (scoring) • Traceback (alignment)