110 likes | 120 Views
Bioinformatics. Dillon Dugan | BIOL 446L. What is Bioinformatics?. Seems that no one can agree on a definitive explanation The best explanation I found: “Bioinformatics involves the integration of computers, software tools, and databases in an effort to address biological questions. ”
E N D
Bioinformatics Dillon Dugan | BIOL 446L
What is Bioinformatics? • Seems that no one can agree on a definitive explanation • The best explanation I found: “Bioinformatics involves the integration of computers, software tools, and databases in an effort to address biological questions. ” • The National Center for Biotechnology Information (NCBI) defines Bioinformatics into three important sub-disciplines: • The development of new algorithms and statistics with which to assess relationships among members of large data sets • The analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures • The development and implementation of tools that enable efficient access and management of different types of information
History of Bioinformatics • The demand of bioinformatic databases started in 1956 when Sanger reported the first protein sequence and nearly a decade later when the first nucleic acid sequence was reported • In 1966, Margaret Belle (Oakley) Dayhoff and Richard V. Eck pioneered the field of bioinformatics by using computational analysis to compare protein sequences to reconstruct their evolutionary histories from those sequence alignments • Their database was published as Atlas of Protein Sequence and Structure, which is known as the first bioinformatic database • The field of bioinformatics would be fueled by the need of databases with lots of storage and the need of computer programs to process the data collected from sequences
Databases • What is a database • An organized collection of data • Generally stored and accessed electronically • How do electronic databases work? • In general, information is stored as bytes either on a local hard drive or a cloud service somewhere else • This information is stored in rows called records and contain columns of similar information called fields • Queries are used to search for the desired information within the storage • Searches can be based on what you want in a field • Some codes provide maintenance to the database
Biological Databases • There are approximately 180 biological databases available presently • The three primarily used databases are GenBank, EMBL, and DDJB • These databases are divided into nucleic acid sequences, protein/amino acid sequences, signal transduction pathway, metabolic pathway, and a few other minor databases • The main databases are nucleic acid and protein/amino acid databases • How do Biological Databases work? • DNA sequence is read through sequencing • That data is put into some sort of server/database • This information can be analysis by bioinformatic tools such as BLAST or MEME
GenBank • Started as the Los Alamos Sequence Database in 1979 at Los Alamos National Laboratory • Walter Goad, a nuclear physicist, decided to focus on biological efforts by creating the Los Alamos Sequence Database with some of his colleagues. • This would later be culminated with the creation of the public GenBank in 1982 • In collaboration with BBN, nearly 2,000 sequences were stored in this database by 1983 • Later, LANL would collaborate with Stanford University in the mid-80s • As one of the earliest widely accessible biological databases, GenBank started a program to promote open access communication between bio-scientists • By 1992, GenBank project was transferred to the newly created National Center of Biotechnology Information under the National Library of Medicine. • Currently, nearly 100,000 distinct organisms’ nucleotide sequence and protein translations are publicly accessible through GenBank • As you all know, GenBank hosts a webpage-based tool to search for sequences similar to yours and give detail information about the match, called BLAST
European Molecular Biology Laboratory (EMBL) • The EMBL was the creation of Leó Szilárd, James Watson, and John Kendrew as an international research center to rival the American-dominated field of molecular biology in 1974 • Its main laboratory is in Heidelberg, Germany. But there are outstations in England, France, Italy, Spain, and another in Germany • The EMBL focuses on research of molecular biology and molecular medicine as well as training for scientists, students, and visitors • Contains two important tools: ClustalX and HMMER • ClustalX: Multiple Sequence Alignment of DNA or protein sequences • HMMER: fast and sensitive homologous searches
DNA Data Bank of Japan • DNA sequence database located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan • DDBJ started its activity in 1986 and remains the only nucleotide sequence database in Asia • Although it is mostly used by Japanese researchers, DDBJ accepts data from researchers of any country. • Has their own BLAST tool for nucleotide sequence search and a TXSearch for taxonomical sequence search
Types of Bioinformatics Tools • BLAST • Basic Local Alignment Search Tool • An algorithm for comparing biological sequence information, such as nucleic acid sequences or protein/amino-acid sequences • FASTA • An algorithm for comparing full length alignments via Smith-Waterman algorithms • Very time consuming • More precise and accurate results • Clustal • Multiple sequence alignment based on deriving phylogenetic trees from UPGMA cluster analysis of pairwise sequences • Written in C++ • HMMER • Detects homologous protein or nucleotide sequences by comparing a profile-HMM (Hidden Markov Model) to either a single sequence or a database of sequences • Profile HMMs turn a multiple sequence alignment into a position-specific scoring system, which can be used to align sequences and search databases for remotely homologous sequences • SignalP • Predicts the presence and location of signal peptide cleavage sites in amino acid sequences in eukaryotes, Gram+ prokaryotes, and Gram- prokaryotes. • Predictions are made through combination of many artificial neural networks • SMART • Simple Modular Architecture Research Tool • Biological database that identifies and analysis of protein domains within protein sequences • Protein domains: conserved part of a given protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain
Work Cited • Fox, Joanne. 4 Aug. 2006. What is Bioinformatics? The Science Creative Quarterly. www.scq.ubc.ca/what-is-bioinformatics/. • Streeton, Antony O. W. 2002. The First Sequence: Fred Sanger and Insulin. The Genetic Society of America. • Christophe. 19 Aug. 2015. How Does a Relational Database Work. Coding Geek. coding-geek.com/how-databases-work/. • Masic, Izet. 2016. The Most Influential Scientists in the Development of Medical Informatics: Margaret Belle Dayhoff. National Center of Biotechnology Information. • Lee, John. 2007. Richard V. Eck (1922-2006): Bioinformatics: In the beginning. National Center of Biotechnology Information. • Thampi, Sabu M. 2001. Bioinformatics. LBS College of Engineering • http://www.cbs.dtu.dk/services/SignalP/ • http://smart.embl-heidelberg.de/help/smart_about.shtml • https://en.wikipedia.org/wiki/Bioinformatics • https://en.wikipedia.org/wiki/Margaret_Oakley_Dayhoff • https://en.wikipedia.org/wiki/Database • https://en.wikipedia.org/wiki/GenBank • https://en.wikipedia.org/wiki/DNA_Data_Bank_of_Japan • https://en.wikipedia.org/wiki/HMMER • https://en.wikipedia.org/wiki/Clustal • https://en.wikipedia.org/wiki/Simple_Modular_Architecture_Research_Tool