270 likes | 365 Views
Bioinformatics at Promega Corporation. Intro to Bioinformatics Biotec May 4, 2006 Ethan Strauss Sr. Scientist R&D Bioinformatics, Promega, Ethan.strauss@promega.com http://q7.com/~ethan/molbio. My Background. Bachelor’s degree in biology PhD and work experience in Molecular Biology
E N D
Bioinformatics at Promega Corporation Intro to Bioinformatics Biotec May 4, 2006 Ethan Strauss Sr. Scientist R&D Bioinformatics, Promega, Ethan.strauss@promega.com http://q7.com/~ethan/molbio
My Background • Bachelor’s degree in biology • PhD and work experience in Molecular Biology • Eight years in Promega Technical Services • Almost a year in Bioinformatics (officially) • No formal computer training • No formal bioinformatics training
Bioinformatics at Promega Corporation • Bioinformatics did not exists as a separate function until 2001 • One person 2001- 2005 • Two people 2005 - ? • Bioinformatics supports primarily R&D (~100 scientists) • Mentor and train R&D scientists • Provide expertise for projects (~120 requests per year) • Propose and evaluate new acquisitions • Liaison to IT department • Manage bioinformatics infrastructure (~15 tools) • Develop new tools and adapt existing tools in house
Bioinformatics Projects • Programming • Tools for internal and external Promega customers • Plexor™ Primer Design System • Biomath • siRNA Designer • Sequence analysis for Excel and Microsoft Word • Analysis of BLAST results • Automated data retrieval (Web services) • Database for tracking vector construction • Database for keeping track of plasmid features • Laboratory Information Management System (LIMS) • Chemical Database
Bioinformatics Projects • Biocomputing (use of computers in biological research) • Database searches • data mining • discovery research • Analysis & in silico design of nucleic acid and protein sequence • Molecular visualization • Modeling • Simulation (proteins, ligands)
Programming • Tools for Promega customers • Biomath (http://www.promega.com/biomath/) • Basic calculations (Most can be done easily by hand) • Simple code (Javascript) • Established theory. • Universal (not Promega specific) • siRNA Designer(http://www.promega.com/siRNADesigner/ ) • Complex calculations • More complex code (VBScript) • Rapidly evolving theory • Partially Promega specific
Programming • Tools for Promega customers • Plexor Primer Design (https://www.promega.com/techserv/tools/plexor) • Complex calculations • Complex code (C#.Net) • Separate user interface and main calculations • Multiple interacting modules • Database integration • Integration with Genbank (through a web service) • Proprietary improvements on established theory • Very Promega specific
Programming • Tools for internal use • BLAST analysis of Plexor Primers • Primer specificity is important • BLAST can determine specificity, but output is very complex. • Simplify • Combine all hits from the same “Gene” • Only show hits which could mis-prime • Groups hits by species • Allow sorting by species
Programming Initial BLAST results (1 page out of ~30) • Tools for internal use • BLAST analysis of Plexor Primers Analyzed BLAST results (complete!)
Programming • Tools for internal use • Vector/Insert Database • Promega’s Flexi vector system has a very structured cloning procedure. • R&D has been making many different Flexi vector backbones with many inserts. • Keeping track has been a problem. • A database is in development
Programming • Tools for internal use
Programming • Internal Projects • Which Restriction enzyme cuts least frequently in human ORFs? • Method: • Download human Refseq database (ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/) • Load into local database • Scan each sequence for each RE site • The scan took 2-3 hours to complete http://www.promega.com/pnotes/89/12416_11/12416_11.pdf
Programming • Internal Projects • Which human genes in Genbank are the most “popular”? • Method • Download “Gene” database (ftp://ftp.ncbi.nlm.nih.gov/gene/) • Download Gene Ontology information (http://www.geneontology.org/) • Use web services to get pathway information from KEGG (http://www.genome.jp/kegg/) • Use web services to get citation information from Pubmed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed) • Load all into local database • Rank genes by desired criteria • Size • Function • Localization • Pathways • Publications
Database searches and data mining Question: Can you reformat this sequence for me?Tool: ReadSeq http://bimas.dcrt.nih.gov/molbio/readseq & Macros Question: How many viral proteins start with MetHis?Tool: Hits database & motif searches http://hits.isb-sib.ch/ Question: How many different bacterial two-domain proteins are known?Tool: SCOP database http://scop.berkeley.edu/ Question: How do I design PCR primers selective for bacterial species X?Tool: Ribosomal database 16s rRNA alignment: http://rdp.cme.msu.edu
In silico design – RNA sequences Goal: Design RNA sequence that folds into specific structure (specific structure provides desired function) Tools: mfold (Michael Zucker) http://www.bioinfo.rpi.edu/~zukerm/ Vienna RNA Package http://www.tbi.univie.ac.at/
In silico design – DNA sequences Goal: Express protein of interest in E. coli cells – fastest way Steps: Obtain protein or DNA sequence from database Optimize codon usage for expression in E. coli Match restriction enzyme sites to expression vector Send DNA sequence for synthesis (cost ~$1/base) Tools: NCBI database http://www.ncbi.nlm.nih.gov Codon usage database http://www.kazusa.or.jp/codon/ Restriction enzyme database http://rebase.neb.com/rebase/rebase.html Sequence analysis software
In silico design – reporter gene Goal: Design optimal DNA sequence coding for reporter protein (maximize expression and minimize unintended regulation)
In silico design – reporter genes Tools: Optimize codon usage: Codon Usage DB http://www.kazusa.or.jp/codon/ INCA http://www.bioinfo-hr.org/inca/ Identify & remove regulatory sites: TRANSFAC DB http://www.biobase.de/ TESS http://www.cbil.upenn.edu/tess/ Genomatix tools http://www.genomatix.de Others hRluc Expression: up 10x Background: down 10x Non-specific regulation: lower
Visualization – molecular system of interest Goal: Visualize molecule of interest (blue) and interaction partners Tools: World Index of Molecular Visualization Resources http://molvis.sdsc.edu/visres/index.html
Modeling – protein fold Goal: 3D structure model of enzyme => location of N/C termini => find active site => other Tools: NCBI BLink http://www.ncbi.nlm.nih.gov/ Protein Data Bank http://www.rcsb.org/pdb SwissModel http://swissmodel.expasy.org/ WHAT IF http://swift.cmbi.ru.nl/whatif/ InsightII Modeler http://www.accelrys.com/insight unknown 3D structure: Renilla luciferasehomologue with known 3D structure: Hydrolase sequence identity: 36%
Modeling – protein engineering • Goal: Alter catalytic activity of enzyme => predict structural effects of different point mutations mutation disrupts structure mutation does not disrupt structure Tools: InsightII Modeler http://www.accelrys.com/insight/
Modeling – protein engineering Goal: Improve substrate binding rate of enzyme => identify specific amino acids to mutate constricted binding tunnel open binding tunnel (mutant) Tools: InsightII Modeler http://www.accelrys.com/insight/
Modeling – substrate engineering Goal: Find better substrate for enzyme => analyze geometric constraints of substrate binding pocket Tools: Hetero-compound Info Center http://alpha2.bmc.uu.se/hicup/ InsightII Modeler http://www.accelrys.com/insight/
LIMS – Laboratory Information Management System • Goal: Manage in-house DNA sequences and associated data • Eval: UW-Madison Center for Eukaryotic Structural Genomics • Sesame http://www.sesame.wisc.edu/ • “…Sesame is designed to organize and record data relevant to complex scientific projects, to launch computer-controlled processes, and to help decide about subsequent steps on the basis of information available. The Sesame system is based on the multi-tier paradigm, and it consists of a framework and application modules that carry out specific tasks.Users interact with Sesame through a series of web-based Java applet-applications designed to organize data. It allows collaborators on a given project to enter, process, view, and extract relevant data, regardless of location, so long as web access is available. Data reside in an Oracle relational database. Sesame serves as a digital laboratory notebook and allows users to attach numerous files and images…”
Bioinformatics Advice • Be aware of bias in databases! • Search Genbank (nucleotide) for Human[Organism] apoptosis. How many hits? • Now try Orcinus[Organism] apoptosisHow many hits? • Can you conclude that Orcinus does not have apoptosis?
Bioinformatics Advice • Bioinformatics is changing and advancing very rapidly. • Don’t forget to notice what is new. • NCBI now has ~20 different databases. They had two only 3-5 years ago • If you want to do something that you know can’t be done, check again in two weeks! • My standard computer can process the entire human genome for Restriction sites, ORFs etc in a few hours. Not long ago, the best computers couldn’t even hold that much data! • If old tools work, don’t feel you need to use the newest tools. • I still do much of my analysis with Microsoft Word…