890 likes | 1.67k Views
Use of Computers in Molecular Biology. Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore. What is BioInformatics?. Many related terms and buzzwords A multiplicity of names: bioinformatics
E N D
Use of Computers in Molecular Biology Meena K Sakharkar Training Manager, BioInformatics Centre National University of Singapore
What is BioInformatics? • Many related terms and buzzwords • A multiplicity of names: • bioinformatics • biocomputing • biological computing • computational biology • computational genomics • biological data mining
Overview of the challenges of Molecular Biology Computing • The huge dataset problem • automated DNA sequencers • the Human Genome Project • bulk sequencing of cDNAs (ESTs)
GenBank Growth Chart Bases Year As of Oct. 1999, GenBank contains over 3.8 billion bases of DNA and protein sequence, which requires about 18 gigabytes of computer disk storage space.
Human Genome Project • What is the Human Genome Project? • 15-year effort formally begun in October 1990. coordinated by the U.S. Department of Energy and the National Institutes of Health. • identify all the estimated 80,000 genes in human DNA, • determine the sequences of the 3 billion chemical bases that make up human DNA, • store this information in databases, • develop tools for data analysis, and • address the ethical, legal, and social issues (ELSI) that may arise from the project.
Who is head of the U.S. Human Genome Project? • The DOE Human Genome Program is directed by Ari Patrinos, and Francis Collins directs the NIH Human Genome Program. • Ari Patrinos also heads the Department of Energy Office of Biological and Environmental Research.
Related fields • molecular evolution • origin of life • genomics and proteomics • the Human Genome Project • theoretical biology • complexity and information theory • biotechnology • lead drug discovery • computing with biomolecules
Our ( working) definition • Bioinformatics:the body of tools, algorithms and know-how needed to handle complex biological information the technological aspect • Computational biology:the application of bioinformatics tools to perform biological studies the scientific aspect very broad and diverse field
Bioinformatics is clearly a multi disciplinary field including: • computer systems management • networking, database design • computer programming • molecular biology
Integrating bioinformatics and computational biology: • A biologist can use existing tools but might misinterpret results The black-box effect - the 'software kit' • A biologist might refrain from doing some interesting analysis if the existing software doesn't offer it as an option The ability to program is important • A computer scientist or a programmer can produce interesting and/or efficient algorithms and tools, but these might lack biological relevance. A biological training/background is important • Beware of the 'just a tool maker' stigma • Best results are achieved by integrating the development of tools with their usage in interesting biological systems
How to handle all the information? • Producing • Processing • Storing • Sharing • Querying • Retrieving • Visualising • Annotating • Curating
Use of Computers in Molecular Biology • Powerful tools to organise the data itself. • Exponential growth. • A new release is made every two months. • Data Analysis. • Retrieval. • Homology Search. • Modelling purposes - Drug Design • Data Integration • Data Visualisation
Paradigmatic Shift: • Getting new sequences is now easy. • Having a new sequence, we can start by analysing it using the computer, or we can start by doing experimental work. • "A month in the lab can often save an hour in the library." - Westheimer ... or searching the Internet, or doing computerised analyses. • From 'wet lab' to 'soft lab'. • in vivo, in vitro, and in silico
Information is being collected, organized, and made available: • GenBankis the central sequence information database in the United States • Data is shared between GenBank andEuropean Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ) • All sequence data submitted to any of these databases is automatically integrated into the others. • Sequence data is also incorporated from theGenome Sequence Data Base (GSDB) and from patent applications.
Similarity Searching in the databanks • "Are there any sequences in the databanks similar to my sequence?" • Directly searching the databanks by comparing sequences uses too much computer time • The Biologist uses timesaving tools: FASTA and BLAST • Relies on statistics and the informed judgement of the Biologist.
Pairwise and Multiple Alignments • Multiple Alignment is the basis for the study of protein families and functional domains. • When pairwise alignment is expanded to multiple sequences, it becomes acomputationally huge problem. • To reduce the nearly infinite permutations, a simplified heuristic (approximate) algorithm is used known as progressive pariwise alignment
Structure-function relationships: Sequence patterns that predict function • Challenging areas of computational molecular biology is theprediction of the function of protein molecules from their sequence. • Sequence determines 3-D structure, structure determines function • Identify conserved regions (domains or motifs) • Domain databasescan be used to scan any unknown protein sequence
PubMed • Project by NIH and NLM. • Search Tool for accessing literature citations. • PubMed Search system - MedLine and Pre Medline Database and Molecular Biology Databases indexed under Entrez.
MedLine • MedLine - MEDlers OnLINE Database - NCBI’s premier bibliographic database. • Covers medicine, nursing, dentistry, veterinary medicine, the health care sciences and pre-clinical sciences. • Has over 3900 current biomedical journals published in the US and other foreign countries.
MedLine • 9 million records. • Since 1966.
PreMedLine • Introduced in August 1996. • Basic Citation and abstracts before the full records are prepared and added to Medline.
MEDLINE SAMPLE RECORD UI 98408838 AU Tao X, Dafu D TI Relationship between synonymous codon usage and protein structure. MH Codon* MH Protein Folding* MH Protein Structure, Secondary* MH Proteins / genetics …… AB The hypothesis that synonymous codon usage is related to protein three- dimensional structure is examined by … PT Journal article SO FEBS Lett 1998 Aug 28 : 434 (1- 2) : 93- 6
MEDLINE Indexing • MeSH Terms to LIMIT Retrieval • human, animal, male, female, • age groups, organism, etc. • Publication Types ( Another way to LIMIT ) • review, clinical trial, letter, journal article, etc.
MEDLINESubject Headings Advantages of MeSH Terms • Represent a subject concept & no term synonyms needed • Find relevant articles on a search topic that may not be explicitly mentioned in a title or abstract • Focus search & be specific to eliminate irrelevant records • Increase search efficiency to save time … Get reliable results
Searching MEDLINESubject Headings • Disadvantages of MeSH • Thesaurus terms maynot cover all concepts, esp. jargon • Not every concept in abstract or article can get thesaurus terms
MEDLINE Searching Search terms are combined with Boolean “OR” and “AND” .
Modifying Retrieval -- -NOT ENOUGH Found • Reduce number of concepts to combine • Add synonyms or related terms • Use both free- text words & MeSH terms • Truncate free- text words as appropriate • Explode subject term, if it has narrower terms • Do NOT use limits ( e. g., major point, review ) • Consult a professional searcher … Librarian
Modifying Retrieval --- TOO Many Found • Use MeSH terms only … Use no free- text words • Use “MeSH Power” to Focus Your Search • Try a more specific MeSH term • Limit MeSH terms to MAJOR point of article • Use a Subheading with your MeSH term • Reduce number of synonyms, if free- text searching • Add additional concepts to your search • Use Limits … English language, reviews • Restrict to human, animal, or organism
What is the Internet? • A world wide collection of networks of computers • A network of computer networks • A network based on the TCP/IP protocol
Standalone Computer PC Printer A typical setup at home Speakers
LAN A Small Local Area Network of two computers and one printer in your office
Campus Network Wide Area Network National Network InterCountry Network Global Network The INTERNET
What can you do with Internet? INTERNET APPLICATIONS • Electronic Mail (Email) • Internet Talk/Chat (IRC) • File Transfer (FTP) • Remote Login (Telnet) • Internet News (Usenet) • Info retrieval (Gopher, World Wide Web) • AudioVideo Conferencing (CU-SeeMe, Mbone) • Internet Phone
FTP: File Transfer Protocol ftp ncbi.nlm.nih.gov login: anonymous passwd: email address If you want to ftp from a server then use your own login and passwd
Ftp commands continued….. • cd - change directory • ls - listing • pwd - present working directory • bin - transfer in binary mode • asc - transfer in ascii mode • hash - show the transfer. • lcd - local change directory
FTP commands continued.. • prompt - multiple file tranfer • mget - multiple file tranfer else you can just use get • mput - put multiple files onto the server put - single file transfer
Telnet • Work on another machine by remote login. • Telnet intron.bic.nus.edu.sg login: passwd: • Must have an account on the machine for doing telnet • Must have internet connection • Space allocated to you on the machine
What is Hypertext? • Non-Linear Text • Links embedded in the text • Jumps to other locations in the document/db Fence ...... ...... ...... ...... the quick brown fox jumps over the fence
Creating a Web Page • Terms to Know • WWW/Web: World Wide Web • HTML: Hyper Text Mark-up Language • URL: Uniform Resource Locator • I assume that: • know how to use Netscape or some other Web browser • have access to a Web server (or that you want to produce HTML documents for personal use in local-viewing mode)