530 likes | 707 Views
Ollie Bridle BSc. Hons., MA., MPhil. oliver.bridle@ouls.ox.ac.uk May 2008. WISER: Bioinformatics:. sources for research in biology. Outline. Introduction. Information sources in biology and associated problems. What is bioinformatics? DNA databases. Entrez. (+ exercise) Summary. Aims.
E N D
Ollie Bridle BSc. Hons., MA., MPhil. oliver.bridle@ouls.ox.ac.uk May 2008 WISER: Bioinformatics: sources for research in biology
Outline • Introduction. • Information sources in biology and associated problems. • What is bioinformatics? • DNA databases. • Entrez. (+ exercise) • Summary.
Aims • Convince you that these bioinformatics resources are valuable for research. • Give you some important searching strategies. • Show you how to find what you want. • Suggest other resources and further help.
What I won’t Cover • All the resources available. • Commercial software. • Huge amounts of scientific detail. • Bibliographic and abstract databases • Check out some of the other WISER sessions.
About Me… • Trainee librarian. • Formerly a biologist - degrees in Microbiology (BSc) and Microbial Genetics (MPhil). • Much less familiar with animal and population genetics…but… • As far as searching databases goes, similar principles apply.
Information Sources for Research - Key Questions • What is available? • Where do I find it? • How do I search it?
Problems with Biological Data • Data collection. • The base of information is large, expanding and diverse. • Organisation and accessibility. • Requirement for special search techniques. You can’t Google a DNA sequence…yet! • A student/researcher wants the right information quickly!!!
The Good News • Large projects working to organise this information. • Much is freely available over the internet. • University subscribes to many e-journals and bibliographic databases available through Oxlip.
A Definition of Bioinformatics ‘…information technology applied to the management and analysis of biological data’ (Attwood, T. K) A multidisciplinary subject.
Bioinformatics aims to… • Collect, • Organise, • Store, • Retrieve, • Analyse, ….biological data with the use of computers.
What is a DNA Sequence? The DNA double helix is made up of a series of chemical bases stung along a sugar backbone. There are 4 bases usually represented by the letters A, T, C and G. The linear sequence in which these bases occur determines all the instructions for building an organism.
What is a Protein Sequence? Proteins are complex molecules which control most aspects of cell biology. Constructed of small subunits called amino acids. There are 20 types of amino acid. Assembeled by ‘reading’ (or translating) the DNA sequence. Every set of 3 bases (e.g. ATG) corresponds to an amino acid. So a protein is built up one amino acid at a time according to the DNA blueprint.
In Summary… DNA Molecule Proteins Complete Organism DNA Sequence
Looking at DNA sequences I • Analysis of DNA or protein sequences is a frequent requirement of research. • Locating genes within a sequence. • Comparing two sequences for similarity. • Searching for similar genes (orthologues) in other organisms.
Looking at DNA sequences II DNA sequences are easily stored, retrieved, compared and manipulated on computers. Just represent each base as a letter! Computers can compare two or more sequences and find similar regions. Much analysis of genetic information now takes place in silico.
Looking at DNA Sequences III DNA sequences can be determined experimentally. Software allows biologists to construct and view maps of DNA sequence. The DNA code of ATCG gets transformed into something much more human friendly. Artemis is one available map viewer.
DNA Databases Free access to vast numbers of sequences deposited by researchers all over the world. Used alongside scientific papers. Can be searched or ‘mined’ in a variety of ways.
Global Bioinformatics Agencies DNA Data Bank of Japan International Nucleotide Sequence Database Collaboration European Molecular Biology Laboratory National Centre for Biotechnology Information
NCBI and Genbank Genbank is NCBI’s DNA database. Extensive search and deposit capabilities. 606 sequences
A Practical Example A researcher might start with a piece of DNA rather than a literature citation. Here we will – Search a DNA database using a piece of DNA sequence. Use the results of the search to identify relevant literature.
1) Grow some bugs. 4) Generate sequence. 2) Extract the DNA. 3) Amplify up the desired section of DNA. The Experiment
A DNA Sequence The following sequence is in FASTA format. >G08_CHEV11Fed.seq GTCGACGCGCAAATGGTTCTATATCCATACCAATAGCAGTATCGTTGCCA TTATCACGAATGGAATTAAGTAAAGTTTTCATTCTATCAATAGACTCTAA AACCACATCCATGATATCTGGAGTTATTTTTAACTCGCCATGTCTTGCTT TGTTTAAAACATCCTCCATGTGGTGAGTTAACTTTGTTAAAACATCAAAA TTTAAGAAGCTTGATGATCCTTTAACCGTATGTGCAACACGGAAAATTCT ATTTAATAATTCTAAATCTTCTGGATTTGATTCAAGCTCTACTAAATCAT GGTCGATTTGCTCAACAAGCTCAAAAGCTTCAACCAAAAAGTCTTCAAGT ATTTCTTGCATATCTTCCATATTTTACCCCTGTTCTTGAGATTGATGTTT TTTAATAACCTTTGCAATTTCATTGAAGAAATCGCTAGCGTTAAATTTGA CAAGATAGCCTTCTCCACCAGCTTCTTGAACACCTTTCTCATTCATAAAT TCATTTGATAAAGATGAGTTAAAGACTATAGGAATATCTTTAAATCCGGG ATCTTCTTTAATGCGTGCAGCGGATCCCGGGTACCTGCAGAATTCAGCTG CGCCCTTTAGTTCCTAAAGGGTTTTTATCAGTGCGACAAACTGGGATTTT ATTTATTCAGCAAGTCTTGTAATTCATCCAAAAAACGGCAAACATGAAAG CCGTCACAAACGGCATGATGCACTTGAATCGATAAGGGAATATAGTATTT TCCGCCCTCCTCATAATACTTCCCAAACGTAAATATCGGCAGTAGATAGT
A BLAST Search Basic Local Alignment Search Tool Aimed at finding highly similar sequences in the database. Lets see how to submit a sequence query to the Genbank database.
BLAST Search Screen Enter sequence. Select database. Select BLAST type.
The Statistics • Guidelines for evaluating stats (data from ‘Introduction to Bioinformatics’, Lesk, A, OUP (2005)) • E ≤0.02 – Sequences probably homologous (i.e. derived from a common ancestor) • E between 0.02 and 1 – homology unproven but can’t be ruled out. • E>1 – Expect this good a match by chance. • Putting the amino acid sequence NELLYTHEELEPHANT into a BLAST protein search produces results! • Best match E value = 9
BLAST Results II Two possible matches.
BLAST Results III Literature references allow us to go straight to citations in PubMed relevant to the sequence we have found. Here is the name of the gene!
Evaluating the Data • There are errors in these databases! Is a BLAST search appropriate? Should I cross reference? What is the source of this sequence? What are the statistics telling me?
Using Accession Numbers Papers often contain accession numbers. No database submission = No publication. Using HTML versions of papers you can link directly to the gene or protein sequence. Here’s one I made earlier….
Exploring Further Start with a completely unknown sequence. Searching for ‘CheV’ in WOS will not bring up all the relevant papers. Starting from a DNA sequence you have a new way to search. ‘Having a BLAST with bioinformatics (and avoiding BLASTphemy)’, A. Pertsemlidis and J. W. Fondon III. Genome Biology (2001), 2(10), pp. 1-10
Structure of Entrez Powerful resource for research. Entrez is a cross-database search engine. Records are cross referenced and linked.
Single Keyword Search • Type keyword into the search box and click ‘GO’ • The number of hits for the search term is shown by each database. • Single keyword searches are limited. • Advanced search techniques refine results and produce fewer irrelevant hits.
Using Boolean Operators • Boolean operators and phrases build complex searches. • Use AND, OR and NOT to join terms. Chemotaxis AND “Campylobacter jejuni” • Use UPPERCASE for the operators. • A phrase is enclosed in quotation marks. “Protein glycosylation”
Your Turn! • A little practice using Entrez. • Follow the instructions on the handout. • Shout if you have problems. 10 Minutes
Notes on the Exercise Using brackets with Boolean operators refines search results. Care with placing brackets is essential! The clipboard is helpful for recording results of searches.
Refining Searches and Setting Limits. • Within an individual database results may be further refined by setting limits. • The number and type of limits will depend on the database. • Click the ‘limits’ tab from within one of the databases.
Steps in Setting a Limit • Select a field to limit the search by. • Type in the limiting term in the search box. • Select other limiting options e.g. – • Publication date. • Database. • Hit ‘GO’ to retrieve the results.
Using the History • The history keeps track of previous searches. • You can combine searches and limits quickly and easily. • You can isolate records matching very specific criteria. • A demonstration....
Jumping Between Databases • Records in Entrez are extensively cross linked. • The ‘links’ hyperlink next to each record lets you jump between databases.
Entrez in Summary • We’ve looked at – • Simple and advanced searching. • Accessing and moving between records. • Using the clipboard. • Setting limits. • Using the history. • Sorting results.
Evaluating Entrez I Advantages Quickly cross reference many databases. Elaborate searches can be constructed within each database. Tools to save and modify searches. Pools many resources.
Disadvantages Can return many irrelevant results. Syntax for advanced searching is complicated (many databases = many fields). Doesn't cover everything! Evaluating Entrez II
Summary • Bioinformatics resources help collect, organise and analyse biological data. • Essential resources for biology research. • Bioinformatics databases can be searched in unique ways. • Entrez provides a powerful cross-database searching tool. • Many more resources out there!
And Finally… Thanks for listening! Any Questions?