1 / 30

Introduction to Bioinformatics

Introduction to Bioinformatics. BLAST. BLAST. Introduction What is BLAST? Query Sequence Formats What does BLAST tell you? Choices Variety of BLAST BLAST Programs: Which One to Use? Commonly Used BLAST programs BLAST Databases: Which One to Search? Understanding the Output

hubert
Download Presentation

Introduction to Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Bioinformatics BLAST

  2. BLAST • Introduction • What is BLAST? • Query Sequence Formats • What does BLAST tell you? • Choices • Variety of BLAST • BLAST Programs: Which One to Use? • Commonly Used BLAST programs • BLAST Databases: Which One to Search? • Understanding the Output • Database Search with BLAST • Blast Steps – How It Works Acknowledgement: The presentation includes adaptations from NCBI’s Introduction to Molecular Biology Information Resources Modules

  3. What is BLAST? • Basic Local Alignment Search Tool • The GoogleTM of bioinformatics • Query is a DNA or protein sequence, not a text term • Character string comparison against all the sequences in the target database • Rigorous statistics used to identify statistically significant matches

  4. Query Sequence Formats • Bare sequence • QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP • 1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels 181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp • Identifiers • accession, accession.version or gi's • e.g., p01013, AAA68881.1, 129295, gi|129295 • FASTA format

  5. Query Sequence in FASTA Format • FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line • Up to 80 nucleotide bases or amino acids per line • Blank lines not allowed in the middle • Example • >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP • Additional information

  6. What does BLAST tell you? • Putative identity and function of your query sequence • Helps to direct experimental design to prove the function • Find similar sequences in model organisms (e.g., yeast, C. elegans, mouse), which can be used to further study the gene • Compare complete genomes against each other to identify similarities and differences among organisms

  7. Variety of BLASTs: http://www.ncbi.nlm.nih.gov/BLAST/

  8. BLAST Programs: Which One to Use? Depends on: • What type of query sequence you have (nucleotide or protein) • What type of database you will search against (nucleotide or protein) • BLAST program descriptions • brief list • BLAST program selection guide

  9. Commonly Used BLAST Programs • Examples of BLAST programs • BLASTN • Nucleic acids against nucleic acids • BLASTP • Protein query against protein database • Usually better to use than nucleotide-nucleotide BLAST • Since the genetic code is degenerate, blastn can often give less specific results than blastp • ...but... what if we don't have a protein query sequence. What are our options? • BLASTX • Translated nucleic acids against protein database • One way to do a protein BLAST search if you have a nucleotide query sequence • The BLAST program does the translating for you, in all 6 reading frames

  10. BLAST Databases: Which One to Search? What type of data do you want to search against? For example: • Characterized sequences? • Specialized sequences? • Complete genomes or chromosomes? • BLAST database descriptions are available in the: • BLAST help document • BLAST program selection guide

  11. Request ID: RID • An RID is like a ticket number that allows you to retrieve your search results and format them in many different ways over the next 24 hours. • If you've saved RIDs from your recent searches, you can enter the RIDs directly using the Retrieve results with a Request ID page, which is accessible from the bottom of the BLAST home page

  12. Search Results: Understanding the Output • Reference to BLAST paper • Reminders about your specific query • RID • query sequence reminder (contains the information from your FASTA def line) • what database you searched against • Graphical summary • shows where the hits aligned to your query • colors indicate score range • mouse over a colored bar to see info about that hit • Text summary (GI numbers and Def lines) • GI links to complete record in Entrez • Score links to pairwise alignment between your query sequence and the hit • Pairwise alignments • BLAST statistics for your search

  13. Database Search w/ BLAST • Primary use of bioinformatics • Finding similar sequences • BLAST Acknowledgement: Slides 15 – 19 are adapted from lecture notes of Professor Chau-Wen Tseng of CS Department at the University of Maryland with permission.

  14. Database Search w/ BLAST • Set up format options and hit the Format button RID Click button!

  15. Database Search w/ BLAST • Versions of BLAST • BLASTN • Nucleic acids against nucleic acids • BLASTP • Protein query against protein database • BLASTX • Translated nucleic acids against protein database • TBLAST • Protein query against translated nucleic acid database • TBLASTX • Translated nucleic acids against translated nucleic acids

  16. Database Search w/ BLAST

  17. Database Search w/ BLAST • BLAST graphic result

  18. Database Search w/ BLAST • BLAST result 0Matching sequences w/ bit-score & E-value 0Hyperlinks to database entry for sequence • Example gi|17330420|gb|BH384278.1|BH384278... 153 3e-36 gi|17320126|gb|BH373984.1|BH373984... 140 9e-34 gi|17338337|gb|BH392196.1|BH392196... 112 8e-25 gi|20373967|gb|BH771010.1|BH771010... 105 1e-21 gi|17314411|gb|BH368367.1|BH368367... 104 2e-21 gi|17332712|gb|BH386570.1|BH386570... 64 3e-21 Hyperlinks to sequences Bit Score E-value

  19. BLAST – Statistical Evaluation • E Value • The number of different alignments with scores equivalent to or better than alignment score that are expected to occur in a database search by chance. • The lower the E value, the more significant the score.

  20. BLAST – How It Works • Find high scoring local alignments between query sequence and target database • Assumption • True match alignments very likely to contain within them very high scoring matches • Steps • Seeding • Searching • Extension • Evaluation

  21. BLAST Steps • Seeding • For each word of length w in the query (w-mer), generate a list of all possible words (neighbors) with a score of at least threshold T (determined by using the scoring matrix) • Default • w = 3 for protein • w =11 for DNA

  22. Query word (w = 3) Query:GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 PMG 13 PSG 13 PQA 12 PQN 12 … Neighborhood words Neighborhood score threshold (T = 13) This example uses BLOSUM 62.

  23. BLOSUM 62

  24. BLAST Steps • Searching • Determine the locations of all common “words” between the query and the database (“word hits”) • Identifies all word hits

  25. Query word (w = 3) Query:GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 PMG 13 PSG 13 PQA 12 PQN 12 … Neighborhood words Neighborhood score threshold (T = 13) Hit Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA

  26. BLAST Steps • Extension • Extend hits to find HSPs (high-scoring segment pairs) that have scores higher than a threshold • Introduce gaps using dynamic programming • Problem of extension • Time-consuming to find the highest score • Solution (heuristic) • Extend until score drops a value of X Example: ABCDEFGHIJKLMNOPQRST |||||| ||||| | ABCDEFZYIJKLMXWVUTAB 1234565456789876565  Score 00000012100001234345 Drop off score Match = 1 Mismatch = -1 X = 5

  27. Query word (W = 3) Query:GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFGCATSWPI PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 PMG 13 PSG 13 PQA 12 PQN 12 … Neighborhood words Neighborhood score threshold (T = 13) Hit Query: SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA +LA++L+ TP G R++ +W+ P+ D + ER + A Subject: TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA

  28. BLAST Steps • Evaluation • Maximal segment pairs (MSPs) – maximum-scoring HSPs • Evaluate the statistical significance of extended hits (HSPs) • Report only those above the determined threshold (MSPs)

  29. For local, ungapped alignments: m: size of query n: size of database E: expected # of HSPs with scores at least S p: prob of finding at least one HSP with S good tutorial at: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html BLAST – Statistical Evaluation

  30. Interpretations of Expected Value • Expected value ranges • E < 10-100 → very low, homologs or identical genes • E < 10-3 → moderate, may be related genes • E > 1 → high, probably / may be unrelated • 0 0.5 < E < 1 → ??? In the “twilight zone” Try detailed search • If database search • Long list of gradually declining of E values → large gene family • Long regions of moderate similarity → more significant than short regions of high identity • Biological relevance • Still need to determine biological significance!!!

More Related