1 / 55

Course Module: Genomics and Personalized Care Lecture 2 Blast, UCSC Genome Browser, Flybase

Course Module: Genomics and Personalized Care Lecture 2 Blast, UCSC Genome Browser, Flybase. Pairwise Local Alignment. Pairwise local sequence alignment: identify similar segments in two sequences

wilhelminaa
Download Presentation

Course Module: Genomics and Personalized Care Lecture 2 Blast, UCSC Genome Browser, Flybase

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Course Module: Genomics and Personalized CareLecture 2 Blast, UCSC Genome Browser, Flybase

  2. Pairwise Local Alignment • Pairwise local sequence alignment: identify similar segments in two sequences • Smith-Waterman algorithm (a dynamic programming algorithm) is guaranteed to find optimal alignments, but it is computationally expensive. • BLAST is a heuristic approximations to local alignment and they run much faster than Smith-Waterman algorithm but retain sensitivity of the search

  3. BLAST • BLAST [Basic Local Alignment Search Tool] is a sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query • It is the most widely used and referenced computational biology resource • The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length W with a score of at least T when compared to the query using a substitution matrix • Word hits are then extended in both directions to generate an alignment with score exceeding a given threshold S

  4. BLAST Algorithm • Filter out low complexity regions • Locate words with a fix size in the query sequence • Scan the sequence database for entries that match the words in the query sequence • If there is a hit (i.e. a match between a word in the query and a word in the database entry), extend the hit in both directions. Keep track of the score and stop the extension when the score drops below a threshold

  5. Word Size • The initial search is done for a word of length W • Default values: • Protein sequence search: W = 3 • Nucleotide sequence search: W = 11 • Highly similar nucleotide sequence: W=28 • Each word in the query sequence index is compared to the database index and residue pairs are scored

  6. Four Steps of a BLAST search • Enter query sequence • Select one BLAST program • Choose the database to search • Set optional parameters

  7. Enter Query Sequence • A sequence can be pasted into a text field in FASTA format or as accession number • A sequence or a sequence list can also be uploaded as a file • Users may indicate a range of the query sequence instead of using the whole query sequence • You may enter a descriptive title for your BLAST search

  8. Align Two or More Sequences • You may provide two or more sequence and perform pairwise BLAST search

  9. Select a BLAST Program • BLAST Programs: • BLASTN: DNA query sequence against a DNA database • BLASTP: protein query sequence against a protein database • BLASTX: DNA query sequence, translated into all six reading frames, against a protein database • TBLASTN: protein query sequence against a DNA database, translated into all six reading frames • TBLASTX: DNA query sequence, translated into all six reading frames, against a DNA database, translated into all six reading frames • Choose the right one according to the sequence you have and your purpose of the search

  10. Choose the Database to Search • BLASTN

  11. Optional Parameters • Specify the organism to search or exclude • Common name, taxonomy id, … • Exclude certain sequences • Exclude predicted sequences or sequences from metagenomics • Use Entrez query to select a subset of the blast database page 93

  12. Algorithm Parameters Optional Parameters

  13. Algorithm Parameters • Expect value • Word size • Filtering/masking • Substitution matrix

  14. BLASTN Algorithm Parameters

  15. Expect Value

  16. Expect Value

  17. Expect Value • It is important to assess the statistical significance of search results. • For local alignments, the scores follow an extreme value distribution • Expected value (E value) is the number of matches expected to occur randomly with a given score • The lower the E value, more significant the match. • E = Kmn e-lS • K: A variable with a value dependent upon the substitution matrix used and adjusted for search base size. • m, n: length of the query and database sequences • λ: A statistical parameter used as a natural scale for the scoring system • S: alignment score

  18. More about E Value • The value of E decreases exponentially with increasing alignment score S (higher S values correspond to better alignments). Very high scores correspond to very low E values. • For E=1, one match with a similar score is expected to occur by chance. • For a much larger or smaller database, you would expect E to vary accordingly

  19. Why Set Expect Threshold to 1000 • When you perform a search with a short query (e.g. 9 amino acids). There are not enough residues to accumulate a big score (or a small E value). • A match of 9 out of 9 residues could yield a small score with an E value of 100 or 200. And yet, this result could be real and of interest to you. • By setting the E value cutoff to 1000 or a bigger value you do not change the way the search was done, but you do change which results are reported to you. • All hits with E value less than 1000 are reported

  20. E Values • Orthologs from closely related species will have the highest scores and lowest E values • Often E = 10-30 to 10-100 • Closely related homologs with highly conserved function and structure will have high scores • Often E = 10-15 to 10-50 • Distantly related homologs may be hard to identify • Less than E = 10-4 • These values may be served as general guideline but not a strict range for those situations

  21. Set the Expect Threshold • The Expect Threshold can be any positive real number. • The lower the number the more stringent the matches displayed. • The default value of 10 signifies that 10 matches can be expected by chance in a search of the database using a random query with similar length. • No match with an E-value higher than the Expect Threshold selected will be displayed • Increase the Expect Threshold to 1000 or more when searching with a short query

  22. BLAST Search Output

  23. BLSTN Output (header)

  24. BLASTN Output (Graphic Summary) matches to itself probable homologs distantly related homologs distant homolog with shared domain or motif

  25. BLASTN Output (Descriptions)

  26. BLASTN Output (Sequence Alignments)

  27. UCSC Genome Browser Adopted from OpenHelix Training Materials

  28. UCSC Genome Browser • http://genome.ucsc.edu

  29. Genome Browser Gateway • Use this Gateway to search by: • Gene names, symbols, IDs • Chromosome number: chr7, or region: chr11:1038475-1075482 • Keywords: kinase, receptor • See lower part of page for help with format

  30. 3 2 1 The Genome Browser Gateway Make your Gateway choices: • Select Clade • Select genome = species: search 1 species at a time • Assembly: the official backbone DNA sequence • Position: location in the genome to examine • Image width: how many pixels in display window; 5000 max • Configure: make fonts bigger + other choices 4 5 6 assembly

  31. UCSC Genome Browser

  32. select The Genome Browser Gateway • Sample search: human, March 2006 assembly, tp53 • Select from results list • ID search may go right to a viewer page, if unique

  33. UCSC genes RefSeq genes MGC clones mRNAs & ESTs many species compared single species compared SNPs repeats Sample Genome Viewer Image, TP53 Region base position

  34. Tick marks; a single location (STS, SNP) < < < < < < < exon exon < exon ex 3' UTR 5' UTR Intron and direction of transcription <<< or >>> Track colors may have meaning—for example, UCSC Gene track: • If there is a corresponding PDB entry = black • If there is a corresponding reviewed/validated seq = dark blue • If there is a non-RefSeq seq = lightest blue • Alignment indications (Conservation pairs: “chain” or “net” style) • Alignments = boxes, Gaps = lines For some tracks, the height of a bar is increased likelihood of an evolutionary relationship (conservation track) Visual Cues on the Genome Browser

  35. Options for Changing Images: Upper Section Walk left or right Zoom in Zoom out • Change your view or location with controls at the top • Use “base” to get right down to the nucleotides • Configure: to change font, window size, more… • Next item, next exon navigation assistance can be turned on Specify a position Fonts, window, next item, more Click to zoom 3x and re-center

  36. enforce changes Enforce changes Change track view Annotation Track Display Options • Some data is ON or OFF by default • Menu links to info about the tracks: content, methods • You change the view with pulldown menus • After making changes, REFRESH to enforce the change Links to info and/or filters

  37. Dense: all items collapsed into a single line • Squish: each item = separate line, but 50% height • Pack: each item separate, but efficiently stacked (full height) • Full: each item on separate line Annotation Track Options Defined • Hide: removes a track from view

  38. Enforce any changes (hide, full, squish…) Flip display to Genomic 3’5’ Reset, back to defaults Start from scratch Mid-page Options to Change Settings • You control the views • Use pulldown menus • Configure options page

  39. OR Cookies and Sessions • Your browser remembers where you were (cookies) To clear your “cart” or parameters, click default tracks or reset • Save your setup as “sessions” and store/share them

  40. Get DNA Sequence for Region Shown in Browser

  41. GEP Drosophila Genome Browser • UCSC Genome Browser, GEP version, parts of genomes, GEP data, used for annotation of Drosophila species • http://gander.wustl.edu Male Drosophila melanogaster http://en.wikipedia.org/wiki/Drosophila_melanogaster

  42. Flybase

  43. Introduction

  44. Quick Searches

  45. Sequence Searches (BLAST)

More Related