Course Module: Genomics and Personalized Care Lecture 2 Blast, UCSC Genome Browser, Flybase

Course Module: Genomics and Personalized CareLecture 2 Blast, UCSC Genome Browser, Flybase

Pairwise Local Alignment • Pairwise local sequence alignment: identify similar segments in two sequences • Smith-Waterman algorithm (a dynamic programming algorithm) is guaranteed to find optimal alignments, but it is computationally expensive. • BLAST is a heuristic approximations to local alignment and they run much faster than Smith-Waterman algorithm but retain sensitivity of the search

BLAST • BLAST [Basic Local Alignment Search Tool] is a sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query • It is the most widely used and referenced computational biology resource • The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length W with a score of at least T when compared to the query using a substitution matrix • Word hits are then extended in both directions to generate an alignment with score exceeding a given threshold S

BLAST Algorithm • Filter out low complexity regions • Locate words with a fix size in the query sequence • Scan the sequence database for entries that match the words in the query sequence • If there is a hit (i.e. a match between a word in the query and a word in the database entry), extend the hit in both directions. Keep track of the score and stop the extension when the score drops below a threshold

Word Size • The initial search is done for a word of length W • Default values: • Protein sequence search: W = 3 • Nucleotide sequence search: W = 11 • Highly similar nucleotide sequence: W=28 • Each word in the query sequence index is compared to the database index and residue pairs are scored

Four Steps of a BLAST search • Enter query sequence • Select one BLAST program • Choose the database to search • Set optional parameters

Enter Query Sequence • A sequence can be pasted into a text field in FASTA format or as accession number • A sequence or a sequence list can also be uploaded as a file • Users may indicate a range of the query sequence instead of using the whole query sequence • You may enter a descriptive title for your BLAST search

Align Two or More Sequences • You may provide two or more sequence and perform pairwise BLAST search

Select a BLAST Program • BLAST Programs: • BLASTN: DNA query sequence against a DNA database • BLASTP: protein query sequence against a protein database • BLASTX: DNA query sequence, translated into all six reading frames, against a protein database • TBLASTN: protein query sequence against a DNA database, translated into all six reading frames • TBLASTX: DNA query sequence, translated into all six reading frames, against a DNA database, translated into all six reading frames • Choose the right one according to the sequence you have and your purpose of the search

Choose the Database to Search • BLASTN

Optional Parameters • Specify the organism to search or exclude • Common name, taxonomy id, … • Exclude certain sequences • Exclude predicted sequences or sequences from metagenomics • Use Entrez query to select a subset of the blast database page 93

Algorithm Parameters Optional Parameters

Algorithm Parameters • Expect value • Word size • Filtering/masking • Substitution matrix

BLASTN Algorithm Parameters

Expect Value

Expect Value • It is important to assess the statistical significance of search results. • For local alignments, the scores follow an extreme value distribution • Expected value (E value) is the number of matches expected to occur randomly with a given score • The lower the E value, more significant the match. • E = Kmn e-lS • K: A variable with a value dependent upon the substitution matrix used and adjusted for search base size. • m, n: length of the query and database sequences • λ: A statistical parameter used as a natural scale for the scoring system • S: alignment score

More about E Value • The value of E decreases exponentially with increasing alignment score S (higher S values correspond to better alignments). Very high scores correspond to very low E values. • For E=1, one match with a similar score is expected to occur by chance. • For a much larger or smaller database, you would expect E to vary accordingly

Why Set Expect Threshold to 1000 • When you perform a search with a short query (e.g. 9 amino acids). There are not enough residues to accumulate a big score (or a small E value). • A match of 9 out of 9 residues could yield a small score with an E value of 100 or 200. And yet, this result could be real and of interest to you. • By setting the E value cutoff to 1000 or a bigger value you do not change the way the search was done, but you do change which results are reported to you. • All hits with E value less than 1000 are reported

E Values • Orthologs from closely related species will have the highest scores and lowest E values • Often E = 10-30 to 10-100 • Closely related homologs with highly conserved function and structure will have high scores • Often E = 10-15 to 10-50 • Distantly related homologs may be hard to identify • Less than E = 10-4 • These values may be served as general guideline but not a strict range for those situations

Set the Expect Threshold • The Expect Threshold can be any positive real number. • The lower the number the more stringent the matches displayed. • The default value of 10 signifies that 10 matches can be expected by chance in a search of the database using a random query with similar length. • No match with an E-value higher than the Expect Threshold selected will be displayed • Increase the Expect Threshold to 1000 or more when searching with a short query

BLAST Search Output

BLSTN Output (header)

BLASTN Output (Graphic Summary) matches to itself probable homologs distantly related homologs distant homolog with shared domain or motif

BLASTN Output (Descriptions)

BLASTN Output (Sequence Alignments)

UCSC Genome Browser Adopted from OpenHelix Training Materials

UCSC Genome Browser • http://genome.ucsc.edu

Genome Browser Gateway • Use this Gateway to search by: • Gene names, symbols, IDs • Chromosome number: chr7, or region: chr11:1038475-1075482 • Keywords: kinase, receptor • See lower part of page for help with format

3 2 1 The Genome Browser Gateway Make your Gateway choices: • Select Clade • Select genome = species: search 1 species at a time • Assembly: the official backbone DNA sequence • Position: location in the genome to examine • Image width: how many pixels in display window; 5000 max • Configure: make fonts bigger + other choices 4 5 6 assembly

UCSC Genome Browser

select The Genome Browser Gateway • Sample search: human, March 2006 assembly, tp53 • Select from results list • ID search may go right to a viewer page, if unique

UCSC genes RefSeq genes MGC clones mRNAs & ESTs many species compared single species compared SNPs repeats Sample Genome Viewer Image, TP53 Region base position

Tick marks; a single location (STS, SNP) < < < < < < < exon exon < exon ex 3' UTR 5' UTR Intron and direction of transcription <<< or >>> Track colors may have meaning—for example, UCSC Gene track: • If there is a corresponding PDB entry = black • If there is a corresponding reviewed/validated seq = dark blue • If there is a non-RefSeq seq = lightest blue • Alignment indications (Conservation pairs: “chain” or “net” style) • Alignments = boxes, Gaps = lines For some tracks, the height of a bar is increased likelihood of an evolutionary relationship (conservation track) Visual Cues on the Genome Browser

Options for Changing Images: Upper Section Walk left or right Zoom in Zoom out • Change your view or location with controls at the top • Use “base” to get right down to the nucleotides • Configure: to change font, window size, more… • Next item, next exon navigation assistance can be turned on Specify a position Fonts, window, next item, more Click to zoom 3x and re-center

enforce changes Enforce changes Change track view Annotation Track Display Options • Some data is ON or OFF by default • Menu links to info about the tracks: content, methods • You change the view with pulldown menus • After making changes, REFRESH to enforce the change Links to info and/or filters

Dense: all items collapsed into a single line • Squish: each item = separate line, but 50% height • Pack: each item separate, but efficiently stacked (full height) • Full: each item on separate line Annotation Track Options Defined • Hide: removes a track from view

Enforce any changes (hide, full, squish…) Flip display to Genomic 3’5’ Reset, back to defaults Start from scratch Mid-page Options to Change Settings • You control the views • Use pulldown menus • Configure options page

OR Cookies and Sessions • Your browser remembers where you were (cookies) To clear your “cart” or parameters, click default tracks or reset • Save your setup as “sessions” and store/share them

Get DNA Sequence for Region Shown in Browser

GEP Drosophila Genome Browser • UCSC Genome Browser, GEP version, parts of genomes, GEP data, used for annotation of Drosophila species • http://gander.wustl.edu Male Drosophila melanogaster http://en.wikipedia.org/wiki/Drosophila_melanogaster

Flybase

Introduction

Quick Searches

Sequence Searches (BLAST)

Course Module: Genomics and Personalized Care Lecture 2 Blast, UCSC Genome Browser, Flybase