330 likes | 464 Views
School B&I TCD Bioinformatics. Database homology searching May 2010. Why search a Database. To find homologous sequences to your unknown to determine function To find other related sequences to do evolutionary studies (trees) or to make specialised database (nematode 16sRNA)
E N D
School B&I TCD Bioinformatics Database homology searching May 2010
Why search a Database • To find homologous sequences to your unknown to determine function • To find other related sequences to do evolutionary studies (trees) or to make specialised database (nematode 16sRNA) • To find the mouse or E.coli homolog of your gene of interest • To find genes in a newly sequenced genome • To predict 3-D structure (blast vs PDB)
BLAST • For many molecular biologists • a Blast search • with a DNA sequence • at NCBI • accepting default parameters • IS bioinformatics • 70% of searches at NCBI at blastN
BLAST • Basic local alignment search tool • More or less unchanged for 20 years • Extraordinarily quick: • Submit seq via satellite to Bethesda MD • Crunch thro 400 GB of sequence • Return results via satellite • ALL in 30 seconds • Available at other sites (EBI, ExPaSy etc.) • with better response times, fewer conditions
NCBI penalties • 1st request: current time • 2nd request: current time + 60 seconds • 3rd request: current time + 120 seconds • 4th request: current time + 180 seconds • 5th request: current time + 240 second • DON’T submit multiple simultaneous queries
Reliable servers • http://www.ch.embnet.org/software/bBLAST.html • http://www.ch.embnet.org/software/aBLAST.html • Basic and advanced SwissBlast • http://www.ebi.ac.uk/blast2/index.html • EBI (Cambridge, UK) NCBI Blast • http://www.ncbi.nlm.nih.gov/blast/ • Not the only Blast server!
BLAST varieties • blastn: searches a DNA sequence against a DNA database like EMBL, Genbank, or dbEST. also megablast for exact matches • blastp: searches a protein sequence against a protein database such as UniProt, or "nr" a non-redundant database which ideally contains one copy of every available sequence. Then you have: • blastx: searches a DNA sequence (translated in all six reading frames) against a protein database. • tblastn: searches a protein sequence against a DNA database (translated in all six reading frames) – essential for searching EST databases. and in the interests of completeness there is: • tblastx: searches a DNA sequence (translated in all six reading frames) against a DNA database (translated in all six reading frames). finally • Psi-blast an iterative process using position specific subst matrix PSSM to find distantly related sequences
Algorithm • Five steps • Break the query seq into short “words” – typically 3 consecutive residues for protein • Search the database for (nearly) exact matches to these words • Extend match “hits” out on either side until score stops going up – these are HSPs (high scoring segment pairs) • Sort the HSPs by some “optimum” criterion • Significant hits are then formally scored, aligned and displayed
Alternatives to BLAST • FASTA • A little slower than Blast • More sensitive (so recommended) for DNA to DNA searches • Smith-Waterman (Blitz) • Much slower than Blast (20x slower) • Much more sensitive • But Blast is standard because it gives a “good enough” answer quickly.
Blast Blast at NCBI Paste your query sequence here Parallel search in Conserved domain DB
Input sequence parameters Optional parameters to change include • Parameters that alter the hits found • Database searched • Word size • Substitution matrix • Gap penalties • Low complexity masking • Parameters that alter the results delivered • Expectation cut-off • Limit by organism or taxonomic group • Number of hits reported • Number of alignments shown
Database • If query is a coding gene: translate and search protein database • Search PDB if you want a 3-D structure • Search NR if you want any hit • Search UniProt to know what the hits are • Search dbEST to know if your sequence is expressed • UniProt90: no seq is more than 90% ident to any other (for an uncluttered tree) also UniProt50
Word size • Default at NCBI Blast • 11 for DNA; option of 7 and 5 • 3 for protein; option of 2 • Increasing wordsize will speed search … but will lose sensitivity • It will miss some useful but distant hits • Decreasing wordsize will be more sensitive … but take longer
Substitution matrix • By default BLAST uses BLOSUM62 • Can change this • Blosum90 mirrors changes in closely related sequences • Blosum30 changes in distant sequences • Should run 3 blast searches with different substitution matrices.
Changes 30-62-90 not linear Substitution matrices Top left part of a BLOSUM 90 matrix A R N D C Q E G H I L A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2 R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3 N -2 -1 7 1 -4 0 -1 -1 0 -4 -4 D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5 C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2 Q -1 1 0 -1 -4 7 2 -3 1 -4 -3 E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4 G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5 H -2 0 0 -2 -5 1 -1 -3 8 -4 -4 I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5 Positive off-diagonals aresimilar
Reality of matrix change • Query: GHDEICI • GH + C • Sbjct: GHACNCG • Scores 39 with Blosum30; 5 with Blosum90 • Query: HEQCRLEN • +E LEN • Sbjct: QENAHLEN • Scores 19 with Blosum30; 24 with Blosum90 • So GHDEICIHEQCRLEN will find different hits
Matrix families • PAM – point accepted mutation • Margaret Dayhoff 1970s (before Genbank) • BLOSUM – based on aligned blocks • Henikoff and Henikoff 1992 • Blosum 62 default (based on aligned seqs 62% identical – don’t ask why 62: it works) • Low PAM = High Blosum • PAM 250 == Blosum 30 • PAM 30 == Blosum 90
Gap penalty • Original Blast was gap-free • Because gapped aligns much more CPU • Now can insert “affine” gaps • Gap open 10; gap extend 1 • Raise gap penalty to discourage gaps • Preferentially gets closely related hits • Lower gap penalty for sensitive search for distant relatives
Low Complexity Masking • Seg masks low-complexity regions • “too many” serines, prolines etc. • What a masked sequence looks like: >P04729 Wheat gamma gliadin MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI PIVQPSVLQQLNPCKVFLQQQCSPVAMPQRLARSQMWQQSSCHVMQQQCCQQLQQIPEQS RYEAIRAIIYSIILQEQQQGFVQPQQQQPQQSGQGVSQSQQQSQQQLGQCSFQQPQQQLG QQPQQQQQQQVLQGTFLQPHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG VGAY* and after low complexity masking: >P04729 SEG low-complexity masked MKTFLVFALIAVVATSAIAQMETSCISGLERPWXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXLNPCKVFLQQQCSPVAMPQRLARSQMWXXXXXXXXXXXXXXXXXXXXXXX RYEAIRAIIYSIIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG VGAY* • Xnu masks repeats (from database of common repeats)
Masking • Check if masking on/off by default (differs at sites) • Run 2 searches with masking on, masking off ANNYway • Masking by hand: • If you know about the DNA binding domain already • Which is really common and will occupy the top 100 hits against any database • Replace the region with Xs • Re-run blast to find other motifs/domains/information • NCBI blast allows select subsequences • DUST masks low info DNA sequences • Like polyA tails
Expectation cut-off • E value • Expected number of chance hits • Given the database searched • Query HSP length • Related to probability • Default is 10: “expect to find 10 hits as good as this by chance alone” • E values less than 10-4 unreliable • So different from p < 0.05 • For short seqs crank E up to 100 or 1000
Data deluge – hits delivered • Default suits general purpose • V = 50 num of “hits” one line descriptions • B = 50 num of alignments between query and hit which are displayed
Limit by taxon/organism Also search at www.ensembl.org for “genomed” organisms
Database choice NR for getting aNNy hit swissprot or refseq for getting hits that have annotation Month for recent hits
The output • Is in five parts • Admin – date, size of database, id of query • Graphic display of query and hits • List of hits with links to database and to… • …alignments (may be > 1 HSP per hit) • More admin, including errors/warnings
Notes! • Record the top admin stuff. Your search will be different • Next week • On a different server • With different DB • With different input parameters • Mouse-over any graphic display • Shows domain structure • Shows if hit is global or local • Read the bottom admin stuff
Blast output sp|P06471|HOR3_HORVU B3-HORDEIN Length = 264 Score = 62.5 bits (149), Expect = 1e-09 Identities = 32/63 (50%), Positives = 38/63 (59%) Query: 131 LNPCARSQMWXXXXXXXXXXXXXXXXXXXXXXXRYEAIY LNPCARSQM R+EA+Y Sbjct: 111 LNPCARSQMLQQSSCHVLQQQCCQQLPQIPEQLRHEAVY Query: 191 SII 193 SI+ Sbjct: 171 SIV 173 Low-complexity masked region What sort of residues masked in The query sequence?? May be more HSPs for same hit If big deletion in either seq
General blast protocol • Find server, choose DB, paste seq, GO • Rerun search with/out masking • Rerun search with two diff subs matrix • 2 x 3 for six searches • If top N hits all same family/domain then XXX this region and resubmit • LOOK at the results; esp strange ones • Limit results by organism
Blast notes • Twilight zone <25% protein <70% NA • Because two sequences give high blast scores to a third, doesn’t mean they are related • NNNNDOMAINANNNNDOMAINBNNN • NNNNDAMIANANNNNNNNNNNNNNN • NNNNNNNNNNNNNNNDIMIAMBNNN • E-value vs % ID. >10-4 unreliable • Bit score <50 unreliable
PSI-BLAST • Uses a PSSM rather than BloSum/PAM • Iterative … • Can find very distant relatives • …so deep insight • BUT can iterate off with the fairies