380 likes | 399 Views
BLAST. Similarity and Homology. Similarity is a measure of “ sameness ” . It is expressed as a percentage, and it does not imply any reasons for the observed sameness, it is simply a measure of the observed likeness.
E N D
Similarity and Homology • Similarity is a measure of “sameness”. It is expressed as a percentage, and it does not imply any reasons for the observed sameness, it is simply a measure of the observed likeness. • Homology is an evolutionary term used to describe relationship via descent from a common ancestor. Homologous things are often similar, but not always, for example the flipper of a whale and your arm, or the DNA sequence for Actin in humans and chickens. • Homology is NEVER expressed as a percent, either things being compared are related or they are not. • Similarity is not homology, things may be % similar, but they are either homologous or not.
Similarity and Homology • Sequence homology can be reliably inferred from statistically significant similarity over a majority of the sequence length. • Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor. • Homologous proteins share common structures, but not necessarily common sequence or function.
What is BLAST? • Basic Local Alignment Search Tool • It is a sequence database search program • It tries to match a query sequence with each of a target database sequences • Produces local alignments: only a portion of each sequence is aligned • Uses statistical theory to determine if a match might have occurred by chance
In 6 frames Nucleotide Sequence Protein Sequence Translated Protein Sequence tblastn blastn blastp blastx Nucleotide DB Protein DB tblastx Translated DB (contain amino acid sequences) In 6 frames
Peptide Sequence Databases nr: All non-redundant GenBank CDS translations + RefSeq Proteins + PDB + SwissProt + PIR + PRFrefseq RefSeq: protein sequences from NCBI's Reference Sequence Project. Swissprot: Last major release of the SWISS-PROT protein sequence database (no updates). Pat: Proteins from the Patent division of GenPept. PDB: Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank. Month: All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days. env_nr: Protein sequences from environmental samples. Nucleotide Sequence Databases nr: All GenBank + RefSeq Nucleotides + EMBL + DDBJ + PDB sequences (excluding HTGS0,1,2, EST, GSS, STS, PAT, WGS). No longer "non-redundant". refseq_rna: RNA entries from NCBI's Reference Sequence project refseq_genomic: Genomic entries from NCBI's Reference Sequence project Est: Database of GenBank + EMBL + DDBJ sequences from EST Divisions est_human: Human subset of est. est_mouse: Mouse subset. est_others: Non-Mouse, non-Human subset of est gss: Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences. htgs: Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr) Pat: Nucleotides from the Patent division of GenBank. Pdb: Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank Month: All new or revised GenBank + EMBL + DDBJ + PDB sequences released in the last 30 days.dbsts Database of GenBank+EMBL+DDBJ sequences from STS Divisions . Chromosome: A database with complete genomes and chromosomes from the NCBI Reference Sequence project.. Wgs: A database for whole genome shotgun sequence entries.env_nt Nucleotide sequences from environmental samples. NCBI BLAST Databases
What do you need for running BLAST ? • BLAST • Blastable database or formatted database which can be queried. • Query sequence • Query parameter
Making your own BLAST DB • Any sequence file of fasta formatted sequences can be turned into a BLAST DB. • How you do this depends on which BLAST variant you are using. • NCBI BLAST-protein DB: formatdb -p T –i myseqfile • NCBI BLAST-nucleotide DB: formatdb -p F –i myseqfile
Command line BLAST • blastall -p blastp -d formatteddb -i myseq -o myseq.blastp
PSI BLAST • PSI stands for Position Specific Iterated. • This search method makes use of a profile, which is a position-specific accounting of what amino acid residues are found in a family of aligned homologous proteins. • PSI-blast accepts a protein sequence as input and first conducts a normal blast search to identify homologues in the database. • A profile is constructed from the spectrum of sequences found in the initially identified homologues. • This profile is used as the search key to identify more distant relatives. • The process is then iterated, each time refining the profile based on inclusion of the new members. • Ideally, the process is expected to converge on a unique set of genes
PHI-BLAST • Pattern Hit Initiated BLAST • PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. • PHI-BLAST searches the specified database for other protein sequences that also contain the input pattern and have significant similarity to the query sequence in the vicinity of the pattern occurrences. • PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of a PHI-BLAST query can be used to initiate one or more rounds of PSI-BLAST searching. • By filling in the "regular expression" box on the PSI-blast page, you can execute a PHI-blast search. • PHI-blast enforces the presence of a motif in addition to the usual PSI-blast criteria for matching. An example of a regular expression is W-x(9,11)-[VFY]-[FYW]-x(6,7)-[GSTNE]-[GSTQCR]-[FYW]-x(2)-P. This means a W followed by 9 to 11 of anything, followed by one of the residues V, F, or Y, etc.
BLAST Assignment • http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml • After reading the tutorial go to basic BLAST input a sequence and run BLAST • Go to advanced BLAST page and use the same input sequence – change the parameters and see if there is any change in output • Go to PSI BLAST tutorial page follow the tutorial and proceed to PHI blast search.