Lecture 4: Practical use of sequence alignment methods and introduction of projects

CZ5225 Methods in Computational Biology Lecture 4: Practical use of sequence alignment methods and introduction of projects

Sequence Alignment Methods • Pairwise alignment  best-matching • Global alignment • Local alignment • Multiple alignment • Software • FASTA • Clustal • BLAST (Basic Local Alignment Search Tool) • PSI-BLAST (detecting remote-homologues) • HMM-based methods (detecting remote-homologues)

Pairwise Alignment Algorithms • Needleman-Wunsch • Global alignment only. • Smith-Waterman • Local or global alignment. Substitution matrix and the gap-scoring scheme Blosum, pam,etc Affine Gap, Extension Gap,etc It is fairly demanding of time and memory resources FASTA,BLAST…

Multiple Sequence Alignment • FASTA : Superseded by BLAST • BLAST : emphasizes the balance between the speed and sensitivity • PSI-BLAST: profile alignments, remote homology identify • HMM: profile alignments, remote homology identify • Clustal: Profile alignments

BLAST Programs • There are five different blast programs, which can be distinguished by the type of the query sequence (DNA or protein) and the type of the subject database: • BLASTP compares an amino acid query sequence against a protein sequence database; • BLASTN compares a nucleotide query sequence against a nucleotide sequence database; • BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database; • TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). • TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Practical Use of BLAST • The Information Database Curation (data collection) • The sequence data transformation. • formatDB, indexing the Sequence Database for BLAST • Do BLAST against the designed database. • Identify the homologous from the blast results. • Scoring the blast hits according to their e-value and their drug susceptibility.

Preparation: Get the BLAST package • Why do we need a local version? • Where to get the software package? • http://www.ncbi.nlm.nih.gov/blast/ • Tree Structure after unpacking:

2.The sequence data transformation. • Any sequence format to FASTA format greater than symbol The description line >Example1 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE >Example2 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFN

3.formatDB, indexing the Sequence Database for BLAST • formatdb -i ecoli.nt -p F -o T • -i Input file(s) for formatting [File In] Optional • -p Type of file(default = T) • T - protein • F - nucleotide [T/F] Optional • -s Create indexes limited only to accessions - sparse [T/F] Optional • default = F • -V Verbose: check for non-unique string ids in the database [T/F] Optional • default = F • -o Parse options(default = F) • T - True: Parse SeqId and create indexes. • F - False: Do not parse SeqId. Do not create indexes.[T/F] Optional • -F Gifile (file containing list of gi's) [File In] Optional • … … • formatdb.exe -i ourOwnDatabase -p T -o T

4.Do BLAST against the designed database. • blastall arguments: • -p Program Name [String] • -d Database [String] • default = nr • -i Query File [File In] • default = stdin • -e Expectation value (E) [Real] • default = 10.0 • -v Number of database sequences to show one-line descriptions default = 500 • -b Number of database sequence to show alignments • default = 250

4.Do BLAST against the designed database. • EXAMPLE: • blastall -p blastp -d db/swissprot -i Q9Y5N1.txt -o Q9Y5N1.out • blastall -p blastp -d db/swissprot -e 1 -i Q9Y5N1.txt -o Q9Y5N1.out Q9Y5N1.txt >newSP|Q9Y5N1|HH3R_HUMAN Histamine H3 receptor (HH3R) (G protein-coupled receptor 97) MERAPPDGPLNASGALAGEAAAAGGARGFSAAWTAVLAALMALLIVATVLGNALVMLAFV ADSSLRTQNNFFLLNLAISDFLVGAFCIPLYVPYVLTGRWTFGRGLCKLWLVVDYLLCTS SAFNIVLISYDRFLSVTRAVSYRAQQGDTRRAVRKMLLVWVLAFLLYGPAILSWEYLSGG SSIPEGHCYAEFFYNWYFLITASTLEFFTPFLSVTFFNLSIYLNIQRRTRLRLDGAREAA GPEPPPEAQPSPPPPPGCWGCWQKGHGEAMPLHRYGVGEAAVGAEAGEATLGGGGGGGSV ASPTSSSGSSSRGTERPRSLKRGSKPSASSASLEKRMKMVSQSFTQRFRLSRDRKVAKSL AVIVSIFGLCWAPYTLLMIIRAACHGHCVPDYWYETSFWLLWANSAVNPVLYPLCHHSFR RAFTKLLCPQKLKIQPHSSLEHCWK

4.Do BLAST against the designed database. Q9Y5N1.out Query= newSP|Q9Y5N1|HH3R_HUMAN Histamine H3 receptor (HH3R) (G protein-coupled receptor 97) (445 letters) Database: swissprot 172,892 sequences; 63,586,428 total letters Score E Sequences producing significant alignments: (bits) Value sp|Q9Y5N1|HRH3_HUMAN Histamine H3 receptor (HH3R) (G-protein cou... 668 0.0 ………………….. sp|P18871|ADA2A_PIG Alpha-2A adrenergic receptor (Alpha-2A adren... 105 2e-022 sp|Q9N2B2|HRH1_PANTR Histamine H1 receptor 105 2e-022 ………………………………….. Database: swissprot Posted date: Jul 8, 2005 9:35 PM Number of letters in database: 63,586,428 Number of sequences in database: 172,892 ………………………………… Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Hits to DB: 40,317,827 Number of Sequences: 172892 …………………………………………….

Parsing and interpreting the results • Biojava • Bioperl • Bioruby • Biopython • Or • Your own codes-Why?

Work Flow of Manipulate Batched BLAST Queries – Shell Programming • Prepare and put the job into the queue • Handle individual request • Analyze/output the result after each job request • Remaining – collect and finalize report Basic/Bash/C/C++/C#/Java/Python/Perl/R/Ruby/TCL

Another way to BLAST like a robot • BLAST URL API ( from NCBI) http://www.ncbi.nlm.nih.gov/blast/Blast.cgi

A Sample URL • http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put • &PROGRAM=blastn&DATABASE=nr&FILTER=L&QUERY=AF123456 CMD Put : submit a query PROGRAM blastn : run BLASTn DATABASE nr : search against nr L : turn low complexity filtering on FILTER QUERY AF123456 : accession, GI, or FASTA An interim update to BLAST URLAPI, still being reviewed, is at: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/node_0.html Quote From: NCBI-programming with BLAST

Intel Pentium Linux 2-way farm Intel Pentium Linux 2-way farm Intel Pentium Linux 2-way farm NCBI BLAST Server End Users Formatter Database loading if needed Blast.cgi Database server Alignment Search Request RID Result RID blastalign obj Merger demon mssql splitd Split queryinto chucks for distributed computing on multiple available CPUs Finished chunks are merged to generate final blastalign object Replicate Backup mssql Quote From: NCBI-Programming with BLAST

Posting a URL NCBI $response = $ua->request($req) User Agent HTTP Request HTTP Response $ua = LWP::UserAgent->new $req = new HTTP::Request POST Quote From: NCBI-programming with BLAST

Introduction of projects • Drug Resistant Mutation Data Collection and Database development • The scoring matrix development by sequence variations and their drug susceptibility data • Prediction of drug resistant mutations

Lecture 4: Practical use of sequence alignment methods and introduction of projects