170 likes | 405 Views
Lab7. QRNA, HMMER, PFAM. Sean Eddy’s Lab. http://selab.janelia.org/software.html. HMMER. Introduction.
E N D
Lab7 QRNA, HMMER, PFAM
Sean Eddy’s Lab • http://selab.janelia.org/software.html
Introduction • HMMER2 is an implementation in UNIX (Linux, MacOS) platform of profile hidden Markov model, whose source code, executables, and user guide can be downloaded from http://hmmer.janelia.org/ • The experiment of HMMER is to look for known domains in a query sequence by searching a single sequence again a library of HMMs. • One such library is PFAM, and you can also create your own library using HMMER
HMMER executables • hmmalign ‐ Align sequences to an existing model. • hmmbuild‐ Build a model from a multiple sequence alignment. • hmmcalibrate‐ Takes an HMM and empirically determines parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E‐values). • hmmconvert ‐ Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles. • hmmemit ‐ Emit sequences probabilistically from a profile HMM. • hmmfetch ‐ Get a single model from an HMM database. • hmmindex ‐ Index an HMM database. • hmmpfam ‐ Search an HMM database for matches to a query sequence. • hmmsearch‐ Search a sequence database for matches to an HMM.
Installation • Simple installation • Download the current version of HMMER “hmmer‐2.3.2.bin.intel‐linux.tar.gz” from http://hmmer.janelia.org/#download; • Unpack the software by typing “tar –xvf hmmer‐2.3.2.bin.intel‐linux.tar.gz” in the command line. You will see a new directory “hmmer‐2.3.2.bin.intel‐linux”. • Enter the directory of hmmer‐2.3.2.bin.intel‐linux. You will see NINE executables ready in the subdirectory “/binaries”, and also nine files in the subdirectory “/tutorial”; • Installation from source code • Download the current HMMER source code version “hmmer‐2.3.2.tar.gz” from http://hmmer.janelia.org/#download; • Create a new directory in your Linux account and upload or move the software package to the directory; • Unpack the software by typing “tar –xvf hmmer‐2.3.2.tar.gz” in the command line; • Type “cd hmmer‐2.3.2” to enter the software directory; • Type “./configure” to configure for your system and build the programs; • Type “make” to generate the executables; • Type “make check” to run the automated test suite; (This is optional but recommended, and all these tests should pass); • Please note that by default programs are in “/usr/local/bin/” and man pages are in “/usr/local/man/man1”; • Type “make install” to install all executables;
Hmmbuild : build a profile HMM from an aignment • hmmbuild[options]hmmfilealignfile • hmmbuild test.hmm test.aln • hmmbuild -h • hmmbuild reads a multiple sequence alignment file alignfile , builds a new profile HMM, and saves the HMM in hmmfile. • alignfile may be in ClustalW, GCG MSF, or SELEX alignment format. • By default, the model is configured to find one or more non-overlapping alignments to the complete model. • To configure the model for a single global alignment, use the -g option; • To configure the model for multiple local alignments, use the -f option; • To configure the model for a single local alignment (standard Smith/Waterman), use the -s option.
Hmmcalibrate: calibrate HMM search statistics • hmmcalibrate[options] hmmfile • hmmcalibrate test.hmm • Hmmcalibrate -h • hmmcalibrate reads an HMM file from hmmfile, scores a large number of synthesized random sequences with it, fits an extreme value distribution (EVD) to the histogram of those scores, and re-saves hmmfile now including the EVD parameters. • This step is optional, but it will increase the sensitivity of your database search • hmmcalibrate may take several minutes (or longer) to run. While it is running, a temporary file called hmmfile.xxx is generated in your working directory. • If you abort hmmcalibrate prematurely (ctrl-C, for instance), your original hmmfile will be untouched, and you should delete the hmmfile.xxx temporary file.
Hmmsearch: - search a sequence database with a profile HMM • hmmsearch[options]hmmfileseqfile • hmmsearch test.hmm query.faa > query.faa.domain • hmmsearch -h • hmmsearch reads an HMM from hmmfile and searches seqfile for significantly similar sequence matches. • hmmsearch may take minutes or even hours to run, depending on the size of the sequence database. It is a good idea to redirect the output to a file. • The output consists of four sections: • a ranked list of the best scoring sequences, • a ranked list of the best scoring domains, • alignments for all the best scoring domains, and • a histogram of the scores. • A sequence score may be higher than a domain score for the same sequence if there is more than one domain in the sequence; the sequence score takes into account all the domains. All sequences scoring above the -E and -T cutoffs are shown in the first list, then every domain found in this list is shown in the second list of domain hits. If desired, E-value and bit score thresholds may also be applied to the domain list using the -domE and -domT options.
Pfam 23.0 (July 2008, 10340 families) • The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). • Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. • The identification of domains that occur within proteins can therefore provide insights into their function. • There are two components to Pfam: Pfam-A and Pfam-B. • Pfam-A entries are high quality, manually curated families. • Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. • Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. • Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. (see Pfam-C)
Sequence analysis with HMM • ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam23.0/ to download files “Pfam_fs.gz” and “Pfam_ls.gz” • Pfam_ls - All global (ls mode) Pfam-A HMMs in an HMM library searchable with the hmmpfam program. • Pfam_fs - All local (fs mode) Pfam-A HMMs in an HMM library searchable with the hmmpfam program. • Data location • /home/kwchoi/public_html/I529-09-lab/Lab7/Data/PFAM_data/ • Copy to your working directory or make symbolic link • To search for domains in “test.faa” in the global sequence database, type • hmmpfam Pfam_fs test.faa > test.faa.pfam” • The results is logged into an output file “test.faa.pfam”;
QRNA • QRNA is a prototype structural noncoding RNA genefinder tools for detecting novel structural RNA genes. • It uses three probabilistic "pair-grammars": • a pair stochastic context free grammar modeling alignments constrained by structural RNA evolution, • a pair hidden Markov model modeling alignments constrained by coding sequence evolution, and • a pair hidden Markov model modeling a null hypothesis of position-independent evolution. • Given an input pairwise sequence alignment (e.g. from a BLASTN comparison of two related genomes), it classify the alignment into the coding, RNA, or null class according to the posterior probability of each class.
Local Installation • The latest version of QRNA (qrna-2.0.3c.tar.gz ) can be download from ftp://selab.janelia.org/pub/software/qrna/ • Configure QRNA and install • tar -xvf qrna-2.0.3c.tar • cd qrna-2.0.3c • cd squid • make • cd ../squid02 • make • cd ../src • make • QRNA is installed in • /home/kwchoi/Installed/qrna-2.0.3c/
Data location: • /home/kwchoi/public_html/I529-09-lab/Lab7/Data/QRNA_data/ • blastn2qrnadepth.pl • /home/kwchoi/Installed/qrna-2.0.3c/src/scripts/blastn2qrnadepth.pl -g human HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast • HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q • HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q.gff • HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q.rep • Simple test • Set the running enveriment and run a simple example: • export QRNADB=/home/kwchoi/Installed/qrna-2.0.3c/lib • /home/kwchoi/Installed/qrna-2.0.3c/src/eqrna -a 5s_rRNA.q > 5s_rRNA.q.eqrna
QRNA demo • Option -C shuffles the columns of the pairwise alignment while maintaining the gap and conserved structure of the original alignment. Compare the two results of using -C and without using -C: • /home/kwchoi/Installed/qrna-2.0.3c/src/eqrna -a -C 5s_rRNA.q > 5s_rRNA.q.con_shuffle.eqrna • Example using the scanning version with a window. Consider file “Scerevisiae orf v other yeasts.q” which contains an alignment of a S. cerevisiae ORF The alignment has 514 nucleotides, and we would like to score it with eqrna using a window of 150 nucleotides, and moving the window 50 nucleotides each time: • /home/kwchoi/Installed/qrna-2.0.3c/src/eqrna -w 150 -x 50 Scerevisiae_orf_v_other_yeasts.q > Scerevisiae_orf_v_other_yeasts.q.w150.x50.eqrna • example start with a blastn output: • /home/kwchoi/Installed/qrna-2.0.3c/src/eqrn -w 50 -x 50 HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q.W150.X50.eqrna • The results files are shown as following: • 5s_rRNA.q.eqrna • 5s_rRNA.q.con_shuffle.eqrna • Scerevisiae_orf_v_other_yeasts.q.w150.x50.eqrna