650 likes | 804 Views
Introduction to the GCG Wisconsin Package. The Center for Bioinformatics UNC at Chapel Hill Jianping (JP) Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6105 E-mail: jjin@email.unc.edu Fax: (919)843-3103. What is GCG.
E N D
Introduction to the GCG Wisconsin Package The Center for Bioinformatics UNC at Chapel Hill Jianping (JP) Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6105 E-mail: jjin@email.unc.edu Fax: (919)843-3103
What is GCG • An integrated package of over 130 programs (the GCG Wisconsin Package). • For extensive analyses of nucleic acid and protein sequences. • Associated with most major public nucleic acid and protein databases. • Works on UNIX OS.
Why use GCG • Removes the need for the constant collection of new software by end users. • Removes the need to learn new interface as new software is released. • Provides a flow of analyses within a single interface. • Unix environment allows users to automate complex, repetitive tasks. • Allows users to use multiple processors to accelerate their jobs. • Supports almost all public databases that can be updated daily. Fast local search.
Flexibility or Automation • 1. MEME: upstream regulatory motifs; • 2. MotifSearch: genes sharing these potential regulatory motifs; • 3. PileUp: multiple sequence alignment; • 4. Distances: extract pairwise distances from the alignment; • 5. GrowTree: a phylogenetics tree.
Interfaces • Command Line: Running programs from UNIX system prompt. • SeqLab: Graphic User’s Interface, requiring an X windows display. • SeqWeb: to a core set of sequence analysis program.
Limitations with GCG • The GUI interface does not give the users the full access to the power of the command line, nor to the complete set of programs. • Many programs place a limit of the maximum size of the sequences that they can handle (350 Kb). This limitation will be removed in version 11.
Databases GCG Supports • Nucleic acid databases • GenBank • EMBL (abridged) • Protein databases • NRL_3D • UniProt (SWISS-PROT, PIR, TrEMBL) • PROSITE, Pfam, • Restriction Enzymes (REBASE)
Database Update Services • DataServe: Automatically updates nucleic acid on a daily basis via FTP. • DataExtended: the most compete set of nucleic acid and protein data. The timing of the release is coordinated with the major GenBank release, 2-3 months. • DataBasic: Similar to DataExtended, but excludes EST and GSS data from GenBank and EMBL.
File Importing and Exporting • Reformat • FromEMBL • FromGenBank • FromPIR ToPIR • FromStaden ToStaden • FromIG ToIG • FromFastA ToFastA
File Formats with GCG • Single sequence files (in GCG format) • List (a list of files) • MSF (multiple sequence format) • RSF (rich sequence format)
GCG Programs • 1. Comparison • 2. Database Searching and Retrieval • 3. DNA/RNA Secondary Structure • 4. Editing and Publication • 5. Evolution • 6. Fragment Assembly • 7. Importing and exporting • 8. Mapping • 9. Primer Selection • 10. Protein Analysis • 11. Translation
Pairwise Comparison (Gap) • Neelman & Wunsch algorithm. • A global alignment covering the whole length of both sequences and the resulting sequences are of the same length with inserted gaps. • Good when two sequences are closely related.
Pairwise Comparison (BestFit) • Algorithm of Smith and Waterman. • Local homology alignment that finds the best segment of similarity b/w two sequences. • The most sensitive sequence comparison method available.
Multiple Comparison (PileUp) • The method of Feng and Doolittle similar to Higgins & Sharp. • A series of progressive pairwise alignments (up to 500 seq.) generate a final alignment. • An extension of Gap, not ideal for finding the best local region of similarity, such as a shared motif.
Database Search • Nearly always employ local alignment algorithms. • Often use “heuristic” methods (for a screen), FASTA and BLAST. • Assures the seq.are given correct local similarity score, but no guarantee that all seq. with high Smith-Waterman scores pass through the screen.
BLAST • Accepts a number of sequences as input and specify any number of DBs. $Blast –INfile2=PIR,SWPLUS; -INfile=hsp70.msf{*}. • Support 5 BLAST programs, but no gap alignment available for TBLASTX. • For non-coding nucleotide homology search, considering either reducing the word size from 11 to 6/7, or using the FASTA. • The number of scoring matrices is limited, BLOSUM62/45/80 and PAM70 available for –MATRix parameter.
Database Search (SSearch) • A rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type. • The most sensitive method available for similarity search. • Very slow.
HmmerSearch • Use a profile HMM as a query to search a sequence database. • Profile HMM: a position specific scoring table, a statistical model of the consensus of a multiple sequence alignment. • Output can be used for any GCG program that accepts list file.