310 likes | 661 Views
Phred/Phrap/Consed Analysis A User’s View. International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001. Arthur Gruber. Faculty of Veterinary Medicine and Zootechny University of São Paulo BRAZIL. What is Phred/Phrap/Consed ? .
E N D
Phred/Phrap/Consed AnalysisA User’s View International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001 Arthur Gruber Faculty of Veterinary Medicine and Zootechny University of São Paulo BRAZIL
What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector and repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.
Why to assemble? • Current DNA sequencing methods generate reads of 500-700 bp – resolution limit of electrophoresis • Whole genomes or large clones need to be fragmented - clone library • Short fragments are randomly sequenced (shotgun approach)– reads are assembled to form final consensus sequence
How to deal with the enormous amount of reads generated by the high throughput DNA sequencers? Sanger Centre
Phred Phred is a program that performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR. b. Calls bases– attributes a base for each identified peak with a lower error rate than the standard base calling programs.
Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files– base calls and quality values are written to output files.
Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases)
The structure of a phd file t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445 t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32
Phrap Phragment Assembly Programor… Phil’s Revised Assembly Program Phrap is a program for assembling shotgun DNA sequence data Key Features: a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus!
Phrap Phragment Assembly Programor… Phil’s Revised Assembly Program Phrap is a program for assembling shotgun DNA sequence data d. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.contigs.qual files e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programs
Phrap output files • *.contigs – fasta file containing the contigs • Contigs with more than one read • Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) • *.singlets – fasta file of the singlet reads • Reads with no match to other read • *.ace – allows for viewing the assembly using Consed • *.view – required for viewing the assembly using Phrapview
Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads.
Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single-strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates.
Phred/Phrap/Consed Pipeline Directories: Chromat_dir Phd_dir Edit_dir
Finishing Problems Finishing can be a boring and difficult task due: DNA sequencing problems a. High GC content – genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc. b. Palindromic regions– lead to strong secondary structures causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product. c. Homopolymeric regions– can reduce DNA synthesis efficiency for some chemistries. Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye).
Finishing Problems Finishing can be a boring and difficult task due: DNA assembly problems a. High content of repeats – highly repeated elements reduce accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units. b. High AT content– some highly biased genomes (i.e. Plasmodium falciparum; plastid genomes) can pose a problem for assembly programs. Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data.