280 likes | 534 Views
Sequencing Data Quality. Saulo Aflitos. Assembly - Concepts. Read (≈100bp). Contig (≈2Kbp). Paired-End Mate-Pair. Scaffold (≈ 2Mbp). Pseudo Molecule (Super Scaffold). Low Complexity Region. Scaffolding. Paired-End Mate-Pair. Scaffold (≈ 2Mbp). Pseudo Molecule (Super Scaffold).
E N D
Sequencing Data Quality SauloAflitos
Assembly - Concepts Read (≈100bp) Contig (≈2Kbp) Paired-End Mate-Pair Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Low Complexity Region
Scaffolding Paired-End Mate-Pair Scaffold(≈ 2Mbp) Pseudo Molecule (Super Scaffold) Low Complexity Region
Scaffolding Repeats?!
Reality Consensus 3x 2x 1x 1x 3x Contig Reads Depth of Coverage Goldberg SMD et al. 2006
Heterozygozity A/C A A A A A A A C C C C C C C A A A A A A A A A A A A A A A N A A A C G T A C G T A A A A 95% ±5 50% ±10
Consequences of Data Cleaning 265.89 Raw Filtered 41.61 48.65 50.37 57.60
Sequencing Shotgun RNAseq
Sequencing Paired End Mate Pair
Sample Preparation Genome Ultrasound Physical RE Shred Gel Beads Size Selection ID Binding to Surface Circularization Adapter Illumina 454 PacBio Sequencing
Sequencing Illumina PE Insert Size 150bp-2Kbp 100bp 100bp Read Length
Sequencing 454 MP Insert Size 2K-20Kbp Read Length 500bp 150bp 150bp 150bp
FastQ Machine Name Read ID (unique) Encoded Quality 0-40 Chance of being wrong
FastQ Statistics 13 0.05 5%
FastQC Quality Checking Tool Contamination screen fastq screen Per base sequence quality Per base sequence content Per sequence quality Sequence duplication Sequence length distribution Per base GC content Per sequence GC content Per base N-content
Exercise • Create “cleaning” folder • mkdir cleaning; cd cleaning • Inside it, run: wget -O saulo.bash http://goo.gl/Tx8g6 • Run it with: bash saulo.bash • This will download FastQC and SolexaQA • FASTQC HELP : http://goo.gl/EE8M7 • FASTQC TUTORIAL: http://goo.gl/rihyA • FASTQC MANUAL : http://goo.gl/9yihC • SolexaQAHelp : http://solexaqa.sourceforge.net/ • Run FastQC: ./FastQC/fastqc & • File > open [Files of Type = FastQ files]
Exercise • Verify the two .fqfiles (you can use less): • bad_MiSeq_dataset.fq • good_MiSeq_dataset.fq • Clean the bad dataset with SolexaQA’s DynamicTrim.pl script: • perlSolexaQA_v.2.1/DynamicTrim.pl ► bad_MiSeq_dataset.fq-h 25 • Verify the improvement (or not) by opening • bad_MiSeq_dataset.fq.trimmed