1 / 38

Data Formats & QC Analysis for NGS

Data Formats & QC Analysis for NGS. Rosana O. Babu. Sequence Formats. All Sequence formats are ASCII text containing sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence

alva
Download Presentation

Data Formats & QC Analysis for NGS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Formats & QC Analysis for NGS Rosana O. Babu

  2. Sequence Formats • All Sequence formats are ASCII text containing sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence • Formats are designed to hold sequence data and other information about sequence

  3. Why so many formats? • Supply required information for each step of analysis • Efficient Data management- moving data across file system takes time • Each Data formats vary in the information they contain • Five types of sequence file formats • Raw Sequence files • Co-ordinate files • Parameter files • Annotation files • Metadata files

  4. Sequencers & Sequence Analysis Packages

  5. Read output formats • 454 • Solexa/Illumina • SOLiD

  6. 454 output formats .sff .fna .qual

  7. Illumina output formats .seq.txt .prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Illumina single line format SCARF

  8. SOLiD output format(s) CSFASTA

  9. If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI ENA at EMBL-EBI

  10. Common (“standard”) format for read alignments: Alignment/Assembly Format SAM BAM(= binary SAM)

  11. Formats for Genome/Gene annotation BED format (genome-browser tracks) GFF format (gene/genome features) BioXSD (XML) (any annotation; under development)

  12. Deposit genome/metagenome in a public repository: INSDC databases: GenBank, EMBL, DDBJ International Nucleotide Sequence Database Collaboration Deposit genome/metagenome metadata: MIGS/MIMSstandard by GSC Genomic Standards Consortium

  13. MIGS: Minimum Information about a Genome SequenceMIMS: Minimum Information about a Metagenome Sequence/Sample

  14. Points to remember on Data Formats • Use raw sequencing data- format when possible • For base-call data, use “standard” FASTQ (Sanger, Phred) • For read alignments, use SAM/BAM format • For annotation results (e.g. GFF or BED format)

  15. QC analysis

  16. Need for QC & Preprocessing • QC analysis of sequence data is extremely important for meaningful downstream analysis • To analyze problems in quality scores/ statistics of sequencing data • To check whether further analysis with sequence is possible • To remove redundancy (filtering) • To remove low quality reads from analysis • Highly efficient and fast processing tools are required to handle large volume of datasets

  17. FastQC and FastX Toolkit • Use FastQC in preliminary analysis • Use FastX-toolkit to optimize different datasets and visualize the results with FastQC

  18. FastQC output • Basic statistics • Quality- Per base position • Per Sequence Quality Distribution • Nucleotide content per position • Per sequence GC distribution • Per base GC distribution • Per base N content • Length Distribution • Overrepresented/ duplicated sequences • K-mer content

  19. FastQC (Box-Whisker plot) Y axis- Quality Score X axis- Base position

  20. Basic Statistics Contains information about • File_type • ASCII encoding quality value • Total sequences, filtered sequence • Sequence length • Percentage GC content

  21. 2. Quality- Per base position

  22. 2. Quality- Per base position

  23. 3.Per Sequence Quality Distribution

  24. 3. Per Sequence Quality Distribution

  25. 4.Nucleotide content per position

  26. 4. Nucleotide content per position

  27. 5.Per sequence GC distribution

  28. 5.Per sequence GC distribution

  29. 6. Per base GC distribution

  30. 6. Per base GC distribution

  31. 7. Per base N content

  32. 7. Length Distribution

  33. 8. Kmer content

  34. 9. Overrepresented/ duplicate sequences Too many duplicate regions in the sequence will be due to sequencing problems

  35. FASTX Toolkit • fastx_quality_stats .txt • fastq_quality_boxplot_graph.png • fastx_nucleotide_distribution.png • QC report.txt

  36. QC Report • Sequence Statistics • Total No. Of Sequences 6970943 • Avg. Sequence Length 54 • Max Sequence Length 54 • Min Sequence Length 54 • Total Sequence Length 376430922 • Total N bases 14254521 • % N bases 3.78676 • No of Sequences with Ns 278635 • % Sequences with Ns 3.99709 • Quality Statistics • Total HQ bases 334195496 • %HQ bases 88.78 • Total HQ reads 6350256 • %HQ reads 91.0961

  37. quality_boxplot_graph & nucleotide_distribution

  38. Thank you

More Related