1 / 10

Next Gen Sequencing Data

Next Gen Sequencing Data. The FASTQ file format is the current standard for next generation sequencer output. This is the format for the Illumina Genome Analyzer. FASTQ Format . ASCII text No standard file extension: but .fq .fastq and .txt are commonly used 4 lines per sequence

sun
Download Presentation

Next Gen Sequencing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next Gen Sequencing Data • The FASTQ file format is the current standard for next generation sequencer output. • This is the format for the Illumina Genome Analyzer

  2. FASTQ Format • ASCII text • No standard file extension: but .fq .fastq and .txt are commonly used • 4 lines per sequence • Line 1 begins with the @ character, a sequence ID, and an optional description • Similar to the > line in a FASTA file • Line 2 is the sequence letters • Line 3 begins with the + character, followed by the same sequence ID, and another optional description • Line 4 encodes quality values for the sequence letters in line 2 • Must contain the same number of characters as the sequence in line 2

  3. FASTQ example @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***+*''))**55CCF>>>>>>CCCCCCC65

  4. FASTQ format • Line 2 (sequence characters) and Line 4 (quality value characters) may be wrapped (split over multiple lines). • Wrapping is discouraged because it makes parsing more complicated

  5. Illumina Sequence ID format • Illumina uses a special format for their sequene Ids: @HWUSI-EAS100R:6:73:941:1973#0/1 • HWUSI-EAS100R: the unique instrument name • 6: the flowcell lane • 73: tile number within the flowcell lane • 941: 'x'-coordinate of the cluster within the tile • 1973: 'y'-coordinate of the cluster within the tile • #0: index number for a multiplexed sample (0 for no indexing) • /1: the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

  6. Quality Values • Quality values are encoded differently on different next gen sequencers, • Illumina uses pipeline software originally developed by Solexa to determine quality values for each base in a seqeunce. • Older versions of the Illumina pipeline software used different equations to calculate quality. • Current software (v. 1.5) uses the Phred (i.e. Sanger) scoring scheme • A Phred score of a base is Qphred =(-10)*log10(e) • e is the estimated probability of a base being wrong

  7. Encoding Phred Quality Scores • Phred scores are presented on Line 4 of a FASTQ file: !''*((((***+))%%%++)(%%%%).1***+*''))**55CCF>>>>>>CCCCCC65 • These characters are the ASCII value found by adding 64 to the Phred Quality score • In version 1.3 of the Illumina software, the Phred values 0-62 can be encoded as ASCII 64-126 • Values greater than 40 are not expected in raw read data • In the newest version of the Illumina pipeline software, Phred scores 0 and 1 are no longer used. A Phred score of 2 (ASCII 64, ‘B’) is now only used at the end of a read. • Phred 2 is now a read segment quality control indicator

  8. Format Conversion • Because each version of the Illumina pipeline software uses a different Quality value scheme, file conversion software is sometimes necessary. • There are many conversion options, BioPerl v. 1.6.1+ can convert Sanger(Phred), Solexa (Illumina 1.0), and Illumina 1.3+ files. • Other options are Biopython, EMBOSS, BioRuby, and MAQ

  9. Storage Requirements • Large sequencing centers have Terabytes and sometimes Petabytes of sequence data that must be analyzed. • This data is in ASCII or other plain text formats • A new encoding method called G-SQZ (Genomic SQeeZ) has recently been invented. • This method can compress sequence data as much as 80% • G-SQZ can encode ACGT frequencies, annotation information, data quality, erroneous entries (unidentified bases) • G-SQZ allows data access at regular intervals, such as every millionth base • All of the information does not have to be decoded from the start • Multiple computer processes could decode and process different chunks of the data simultaneously

  10. References • New Technology Reduces Storage Needs and Costs for Genomic Data http://www.sciencedaily.com/releases/2010/07/100706150614.htm • The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants http://nar.oxfordjournals.org/content/38/6/1767

More Related