110 likes | 582 Views
Next Gen Sequencing Data. The FASTQ file format is the current standard for next generation sequencer output. This is the format for the Illumina Genome Analyzer. FASTQ Format . ASCII text No standard file extension: but .fq .fastq and .txt are commonly used 4 lines per sequence
E N D
Next Gen Sequencing Data • The FASTQ file format is the current standard for next generation sequencer output. • This is the format for the Illumina Genome Analyzer
FASTQ Format • ASCII text • No standard file extension: but .fq .fastq and .txt are commonly used • 4 lines per sequence • Line 1 begins with the @ character, a sequence ID, and an optional description • Similar to the > line in a FASTA file • Line 2 is the sequence letters • Line 3 begins with the + character, followed by the same sequence ID, and another optional description • Line 4 encodes quality values for the sequence letters in line 2 • Must contain the same number of characters as the sequence in line 2
FASTQ example @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***+*''))**55CCF>>>>>>CCCCCCC65
FASTQ format • Line 2 (sequence characters) and Line 4 (quality value characters) may be wrapped (split over multiple lines). • Wrapping is discouraged because it makes parsing more complicated
Illumina Sequence ID format • Illumina uses a special format for their sequene Ids: @HWUSI-EAS100R:6:73:941:1973#0/1 • HWUSI-EAS100R: the unique instrument name • 6: the flowcell lane • 73: tile number within the flowcell lane • 941: 'x'-coordinate of the cluster within the tile • 1973: 'y'-coordinate of the cluster within the tile • #0: index number for a multiplexed sample (0 for no indexing) • /1: the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
Quality Values • Quality values are encoded differently on different next gen sequencers, • Illumina uses pipeline software originally developed by Solexa to determine quality values for each base in a seqeunce. • Older versions of the Illumina pipeline software used different equations to calculate quality. • Current software (v. 1.5) uses the Phred (i.e. Sanger) scoring scheme • A Phred score of a base is Qphred =(-10)*log10(e) • e is the estimated probability of a base being wrong
Encoding Phred Quality Scores • Phred scores are presented on Line 4 of a FASTQ file: !''*((((***+))%%%++)(%%%%).1***+*''))**55CCF>>>>>>CCCCCC65 • These characters are the ASCII value found by adding 64 to the Phred Quality score • In version 1.3 of the Illumina software, the Phred values 0-62 can be encoded as ASCII 64-126 • Values greater than 40 are not expected in raw read data • In the newest version of the Illumina pipeline software, Phred scores 0 and 1 are no longer used. A Phred score of 2 (ASCII 64, ‘B’) is now only used at the end of a read. • Phred 2 is now a read segment quality control indicator
Format Conversion • Because each version of the Illumina pipeline software uses a different Quality value scheme, file conversion software is sometimes necessary. • There are many conversion options, BioPerl v. 1.6.1+ can convert Sanger(Phred), Solexa (Illumina 1.0), and Illumina 1.3+ files. • Other options are Biopython, EMBOSS, BioRuby, and MAQ
Storage Requirements • Large sequencing centers have Terabytes and sometimes Petabytes of sequence data that must be analyzed. • This data is in ASCII or other plain text formats • A new encoding method called G-SQZ (Genomic SQeeZ) has recently been invented. • This method can compress sequence data as much as 80% • G-SQZ can encode ACGT frequencies, annotation information, data quality, erroneous entries (unidentified bases) • G-SQZ allows data access at regular intervals, such as every millionth base • All of the information does not have to be decoded from the start • Multiple computer processes could decode and process different chunks of the data simultaneously
References • New Technology Reduces Storage Needs and Costs for Genomic Data http://www.sciencedaily.com/releases/2010/07/100706150614.htm • The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants http://nar.oxfordjournals.org/content/38/6/1767