190 likes | 296 Views
Read Corrector. D. Lavenier & N. Maillet IRISA, Rennes. Assembly from NGS data. Next Generation Sequencer. billions of bad reads. billions of good reads. Contigs. Correction. Assembly. Benchmarks. Data :
E N D
Read Corrector D. Lavenier & N. Maillet IRISA, Rennes
Assembly from NGS data Next Generation Sequencer billions of bad reads billions of good reads Contigs Correction Assembly D. LAVENIER & N. Maillet - IRISA
Benchmarks Data : 1 000 257 40-bp reads generate from CP_000025 (1.8 Mbp) with metasim Substitutions: 216520 D. LAVENIER & N. Maillet - IRISA
Read Error Correction • Principle: • Use the coverage redundancy to correct erroneous reads AGGATGACCAGGATTAGGACCAGT Probably due to an error sequencing GATGACCAGGATTAGGACCAGTTC GATGACCAGGATTAGGACCAGTTC ATGACCAGGATTAGGACCAGTTCA ACCAGGATTCGGACCAGTTCATTC ACCAGGATTAGGACCAGTTCATTC ACCAGGATTAGGACCAGTTCATTC CCAGGATTAGGACCAGTTCATTCA D. LAVENIER & N. Maillet - IRISA
Correction principle • Index the reads • Perform correction directly on the index structure • Update the index as soon as a read is corrected • Stop the process when no corrections occur • Reject reads which cannot be corrected D. LAVENIER & N. Maillet - IRISA
Index structure Seed size = 4 #k AGGATGACCAGGATTAGGATCAGT 44 entries A GGAT G ACCA G GATTAGGATCAGT seedX seedY ACCA, k, 5, A, G, G INDEX D. LAVENIER & N. Maillet - IRISA
Index structure Seed size = 4 #k AGGATGACCAGGATTAGGATCAGT 44 entries A G GATG A CCAG G ATTAGGATCAGT seedX seedY CCAG, k, 6, G, A, G INDEX D. LAVENIER & N. Maillet - IRISA
Read Error Correction k1 AGGATGACCAGGATTAGGACCAGT GGAC, k1, 15, G, A, C k2 GATGACCAGGATTAGGACAAGTTC GGAC, k2, 13, G, A, A k3 GATGACCAGGATTAGGACCAGTTC GGAC, k3, 13, G, A, C k4 ATGACCAGGATTAGGACCAGTTCA GGAC, k4, 12, G, A, C k5 ACCAGGATTCGGACCAGTTCATTC GGAC, k5, 09, G, C, C k6 ACCAGGATTAGGACCAGTTCATTC GGAC, k6, 09, G, A, C k7 ACCAGGATTAGGACCAGTTCATTC GGAC, k7, 09, G, A, C k8 CCAGGATTAGGACCAGTTCATTCA GGAC, k8, 08, G, A, C INDEX D. LAVENIER & N. Maillet - IRISA
Voting algorithm k1 AGGATGACCAGGATTAGGACCAGT k2 GATGACCAGGATTAGGACAAGTTC GATGACCAGGATTAGGACCAGTTC k3 GATGACCAGGATTAGGACCAGTTC k4 ATGACCAGGATTAGGACCAGTTCA k5 ACCAGGATTCGGACCAGTTCATTC ACCAGGATTAGGACCAGTTCATTC k6 ACCAGGATTAGGACCAGTTCATTC k7 ACCAGGATTAGGACCAGTTCATTC k8 CCAGGATTAGGACCAGTTCATTCA Majority of A in the column D. LAVENIER & N. Maillet - IRISA
General algorithm 3 steps: • Read indexing • base on double seed • Read correction • iterate until no more correction is possible • Read rejection • remove reads which cannot be corrected D. LAVENIER & N. Maillet - IRISA
Step 2 : Correction do nb_err = 0 for each entry of INDEX LR = list of bad reads nb_err += len(LR) for each elements of LR un-index read from INDEX correct read index read into INDEX until nb_err != 0 D. LAVENIER & N. Maillet - IRISA
Correcting 2 errors T C read with 2 errors: TTGGACCTGTGAGACTTGAGCACAGATGGACCCA iteration 1 will correct C TTGGACCTGTGA G ACTT G AGCA C AGATGGACCCA iteration 2 will correct A TTGGACCTGTGAGACTTGAG C ATAGA TGGA C CCA D. LAVENIER & N. Maillet - IRISA
Step 3 : Read rejection • Principle: • Each double seed of the index is counted • A read is rejected if only one of its double seed is rare D. LAVENIER & N. Maillet - IRISA
Extrapolation Data : 1 000 257 40-bp reads generate from CP_000025 (1.8 Mbp) with metasim Substitutions: 216520 D. LAVENIER & N. Maillet - IRISA
Extrapolation Data : 1 000 257 40-bp reads generate from CP_000025 (1.8 Mbp) with metasim Substitutions: 216520 D. LAVENIER & N. Maillet - IRISA
Step 2 : Parallelization do nb_err = 0 for i = 0 to size(INDEX) LR = list of bad reads in INDEX[i] nb_err += len(LR) for k = 0 to len(LR) un-index LR[k] correct LR[k] index LR[k] until nb_err != 0 do nb_err = 0 for i = 0 to size(INDEX) step S for j = 0 to S LR[j] = list of bad reads in INDEX[i+j] nb_err += len(LR[j]) for j = 0 to S for k = 0 to len(LR[j]) un-index LR[j][k] correct LR[j][k] index LR[j][k] until nb_err != 0 D. LAVENIER & N. Maillet - IRISA
Step 2 : Parallelization do nb_err = 0 for i = 0 to size(INDEX) step S for j = 0 to S LR[j] = list of bad reads in INDEX[i+j] nb_err += len(LR[j]) for j = 0 to S for k = 0 to len(LR[j]) un-index LR[j][k] correct LR[j][k] index LR[j][k] until nb_err != 0 do nb_err = 0 for i = 0 to size(INDEX) step S forall j = 0 to S LR[j] = list of bad reads in INDEX[i+j] nb_err += len(LR[j]) for j = 0 to S for k = 0 to len(LR[j]) un-index LR[j][k] correct LR[j][k] index LR[j][k] until nb_err != 0 D. LAVENIER & N. Maillet - IRISA
Read Correctiona very time consuming process • Data • Salmonella entericasubsp. entericaserovarTyphimuriumstr. (5 Mbp) • 12 726 271 reads – 80 bp D. LAVENIER & N. Maillet - IRISA
Challenges • Correction of billions of reads • Development of scaling algorithms • Parallelism need to be extended • Multicore is not enough • are GPU good candidates ? • Not sure, but need to be tested on new GPU architectures • Find parallel data structures • To decrease the memory footprint per processor • To break the computation into hundreds of tasks D. LAVENIER & N. Maillet - IRISA