50 likes | 201 Views
SOAP 2.0 - Speed up and with scoring system. BGI 2008-05-27. Indexing reference genome. SOAP 2 is specially designed for longer (>36bp) reads. To get all 2-mismatch hits, only need 3 indexing tables; BWT based Compressed Suffix Array, ~7Gb RAM for human genome;
E N D
SOAP 2.0-Speed up and with scoring system BGI 2008-05-27
Indexing reference genome • SOAP 2 is specially designed for longer (>36bp) reads. To get all 2-mismatch hits, only need 3 indexing tables; • BWT based Compressed Suffix Array, ~7Gb RAM for human genome; • Load the reference genome into RAM once, so will significantly reduce I/O; • Use reads as query will facilitate threaded parallel calculation, which fits multi-core CPUs well; • Support varied read sizes in a file;
Alignment strategy • “XOR”+lookup table; • <20m for aligning 1M reads onto the human genome, 4h for 1X data vs 3Gb on an 8-core node, even faster for paired-end reads mapping; • Allow more mismatches at 3’-end of reads; • Gapped alignment (enumeration) if no ungapped hits exist; • Could report all hits if necessary.
Scoring system Trying two methods: • Heng’s method implemented in Maq; • Similar in principle • Set quality cutoff (Q10?), not count low-quality mismatches; • For multiple equal best hits, take it as repeat hits; • For one best hit, and multiple second best hits, P = 1/(1+aNsecond), Nsecond is number of second best hits with one more mismatches, a is estimated average error probability (a=0.01?).
Input & Output • Input • Text (.fa, .fq) • gziped • Output • SOAP • .glz (GLF) • gziped • binary • ACE • Others as necessary