Finding Deletions with Exact Break Points from Noisy Low Coverage Paired-end Short Sequence Reads

Finding Deletions with Exact Break Points from Noisy Low Coverage Paired-end Short Sequence Reads Jin Zhang and Yufeng Wu Department of Computer Science and Engineering University of Connecticut

Introduction • Structural variants Reference Alternative deletion insertion • Our problem: • Finding Exact break points of deletions • using low coverage noisy data efficiently Mean insertion size + 3 sds Reference Alternative • Data( from 1000 genomes project pilot 1 ) Deletion low coverage (2-6x);Illumina (2009_08); 45 individuals in CEU population;(combine) 580 G in BAM format; Abnormal insertion size Only one end mapped 24G in fastqformat 250 Million reads anchor Both ends mapped deletion Abnormal insertion size In region without Deletion Real Split-reads (also contain error) Sequence error & other errors (more 99%) Split-reads Mapping • Achieve Deletions with exact break point efficiently

Method Spanning pair (abnormal insertion size) anchor anchor Split-read Split-read • We map Split-read on Burrows-Wheeler Transform (BWT) reference reference efficiency alternative alternative deletion Inexact mapping deletion Ex. Hit at 2 positions CACAATACCCTCTCACACCAACGTTACG SVs near SNPs and indels can be found; Reads with errors can be used reference Split not unique Hits not unique or Report the hits and splits with the best quality mismatch Split-read CAAT CCCTC ACGTAACG Search locally reference ex. search near region of 1Mb reference indel We pre-build local BWTs with length 102k (2k for overlap) on each strand. Search on which BWT is decided by the anchor.

Method • Calling candidate deletions TACGTTTAACCATACGGCCAAAACGTAACGT TTAACCAT (leftmost) ACGTAACGT or TTAACCATACGTAACGT (rightmost) TTAACCATACG TAACGT (1)Sorting split-reads leftmost break points (2)cluster the split-reads supporting the same candidate (3)call candidate Cutoff value: at least how many split-reads support the candidate Reference

Method • Candidates validation(calling deletions) Has spanning pair Alternative Abnormal insertion size maybe caused by deletion Reference deletion candidate If there existsspanning pairs with abnormal insertion size, we validate the deletion Notice that the candidate is from split-reads mapping No spanning pair (not enough information, we can’t validate) Reference candidate low coverage may cause no spanning pair, there is a deletion. Split-read is not mapped right or is with error, there is No deletion.

Results • Benchmarks Benchmark 1 The 1000 Genomes Project Consortium, (2010) A map of human genome variation from population-scale sequencing, Nature, 467, 1061-1073. Benchmark 2 Mills,R.E., et al., (2011) Mapping copy number variation by population-scale genome sequencing, Nature 470, 5965. We run our test with 1000 genomes project releases as benchmarks; The Deletions in these releases are found by multiple methods • Data Low coverage (can be as low as 2x) Illumina (2009_08); 45 individuals in CEU population;(combine) 580 G in BAM format; • Comparison Ye,K., el al., (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads, Bioinformatics, 25, 2865-s2871

Results • Vs. Pindel v1 (Max Event 1Mb) What is the percentage of true deletions? We use the 1000 genomes project releases as benchmarks Total number of called deletions With the benchmark, our method with cutoff 2, has more true Positives and less false positives There might be deletions not in the benchmarks but found by our method Chromosomes Chromosomes True positive with precise deletions in benchmark 1 True positive with precise deletions in benchmark 2 chromosomes chromosomes

Results • Vs. Pindel v2 (Max Event 8092k) Total number of deletions found Comparison with v2 has the same trend with the comparison with v1 Chromosomes True positive with precise deletions in benchmark 1 True positive with precise deletions in benchmark 2 Chromosomes

Results • Running time Data: 45 individuals on chromosome 1 Running on our Xeon server with 24 CPUs Finding SV with Maximum Event Size include 1Mb • An example of inexact mapping • (P1_M_061510_1 9_22) Experiment is run on workstations supported by NSF grant IIS-0916948 Research partly supported by NSF grant IIS-0953563

Finding Deletions with Exact Break Points from Noisy Low Coverage Paired-end Short Sequence Reads

Finding Deletions with Exact Break Points from Noisy Low Coverage Paired-end Short Sequence Reads

Presentation Transcript

Parallel de novo Assembly of Large Genomes from Paired Short Reads

Short Break Holidays

Genome Resequencing with Short Reads

PE-Assembler: De novo assembler using short paired-end reads

FINDING THE SLOPE FROM 2 POINTS

Supplementary Figure 1 – Mapped contigs from reassembled paired end reads to the cranberry

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

Break-points detection with atheoretical regression trees

NGS Part I RNA-Seq Short Reads Sequence Analysis Feb 29, 2012

Detecting Copy Number Variation With Short Paired Reads

Finding Sequence Similarities

Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities

Finding Approximate Repeating Patterns from Sequence Data

LD-Based Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads

Detecting copy number variations using paired-end sequence data

Algorithms for Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads

EST sequence reads from chromatogram and base calling

Fast Imputation Using Medium- or Low-Coverage Sequence Data

Removal of low quality reads

Shimla Short Break with SOTC Holidays

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads

Algorithms for Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads