220 likes | 420 Views
De Novo Repeat Classification and Fragment Assembly. 석사 1 년 김 우 연. PROGRAMS related Repeat. Repeat Annotation - libraries RepeatMasker ( A.F.A. Smit and P. Green, unpubl. ) MaskerAid ( Bedell et al. 2000 ) No de novo compilation Repeat Analysis RepeatMatch ( Delcher et al. 1999 )
E N D
De Novo Repeat Classification and Fragment Assembly 석사 1년 김 우 연
PROGRAMS related Repeat • Repeat Annotation - libraries • RepeatMasker ( A.F.A. Smit and P. Green, unpubl. ) • MaskerAid ( Bedell et al. 2000 ) • No de novo compilation • Repeat Analysis • RepeatMatch ( Delcher et al. 1999 ) • REPuter ( Kurtz et al. 2000, 2001 ) • RECON, RepeatFinder, LTR_STRUC • No compact overview or summary of the repeat family
Genome Research • Received January 27, 2004 • Accepted in revised form June 29, 2004
CONTENTS • Introduction • Concepts • Methods • De Bruijn Graphs & A-Bruijn Graphs • RepeatGluer Algorithm • Constructing A-Bruijn Graphs Without the Similarity Matrix • Fragment Assembly • FragmentGluer Algorithm • Results and Discussion
INTRODUCTION • “The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack” – Bao and Eddy (2002) • One of the difficulties in repeat classification is that many repeats represent mosaics of sub-repeats – Bailey et al. 2002 • Aims • Proposing a new approach to repeat classification • FragmentGluer assembler
Genomic dot-plot of an imaginary sequence An imaginary evolutionary process Gluing repeated regions leads to the repeat graph the final genome Genomic dot-plot
The idea of our approach By gluing points together, repeats transform into the A-Bruijn graph
Mosaic repeat organization • BAC from human Chromosome Y • Repeat pairs by REPuter & Sub-repeats by our division • Repeat multigraph • Repeat graph • RepeatFinder vs RECON vs REPuter
ACTGCTGCC ACTGCTGCC De Bruijn Graphs & A-Bruijn Graphs De Bruijn Graph: ACTGCTGCC TGC GCT GCC ACT CTG
De Bruijn Graphs & A-Bruijn Graphs A-Bruijn Graph: … AT … ACT … ACAT …
Whirls & Bulges Available gaps & mismatch
RepeatGluer Algorithm • Construct the A-Bruijn graph • Eliminate whirls • Remove bulges • Erosion – Remove all leaves • Straighten zigzag paths • Forming the consensus sequence • Output repeat families
Constructing A-Bruijn Graphs Without the Similarity Matrix • Constructing of the A-Bruijn graph assumes S and A • S and { S1, …, St } can construct A-Bruijn graph of S • A set for every pair of consecutive positions in S • Matrix |Si| x |Sj| • A snapshot of a “small” area of matrix A S: A genomic sequence n: the length of S A: matrix n x n { S1, …, St }: A set of substrings |Si|: the length of the string Si
Fragment Assembly • Assemblers • Phrap ( Green 1994 ) • Celera assembler ( Myers et al. 2000 ) • EULER assembler ( Pevzner et al. 2001 ) • http://nbcr.sdsc.edu/euler • ARCHNE, Phusion, CAP, TIGR • Building an accurate assembler • EULER + Phrap EULER+ • EULER’s accuracy in analyzing repeats & Phrap’s ability to handle low-coverage regions, low-quality reads, and read ends • Less memory than the original EULER • FragmentGluer algorithm
FragmentGluer Algorithm • Construct the A-Bruijn graph of S • Eliminate whirls by splitting the composed vertices • Remove bulges • Erosion procedure by removing all leaves • Straighten zigzag paths • Thread each read • Definition consensus sequence • Output repeat families • Transform mate-pairs into mate-paths after step 6 • Assemble the resulting contigs into scaffolds by the EULER Scaffolding algorithm
Benchmarking • EULER produced the least number of misassembled contigs. • EULER also had the least number of missing repeat copies (4), ahead of Phrap (5) and Arachne (9). • Average coverage, over 518 clones, was 99.3% for Phrap, 98.8% for EULER, and 98.6% for ARACHNE • Average number of contigs per clone was the least for EULER (6.2) followed by Phrap (6.8) and ARACHNE (13.8).
More research • The consensus sequence analysis of FragmentGluer • Detecting de novo HERVs as the consensus sequence of FragmentGluer