190 likes | 210 Views
A clustering method for repeat analysis in DNA sequences. Molecular Biology & Phylogeny Laboratory 석사 1 년 김우연. A clustering method for repeat analysis in DNA sequences. Natalia Volfovsky, Brian J Haas and Steven L Salzberg The Institute for Genomic Research, USA Genome Biology 2001.
E N D
A clustering method for repeat analysis in DNA sequences Molecular Biology & Phylogeny Laboratory 석사 1년 김우연
A clustering method for repeat analysis in DNA sequences • Natalia Volfovsky, Brian J Haas and Steven L Salzberg • The Institute for Genomic Research, USA • Genome Biology 2001 Pusan Bioinformatics & Biocomplexity Research Center
Abstract Pusan Bioinformatics & Biocomplexity Research Center
Suffix Trie • Definition • Tree: 한 개 이상의 node 로 구성된 유한집합 • Suffix: 각 위치에서 시작하는 가장 긴 substring • Suffix tree: 모든 suffix 를 표현하는 trie • 예: T = ababa# a # 123456 b b 6 a # a b b 5 a a # # # # 4 3 2 1 Pusan Bioinformatics & Biocomplexity Research Center
T = ababa# P = aba 123456 a # • Edge : label • Internal node • Sibling edge • Leaf node <=> Suffix ba 6 # ba # 5 # ba# ba# 4 1 3 2 Suffix Tree • Definition • Suffix tree: 모든 suffix 를 표현하는 compacted trie • 예: Pusan Bioinformatics & Biocomplexity Research Center
Example T = ATGATGC# 12345678 8 # ATG C# G TG ATGATGC# 7 ATGC# 6 C# TGATGC# GATGC# C# ATGC# C# ATGC# 1 4 5 3 2 Pusan Bioinformatics & Biocomplexity Research Center
Numerous methods for detecting repeats • RepeatMasker • Using a database of known repeat sequences and implements a string-matching algorithm • MaskerAid • Same approach • More rapid than RepeatMasker • WU-BLAST • Using the BLAST engine • Based on suffix trees • RepeatMatch, REPuter, RepeatFinder • Finding all exact repeats • 10-100 megabases (Mb) Pusan Bioinformatics & Biocomplexity Research Center
Definitions • An exact repeat • A subsequence occurring in DNA seqeunce at least twice • A maximal repeat • Can’t be extended in either direction without incurring a mismatch Pusan Bioinformatics & Biocomplexity Research Center
Exact repeats Pusan Bioinformatics & Biocomplexity Research Center
Definition of repeats Pusan Bioinformatics & Biocomplexity Research Center
Algorithm description • Using either of two suffix tree method • RepeatMatch, REPuter • Based on first identifying all exact repeats • Defining repeat classes by merging and extending • Step1: Selection and pre-processing • Step2: Merging procedure • Step3: Classification • Step4: BLAST searches and repeat class updates Pusan Bioinformatics & Biocomplexity Research Center
STEP1: Selection and pre-processing Interpreting a partition of the original genome sequence By output of RepeatMatch or REPuter F: forward RC: reverse complement l: length Pusan Bioinformatics & Biocomplexity Research Center
STEP2: Merging procedure Merging two exact repeats that either overlap or that occur within A limited distance ( a gap ) of each other Pusan Bioinformatics & Biocomplexity Research Center
STEP3: Classification One step of the classification procedure Pusan Bioinformatics & Biocomplexity Research Center
STEP4: BLAST searches and further merging If a class appears in multiple similarity pairs, all these similar classes are merged with the original class. Pusan Bioinformatics & Biocomplexity Research Center
Repeat analysis of microbial genomes Minimal exact repeat length: 25 bp Gap: 25 bp Pusan Bioinformatics & Biocomplexity Research Center
Prototype repeat sequences • Prototype • The most representative element for each class Pusan Bioinformatics & Biocomplexity Research Center
Finding new HERVs by Suffix Tree Pusan Bioinformatics & Biocomplexity Research Center