A clustering method for repeat analysis in DNA sequences

A clustering method for repeat analysis in DNA sequences Molecular Biology & Phylogeny Laboratory 석사 1년 김우연

A clustering method for repeat analysis in DNA sequences • Natalia Volfovsky, Brian J Haas and Steven L Salzberg • The Institute for Genomic Research, USA • Genome Biology 2001 Pusan Bioinformatics & Biocomplexity Research Center

Abstract Pusan Bioinformatics & Biocomplexity Research Center

Suffix Trie • Definition • Tree: 한 개 이상의 node 로 구성된 유한집합 • Suffix: 각 위치에서 시작하는 가장 긴 substring • Suffix tree: 모든 suffix 를 표현하는 trie • 예: T = ababa# a # 123456 b b 6 a # a b b 5 a a # # # # 4 3 2 1 Pusan Bioinformatics & Biocomplexity Research Center

T = ababa# P = aba 123456 a # • Edge : label • Internal node • Sibling edge • Leaf node <=> Suffix ba 6 # ba # 5 # ba# ba# 4 1 3 2 Suffix Tree • Definition • Suffix tree: 모든 suffix 를 표현하는 compacted trie • 예: Pusan Bioinformatics & Biocomplexity Research Center

Example T = ATGATGC# 12345678 8 # ATG C# G TG ATGATGC# 7 ATGC# 6 C# TGATGC# GATGC# C# ATGC# C# ATGC# 1 4 5 3 2 Pusan Bioinformatics & Biocomplexity Research Center

Numerous methods for detecting repeats • RepeatMasker • Using a database of known repeat sequences and implements a string-matching algorithm • MaskerAid • Same approach • More rapid than RepeatMasker • WU-BLAST • Using the BLAST engine • Based on suffix trees • RepeatMatch, REPuter, RepeatFinder • Finding all exact repeats • 10-100 megabases (Mb) Pusan Bioinformatics & Biocomplexity Research Center

Definitions • An exact repeat • A subsequence occurring in DNA seqeunce at least twice • A maximal repeat • Can’t be extended in either direction without incurring a mismatch Pusan Bioinformatics & Biocomplexity Research Center

Exact repeats Pusan Bioinformatics & Biocomplexity Research Center

Definition of repeats Pusan Bioinformatics & Biocomplexity Research Center

Algorithm description • Using either of two suffix tree method • RepeatMatch, REPuter • Based on first identifying all exact repeats • Defining repeat classes by merging and extending • Step1: Selection and pre-processing • Step2: Merging procedure • Step3: Classification • Step4: BLAST searches and repeat class updates Pusan Bioinformatics & Biocomplexity Research Center

STEP1: Selection and pre-processing Interpreting a partition of the original genome sequence By output of RepeatMatch or REPuter F: forward RC: reverse complement l: length Pusan Bioinformatics & Biocomplexity Research Center

STEP2: Merging procedure Merging two exact repeats that either overlap or that occur within A limited distance ( a gap ) of each other Pusan Bioinformatics & Biocomplexity Research Center

STEP3: Classification One step of the classification procedure Pusan Bioinformatics & Biocomplexity Research Center

STEP4: BLAST searches and further merging If a class appears in multiple similarity pairs, all these similar classes are merged with the original class. Pusan Bioinformatics & Biocomplexity Research Center

Repeat analysis of microbial genomes Minimal exact repeat length: 25 bp Gap: 25 bp Pusan Bioinformatics & Biocomplexity Research Center

Prototype repeat sequences • Prototype • The most representative element for each class Pusan Bioinformatics & Biocomplexity Research Center

Pusan Bioinformatics & Biocomplexity Research Center

Finding new HERVs by Suffix Tree Pusan Bioinformatics & Biocomplexity Research Center

A clustering method for repeat analysis in DNA sequences

A clustering method for repeat analysis in DNA sequences

Presentation Transcript

Resolving ambiguity in DNA sequences

Troubleshooting DNA Sequences:

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences

Stochastic Models For Heterogeneous DNA Sequences

Correlogram Method for comparing Bio-Sequences

Clustering Method for Repeat Analysis in DNA sequences

A method for pacing analysis

Methods for Repeat Detection In Nucleotide Sequences

DNA Sequences

Using DNA sequences

: Determining DNA sequences

Methods for Repeat Detection In Nucleotide Sequences

Reading DNA Sequences

Finding Regulatory Motifs in DNA Sequences

A robust adaptive clustering analysis method for automatic identification of clusters

: Determining DNA sequences

A Bayesian method for DNA barcoding

Finding Regulatory Motifs in DNA Sequences

Clustering Sequences in a Metric Space

SURVEY PROJECT “ A Clustering Method for Repeat Analysis in Dna Sequences”

A new method of finding similarity regions in DNA sequences