Panagiotis Papapetrou*, Gary Benson ** and George Kollios * *Department of Computer Science

Discovering Frequent Poly-Regions in DNA Sequences Panagiotis Papapetrou*, Gary Benson** and George Kollios* *Department of Computer Science **Departments of Biology and Computer Science Boston University

Introduction and Motivation (1/3) • In cells, DNA forms up long chains made up of four chemical units, known as nucleotides. • A number of important regions (known functional regions) at both large and small scales, contain a high occurrence of one or more nucleotides (calledpoly-regions). • Example: • Poly-A: a region “rich” in nucleotide A. • Poly-(C,T): a region “rich” in nucleotides C and T. • Many methods have addressed the problem. However, they only focus on specific types of poly-regions.

Introduction and Motivation (2/3) • Isochores: • Multi-megabase regions of genomic sequence. • Specifically, GC-rich or GC-poor. • CpG islands: • Regions of several hundreds of nucleotides rich in the dinucleotide CpG. • The level of methylation of the cystine (C) is associated with gene expression in nearby genes. • Protein binding regions: • Tens of nucleotides long. • DNA flexibility: dinucleotide, base-step composition.

Introduction and Motivation (3/3) Chromosome: (Nucleodite A) (Nucleodite C) (Nucleodite G) 3’ 5’ Position in the Gene

Related Work • Statistical Methods: • Maximum Likelihood Estimation (MLE) of segments (Auger et. al. 1989, Bement et. al 1977, Fu et. al.1990). • Hidden Markov Chain Model (Churchill et. al. 1992). • Walking Markov Model (Ficket et. al. 1992). • Change-points: (Carlstein et. al. 1994, Braun et. al. 1998, et. al. 2000). • Hierarchical Segmentation (Grosse et. al. 2002, Galvan et. al. 2002 Zhang et. al. 2005).

Main Contributions • Formal definition of the problem of detecting poly-regions in a sequence. • Application of an existing recursive segmentation technique to solve the problem. • Development of an efficient algorithm based on multiple sliding windows. • Application of an efficient arrangement mining algorithm to extract the complete set of frequent arrangements of these regions. • Extensive experimental evaluation of our algorithms on the dog gemone.

Preliminaries (1/4) • Sequence: S = {s1, s2, …, sm}, an ordered set of items, defined over an alphabet. • In our case, sicorresponds to a nucleotide. • k Poly-Region: Hd,k = {I, pstart, pend} • k: number of items in the region. • d: density of the region. • starts and end with one of the k items. • each of the k items has at least (d/k)% frequencuy in the region.

Preliminaries (2/4) • Example of a k Poly-Region: • (1) poly-A • A: 8/10 • (2) poly-(A,C) • A: 4/10 • C: 4/10

Preliminaries (3/4) • Different types of relations1 can occur among Poly-Regions. 1. J. F. Allen and G. Ferguson. “Actions and events in interval temporal logic”. Technical Report 521, The University of Rochester, July 1994”.

Preliminaries (4/4) • k-Arrangement: a set of k temporally correlated events in an e-sequence, denoted as A = {E , R}, where: • E : the set of labels of the event intervals in the arrangement. • R : the set of temporal relations between the events in E. • where is the temporal relation between Eiand Ej.

Problem Statement • Given: • A sequence database S. • A minimum density constraint d. • A range [min, max]. • Find the complete set of maximal poly-regions H of size [min, max], and density of at least d %. • Given: • The set of maximal poly-regions H. • A minimum support threshold min_sup. • Extract the complete set of frequent arrangements of poly-regions in H.

Recursive Segmentation (1/2) • Recursively segment the sequence • Homogeneity difference between each segment in maximized with respect to a measure λ. • In our case we use the Jensen-Shannon Entropy (JSE). • Split point is chosen, where JSE is maximized. • Segmentation of a subsequence is stopped when minimum poly-region size is reached. • Expand each segment to define poly-regions.

Recursive Segmentation (2/2) • To improve the efficiency of the segmentation: • When looking for H-regions of two nucleotides replace the rest of the nucleotides with a single literal. • Example: • S = AAACCCAGGTAGCT • Looking for poly-(A,C): • Snew = AACCCAXXXAXCX

Sliding Windows (1/3) • Define a set of sliding windows W = {w1, w2, …, wN} over the sequence • # of windows: N = max – min + 1. • size of window i : min + i -1. • Each window keeps statistics of a segment: • For each nucleotide: the # of occurrences in the segment. • Each window w = {C, Start, End} • C: set of statistics. • Start: pointer to the start point of the segment in the sequence. • End: pointer to the end point of the segment in the sequence.

Sliding Windows (2/3)

Sliding Windows (3/3) • Heuristic: • NCi = # of items of type C in window i. • C is dense in wi if NCi / |wi| >= d. • Observe: the maximum size of the window where items of type C can fit and fulfill the density constraint is NCi / d. • This indicates which windows of the lower level should be searched for a candidate poly-region. • Start with the window of size max, and for each literal apply the heuristic. • Move to the lower levels.

Frequent Arrangement Mining Algorithm • Use a sliding window W of size M>>max. • At each position of W find the set of arrangements in W. • Keep a global frequency of all arrangements. • Update after each slide. • For the enumeration, the arrangement enumeration tree is used.

Experimental Setup • 39 Chromosomes. • Organism: Canis Familiaris (Dog). • Two phases: • Extract poly-regions. • Discover frequent arrangements of poly-regions in the DNA sequences. • Density constraint d varied between 40-80%. • H-region size varied between 8-64 nucleotides.

Performance Analysis • Recursive Segmentation: • Has accuracy of 85-90%. • Performs better in smaller sequences. • Sliding Window: • Extracts the complete set of poly-regions. • Faster in terms of run time.

Sample Results (1/2) Chromosome 1 (Canis Familiaris)

Sample Results (2/2) Frequent Arrangements

Conclusions • The problem of discovering poly-regions and their frequent arrangements in DNA sequences has been introduced. • Two efficient methods for solving the problem have been discussed. • Recursive Segmentation: approximate. • Sliding Windows: exact. • An efficient algorithm for mining frequent arrangements of intervals has been applied to the extracted poly-regions.

Future Work • Generalize the definition of a poly-region. • Poly-regions of dinucleotides/trinucleotides. • Poly-patterns. • Detect arrangements of poly-regions that occur frequently over: • Coding regions (Genes). • Nucleosomes.

Panagiotis Papapetrou*, Gary Benson ** and George Kollios * *Department of Computer Science