Low-complexity and Repetitive Regions

Low-complexity and Repetitive Regions • OraLee Branch • John Wootton • NCBI • branch@ncbi.nlm.nih.gov

9 6 * 10 20 4 Sequence Composition • DNA Sequences • What would be the expected number of occurrences of a particular sequence in a genome? • Size: human genome 6*109 considering both strands • Base frequency: equal • Sequence length: 20 nucleotides • Bernouli Model: = 0.005 • But: • (GT)n with n>10 = 105

Low-complexity Regions • Simple Sequence Regions (SSR) • MICRO- or MINISATELLITES • Regions that have significant biases in AA or nucleotide composition : repeats of simple motifs • (GT)n (AAC)n (P)n (NANP)n • Low-Complexity Regions/Segments • Complexity can be measured by Shannon’s Entropy • Regarding an amino acid sequence • For each composition of a complexity state, there exists a large number of possible sequences

Low-Complexity Regions • Locally abundant residues may be • continuous or loosely clustered irregular or aperiodic • >25% of AA in currently sequenced genome is in LC regions • non-globular domains  SSR • Examples: myosins, pilins, segments in antigens, short subsequences of 10-50 residues with unknown function • Beta-pleated sheets • Alpha helices • Coiled-coils

Detecting Low-Complexity • SEG and PSEG/NSEG algorithms • Wootton and Federhen • Methods in Enzymology 266:33 (1996) • Computers and Chemistry 17:149 (1993) • SEG • UNIX Executable available on ncbi servers • seg FASTAfile Window TriggerComplexity Extension K2(1) K2(2) • Longer Window lengths define more sustained regions, but overlook short biased subsequences

clobber> seg hu.piron.fa 12 2.20 2.50>gi|730388|sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP) 1-49 MANLGCWMLVVFVATWSDLGLCKKRPKPGG WNTGGSRYPGQGSPGGNRYppqggggwgqphgggwgqphgggwgqphgg 50-86 gwgqggg 87-104 THNQWHKPSKPKTSMKHM agaaaagavvgglggymlgsams 105-127 128-179 RPLIHFGNDYEDRYYRENMYRYPNQVYYRP VDQYSNQNNFVHDCVNITIKQH tvttttkgenftet 180-193 194-228 DVKMMERVVEQMCITQYEKESQAYYQRGSS MVLFS sppvillisflifliv 229-244 245-245 Gclobber> seg hu.piron.fa 12 2.20 2.50 -l>gi|730388|sp|P40250|PRIO_CERAE(50-86) complexity=1.90 (12/2.20/2.50)ppqggggwgqphgggwgqphgggwgqphgggwgqggg>gi|730388|sp|P40250|PRIO_CERAE(105-127) complexity=2.47 (12/2.20/2.50)agaaaagavvgglggymlgsams>gi|730388|sp|P40250|PRIO_CERAE(180-193) complexity=2.26 (12/2.20/2.50)tvttttkgenftet>gi|730388|sp|P40250|PRIO_CERAE(229-244) complexity=2.50 (12/2.20/2.50)sppvillisflifliv

SEG piron with different window lengthsquestion-based – exploratory tool – optimization step

Detecting Low-Complexity • Intuitive explanation • Take a 20-residue long sequence • (20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) • ( 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ) • ( 3 3 3 3 3 2 2 2 2 1 1 0 0 0 0 0 0 0 0 0) • Complexity can be described by Shannon’s Entropy (K2) • Regarding an amino acid sequence • For each composition of a complexity state, there exists a large number of possible sequences (K1)

How SEG works • seg FASTAfile Window TriggerComplexity Extension K2(1) K2(2) • Looks within window length: if complexity < K2(1) then extends until complexity < K2(2) • Uniform prior probabilities • Protein sequence data base is a heterogeneous statistical mixture such that the initially-unknown AA frequencies in Low-complexity subsets need have no similarity to frequencies in total data base • Unbiased view of low-complexity regions • Gives equiprobable compositions for any complexity state

How SEG works, continued • How do you correct for the background AA/nuc composition bias? • After randomly shuffling all the residues, determine the trigger complexity that results in 4% of the data base being within Low-complexity regions • Then use this trigger complexity and subtract 4% from %AA in Low-complexity regions

Detecting Low-complexity with repetitive motif: SSR • PSEG or NSEG • Repetition of residue types or k-grams • Period 3 (n V E n K N n V D n K D n V N n K S n K) (n m i n m i n m i n m i n m i n m i n m) (n m E n m N n m D n m D n m N n m S n m) • Sliding window along sequence in single residue steps

Evolutionary Mechanisms • Evolution of sequences in general • Evolution rate of 10-5 – 10-9 • Base pair substitution (10-9 ) • Insertion/deletions • Recombination • In SSR, Low-complexity regions, mutations are in length – with steps typically +/- one repeat unit • Evolution rate 10-3 • Biased nucleotide substitution due to increased recombination in repetitive regions • Unequal crossing over (recombination) • Replication slippage • Alignment of repeats does not imply relationships/ancestory

Low-Complexity and BLAST searches • Low-complexity regions results in BLAST searches being dominated by Low-complexity regions – biased AA/nuc composition • BLAST added “mask low-complexity” by default • Seg parameters: 12 2.2 2.5 • BLAST now also uses a compositional bias filter on the whole database • Masks if composition bias using seg 10 1.8 2.1 • YOU MAY WANT TO TURN THESE OPTIONS OFF and use your own organism-specific seg paramenters when doing protein homology searching • YOU WILL NEED TO TURN THESE OPTIONS OFF if you are interested in looking at sequence similarities of repetitive/low complexity regions.

Example: Plasmodium falciparum • Using whole genome sequences is important to limit pcr sequencing bias for antigens: hydrophilic proteins • Considering GC-content / AA bias • P. falciparum is approximately 28 % GC • Visualization of individual proteins

A helpful tool here and in general • SEALS: A system for Easy Analysis of Lots of Sequences, R. Walker and E. Koonin, NCBI • www.ncbi.nlm.nih.gov/ CBBresearch/Walker/SEALS/index.html • Demonstrate getting an appropriate data set • Taxnode2gi, gi2fasta • Daffy • Purge • Gref • Fanot • Use cleaned data set of P. falciparum proteins

Protein Analysis • Setting the trigger complexity: • Dbcomp • Shuffledb • Seg • Run SEG on P. falciparum MSP1, PfEMP2, Cg2 • Options • –p (tree form output) • -l (only report Low-C segs) • -h (don’t report Low-C segs) • -x (substitute Low-C with x) • Run PSEG on P. falciparum MSP1, PfEMP2, Cg2 with different –z (periodicity)

Usefulness of studying Low-Complexity Within a protein secondary structure, homology searchers, protein location genetic disorders Within taxa microsatellite markers polymorphism comparisons between proteins Between taxa Synteny , orthologs different selection pressures upon different organisms parasites: immunogenicity, rapid evolution of antigens, recombination

Low-complexity and Repetitive Regions

Low-complexity and Repetitive Regions

Presentation Transcript

Low Complexity Transform and Quantization in H.264/AVC

Repetitive Manufacturing

Wireless Communication Low Complexity Multiuser Detection

CHIPPER : A Low-complexity Bufferless Deflection Router

Repetitive Structures

Complexity and Disorder at Ultra-Low Temperatures

Low complexity and distributed energy minimization

Repetitive Beta Folds

Low-Complexity Transform and Quantization in H.264/AVC

Repetitive Structure

Repetitive motion and ergonomics

Low Complexity, High throughput wireless architecture

Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit

Low Cost Carriers Opportunities and Risks for Cities and Regions

Repetitive Structures

Low-Complexity Reorder Buffer Architecture*

Repetitive DNA

Low-complexity Scheduling for Wireless Networks

Low-complexity mode decision for MVC

Low Complexity Encoding for Network Codes

Memory and Repetitive Arithmetic Machines

The Standard RPC for low  regions