220 likes | 737 Views
Low-complexity and Repetitive Regions. OraLee Branch John Wootton NCBI branch@ncbi.nlm.nih.gov. 9. 6. *. 10. 20. 4. Sequence Composition. DNA Sequences What would be the expected number of occurrences of a particular sequence in a genome?
E N D
Low-complexity and Repetitive Regions • OraLee Branch • John Wootton • NCBI • branch@ncbi.nlm.nih.gov
9 6 * 10 20 4 Sequence Composition • DNA Sequences • What would be the expected number of occurrences of a particular sequence in a genome? • Size: human genome 6*109 considering both strands • Base frequency: equal • Sequence length: 20 nucleotides • Bernouli Model: = 0.005 • But: • (GT)n with n>10 = 105
Low-complexity Regions • Simple Sequence Regions (SSR) • MICRO- or MINISATELLITES • Regions that have significant biases in AA or nucleotide composition : repeats of simple motifs • (GT)n (AAC)n (P)n (NANP)n • Low-Complexity Regions/Segments • Complexity can be measured by Shannon’s Entropy • Regarding an amino acid sequence • For each composition of a complexity state, there exists a large number of possible sequences
Low-Complexity Regions • Locally abundant residues may be • continuous or loosely clustered irregular or aperiodic • >25% of AA in currently sequenced genome is in LC regions • non-globular domains SSR • Examples: myosins, pilins, segments in antigens, short subsequences of 10-50 residues with unknown function • Beta-pleated sheets • Alpha helices • Coiled-coils
Low-Complexity Regions • Locally abundant residues may be • continuous or loosely clustered irregular or aperiodic • >25% of AA in currently sequenced genome is in LC regions • non-globular domains SSR • Examples: myosins, pilins, segments in antigens, short subsequences of 10-50 residues with unknown function • Beta-pleated sheets • Alpha helices • Coiled-coils
Detecting Low-Complexity • SEG and PSEG/NSEG algorithms • Wootton and Federhen • Methods in Enzymology 266:33 (1996) • Computers and Chemistry 17:149 (1993) • SEG • UNIX Executable available on ncbi servers • seg FASTAfile Window TriggerComplexity Extension K2(1) K2(2) • Longer Window lengths define more sustained regions, but overlook short biased subsequences
clobber> seg hu.piron.fa 12 2.20 2.50>gi|730388|sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP) 1-49 MANLGCWMLVVFVATWSDLGLCKKRPKPGG WNTGGSRYPGQGSPGGNRYppqggggwgqphgggwgqphgggwgqphgg 50-86 gwgqggg 87-104 THNQWHKPSKPKTSMKHM agaaaagavvgglggymlgsams 105-127 128-179 RPLIHFGNDYEDRYYRENMYRYPNQVYYRP VDQYSNQNNFVHDCVNITIKQH tvttttkgenftet 180-193 194-228 DVKMMERVVEQMCITQYEKESQAYYQRGSS MVLFS sppvillisflifliv 229-244 245-245 Gclobber> seg hu.piron.fa 12 2.20 2.50 -l>gi|730388|sp|P40250|PRIO_CERAE(50-86) complexity=1.90 (12/2.20/2.50)ppqggggwgqphgggwgqphgggwgqphgggwgqggg>gi|730388|sp|P40250|PRIO_CERAE(105-127) complexity=2.47 (12/2.20/2.50)agaaaagavvgglggymlgsams>gi|730388|sp|P40250|PRIO_CERAE(180-193) complexity=2.26 (12/2.20/2.50)tvttttkgenftet>gi|730388|sp|P40250|PRIO_CERAE(229-244) complexity=2.50 (12/2.20/2.50)sppvillisflifliv
SEG piron with different window lengthsquestion-based – exploratory tool – optimization step
Detecting Low-Complexity • Intuitive explanation • Take a 20-residue long sequence • (20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) • ( 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ) • ( 3 3 3 3 3 2 2 2 2 1 1 0 0 0 0 0 0 0 0 0) • Complexity can be described by Shannon’s Entropy (K2) • Regarding an amino acid sequence • For each composition of a complexity state, there exists a large number of possible sequences (K1)
How SEG works • seg FASTAfile Window TriggerComplexity Extension K2(1) K2(2) • Looks within window length: if complexity < K2(1) then extends until complexity < K2(2) • Uniform prior probabilities • Protein sequence data base is a heterogeneous statistical mixture such that the initially-unknown AA frequencies in Low-complexity subsets need have no similarity to frequencies in total data base • Unbiased view of low-complexity regions • Gives equiprobable compositions for any complexity state
How SEG works, continued • How do you correct for the background AA/nuc composition bias? • After randomly shuffling all the residues, determine the trigger complexity that results in 4% of the data base being within Low-complexity regions • Then use this trigger complexity and subtract 4% from %AA in Low-complexity regions
Detecting Low-complexity with repetitive motif: SSR • PSEG or NSEG • Repetition of residue types or k-grams • Period 3 (n V E n K N n V D n K D n V N n K S n K) (n m i n m i n m i n m i n m i n m i n m) (n m E n m N n m D n m D n m N n m S n m) • Sliding window along sequence in single residue steps
Evolutionary Mechanisms • Evolution of sequences in general • Evolution rate of 10-5 – 10-9 • Base pair substitution (10-9 ) • Insertion/deletions • Recombination • In SSR, Low-complexity regions, mutations are in length – with steps typically +/- one repeat unit • Evolution rate 10-3 • Biased nucleotide substitution due to increased recombination in repetitive regions • Unequal crossing over (recombination) • Replication slippage • Alignment of repeats does not imply relationships/ancestory
Low-Complexity and BLAST searches • Low-complexity regions results in BLAST searches being dominated by Low-complexity regions – biased AA/nuc composition • BLAST added “mask low-complexity” by default • Seg parameters: 12 2.2 2.5 • BLAST now also uses a compositional bias filter on the whole database • Masks if composition bias using seg 10 1.8 2.1 • YOU MAY WANT TO TURN THESE OPTIONS OFF and use your own organism-specific seg paramenters when doing protein homology searching • YOU WILL NEED TO TURN THESE OPTIONS OFF if you are interested in looking at sequence similarities of repetitive/low complexity regions.
Example: Plasmodium falciparum • Using whole genome sequences is important to limit pcr sequencing bias for antigens: hydrophilic proteins • Considering GC-content / AA bias • P. falciparum is approximately 28 % GC • Visualization of individual proteins
A helpful tool here and in general • SEALS: A system for Easy Analysis of Lots of Sequences, R. Walker and E. Koonin, NCBI • www.ncbi.nlm.nih.gov/ CBBresearch/Walker/SEALS/index.html • Demonstrate getting an appropriate data set • Taxnode2gi, gi2fasta • Daffy • Purge • Gref • Fanot • Use cleaned data set of P. falciparum proteins
Protein Analysis • Setting the trigger complexity: • Dbcomp • Shuffledb • Seg • Run SEG on P. falciparum MSP1, PfEMP2, Cg2 • Options • –p (tree form output) • -l (only report Low-C segs) • -h (don’t report Low-C segs) • -x (substitute Low-C with x) • Run PSEG on P. falciparum MSP1, PfEMP2, Cg2 with different –z (periodicity)
Usefulness of studying Low-Complexity Within a protein secondary structure, homology searchers, protein location genetic disorders Within taxa microsatellite markers polymorphism comparisons between proteins Between taxa Synteny , orthologs different selection pressures upon different organisms parasites: immunogenicity, rapid evolution of antigens, recombination