460 likes | 594 Views
Zipf’s monkeys. Observations from real and random genomes. Environmental genomics. When an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments Given a collection of DNA fragments (i.e. reads), figure out which organisms they came from.
E N D
Zipf’s monkeys Observations from real and random genomes
Environmental genomics • When an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments • Given a collection of DNA fragments (i.e. reads), figure out which organisms they came from
The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCT…
The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATG…
The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAG…
The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAGATCATCATCGCGCATCAATCAGTG…
The data ___________________________________________________________________________________________________________________________________________________________
The data ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________
The data ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________
The data ______________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________
The data ______________________________________________________________________________________________ ________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ ________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________
The data ______________________________________________________________________________________________ ________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ ________________________________________________________________________________________________________ _______ _____________ ____ ______________ ___________________________ __________ ________________ _____ ____________________________ ______________________ ________________________________________________________ ________ _______ __ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________
The data ______________________________________________________________________________________________ ________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ ________________________________________________________________________________________________________ _______ _____________ ____ ______________ ___________________________ __________ ________________ _____ ____________________________ ______________________ ________________________________________________________ ________ _______ __ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ________________________ __________________ ________________ ________________________________ ___________________ __________ _______ ___________________ ____________ _____ _______ ________________ _________________ _______________ ______________ ___________ _______________ ___________ _____ _______ ___________ _________ ______________________ ___ __ _____________ ___________________________________ ____________________ _______________________ __________ How can we reconstruct the original genomes?
Approaches • Jigsaw puzzle • Find common subsequences • Align overlapping regions • Statistics • Compute histograms of oligonucleotides (n-grams) • Match to distributions for known organisms • Use rare polymers to select anchor points (BLAST-like)
Compression distance • Conjecture: a lossless, dictionary-based sequence compressor built for a genome compresses one of its own subsequences better than would the compressor built for another genome • (normalized) universal compression distance max[ C(xy) – C(x), C(yx) – C(y) ] UCD(x,y) = --------------------------------------------- max[ C(x), C(y)]
CM clustering • Compression Maximization • Adopt compression into a kind of EM clustering • Partition reads randomly into [say] two groups • For each read, compute compression distance to each group (à la leave-one-out) • Reassign read to closest group • Iterate until some stopping criterion • Apply recursively to each group
Experiment groupAgroupB DG2 AF2 NM1 DE2 MR2 AD4 DE3 CA4 AD5DE5 AF1DG1 DE1 AD1 AF3NM3 DG4 AF4 AF5 DG5 CA1 MR1 MR4 AD3 CA3 CS5 DE4 CA2 CA5MR5 NM4 CS3 CS2 NM2 AD2 DG3 CS4 CS1 MR3 NM5
Experiment: result groupAgroupB AD1DE1 AD2DE2 AD3DE3 AD4DE4 AD5DE5 AF1DG1 AF2DG2 AF3DG3 AF4DG4 AF5 DG5 CA1 MR1 CA2MR2 NM1 MR3 NM2 MR4 NM3 MR5 CS1CA3 CS2 CA4 CS3 CA5 CS4 NM4 CS4 NM5 stop when µCD > 70
Reassembly • Can the LZ trie be used to reassemble reads into genomes? • The LZ trie is a regular grammar of the set of reads • A long phrase is an extension of a shorter phrase • The start of one read is the end of another • The part of a long phrase that is the suffix after a shorter phrase (i.e. the difference between the short phrase and the long one) is the prefix of another phrase
Along the way …. • While setting up the initial experiments, we started to ponder things that might go wrong • Different genomes might have a lot of common subsequences that will conflate the clustering result • SNPs and missing fragments might thwart compression • Compression model might take too long to converge on a useful model (paucity of data) • What is the underlying principle being leveraged?
Information theory • A linear sequence of symbols intended for communication exhibits a balance between randomness and regularity • If a sequence is entirely random, it is noise • If a sequence is entirely predictable, it is redundant • Patterns provide means for recognition (interpretation) and irregularities provide for novelty (information) • Compression attempts to minimize redundancy
Information theory • Human languages exhibit non-uniform distributions over letters, phonemes, words, etc
DNA primary sequences • Four nucleotide symbols: A, C, G, T • Much of a genome codes nothing, and the rest is genes • A gene is copied (transcription) off the genome, and the copy is used to build a protein (translation) • Three consecutive nucleotides form a codon, which codes for a specific amino acid • A sequence of amino acids (residues) constitutes a protein • Proteins are where structure definitely exists
DNA primary sequences • 43= 64 possible codons • 20 possible amino acids • Many amino acids have more than one codon
Genomic regularities • Most genes start with ATG and end with a stop codon (TAG, TAA, and TGA most frequent) • TATA-box in regulatory region (for binding) • GC rich regions (for stability) But • Frequency of individual nucleotides or residues is not-so interesting (no syntax) • Tertiary structure of proteins is The Thing: the interactions of amino residues are paramount
Genomic regularities • Do genomes have sequential syntactic structures?
Problems from paucity of data • Takes time for an LZ compression trie to become saturated with characteristic phrases • Experimental data somewhat small, thus interesting sequences may not manifest quickly enough • Prime the trie by prepending some random DNA to the data prior to computing CD • How much? How about a million?
Miller’s monkey • 19th century – Wilfried Pareto showed that power-law distributions abound in social, scientific, economic and geophysical data • 1949 – G.K. Zipf argued that power-law distributions are an interesting linguistic phenomenon • 1957 – G.A. Miller argued that the effect related to random placement of spaces, and that a monkey at a typewriter would produce ‘language’ with Zipfian distribution • 1968 – David Howes argued that Miller’s proof is flawed • 2004 – Michael Mitzenmacher demonstrated the connection between power-law distributions and log-normal distributions
conclusion • Probably nothing!