100 likes | 191 Views
The Statistical Nature of English with Implications for Data Compression. Joshua Blackburn Communications Theory Honors April 28, 2006. Statistical Structure in Language. Letters E: 0.1024 W: 0.0142 Digrams TH: 0.0254 AO: 0.0001
E N D
The Statistical Nature of English with Implications for Data Compression Joshua Blackburn Communications Theory Honors April 28, 2006
Statistical Structure in Language • Letters • E: 0.1024 • W: 0.0142 • Digrams • TH: 0.0254 • AO: 0.0001 • Not directly related to individual letter probabilities • T: 0.0835 • H: 0.0442 • A: 0.0640 • O: 0.0621 • Trigrams • THE: 0.0167 • QMZ: 0
Types of Languages • Natural Spoken Language • English: The cat chased the dog up the hill. • German: Die Katze jagte den Hund herauf den Hügel. • Spanish: El gato persiguió el perro encima de la colina. • Polish: Ten kot goniący ten pies w górze ten pagórek. • Czech: Člen určitý kočka cizelovat člen určitý být v patách autobus člen určitý vrch. • Pulse Code Modulated Samples of Continuous Process • Non-Return to Zero • Return to Zero • Mathematical Cases • Sequence of symbols with defined probabilities.
Mathematical Cases • Zeroth Order • All symbols equiprobable. • BDCBCECCCADCBDDAAECEEAABBDAEECACEEBAEECBC • First Order • Accounts for letter probabilities. • P{A, B, C, D, E}={0.4, 0.1, 0.2, 0.2, 0.1} • AAACDCBDCEAADADACEDAEADCABEDADDCECAAAAAD • Second Order • Accounts for transition probabilities. • ABBABABABABABABBBABBBB BABABABABABBBACACABB
Mathematical Cases • Third Order • Uses transition probabilities from the previous two symbols. • Word Analysis • Basic unit of analysis can be words instead of symbols. • Any order of analysis can be used. • Typical First Order Analysis: DAB EE A BEBE DEED DEB ADEE ADEE EE DEB BEBE BEBE BEBE ADEE BED DEED DEED CEED ADEE A DEED DEED BEBE CABED BEBE BED
Entropy • Definition: average number of symbols per information • Calculation • Translation: • Letters equiprobable • Mapping 1: 2 bits/symbol • Mapping 2: 2.25 bits/symbol • P{A, B, C, D}={0.4, 0.3, 0.2, 0.1} • Mapping 1: 2 bits/symbol • Mapping 2: 1.9 bits/symbol • Intrinsic: • Letters equiprobable: 2 bits/symbol • P{A, B, C, D}={0.4, 0.3, 0.2, 0.1}: 1.846 bits/symbol
Matlab Implementation • Analyzed the English Language • cleanstring() • Read standard text file. • Converted uppercase to lowercase. • Removed punctuation. • 60 lines of code. • createpmf() • Mapped 26 letters and space to integers 1 to 27 • Used the mappings of the current letter and previous two as indices of a 27x27x27 frequency table. • Incremented the proper location as each letter was read. • 94 lines of code. • createmarginals() • Creates a CDF of each letter conditioned upon the two previous letters. • 66 lines of code.
Matlab Implementation • Created English Approximation • createEnglish() • Randomly generates a stream of letters according to the probability model for the desired order. • Created the lower order marginal CDFs. • Zeroth Order: randint() • Higher Orders • Used the proper CDF to map the uniform rand() function to the proper nonuniform probability model. • 152 lines of code.
Matlab Implementation • Results • Zeroth Order • qvytekzylybjvadffqhfaumzmlwofswaskwntliffsioeskxxq • Equal presence of the rare letters (q, v, w, x, z, etc.) • First Order • o tew dtgsm eshnmtet ik thy g laftnae iearuac uot • Increased spaces and vowels • Second Order • tivecurm al aris ch at gero inanhah b s tallest hat • Divided into syllables • Third Order • hour goes of ind procaughtiven torst wit mink ing • Pronounceable text
Matlab Implementation • Calculated the Entropy • marginalpmf() • Normalized the frequency tables to sum to one. • 56 lines of code. • entropy() • Calculated the entropy for each order of approximation. • First: 4.0936 bits/symbol • Second: 7.4486 bits/symbol • Third: 10.113 bits/symbol • 23 lines of code. • Results • First Order: 4.0936 bits/letter • Second Order: 3.7243 bits/letter • Third Order: 3.371 bits/letter