140 likes | 291 Views
Probabilistic Text Generation. A Nifty Assignment from Joe Zachary School of Computing University of Utah. Probabilistic Text Generation. Based on an idea by Claude Shannon (1948) popularized by A.K. Dewdney (1989) Generates probabilistic text based on the patterns in a source file
E N D
Probabilistic Text Generation A Nifty Assignment from Joe Zachary School of Computing University of Utah
Probabilistic Text Generation • Based on an idea by Claude Shannon (1948) popularized by A.K. Dewdney (1989) • Generates probabilistic text based on the patterns in a source file • Both fun and appropriate for CS 2 students • Guess which one is actually from the text • which means the other two are random
King James Bible For every man putteth one and my mother: for them, to David sent Samson, and a sacrifice. Then went Samson down, and his father and his mother, to Timnath, and came to the vineyards of Timnath: and, behold, a young lion roared against him. Now the ark; and treasures upon them: and, Who also in the evil: so do your heart?
King James Bible nGram length == 6 For every man putteth one and my mother: for them, to David sent Samson, and a sacrifice. Then went Samson down, and his father and his mother, to Timnath, and came to the vineyards of Timnath: and, behold, a young lion roared against him. Now the ark; and treasures upon them: and, Who also in the evil: so do your heart?
Tom Sawyer Huck started to act very intelligently on the back of his pocket behind, as usual on Sundays. He was always dressed fitten for drinking some old empty hogsheads. The men contemplated the treasure awhile in blissful silence.
Tom Sawyer nGram length == 6 Huck started to act very intelligently on the back of his pocket behind, as usual on Sundays. He was always dressed fitten for drinking some old empty hogsheads. The men contemplated the treasure awhile in blissful silence.
Hamlet Ay me, what act, That roars so loud and thunders in the index? Worse that a rat? Dead for a ducat, drugs fit that I bid you not? Leave heart; for to our lord, it we show him, but skin and he, my lord, I have fat all not over thought, good my lord?
Hamlet nGram length == 5 Ay me, what act, That roars so loud and thunders in the index? Worse that a rat? Dead for a ducat, drugs fit that I bid you not? Leave heart; for to our lord, it we show him, but skin and he, my lord, I have fat all not over thought, good my lord?
Niftiness • Not a toy: it slurps up entire books • Defies expectations: it turns out to be both straightforward and educational • Entertaining: I (Joe Zachary) run a contest to find the funniest generated text
nGram length = 0 The probability that c is the next character to be produced equals the probability that c occurs in the source file. rla bsht eS ststofo hhfosdsdewno oe wee h .mr ae irii ela iad o r te u t mnyto onmalysnce, ifu en c fDwn oee iteo
nGram length ==1 Let s be the previously produced character. The probability that c is the next character to be produced equals the probability that c follows s in the source text. "Shand tucthiney m?" le ollds mind Theybooure He, he s whit Pereg lenigabo Jodind alllld ashanthe ainofevids tre lin--p asto oun
nGram length == 3..15 Let nGram be the previously produced k (4 in this case) characters. The probability that c is the next character to be produced equals the probability that c follows nGram in the source text. Mr. Welshman, but him awoke, the balmy shore. I'll give him that he couple overy because in the slated snufflindeed structure's
Algorithm • Pick a random k-letter nGram from the text • Repeatedly: • Make a list of every character that follows the nGram in the text • Randomly pick a character c from the list • Print c • Remove the first character from the seed and append c
Example nGram length == 2 Seed: th Text: We hold these truths to be self-evident: that all men are created equal; that they List: [e, s, a, a, e] Character: s (20% of the time) New nGram: hs