1 / 22

A Random Text Model for the Generation of Statistical Language Invariants

A Random Text Model for the Generation of Statistical Language Invariants. Chris Biemann University of Leipzig, Germany HLT-NAACL 2007, Rochester, NY, USA Monday, April 23, 2007. Outline. Previous random text models Large-scale measures for text A novel random text model

pravat
Download Presentation

A Random Text Model for the Generation of Statistical Language Invariants

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Random Text Model for the Generation of Statistical Language Invariants Chris BiemannUniversity of Leipzig, Germany HLT-NAACL 2007, Rochester, NY, USA Monday, April 23, 2007

  2. Outline • Previous random text models • Large-scale measures for text • A novel random text model • Comparison to natural language text

  3. Necessary property: Zipf‘s Law • Zipf: Ordering words in a corpus by descending frequency, the relation between the frequency of a word at rank r and its rank is given by f(r) ~ r-z, where z is the exponent of the power-law that corresponds to the slope of the curve in a log plot. For word frequencies in NL, z  1 • Zipf-Mandelbrot: f(r) ~(r+c1)-(1+c2): Approximates lower frequencies for very high ranks

  4. Previous Random Text Models B. B. Mandelbrot (1953) • Sometimes called the “monkey at the typewriter” • With a probability w, a word separator is generated at each step, • with probability (1-w)/N, a letter from an alphabet of size N is generated H. A. Simon (1955) • No alphabet of single letters • at each time step, a previously unseen new word is added to the stream with a probability , whereas with probability (1-), the next word is chosen amongst the words at previous positions. • frequency distribution that follows a power law with exponent z=(1-). • Modified by Zanette and Montemurro (2002): - sublinear growth for higher exponents- Zipf-Mandelbrot law by maximum probability threshold

  5. Critique on Previous Models • Mandelbrot: All words with the same length are equiprobable, as all letters are equiprobableFerrer i Cancho and Solé (2002): Initialisation with letter probabilities obtained from natural language text solves this problem, but where do these letter frequencies come from? • Simon: No concept of „letter“ at all. • Both: • no concept of sentence • no word order restrictions: Simon = bag of words, Mandelbrot does not take into account generated stream at all

  6. Large-scale Measures for Text • Zipf‘s law and lexical spectrum: rank-frequency plot should follow a power law with z1, frequency-spectrum (probability of frequencies) should follow a power law with z2 (Pareto distribution) • Word length: Should be distributed like in natural language text, according to a variant of the gamma distribution (Sigurd et al. 2004) • Sentence length: Should also distributed like in NL, same gamma distribution • Significant neighbour-based co-occurrence graph: Should be a similar in terms of degree distribution and connectivity in random text and NL.

  7. A Novel Random Text Model Two parts: • Word Generator • Sentence Generator Both follow the principle of beaten tracks: • Memorize what has been generated before • Generate with higher probability if generated before more often Inspired by Small World network generation, especially (Kumar et al. 1999).

  8. Word Generator • Initialisation: • Letter graph of N letters. • Vertices are connected to themselves with weight 1. • Choice: • When generating a word, the generator chooses a letter x according to its probability P(x), which is computed as the normalized weight sum of outgoing edges: • Parameter: • At every position, the word ends with a probability w(0,1) or generates a next letter according to the letter production probability as given above. • Update: • For every letter bigram, the weight of the directed edge between the preceding and current letter in the letter graph is increased by one. • Effect: self-reinforcement of letter probabilities: • the more often a letter is generated, the higher its weight sum will be in subsequent steps, • leading to an increased generation probability. with

  9. Word Generator Example The small numbers next to edges are edge weights. The probability for the letters for the next step are P(A)=0.4 P(B)=0.4 P(C)=0.2

  10. Measures on the Word Generator • Word Generator fulfills measures much better than the Mandelbrot model. • For other measures, we need something extra...

  11. Sentence Generator I • Initialisation: • Word graph is initialized with a begin-of-sentence (BOS) and an end-of-sentence (EOS) symbol, with an edge of weight 1 from BOS to EOS. • Word Graph: (directed) • Vertices correspond to words • edge weights correspond to the number of times two words were generated in a sequence. • Generation: • random walk on the directed edges starts at the BOS vertex. • With a new word probability (1-s), an existing edge is followed from the current vertex to the next vertex • the probability of choosing endpoint X from the endpoints of all outgoing edges from the current vertex C is given by

  12. Sentence Generator II • Parameter: • With probability s (0,1), a new word is generated by the word generator model • next word is chosen from the word graph in proportion to its weighted indegree: the probability of choosing an existing vertex E as successor of a newly generated word N is given by • Update: • For each sequence of two words generated, the weight of the directed edge between them is increased by 1

  13. Sentence Generator Example • In the last step, the second CA was generated as a new word from the word generator. • The generation of empty sentences happens frequently. These are omitted in the output.

  14. Comparison to Natural Language • Corpus for comparison: The first 1 million words of BNC, spoken English. • 26 letters, uppercase, punctuation removed  same in word generator • 125,395 sentences  set s=0.08, remove first 50K sentences • average sentence length: 7.975 words • Average word length: 3.502 letters  w=0.4 OOH OOH ERM WOULD LIKE A CUP OF THIS ER MM SORRY NOW THAT S NO NO I DID NT I KNEW THESE PEWS WERE HARD OOH I DID NT REALISE THEY WERE THAT BAD I FEEL SORRY FOR MY POOR CONGREGATION

  15. Word Frequency • Zipf-Mandelbrot distribution • Smooth curve • Similar to English

  16. Word Length • More 1-letter words in the sentence generator • Longer words in the sentence generator • Curve is similar • Gamma distribution here:f(x)~x1.50.45x

  17. Sentence Length • Longer sentences in English • More 2-word sentences in english • Curve is similar

  18. Neighbor-based Co-occurrence Graph • Min. cooc. freq=2, min. log likelihood ratio=3.84 • NB-graph is a small world • Qualitatively, English and sentence generator are similar • Word generator shows much much less co-occurrences • Factor 2 in clustering coefficient and number of vertices

  19. Formation of Sentences • Word graph grows and contains the full vocabulary used so far for generating in every time step. • Random walks starting from BOS always end in EOS. • Sentence length slowly increases: random walk has more possibilities before finally arriving at the EOS vertex. • Sentence length is influenced by both parameters of the model: • the word end probability w in the word generator • the new word probability s in the sentence generator.

  20. Conclusion Novel random text model • obeys Zipf‘s law • obeys word length distribution • obeys sentence length • shows similar nb-cooccurrence data First model that: • produces smooth lexical spectrum without initial letter probabilities • incorporates notion of a sentence • models word order restrictions

  21. Sentence generator at work Beginning: Q . U . RFXFJF . G . G . U . R . U . RFXFJF . XXF . RFXFJF . U . QYVHA . RFXFJF . R TCW . CV . Z U . G . XXF . RFXFJF . M XXF . Q . G . RFXFJF . U . RFXFJF . RFXFJF . Z U . G . RFXFJF . RFXFJF . M XXF . R . Z U . Later: X YYOXO QO OEPUQFC T TYUP QYFA FN XX TVVJ U OCUI X HPTXVYPF . FVFRIK . Y TXYP VYFI QC TPS Q UYYLPCQXC . G QQE YQFC XQXA Z JYQPX. QRXQY VCJ XJ YAC VN PV VVQF C XJN JFEQ QYVHA. U VIJ Q YT JU OF DJWI QYM U YQVCP QOTE OD XWY AGFVFV U XA YQYF AVYPO CDQQ TY NTO FYF QHT T YPXRQ R GQFRVQ . MUHVJ Q VAVF YPF QPXPCY Q YYFRQQ. JP VGOHYY F FPYF OM SFXNJJ A VQA OGMR L QY . FYC T PNXTQ . R TMQCQ B QQTF J PVX YT DTYO RXJYYCGFJ CYFOFUMOCTM PQRYQQYC AHXZQJQ JTW O JJ VX QFYQ YTXJTY YTYYFXK . RFXFJF JY XY RVV J YURQ CM QOXGQ QFMVGPQ. OY FDXFOXC. N OYCT . L MMYMT CY YAQ XAA J YHYJ MPQ XAQ UYBX RW XXF O UU COF XXF CQPQ VYYY XJ YACYTF FN . TA KV XJP O EGV J HQY KMQ U .

  22. Questions? Danke sdf sehr gf thank fdgf you g fd tusen sd ee takk erte dank we u trew wel wwd muchas werwe ewr gracias werwe rew merci mille werew re ew ee ew grazie d fsd ffs df d fds spassiva fs fdsa rtre trerere rteetr trpemma eedm

More Related