1 / 37

Peter Grzybek

Peter Grzybek.  Von der Ökonomie der Sprache zur Selbst-Regulation kultureller Systeme Korpuslinguistik vs. Textanalyse  Exakte Literaturwissenschaft: Zur Prosa Karel Č apeks  Was tun die Wörter im Vers miteinander? Zur Poesie A.S. Pu š kins. http://www-gewi.uni-graz.at/quanta

leah-parker
Download Presentation

Peter Grzybek

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Peter Grzybek • Von der Ökonomie der Sprache zur Selbst-Regulation kultureller Systeme • Korpuslinguistik vs. Textanalyse •  Exakte Literaturwissenschaft: • Zur Prosa Karel Čapeks •  Was tun die Wörter im Vers miteinander? • Zur Poesie A.S. Puškins http://www-gewi.uni-graz.at/quanta Austrian Research Fund Project #15485

  2. Peter Grzybek Korpus-Linguistik vs. Text-Analyse http://www-gewi.uni-graz.at/quanta Austrian Research Fund Project #15485

  3. Analysis of Letter Frequencies Methodological Problems in Former Studies • Insufficient Data Distinction • (graphemic and phonematic/phonetic data) • Insufficient Control of Data Homogeneity • (text / text segments / text mixtures (corpora) • Frequency Models: Continuous vs. Discrete • (a) theoretical entropy, repeat rate • (b) pi = 1 • Goodness of Fit • Graphics vs. tests, R² vs. ²

  4. Analysis of Letter Frequencies Methodological Decisions • Data Distinction • Graphemic data • Control of Data Homogeneity • Text vs. text segments vs. text cumulations vs. text mixtures (corpus) • Discrete Frequency Models • Test of relevant models • 4. Goodness of Fit • ² test  C = ² / N (C < 0.02 = * ; C < 0.01 = **)

  5. Analysis of Letter Frequencies Slavic Alphabets

  6. Analysis of Letter Frequencies Russian

  7. Zipf (Zeta) distribution Basic assumption: r x fr = c  fr = c / r

  8. Zipf-Mandelbrot distribution Basic assumption:  fr = c / (r + b)a

  9. Zipf and Zipf-Mandelbrot Distributions: Goodness of Fit (38 Russian samples)

  10. Geometric Distribution and Good Distribution

  11. Negative Hypergeometric Distribution n = inventory size, x = class 2 parameters: K and M Analysis of Russian Letter Frequencies: Corpus: 37 Texts (ca. 8.5 mio. letters)

  12. Negative Hypergeometric Distribution Analysis of Russian Letter Frequencies Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus Constancy of goodness of fit (C) Constancy of Parameters (K, M)

  13. Negative Hypergeometric Distribution Analysis of Slovene Letter Frequencies Corpus: ca. 130.000 letters Goodness of fit (C= 0.0094)

  14. Negative Hypergeometric Distribution Analysis of Slovene Letter Frequencies Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus Constancy of goodness of fit (C) Constancy of Parameters (K, M)

  15. Slovene Letters Slowene Phonemes Analysis of Slovene Letter and Phoneme Frequencies: Corpus: ca. 130.000

  16. First Tentative Results of Slowak Letter Frequencies • Tasks: • Interpretation of Parameters: „foreign letters Q-W-X“ influence inventory size • Exploration of Data Basis: Texts, Text Segments, Text cumulations, text mixtures

  17. The Question of Data Homogeneity

  18. “[…] the magnitude of words tends, on the whole, to stand in an inverse (not necessarily proportionate) relationship to the number of occurrences” Zipf (1935: 25) Four major problems in research

  19. What is the direction of dependence: Does frequency depend on length or vice versa? What is the unit of measurement: Is word length measured in letters, phonemes, syllables, morphemes, ...? What is frequency: Absolute occurrence or the rank of words, or of word forms? What is the text basis: Corpus data, frequency dictionaries, ..., individual texts?

  20. Assuming that word length is a variable of frequency Measuring word length in the number of syllables per word Analyzing the absolute occurrence of words the influence of the text basis shall be tested: Individual texts vs. text cumulations vs. corpus data  DATA HOMOGENEITY

  21. Different Languages  Different Authors  Different Text Types • complete novel, composed of chapters • complete book of a novel, consisting of several chapters • individual chapters • dialogical vs. narrative sequences within a text

  22. Russian Anna Karenina (ch. 1) x frequency y length 1 2 3 4 5 6 7 8 9 10 13 19 20 37 2.92 2.14 2.05 1.50 1.33 1.50 1.67 1.00 1.00 1.00 1.00 1.00 1.00 1.00 3.03 2.04 1.70 1.53 1.43 1.36 1.31 1.27 1.24 1.22 1.17 1.12 1.11 1.06 a = 2.0261, b = 0.9660 R² = 0.88, N = 397

  23. Text Language N R² a b Anna Karenina (I,1) Russian 397 0.88 2,03 0,97 Evgenij Onegin (I) Russian 1871 0.96 1,70 0,79 Na badnjak Croatian 2450 0.93 1,95 0,51 Zářivé hlubiny Czech 1363 0.94 1,76 0,59 Hiša M.P. (I) Slovenian 1147 0.84 1,80 0,40 Zakliata panna Slovak 926 0.88 1,48 0,69 Hänsel und Gretel German 803 0.87 1,16 0,51 Fairy Tale by Móra Hungarian 234 0.96 1,57 0,84 Di lembung kuring Sundanese 431 0.91 1,86 0,51 Burung api Indonesian 1393 0.92 2,44 0,26  Portrait of a Lady (I) English 1104 0.89 1,23 0,83 0.84  R²  0.96

  24. The course of the theoretical curves

  25. The relationship between parameters a and b

  26. The relationship between text length (N) and parameter a

  27. Obvious data inhomogeneity 1. Texts from different languages, authors, and various text types 2. Violation of the ceteris paribus condition Ergo: The data in this mixture are not adequate for testing the hypothesis at stake

  28. Lev N. Tolstoj: Anna Karenina Chap. I,1 vs. I (34 chapters)

  29. Henry James: Portrait of a Lady Chap. 1 vs. novel (52 chapters)

  30. Ks.Š. Gjalski: Na badnjak Narrative vs. dialogical sequences

  31. Chapter N Types M Tokens a b R2 I I+II I-III I-IV I-V I-VI I-VII text (I-VIII) 1871 2918 3951 4851 5737 6509 7476 8329 3209 5546 8359 10936 13376 15978 19061 22482 1,70 1,84 1,92 1,97 1,95 1,97 2,03 2,05 0,79 0,69 0,57 0,53 0,48 0,52 0,43 0,40 0.96 0.88 0.88 0.92 0.94 0.94 0.86 0.88 Evgenij Onegin Text cumulation (I – VIII) Results of fitting y = ax^-b + 1 to the cumulative text of Evgenij Onegin

  32. Evgenij Onegin – text cumulation (chap. I – VIII) Dependence of parameter b on parameter a Fitting y = ax^-b R² = 0.92

  33. Evgenij Onegin Text cumulation (I – VIII) Dependence of a on Text Length (N): a = 0.6493N0.1286 (R²= 0.96 )

  34. Summary & Results (I) Data corroborate hypothesis: There is a specific interrelation of parameters: a = f (N) b = g(a) b = h(N) f, g, h  functions of the same type

  35. Summary & Results (II) • Homogeneous texts do not interfere with linguistic laws, inhomogeneous texts can distort the textual reality. • Text mixtures can evoke phenomena which do not exist as such in individual texts • Short texts do not allow a property to take appropriate shape; long texts (and corpora) contain mixed generating regimes superposing different layers, what may lead to “artificial” phenomena. • With an increase of text size the resulting curve of the frequency-length relationship is shifted upwards; this is caused by the fact that the number of words occurring only once increase up to a certain text length. If this assumption is correct, then b converges to zero, yielding the limit y = a.

  36. F I N I S

More Related