1 / 24

Automatic Term Weighting, Lexical Statistics and …… Quantitative Terminology

Automatic Term Weighting, Lexical Statistics and …… Quantitative Terminology. Kyo Kageura National Institute of Informatics July 05, 2003. Project. To rescue/recover the sphere of lexicology

huy
Download Presentation

Automatic Term Weighting, Lexical Statistics and …… Quantitative Terminology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Term Weighting, Lexical Statistics and ……Quantitative Terminology Kyo Kageura National Institute of Informatics July 05, 2003

  2. Project • To rescue/recover the sphere of lexicology • To release the richness and productivity of lexico-conceptual sets from the dominance of discourse …… • while maintaining the traceable procedure in the process of doing this • and starting from textual corpora

  3. Contents • Sphere of Texts and Sphere of Lexicon/ology • Three (representative) methods of automatic term weighting and their meanings • From corpus-based lexical statistics to (still) corpus-based quantitative lexicology • Measuring lexical productivity in lexicon (i.e. lexicological concept of productivity) from textual data, with some experiments • Conclusions

  4. Textual Sphere and Lexicological Sphere Lexicological Sphere This exists complex terms lexicology lexicon quantitative lexicology terms So what about talking about lexicology when talking about corpus-based… Textual Sphere

  5. Lexicological Sphere and Texts • Lexicology deals with actual set of words • which does not mean it’s natural history • Lexicological model with expectations addresses “realistic possibility of existence,” not permissible forms or fantasy land • thus actual data is required • primary language data is texts • Thus becomes recovery of lexicological characteristics the task of lexicology

  6. Automatic Term Weighting (ATW) • To review some representative ATW methods gives important insights into the current topic • while at the same time giving insights into ATWs • We look at • Tfidf (its info-theoretic interpretation by Aizawa) • Term representativeness (by Hisamitsu) • Lexical measure (by Nakagawa) which goes from texts to lexicology, almost.

  7. ATW1: tfidf Tfidf and many other similar measures, in fact most of what are used in IR, are based on the document-term matrix which has formal duality. Thus the weight of terms is always and only meaningful vis-à-vis the given set of documents or its population (Dfitf thus makes sense, as in probabilistic model).

  8. ATW2: Term representativeness • You shall know the meaning of a word by the company it keeps (or see friends to know a person … if there is any, anyway) • To calculate the weight of a term ti, take the distribution of words that accompany ti in a certain window size and calculate the distance between this and the distribution of random chunk of the same window size (NB: size normalisation is necessary due to LNRE nature of language data).

  9. ATW2: Term representativeness • This method discards the factor of dominant discourse or minor discourse at the level of observed texts (or does not do favor to people who randomly buy friends by money). • This method calculates the characteristic that the term ti, if appears at all, can attract at the level of discourse (depending on the nature of window the method takes, of course).

  10. ATW3: Nakagawa’s method • Observe the number of different elements (element types) that accompany tiwithin the complex lexical units in texts. • This reflects, therefore, a nature of lexical productivity of the focal element ti, but together with the degree of its use in discourse (texts)

  11. ATW to Quantitative Lexicology • To characterise lexicological nature of elements from their occurrence in texts: • As in the method of term representativeness in Hisamitsu, the “discourse size” factor should be reduced, more essentially; • As in Nakagawa’s method, the point of observation should be limited to complex terms (or those which are supposed to be registered or can be registered to the lexicon/lexicological sphere).

  12. A Quantitative Terminonlogical Study • Aim: To recover the productivity of constituent elements of simplex and complex terms as head. • Observe, like Nakagawa, the window range of simplex and complex terms in texts, e.g.

  13. Some preconditions/assumptions • Corpus and the target terminological space should: • belong to and represent the same domain • cover the same period of time • in general matches qualitatively • We are concerned with defining a measure which can compare “productivity” of elements in the same lexicological/terminological sphere.

  14. Definition of measures (a) • f(i,N): frequency of ti in the text of size N • This is the extent of use in discourse, nothing to do with lexicological productivity • d(i,N):number of different complex words whose head is ti in the text of size N • the first manifestation of lexicological productivity • basically identical to Nakagawa (2000) • thus this is the point of departure

  15. Definition of measures (b) • d(i,N) means the manifestation of the productivity of ti as it occurs in the corpus • d(i,N)is sensitive to the extent of use of the focal element in the textual corpus, • e.g. the following can be the case…

  16. Definition of measures (c) • Better measure for manifested productivity d(i,λN):the overall transition pattern of d(i,λN)whereλtakes a positive real value (a la Hisamitsu). • The measure for potential productivity d(i) = d(i,λN);λ→∞:discard all the quantitative factor • Can be computed by LNRE models

  17. The measures and prob. distributions • Three distributions 1) The occurrence probability of heads in theoretical lexicological space. 2) The occurrence probability of modifiers for each head. 3) The probability of use of the head in the text. • Relations… • f(i,N) ⇔ 3) • d(i) ⇔ 1) • d(i,N) ⇔ 2),3)

  18. Experiments (1/5) • Artificial intelligence abstracts in Japanese • 4 elements, i.e. 「System」「Model」(general) and 「knonwledge」「information」(specific) are observed

  19. Experiments (2/5)

  20. Experiments (3/5)

  21. Experiments (4/5)

  22. Experiments (5/5) General elements, such as “system” or “model,” have high lexicological productivity, while subject-specific elements, such as “knowledge” or “information,” have rather low productivity.

  23. Summary • Starting from the observation of ATW methods and going into examining corpus-based quantitative terminological study, we • clarified the position of lexicology/lexicon • clarified the basic framework of quantitative lexicology/terminology, with relevant measures. • gave some corresponding distributions • gave the framework of interpretation to measures • carried out experiments …

  24. Remaining problems • Concepts of “lexicologisation” and “word” • To be registered to the lexicon • To be consolidated as a lexical unit within the syntagmatic stream of language manifestations • Distribution of complex words in texts and word unit • “reference+head” vs. “modifier+head” • The former is related to an essential concept(ualisation) of lexicon/lexicology…

More Related