1 / 36

Statistical Analysis of Light Verb Variations in Chinese Corpus

Explore semantic nuances and selectional restrictions of light verbs in Chinese through statistical analysis of comparable corpora. Investigate variations between Mainland and Taiwan Mandarin usage. Analyze the distributional patterns and factors influencing light verb choices.

epeck
Download Presentation

Statistical Analysis of Light Verb Variations in Chinese Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparable Corpus Driven, Multivariate Approach to Light Verb Variations in World ChinesesJingxia LIN2, Menghan JIANG1, and Chu-Ren HUANG11 The Hong Kong Polytechnic University, 2Nanyang Technological University

  2. Light verbs in Chinese Similar to English light verbs: take rest, give advice, give description • Semantically bleached: containing no eventive information • The predicative content mainly comes from its taken complement 進行討論 jin4xing2 tao3lun4 ‘have a discussion’ • Being semantically bleached, they do not strongly select their objects • They can take a wide range of objects, including deverbal nouns, eventive nouns, and sometime concrete numbers with eventive meaning • They are sometimes interchangeable with the same nominal object

  3. Underspecified Selecitonal Restriction of Chinese Light Verbs • 從事cong2shi4,搞gao3, 加以jia1yi3, 進行jin4xing2, 做zuo4 are among the most frequently used (also most typical) light verbs in Modern Chinese • The use of these five light verbs are sometimes interchangeable • 從事/搞/加以/進行/做研究 • cong2shi4/gao3/jia1yi3/jin4xing2/zuo4 yan2jiu1 • “to do research”

  4. Underspecified Selecitonal Restriction of Chinese Light Verbs II • Collocation constraints are sometimes found with these light verbs, • e.g., 進行/*加以/*從事/搞/*做赛事, jin4jing2/*jia1yi3/*cong2shi4/gao3/*zuo4 bi3sai4 “play a game” • *進行/加以/*從事/*搞/*做考慮 *jin4jing2/jia1yi3/*cong2shi4/*gao3/*zuo4 kao3lv4 “give consideration”

  5. Variations of Light Verb Usages in Mainland and Taiwan Mandarin Variants • Even with the very limited collocation constraints, variations still exist: Taiwan light verbs tend to take more types of NPs and even VPs as its complements • 進行感恩之旅/君子之爭 Jin4xing2 gan3en1zhi1lv3/ju1zi3zhi1zheng1 “to proceed with a ‘thanksgiving trip’/‘gentlemen’s dispute’” • 進行抹黑/開票 Jin4xing2 mo3hei1/kai1piao4 “to proceed with ‘mud-slinging’/’ballot counting’ ” -------(Huang et al. 2013)

  6. Theoretical Challenges for Corpus-based Studies of Chinese Light Verbs • Can distribution based statistically analysis identify the differences among different Chinese light verbs? • The contrasts among the light verbs are often tendencies rather than grammaticality dichotomies; hence the distributional patterns are less prominent and harder to characterize • Can the subtle light verb variations between different variants of Chinese, be identified through statistical analysis based on comparable corpora (cf. Huang et al. 2013).

  7. Main Research Questions Facing the above challenges, we try to resolve the following four research questions: • Can light verbs be differentiated from each other by statistical methods? • Can the grammatical differences between variants of the same language be empirically verified by distributional features? • Are these differences statistically significant? • If answers to both questions are yes, how do they differ statistically from each other? • That is, is the distributional difference between two different light verbs or the between two variants of the same light verb more prominent?

  8. Methodology • A comparable-corpus-driven statistical approach • 加以jia1yi3, 進行jin4xing2, 從事cong2shi4,搞gao3, 做zuo4in Mainland Mandarin and Taiwan Mandarin • Statistical methods and tools • Univariate analysis + multivariate analysis • Polytomous package in R (Arppe 2008)

  9. Data • Chinese Gigaword corpus (over 1.1 billion Chinese words) • Central News Agency (Taiwan, about 700 million characters) • Xinhua News Agency (Mainland China, about 400 million characters) • Random sample: 200 sentences for each of the five light verbs in Mainland and Taiwan corpora • 1,000 in total for Mainland Chinese • 1,000 in total for Taiwan Chinese

  10. 12 factors: (e.g. Zhu 1985, Zhou 1987, Cai 1982, Huang et al. 1995, among others)

  11. Mainland Chinese-An overall look of the factors > str(MLLV3) 'data.frame': 1000 obs. of 13 variables: $ LV : Factor w/ 5 levels "congshi","gao",..: 1 1 1 1 1 1 1 1 1 1 ... $ POS : Factor w/ 2 levels "N","V": 2 2 2 2 1 1 2 2 2 2 ... $ ARGSTR : Factor w/ 3 levels "one","two","zero": 1 1 2 1 3 3 2 1 1 1 ... $ VOCOMP : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ EVECOMP : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ OTHERLV : Factor w/ 1 level "no": 1 1 1 1 1 1 1 1 1 1 ... $ ASP : Factor w/ 4 levels "guo","le","no",..: 3 3 3 3 3 3 3 3 3 3 ... $ SPONTEVT : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1 ... $ DUREVT : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ... $ FOREVT : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ... $ PSYEVT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ INTEREVT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ ACCOMPEVT: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... • Among the 12 independent variables, two have only one level • OTHERLV: occurrence of the dependent variable (light verbs) with another light verb • All five light verbs (1000 sentences) do not co-occur with another light verb • SPONTEVT: with spontaneous events as the complement to light verbs • All five light verbs (1000 sentences) take spontaneous events as their complements •  the two factors are not effective in distinguishing the five light verbs, and are thus excluded from further statistical analysis

  12. Univariate analysis of Chinese light verbs • Chi-squared tests for the significance of the co-occurrence of the factor with individual light verbs • Chisq.posthoc() function in the Polytomous package automatically transforms the results (Standardized pearson residuals eij (Agresti 2002)) into signs • “+”: eij > 2, statistically significant overuse of the light verb with the factor • “-”: eij < -2, statistically significant underuse of the light verb with the factor • “0”: eij [-2,2], lack of statistical significance

  13. Mainland Chinese– a univariate analysis Four features show no significance (p-value <0.05) in distinguishing the five light verbs.

  14. Mainland Chinese– a univariate analysis Also the table presents that each light verb shows significant preference for certain factors.

  15. Polytomous Logistic Regression • 加以/進行/從事/搞/做 研究. • Jia1yi3/jin4xing2/cong2shi4/gao3/zuo4 yan2jiu1 • “to do research” • Five light verbs as the possible outcome • Estimate the probability of presence of each of the potential light verb • Polytomous logistic regression • An extension of standard logistic regression • allows for simultaneous estimation of the probability of multiple outcomes (light verbs in the current study)

  16. Main Results of Polytomous for Mainland Chinese • odds>1: the chance of the occurrence of a light verb is significantly increased by the feature (marked in orange) • odds<1: the chance of the occurrence of a light verb is significantly decreased by the feature (marked in blue) • Non-significant odds (p-value >0,05) are given in parentheses

  17. Distributional Contrasts Can Differentiate Light Verb Pairs Most pairs of light verbs can be effectively differentiated by one of more factors (i.e. those where they have contrasting positive/negative tendencies to appear) congshi/gao: ARGSTRtwo congshi/jiayi: ARGSTRtwo congshi/jinxing: INTEREVTypes gao/jiayi: ACCOMPEVTypes gao/zuo: ARGSTRtwo/ARGSTRzero jiayi/jingxing: ACCOMPEVTypes jiayi/zuo: ARGSTRtwo jinxing/zuo: INTEREVTypes Only two pairs are without contrasting significant features congshi/zuo gao/jinxing

  18. PROBABILITY OF OCCURRENCE OF LIGHT VERBS • A probability model is adopted to predict the identity of light verb at its position of occurrence. • The overall performance of the model is good • the most frequently predicted light verb of each column corresponds to the light verb that actually occurs in the data (see the red figures)

  19. F-score of Automatic Identification of Five Light Verbs Based on Mainland Mandarin Data

  20. Analysis of Outcome (ML) • Each light verb can be successful identified with a better F-score than chance (0.2) with the exception of搞gao3, while the performance varies from light verb to light verb 加以Jia1yi3 > 從事cong2shi4/做zuo4 > 進行jin4xing2 > 搞gao3 • -加以Jia1yi3 is the only light verb with effective differentiating factors with all other light verbs.// All four significant factors are positive (i.e. direct evidence for its occurrence). • 事cong2shi4/做zuo4: Both have only one type of significant factors, but they are negative ones (i.e. indirect evidence). • 搞gao3, and 進行jin4xing2 have both positive and negative factors, which may have cancelled each other out. The significance of their factors are also relatively weak. • Note that the low f-score of 搞gao3 is consistent with the linguistic observation that it is rarely used as LV in ML.

  21. F-score of Automatic Identification of Five Light Verbs Based on Taiwan Mandarin Data

  22. Analysis of Outcome (TW) • Each light verb can be successful identified with a better f-score than chance (0.2). But the performance varies from light verb to light verb 搞gao3/加以Jia1yi3 > 進行jin4xing2/從事cong2shi4 > 做zuo4 • 搞gao3/加以Jia1yi3 each have significant factors are positive only (i.e. direct evidence for its occurrence). • 從事cong2shi4 negative significant factors only (i.e. indirect evidence).進行jin4xing2 has more positive than negative significant factors • 做zuo4 have both types of significant factors, but negative ones outnumber positive ones. • Linguistically,

  23. Comparison of Mainland and Taiwan light verbs -univariate analysis Key results: ML and TW 做zuo4 show opposite usage tendency of the feature ARGSTR.two ML and TW 進行jin4xing2 show opposite usage tendencies of the features ASP.le and ASP.no But the difference is between a significant and non-significant feature, rather than between a significant positive vs. a significant negative feature

  24. Probability estimates of Mainland and Taiwan light verbs by Polytomous • In both ML and TW, the model in overall is good: • the most frequently predicted light verb of each column corresponds to the light verb that actually occurs in the data (see the red figures) • The results also show while a light verb has a highest probability given a particular context (a set of factors), other light verbs might also have a chance to occur.  the reason why empirically more than one light verb can occur in the same context.

  25. Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression

  26. Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns. They differ only in that TW is less likely to take formal event as arguments (FOREVTyes). This is consistent with the intuition that jingxing will be preferred in this context in TW.

  27. Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns. Both ML and TW 搞gao3 are significantly favored by ML 搞gao3 is less likely to occur with accomplishment object. This and the fact that it is unlikely to occur with the aggregate of default variable values suggest that it is unlikely to be used as light verb in ML.

  28. Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns ML 加以jia1yi3 are more likely to occur with two arguments (ARGSTRtwo), as well as taking VO compound or psychological events as objects (VOCOMPyes, and PSYEVTyes). Which confirms the intuition that it is more frequently used in ML.

  29. Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression • Both have similar, non-contradictory distributional patterns. • ML jinxing is not likely to take accomplishment objects (ACCOMPEVTypes), while TW 進行jin4xing2 is very likely to take VO compound objects (VOCOMPyes), consistent with Huang et al. (2013)

  30. Comparison of Mainland and Taiwan light verbs in multivariate polytomous regression Both have similar, non-contradictory distributional patterns Their distributional patterns are consistent with the analysis of zuo4 as the most bleached of Mandarin light verbs. (The attachment of perfect aspect –le is known to be shared grammatical potential of all light verbs.)

  31. Conclusion • This study compares the usage tendencies of Chinese light verbs • (1) Among five different light verbs • (2) Between Mainland and Taiwan Mandarin Usage of the same light verb • The comparable-corpus-driven statistical analysis is able to generalize about the similarities and differences among light verbs with different factors • The contrast between different light verb pairs can be anchored by statistically significant positive vs. statistically significant negative pairs, • The difference between two Chinese varieties for the same light verbs, however, is between statistically significant vs. non-significant pairs. • The above result allows us to hypothesize that • Different light verbs, even with its weak selectional features, can be identified and differentiated by contrasting distributional tendencies • Variants of the same language, however, do not show contrasting tendencies but can be differentiated by existence (i.e. significant vs. non-significant) of some distributional tendencies

  32. References • Arppe, A. (2008) Univariate, bivariate and multivariate methods in corpus-based lexicography – a study of synonymy. Publications of the Department of General Linguistics, University of Helsinki, No. 44. URN: http://urn.fi/URN:ISBN:978-952-10-5175-3. • Arppe, A. (2009) Linguistic choices vs. probabilities – how much and what can linguistic theory explain? In: Featherston, S. & S. Winkler (eds.) The Fruits of Empirical Linguistics. Volume 1: Process. Berlin: de Gruyter, pp. 1–24. • Arppe, A. (in prep.) Solutions for fixed and mixed effects modeling of polytomous outcome settings. • Han, Weifeng, Arppe, Antti & Newman, John (2013). Topic marking in a Shanghainese corpus: from observation to prediction. Corpus Linguistics and Linguistic Theory (preprint). • Butt, M., & Geuder, W. (2001). On the (semi) lexical status of light verbs. Semi-lexical Categories, 323-370. • Cattell, R. (1984). Composite Predicates in English. Syntax and Semantics Volume 17. Sydney: Academic Press Australia. • Cai, Wenlan. (1982). Issues on the Complement of ‘jinxing’ (“進行”帶賓問題). Chinese Language Learning (漢語學習) (3), 7-11.

  33. References • Huang, Chu-Ren and Jingxia Lin. (2013). The ordering of Mandarin Chinese light verbs. Proceedings of the 13th Chinese Lexical Semantics Workshop. D. Ji and G. Xiao (Eds.): CLSW 2012, LNAI 7717, pp. 728-735. Heidelberg: Springer. • Huang Chu-Ren, Jingxia Lin, and Huarui Zhang (2013). World Chineses based on comparable corpus: The case of grammatical variations of jinxing. 《澳门语言文化研究》, 397-414. • Jespersen, O. (1965). A Modern English Grammar on Historical Principles. Part VI, Morphology. London: George Allen and Unwin Ltd. • Zhou, Gang. (1987a). Subdivision of Dummy Verbs (形式動詞的次分類). Chinese Language Learning (漢語學習), 1, 11-14. • Zhou, Xiaobing. (1987b). Sentence Pattern Comparison of ‘jinxing’ and ‘jiayi’ (“進行”“加以”句型比較). Chinese Language Learning (漢語學習), 6, 1-5. • Zhu, Dexi. (1985). Dummy Verbs and NV in Modern Chinese (現代書面漢語里的虛化動詞和名動詞). Journal of Peking University (Humanities and Social Sciences) (北京大學學報(哲學社會科學版)), 5, 1-6.

  34. Thank you

More Related