700 likes | 928 Views
Corpora in Linguistic Research. 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com. Order of Presentation. I. Corpus Research versus Linguistic Research II. Influential Corpora III. Corpus Analysis IV. More on Statistical Analysis V. Q and maybe A (anytime during presentation).
E N D
Corpora in Linguistic Research 南京大学 李长生 电话:025-8443-6787 Email:csli@jlonline.com
Order of Presentation • I. Corpus Research versus Linguistic Research • II. Influential Corpora • III. Corpus Analysis • IV. More on Statistical Analysis • V. Q and maybe A (anytime during presentation)
I. Corpus Research versus Linguistic Research • Corpus Research=Linguistic Research • Language (features) • Learner language (features)
I. Corpus Research versus Linguistic Research • Corpus Research≠Linguistic Research • (Large,) representative authentic data
II. Influential Corpora • Native-speaker corpora • Learner corpora
Collins Corpus/Bank of English • A 2.5-billion word analytical database of English. • Contains written material from websites, newspapers, magazines and books published around the world, and spoken material from radio, TV and everyday conversations. • New data is fed into the corpus every month, to help the Collins dictionary editors identify new words and meanings from the moment they are first used. • Bank of English: part of the Collins Corpus. • Contains 650 million words from a carefully chosen selection of sources, to give a balanced and accurate reflection of English as it is used every day.
British National Corpus • Contains approximately 100 million words of written texts (90%) and transcripts of speech (10%) in modern British English. • Can be accessed online remotely using the BNC Online service.
American National Corpus • Contains 11.5 million words of written and spokenAmerican English data (8.3 million words for writing and 3.2 million words for speech)
Longman/Lancaster Corpus • Contains about 30 million words of published English. • British data takes up 50% and American data 40% while the other 10% represents other varieties such as Australian, African and Irish English.
International Corpus of Learner English • Contains argumentative essays written by advanced learners of English, i.e. university students of English as a foreign language (EFL) in their 3rd or 4th year of study. • Contains over 2.5 million words in the form of 3,640 texts ranging between 500-1,000 words in length written by EFL learners from 11 mother tongue backgrounds, namely, Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish.
CLEC • Contains one million words from writing produced by Chinese learners of English from five proficiency levels: middle school students, junior and senior non-English majors, and junior and senior English majors. • Annotated with learner errors using an annotation scheme which consists of 61 error types clustered in 11 categories.
SWECCL • 包含我国英语专业大学生的口语和笔语总共约200万词
LSECCL • Year 1 • Recording 1 • Task 1 - Reading aloud • Task 2 - Monologue - The Most Unforgettable Birthday • Task 3 - Dialogue - Holiday plan • Recording 2 • Task 1 - Retelling • Task 2 - Monologue - Whether it is appropriate for college students to rent apartments outside the campus and live there • Task 3 - Dialogue - Whether exams should be abolished
LSECCL • Year 2 • Recording 1 • Task 1 - Reading aloud • Task 2 - Monologue - Describe one of your persons you admire most • Task 3 - Dialogue - What gift to buy for a friend - Lily • Recording 2 • Task 1 - Retelling • Task 2 - Monologue - Make critical comments on the use of electronic dictionaries among college students • Task 3 - Dialogue - Whether it is a good practice or not to keep one’s own computer in dorm
LSECCL • Year 3 • Recording 1 • Task 1 - Reading aloud • Task 2 - Monologue - Describe one of your experiences when you had a great ambition to do something • Task 3 - Dialogue - Talk about ways of relaxation after a month-long preparation for an exam • Recording 2 • Task 1 - Retelling • Task 2 - Monologue - Do you think it is appropriate for college students to get married • Task 3 - Dialogue - Talk about the necessity of having certificates
LSECCL • Year 4 • Recording 1 • Task 1 - Reading aloud • Task 2 - Monologue - The Most Unforgettable Birthday • Task 3 - Dialogue - Holiday plan • Recording 2 • Task 1 - Retelling • Task 2 - Monologue - Whether it is appropriate for college students to rent apartments outside the campus and live there • Task 3 - Dialogue - Whether exams should be abolished
III. Corpus Analysis • (Tagging corpus data) • Calculating frequencies and frequencydifferences • Frequencies of occurrence • Frequencies of co-occurrence • Frequency differences across registers/corpora/ periods of time • (Transferring frequencies) • Statistical analysis
Lexis • 《大学英语课程教学要求》(2007) 参考词汇表
Lexis • headwords
Lexis • meanings: deal (Biber et al., 1998)
Lexis • synonyms: utterly, perfectly
Lexis • synonyms: big, large, great (Biber et al., 1998)
Lexis • collocations: system
Lexis • chunks (Qi, 2006) • 第一步: 运行WordList • 第二步: 选定语料库 • 第三步: 制作索引 • 第四步: 点击计算(Compute)Clusters
Grammar • that-clause, to-clause (Biber et al., 1998) <V* that <CST> to <TO> * <V?I>/to <TO> * <R* * <V?I>/to <TO> * <R* R <* * <V?I>
Grammar • syntactic co-occurrences of try (McEnery and Wilson, 2001)
Learner Language • Frequency differences across corpora • Frequency differences across periods of time
Across Corpora ICLE L1 (NNS-NNS) SWECCL L1 (NNS-NS) BNC
Tagging Corpus Data • CLAWS • book book_NN1 • 超级批量文本替换 • book_NN1 book <NN1>
Calculating Frequencies and Frequency Differences • passive voice (be done) (Li, 2007a) • * <VB* * <V?N>
Statistical Analysis • 差异 • 两库或三库 • 1. chi-square • Under Analyze, choose Descriptive Statistics, then Crosstabs. Move one variable into the Row(s) box and the other into the Column(s) box. Click Statistics, and check off Chi-square. Click Cells, and check off Expected. • 2. one-way chi-square • Under Analyze, choose Nonparametric Tests, then Chi-Square. Move the variable into the Test Variable List box. Click OK.
Another Example • AWL (Li, 2007a) • +matchlist
Across Periods of Time LSECCL Grades (Year 1-Year 2-Year 3-Year 4)
Li (2007b) Title • 1) Key terms • 3) Noun phrase • 4) Word limit (<20) • 5) Capitalization
Abstract • Summary
Acknowledgments • Specific
Introduction • Motivation for the study, theoretical and practical significance of the study, overall structure
Literature Review • Key terms • Theoretical issues • Empirical studies • Unresolved issues
Literature Review • Bibliographies/Indices/Databases (ERIC, NJU, Google Scholar, corpus4u) • Papers (Chen, 2004) • Journals (Applied Linguistics, Language Learning) • Books (FLTRP)
Research Questions LSECCL Grades (Year 1-Year 2-Year 3-Year 4)
Tagging Corpus Data • Microsoft Word • I think I think <sv> <ip> <cm> <0>
Calculating Frequencies and Frequency Differences • <sv>/<ap>/<dn> • <cm>
Transferring Frequencies • Microsoft Excel • =COUNTIF(N1:N5000,"D:\YEAR1\1-2-B02B.TXT")
Statistical Analysis • Changes in frequency differences • 三次或三次以上数据 • Wilcoxon • Under Analyze, choose Nonparametric Tests, then2 Related Samples. Move the variables into the Test Pair(s) List box.
Results and Discussion • Answers to the research questions, and reasons for the answers
Conclusion • Summary of the findings, theoretical and practical implications of the findings, and limitations of the study