1 / 37

Compiling and Analyzing Your Own Learner Corpus

Compiling and Analyzing Your Own Learner Corpus. Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012. Workshop outline. Opening discussion and corpora overview Graphic Online Language Diagnostic (GOLD) overview Sample GOLD (and related) projects GOLD (or related tool) project lab

Download Presentation

Compiling and Analyzing Your Own Learner Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012

  2. Workshop outline Opening discussion and corpora overview Graphic Online Language Diagnostic (GOLD) overview Sample GOLD (and related) projects GOLD (or related tool) project lab GOLD (or related tool) project discussions Concluding discussion 2

  3. Opening discussion Brief introduction of your professional/language background and teaching/research interests Prior experience with corpus linguistics Primary challenges you are dealing with Primary purposes and goals for taking this workshop and for learning about corpus linguistics in general Any other relevant information 3

  4. Corpora overview What is a corpus Types of corpora Corpus design and compilation Corpus annotation Corpus querying and analysis Learner corpora and L2 development Resources 4

  5. What is a corpus? • Leech (1992): • an unexciting phenomenon, a helluva lot of text, stored on a computer • Sinclair (1991): • a collection of naturally-occurring language text, chosen to characterize a state or a variety of language • Sinclair (2004): • a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research

  6. Types of corpora • General-purpose vs. specialized corpora • British National Corpus & Russian National Corpus • Michigan Corpus of Academic Spoken English • Native vs. learner corpora • International Corpus of Learner English • Spanish Learner Language Oral Corpora • Monolingual vs. parallel & comparable corpora • The JRC-Acquis Multilingual Parallel Corpus • The English-Chinese Parallel Concordancer

  7. Types of corpora (cont.) • Corpora representing one or diverse varieties • International Corpus of English • Synchronic vs. diachronic corpora • The Corpus of Historical American English • Spoken vs. written corpora • Michigan Corpus of Upper-Level Student Papers

  8. Corpus design • Purpose and type of corpus • Spoken/written; cross-sectional/longitudinal • External criteria for content selection • Communicative function of a text • Mode, medium, interaction, domain, topic • Representativeness, balance, size, sampling • Design of the BNC

  9. Corpus design (cont.) Encoding meaningful metadata information Learner: L1, gender, program level, discipline … Sample: date, mode, task, genre, rating … Facilitates contrastive and longitudinal studies MICASE speaker and transcript attributes Corpus markup: The ICE example 9

  10. Corpus annotation • Why annotate • Levels of corpus annotation • Difficulties for corpus annotation • Standards and encoding

  11. Raw text vs. annotated text: How do you… Count the number of words in a Chinese text? Calculate the lexical density of an English text? Count the frequency of can as a modal verb? Know how many T-units in a text are complex? Extract all imperative sentences from a text? Know whether a syntactic structure is used in a text? Why annotate 11

  12. Sentence and word segmentation Part-of-speech (POS) tagging and lemmatization Syntactic parsing Semantic, pragmatic, and discourse annotation Learner corpora: error annotation Project-specific annotation Levels of corpus annotation 12

  13. Sentence and word segmentation • Why is this non-trivial? I went to the shops in Jones St. Saturday afternoon with Mr. Smith. I can’t remember whether it’s a second- or third-grade book. • 克林顿在讲话中指出 Clinton pointed out in his speech (that…) • 克林顿在 讲话 中 指出 Clinton at speech middle point-out • 克林顿  在 讲话  中指  出 Clinton at speech middle-finger out

  14. POS tagging • The what and why • What are the difficulties? • Ambiguity: 48% tokens in the Brown Corpus • Unknown words: neologism • Tagsets: overspecificatin vs. underspecification • Penn Treebank Tagset vs. CLAWS7 Tagset

  15. Lemmatization • Counting linguistic items • Types – number of different words • Tokens – number of words • What constitutes a different word type? • go, went, gone, goes, going? • differ, difference, different, differently? • can as a noun, verb, and modal verb?

  16. Xerox morphological analyzer (demo only) ICTCLAS for Chinese segmentation and POS tagging Querying POS-tagged corpora and Stanford POS tagger for English Tree Tagger for multiple languages Demos and tools: Part 1 16

  17. Chunking and parsing • Partial/full structural analysis of each sentence My dog likes eating sausage. (ROOT (S (NP (PRP$ My) (NN dog)) (VP (VBZ likes) (S (VP (VBG eating) (NP (NN sausage))))) (. .)))

  18. Chunking and parsing (cont’d) • What is it useful for? • Retrieving examples of grammatical patterns • Grammar checking, syntactic complexity analysis • NLP applications that require syntactic analysis • Difficulties • Ungrammatical sentences • Ambiguities, e.g., PP attachment • Errors from preprocessing steps

  19. Semantic and discourse features Word sense disambiguation Propositional idea density Coherence and cohesion Semantic and discourse analysis 19

  20. Annotation standards and encoding Useful standards Separable, linguistically consensual Documentation, compatibility with existing standards Encoding Simple encoding: present_JJ XML-style: <w type=“JJ">present</w> Format varies, depending on level of annotation Manual, computer-aided, and automatic annotation Efficiency, scale, reliability UAM CorpusTool 20

  21. Stanford parser for Arabic, Chinese and English Word sense disambiguation demo Computerized Propositional Idea Density Rater Coh-Metrix for text coherence analysis CHILDES and CLAN Computerized Profiling WMatrix Demos and tools: Part 2 21

  22. Corpus querying and analysis • Manual analysis? • Corpus-specific online interfaces • Raw: MICASE and MICUSP • POS-tagged: Corpora @ BYU • Grammatically and semantically tagged: RNC • General-purpose online interfaces: GOLD • Windows-based querying/concordancing tools • WordSmith Tools & AntConc

  23. Corpus querying and analysis • Natural language processing tools • Good for processing annotated corpora • Extracting occurrences of grammatical patterns • Examples: Stanford parser and Tregex

  24. Resources • Books and journals • Hunston (2002): Corpora in Applied Linguistics • McEnery (2006): Corpus-Based Language Studies • International Journal of Corpus Linguistics • Corpus Linguistics and Linguistic Theory • Corpora • Websites and mailing lists • Bookmarks for corpus-based linguists • Linguistic data consortium • The corpora list; corpus in delicious • Stanford Natural Language Processing Group

  25. Discussion • What kind of corpus do you intend to compile and/or use? For what purpose? • What are the design issues? • How do you intend to format, organize and store your files? • Do you intend to annotate your corpus in some way? How? • How do you intend to search/query your corpus?

  26. Learner corpora and L2 development Samples from same students at different times Did (targeted) language development take place? Was a particular pedagogical intervention effective? Samples from different students What areas do students show different levels of development? What factors affect students’ language development? 26

  27. Graphic Online Language Diagnostic A free online tool for teachers to assess their students’ language development Developed at CALPER, Penn State, funded by DOE Project co-directors: Xiaofei Lu and Michael McCarthy Teachers can use GOLD to Compile, upload, and manage their own corpora Share corpora with each other Search and analyze corpora Demonstration 27

  28. Corpus compilation A user can compile a corpus by Directly compiling and uploading an XML file Using the easy-to-use guided XML creation interface An uploaded corpus can be easily managed Documents can be added or deleted The whole corpus can be deleted Content and metadata of individual documents can be easily accessed 28

  29. Corpus sharing GOLD facilitates easy data sharing A corpus may be set to be Private, shared, or public Corpus owner may give other users right to View, add, edit, or delete corpora Demonstration 29

  30. Basic corpus information Word count Alphabetic or numeric order Can be downloaded as a text file Corpus and document statistics Mean sentence length Mean word length Type-token ratio Demonstration 30

  31. Corpus search Select one or more corpora to search Specify key words or phrases May use the wildcard character, e.g. book* Specify contexts Size of context window Context words and their positions Specify metadata conditions 31

  32. Corpus search results Display of search results Sortable KWIC display of search results Sortable graphic display of search results Demonstration 32

  33. Lexical bundle/collocation search Procedure Select one or more corpora to search Specify search word Specify contexts Specify metadata conditions Search results Sortable list of n-grams found in selected corpora Demonstration 33

  34. Summary of features Difference from other online tools Can create, share, and search multiple corpora Can easily search subsets of data Can work with any language Summary of corpus analysis functions Word list Corpus and document statistics: mean sentence length, mean word length, type-token ratio Corpus search and collocation search 34

  35. Sample questions to ask With data from an individual student, one can either describe or track development in Patterns of usages of words and phrases – frequency, underuse, overuse, etc. Lexical and syntactic complexity Appropriate usage of words and phrases in context Patterns of usages of lexical bundles 35

  36. Sample questions to ask (cont.) With data from different (groups of) students, one can compare similarities or differences among different (groups of) students in terms of Patterns of usages of words and phrases – frequency, underuse, overuse, etc. Lexical and syntactic complexity Appropriate usage of words and phrases in context Patterns of usages of lexical bundles 36

  37. Future enhancements Corpora for benchmarking Multilingual natural language processing Suggestions on desirable functions welcome 37

More Related