510 likes | 665 Views
Probabilistic Tagging of a Corpus of Mennonite Low German: A Case Study Using Qtag. Christopher Cox, University of Alberta christopher.cox@ualberta.ca AACL - March 15, 2008. Introduction.
E N D
Probabilistic Tagging of a Corpus of Mennonite Low German:A Case Study Using Qtag Christopher Cox, University of Alberta christopher.cox@ualberta.ca AACL - March 15, 2008
Introduction • Presentation considers the application of probabilistic part-of-speech (POS) tagging methods to minority language data • Methods appliedprofitably in large-scale corpus construction • Time, linguistic data, technical expertise, and financial resources often comparatively abundant • Perhaps less documented: challenges of applying similar techniques when resources are limited
Introduction • Both linguistic and technical-financial challenges in minority language corpus development: • Existing computational techniques not amenable to linguistic structure of given language • Lack of standardization (e.g. in POS tags, spelling) • Limited resources (cf. McEnery & Ostler 2000) • Important to understand which factors produce acceptable results and minimize investment of effort, given set goals and resources for tagging
Introduction • Presentation offers case study in probabilistic tagging of minority language data: • Applies Qtag to a small (~120,000-token) corpus of written Mennonite Low German (Plautdietsch) • Opportunity to consider problems faced in tagging minority language data in concrete detail • Chance to evaluate tagging procedure adopted, consider alternatives which may have produced results of comparable quality
“Qtag?” • Qtag:language-independent, “pure” probabilistic tagger designed by Oliver Mason • Freely available for non-commercial use • Well-documented Java API • Unicode support • Not alone among pure probabilistic taggers; arguably presents a reasonable point of departure into probabilistic tagging
“Plautdietsch?” • Plautdietsch:Mennonite Low German • Variety of Eastern Low German, once spoken near Gdansk, Poland • Approx. 400,000 speakers, predominantly descendants of Dutch-Russian Mennonites (Anabaptist Christians) • Sizeable Plautdietschspeech communities on four continents and in no fewer than a dozen countries (cf. Epp 1993: 103-4)
A Corpus of Plautdietsch • Corpus intended primarily for research into syntax of verbal complementation in Plautdietsch: • Adequate tagging for verbal-inflectional features important (e.g. tense, person, number, etc.) • Dialectal variation potentially relevant in analysis • Technical resources furnished largely by the Text Analysis Portal for Research (TAPoR) at University of Alberta; time expenditure should be minimized
Challenges • Plautdietsch poses challenges common in minority language corpus construction: • No single orthographic standard. Systems vary between authors and individual published works • No corpora published to date. No tagsets proposed; little consensus on POS classes • Dialectal variation. Substantial variation between and within national varieties.
Corpus Construction • Three-stage corpus construction procedure: • Spelling normalization. Created versions of all corpus source texts normalized according to a published orthographic standard (Epp 1996) • Tagset selection. Adapted a tagset proposed for Standard German (Münster Tagset for German, MT/D; Steiner 2003) to Plautdietsch
Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31”>Goon</word> <wordwd_id=“32”>dach</word> <wordwd_id=“32”>,</word> <wordwd_id=“33”>kompt</word> <wordwd_id=“34”>ennen</word> <wordwd_id=“34”>,</word> <wordwd_id=“35”>sat</word> <wordwd_id=“36”>junt</word> <wordwd_id=“37”>dol</word> <wordwd_id=“37”>.</word> . . . </document>
Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31”>Go’n</word> <wordwd_id=“32”>Dag</word> <wordwd_id=“32”>,</word> <wordwd_id=“33”>komt</word> <wordwd_id=“34”>’enenn</word> <wordwd_id=“34”>,</word> <wordwd_id=“35”>sat</word> <wordwd_id=“36”>Junt</word> <wordwd_id=“37”>dol</word> <wordwd_id=“37”>.</word> . . . </document>
Corpus Construction • Three-stage corpus construction procedure: • Tagging. Normalized texts then tagged gradually with the adopted tagset, in an iterative, interactive process:
Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. . . .
Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. . . .
Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. • Train Qtag on all correct tags and have it tag the next chunk. . . .
Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. • Train Qtag on all correct tags and have it tag the next chunk. • Manually correct the tags assigned to the last chunk, adding them to the training data. . . .
Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. • Train Qtag on all correct tags and have it tag the next chunk. • Manually correct the tags assigned to the last chunk, adding them to the training data. . . .
Iterative Interactive Tagging n n+1 n+2 n+3 n+4 n+5 c • Segment the document into c “chunks” of n tokens. • Manually assign tags to the first chunk. • Train Qtag on all correct tags and have it tag the next chunk. • Manually correct the tags assigned to the last chunk, adding them to the training data. . . .
The Road(s) Not Taken • Iterative, interactive process successful, albeit time consuming, labour intensive - What could have been done to reduce the burden of corpus construction without lessening the quality of resulting data? • Necessary to normalize spelling in advance? • Should greater numbers of tokens have been tagged at each stage? • Should the tag set have been less elaborate?
Simulating Iterative Tagging • Simulations of different models of iterative, interactive tagging conducted using the corrected data • Parameters of the tagging models simulated: • Normalization. Normalized, unnormalized data • Chunk size. 100, 200, 300, 400, 500, 750, 1000, 1500, 2000, 3000, 4000, 5000, 7500, 10000 tokens tagged per round • Tagset selection. 99 tags, 50 tags, 13 tags
Evaluating Tagging Models • Evaluation of each model by rate of accuracy developmentandestimated time requirement • Estimated time requirement as a function of time requirements for initial manual tagging and subsequent tag correction (at various error rates) of c chunks of n tokens using tagset t:
Evaluating Normalization • Does orthographic normalization matter, either for the rate of accuracy development or estimated overall time expenditure? • Holding tag set and chunk size constant, compare simulations of tagging normalized and unnormalized data:
Evaluating Normalization • Does orthographic normalization matter, either for the rate of accuracy development or estimated overall time requirement? • Rate of accuracy development: on average 20% lower for unnormalized data over all tagsets • Estimated time requirement: on average 26 hours long for POS-99, 15 hours for POS-50, 11 hours for POS-13 for unnormalized data
Evaluating Training Data • Does chunk size matter, either for the rate of development of accuracy or estimated overall time expenditure? • Holding tagsets and normalization constant, compare simulations of tagging for different chunk sizes:
Evaluating Training Data • Does chunk size matter, either for the rate of development of accuracy or estimated overall time expenditure? • Rate of accuracy development: no substantial difference in accuracy development for chunk sizes <= 5000 • Estimated time requirement: considerable differences, with smaller chunk sizes (< 2000) taking less time • Minimize time required to tag first chunk manually without the aid of automatically-assigned tags
Evaluating Tagsets • Does tagset detail matter, either to the rate of accuracy development or estimated overall time requirement? • Holding chunk size and normalization constant, compare all three tagsets:
Evaluating Tagsets • Does tagset detail matter, either for the rate of development of accuracy or estimated overall time requirement? • Rate of accuracy development: average 15% increase of mean accuracy for minimal tagset over full tagset, regardless of normalization • Estimated time requirement: time requirement for full tagset (80.5 hours) more than double that of minimal tagset (36.5 hours)
Evaluation: Summary • In the present case, the following guidelines would appear relevant to ‘successful’ tagging: • Normalization:Accuracy gains (here, 20%) may be substantial; however, gains must be weighed against cost of normalization itself • Chunk size:Favour smaller chunk sizes; choose tag correction over manual tag assignment • Tagset:Minimize tagset complexity (wherever corpus goals permit)
Evaluation and Planning • Determining interaction of all such factors in their relation to accuracy likely impossible during corpus planning • Nevertheless, planning and evaluation might profitably enter into corpus construction: • Consideration of general guidelines, such as those proposed in this case study, during corpus design • Periodic evaluation as additional part of iterative tagging process
Tagging and Minority Language Data • Such suggestionsmust bemeasured against requirements, resources, and stated goals of the corpus project: • In present case, detailed verbal coding needed; cost of tagset mitigated through normalization • Sociolinguistic situation may require preservation (in some form) of original orthographies or other “distinctive” features of source data
Tagging and Minority Language Data • Selection of pure probabilistic methods over others in part determined by typological features and available sources of data: • Highly fusional or polysynthetic languages may benefit from morphological parsing, rather than probabilistic POS assignment alone; • Integration of tagged documents with other linguistic data (e.g. dictionaries, word lists) may encourage use of hybrid tools permitting concurrent lemmatization
Conclusion • Computer-assisted part-of-speech assignment a complex problem, one profitably viewed in the larger context of minority language corpus construction: • Computational methods, probabilistic or otherwise, of clear importance, but not sole object of inquiry • Rather, consideration required of resources, requirements, and (socio-)linguistic conditions which bear upon minority language corpus construction as a whole
Conclusion • Case studies of minority language corpus design might contribute to an understanding of such problems in context: • Present direction for further quantitative study of corpus and tagset design • Offer assessment of the challenges facing corpus-based language documentation, providing guidelines from which similar projects might benefit
Acknowledgements • Text Analysis Portal for Research (TAPoR), University of Alberta • Social Sciences and Humanities Research Council of Canada (SSHRC) • Members of the Department of Linguistics, University of Alberta • Oliver Mason (for Qtag)
References • Epp, Reuben. 1993. The History of Low German and Plautdietsch: Tracing a language across the globe. Hillsboro, Kansas: The Reader’s Press. • Epp, Reuben. 1996. The Spelling of Low German and Plautdietsch. Hillsboro, Kansas: The Reader’s Press. • McEnery, Tony and Nick Ostler. 2000. A New Agenda for Corpus Linguistics - Working with all of the World’s Languages. Literary and Linguistic Computing 15.403-49.
References • Tufis, Dan and Oliver Mason. 1998. “Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger.” Proceedings of the First International Conference on Language Resources & Evaluation (LREC), Granada (Spain), 28-30 May 1998, 589-596. • Steiner, Petra. 2003. Das revidierte Münsteraner Tagset Deutsch (MT/D). Beschreibung, Anwendung, Beispiele und Problemfälle [The revised Münster Tagset for German (MT/D). Description, Application, Examples and Problematic Cases]. Online: http://xlex.uni-muenster.de/Portal/MTPD/tagsetDescriptionDE.ps
Qtag Algorithm • Read in the next token. • Retrieve all tags observed for this token (if none available, guess possible tags) • For each possible tag: • Calculate Pw = P(tag|token)= P(token has tag) • Calculate Pc = P(tag|t1,t2)= P(tag follows t1, t2) • Calculate Pw,c = Pw * Pc • Repeat this calculation for the other two tags in the window (except with Pc = P(t1 precedes t2, tag), Pc = P(t2 between t1, tag))
“Qtag” http://www.english.bham.ac.uk/staff/omason/software/qtag.html
Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31”>Goon</word> <wordwd_id=“32”>dach</word> <wordwd_id=“32”>,</word> <wordwd_id=“33”>kompt</word> <wordwd_id=“34”>ennen</word> <wordwd_id=“34”>,</word> <wordwd_id=“35”>sat</word> <wordwd_id=“36”>junt</word> <wordwd_id=“37”>dol</word> <wordwd_id=“37”>.</word> . . . </document>
Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31”>Go’n</word> <wordwd_id=“32”>Dag</word> <wordwd_id=“32”>,</word> <wordwd_id=“33”>komt</word> <wordwd_id=“34”>’enenn</word> <wordwd_id=“34”>,</word> <wordwd_id=“35”>sat</word> <wordwd_id=“36”>Junt</word> <wordwd_id=“37”>dol</word> <wordwd_id=“37”>.</word> . . . </document>
Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31” pos99a=“Aa”>Go’n</word> <wordwd_id=“32” pos99a=“Ngns”>Dag</word> <wordwd_id=“32” pos99a=“Fi”>,</word> <wordwd_id=“33” pos99a=“Vfvca2p”>komt</word> <wordwd_id=“34” pos99a=“Qv”>’enenn</word> <wordwd_id=“34” pos99a=“Fi”>,</word> <wordwd_id=“35” pos99a=“Vfvca2p”>sat</word> <wordwd_id=“36” pos99a=“Rs”>Junt</word> <wordwd_id=“37” pos99a=“Bg”>dol</word> <wordwd_id=“37” pos99a=“Bg”>.</word> . . . </document>
Example: Corpus Preparation <?xmlversion=“1.0”encoding=“utf-8”?> <document doc_id=“1”> <wordwd_id=“31” pos99a=“Aa” pos99c=“Aa”>Go’n</word> <wordwd_id=“32” pos99a=“Ngns” pos99c=“Ngas”>Dag</word> <wordwd_id=“32” pos99a=“Fi” pos99c=“Fi”>,</word> <wordwd_id=“33” pos99a=“Vfvca2p” pos99c=“Vfvca2p”>komt</word> <wordwd_id=“34” pos99a=“Qv” pos99c=“Qv”>’enenn</word> <wordwd_id=“34” pos99a=“Fi” pos99c=“Fi”>,</word> <wordwd_id=“35” pos99a=“Vfvca2p” pos99c=“Vfvca2p”>sat</word> <wordwd_id=“36” pos99a=“Rs” pos99c=“Rs”>Junt</word> <wordwd_id=“37” pos99a=“Bg” pos99c=“Qv”>dol</word> <wordwd_id=“37” pos99a=“Fs” pos99c=“Fs”>.</word> . . . </document>