180 likes | 336 Views
GDEX: Automatically finding good dictionary examples in a corpus. Users appreciate examples. Paper: space constraints Electronic: no space constraints Give lots of examples Constraint: Cost of selection, editing. Project. Macmillan English dictionary Already had 1000 collocation boxes
E N D
Kilgarriff: GDEX GDEX: Automatically finding good dictionary examples in a corpus
Kilgarriff: GDEX Users appreciate examples Paper: space constraints Electronic: no space constraints Give lots of examples Constraint: Cost of selection, editing
Kilgarriff: GDEX Project Macmillan English dictionary Already had 1000 collocation boxes Average 8 per box New electronic version All 8000 collocations need examples Authentic; from corpus
Kilgarriff: GDEX Old method Lexicographer Gets concordance for collocation Reads through until they find a good example Cut, paste, edit
Kilgarriff: GDEX New method Lexicographer Gets sorted concordance 20 best examples in spreadsheet Less reading through Tick the first good one, edit
Kilgarriff: GDEX What makes a good example? Readable EFL users Informative Typical, for the collocation Gives context which helps user understand the target word/phrase
Kilgarriff: GDEX Readability 70 years research Not just (or mainly) EFL Educational theory Teaching children to read Instruction manuals Early work: US military Publishing People like newspapers and magazines that they find easy to read
Kilgarriff: GDEX Readability tests Fleish Reading Ease test 1948 Ave sentence length, ave word length In some word processing software Many similar measures Recent work training data for different reading levels Language modelling Target levels US grades Now, increasingly: Common European Framwork
Kilgarriff: GDEX GDEX Get concordance for collocation For each sentence Score it Sort Show best ones to lexicographer
Kilgarriff: GDEX GDEX heuristics Sentence length (10-26 words) Mostly common words is good Rare words are bad Sentences Start with capital, end with one of .!? No [, ], <, >, http, \ Not much other punctuation, numbers Not too many capitals Typicality: third collocate is a plus
Kilgarriff: GDEX Weighting For each sentence Score on each heuristic Weight scores Add together weighted score How to set weights? Two students: Manually judged 1000 “good examples” Weights set so system makes same choices as students
Kilgarriff: GDEX Was it successful? Did it save lexicographer time? Definitely (says project manager) Rough guess Average number of corpus lines to read until you find a good one: Unsorted: 20 Sorted: 5
Kilgarriff: GDEX Corpus choice Started with BNC but Too old Not enough examples If no good examples in corpus, GDEX can’t help Changed to UKWaC 20 times bigger; from web; contemporary Better Most web junk filtered out Usually a good example in top twenty
Kilgarriff: GDEX GDEX and TALC TALC (Teaching and Language Corpora) Goal: bring corpora into lg teaching Usual problem Concordances are tough for learners to read Way forward GDEX examples Half way between dictionary and corpus
Kilgarriff: GDEX GDEX: Models for use More examples for dictionaries Speed up, as with MED or Fully automatic “more examples” Corpus query tool Sort concordances, best first Now an option in the Sketch Engine Automatic collocations dictionary http://forbetterenglish.com
Kilgarriff: GDEX Recent developments • Configurable GDEX • For other languages • Interface to help set up • Commonest string • Between ‘bare collocate’ and example