210 likes | 573 Views
Bilingual term extraction revisited. Špela Vintar . University of Ljubljana spela.vintar@ ff.uni-lj .si. Extracting terms from the A c quis corpus. Using a bilingual subcorpus on Nuclear Energy (EN-SL) No linguistic preprocessing, only stop lists Universal terms and collocations:
E N D
Bilingual term extraction revisited Špela Vintar. University of Ljubljana spela.vintar@ff.uni-lj.si
Extracting terms from the Acquis corpus • Using a bilingual subcorpus on Nuclear Energy (EN-SL) • No linguistic preprocessing, only stop lists • Universal terms and collocations: • Council regulation • European Union • Member State • Commission directive • Article • Having regard to Danger of “Acquis stoplists”: European Atomic Energy Community
N dfi weight(i, j) = (1 + log(tfi,j)) log — “keyness” Measures of keyness: • subcorpus vs. general language corpus (here: Acquis)relative corpus frequency • document vs. document collectiontf.idf Applied to single or multi-word units.
Words not found in the reference corpus 1 sievert 1 gray 1 Sv 1 wT 1 radon 1 becquerel 1 wTHT 1 DT 1 EDA 1 aboveground 1 APPRENTICES 1 Thermonuclear 1 wR 1 dN 1 ankles 1 mSv 1 after-effects 1 DOSE 1 forearms 1 avertable 1 ITER 1 Committed 1 cosmic 1 HT 1 Bq 1 dt Words with high rel.freq. 0, 54 Radiological 0,49 concerned 0,21 Board 0,11 aid 0,10 Potential 0,08 reasonably 0,08 Reconstruction 0,08 give 0,08 extend 0,07 alia 0,04 CHAPTER 0,01 qualified 0,01measurement 0,01Nuclear 0,01materials 0,01steps 0,01energy 0,01declared 0,01relevant 0,01contaminating 0,01Design 0,01developments 0,01contribute 0,01procedure 0,01reduce 0,01costs Examples of unigrams extracted through rel. freq
radiological 0,67 exposures 0,25 JRC 0,20 lens 0,19 radiation 0,17 apprentices 0,14 ionizing 0,14 serviceable 0,13 dose 0,13 nuclear 0,12 doses 0,12 workplaces 0,11 EXPOSURE 0,11 radioactive 0,10 joule 0,10 Resolutions 0,10 Governors 0,10 Dose 0,08 students 0,08 Chernobyl 0,08 Cabinet 0,076 Nuclear 0,067 exposure 0,067 non-Member0,059 gender 0,056 workers 0,052 Reactor 0,050 Euratom 0,049 proceeds 0,047 disregarded 0,043 Exchanges 0,042 Optimization 0,042 PRACTICES0,042 dosimetric 0,042 exposed 0,037 population 0,036 contaminating 0,033 Tf.idf
sevanju 0,19082 radiološkega 0,17864 dozimetrijo 0,17052 sivert 0,13804 radionuklidov 0,13804 sevanja 0,13195 Dana 0,12992 Černobil 0,12180 Izpostavljenost 0,12180 Jedrska 0,11368 dozo 0,09473 prebivalstva 0,09256 sevanjem 0,08932 ITER 0,08120 Oddelkom 0,07308 inovativnosti 0,07308 študente 0,07308 izpostavljenosti 0,07308 radioaktivne 0,06766 SRS 0,06766 doza 0,06496 posameznike 0,06090 pooblaščenimi 0,05684 cepitve 0,05684 nivoji 0,05684 efektivno 0,05684 medicine 0,05278 fuzije 0,05075 zaposlitvijo 0,04872 termonuklearni 0,04872 študentov 0,04872 guvernerjev 0,04872 prioritete 0,04872 reaktorja 0,04872 jedrske 0,04872 delodajalca 0,04669 izpostavljenih 0,04601 ionizirajočemu 0,04466 ekvivalentno 0,04263 dosegljive 0,04060 ionizirajočega 0,04060 jedrskem 0,04060 nuklearnih 0,04060 kontrolirana 0,04060 radiološki 0,04060 Tf.idf - Slovene
Acronyms (NPP, SG, RBB ...) Unknown words not found in the reference corpus unknown to the lemmatizer Cognates & Named entities radioactive ### radioaktivna 1.0 radioactive ### Radioaktivna 1.0 Radioactive ### Radioaktivna 1.0 radioactive ### radioaktivne 1.0 radioactive ### radioaktivnih 1.0 radioactive ### radioaktivnimi 1.0 radioactive ### radioaktivno 1.0 radioactive ### radioaktivnosti 1.0 radioactive ### radiokativnega 1.0 radiography ### radiografijo 1.0 radionuclide ### radionuklid 1.0 radionuclide ### radionuklida 1.0 radionuclide ### radionuklidov 1.0 radionuclides ### radionuklide 1.0 radionuclides ### radionuklidov 1.0 ratify ### ratificirajo1.0 Reactor ### reaktorja 1.0 reactor ### reaktorjev 1.0 reactors ### reaktorji 1.0 Other indicators of termhood
Identifying multi-word units • Collocation extraction techniques • Mutual Information (Church & Hanks 1990) • Log-likelihood ratio (Dunning 1993) • Entropy-based (Shimohata et al. 1997) • Semantic non-compositionality (Pearce 2001) • Daille (1994): LL is the most appropriate measure • for n > 3: n-gram frequency (+ stopword filtering) also works
N-gram term weighting • statistically extracted n-grams are not necessarily terms need for filtering / weighting • Stopword filtering • Weighting with tf.idf, ll-rank/core frequencyw(tw1, w2, w3) = tf.idfw1tf.idfw2tf.idfw3/n * 1/rank
2-grams, weighted with rel.freq. Thermonuclear Experimental 1.91766291545192 International Thermonuclear 1.90047962704222 wR values 1.74111305022281 cosmic radiation 1.68720469442766 non-Member States 1.67427461796584 Atomic Energy 0.996377043841846 European Atomic 0.995366262170687 Energy Community 0.995029334946967 Member States 0.994692407723247 Member State 0.994355480499528 exposed workers 0.990312353814892 radiation protection 0.988290790472574 ionizing radiation 0.985847228548466 nuclear power 0.975824483194946 Nuclear Safety 0.97077057483915
3-grams Thermonuclear Experimental Reactor 2.83532583090384 International Thermonuclear Experimental 2.81814254249414 mSv per year 2.73507410483208 APPRENTICES AND STUDENTS 2.69461709334949 exceed 1 mSv 2.46078960008804 feet and ankles 2.2734580636999 European Atomic Energy 1.99321785686789 Atomic Energy Community 1.99288092964417 DECIDED AS FOLLOWS 1.95055494141597 nuclear power stations 1.94785693049428 Nuclear Safety Account 1.94301570053366 controlled nuclear fusion 1.88877041751479 Energy Community represented 1.87461947411856 natural radiation sources 1.87453104455193 nuclear power station 1.87309490609042 apprentices and students 1.86800777160465 Chernobyl nuclear power 1.86257180721416 establishing the European 1.85670559767151
Treatment of nested terms C-value (Frantzi & Ananiadou 1996) C-value(a) = (length(a) –1)(freq(a) – t(a)/c(a)) n-gram C-value Chernobyl nuclear7,3 nuclear power plant15,2 Chernobyl nuclear power plant 20,4
Bilingual lexicon extraction • using Twente (Hiemstra 1998) • based on the Iterative Proportional Fitting Procedure (IPFP), word-to-word translation model • outputs translation candidates + scores for each word in the corpus; both ways • using stopword-filtered corpora to improve results • bilingual lexicon expanded with cognates
Term alignment • for each source term candidate we collect all single-word equivalents from the bilingual lexicon jedrska elektrarna Černobil power 0.50 plant 0.50 Chernobyl 1.00 nuclear 1.00
Term alignment • for each source term candidate we collect all single-word equivalents from the bilingual lexicon jedrska elektrarna Černobil power 0.50 plant 0.50 Chernobyl 1.00 nuclear 1.00 Nuclear power plant 2.00 Power plant 1.00 Chernobyl nuclear power plant 3.00
Outcome • Corpus: 17.000 tokens • Extracted 193 Slovene and 199 English term candidates • Bilingual (aligned): 112 • What we miss: • Term variation: • disposal of waste / emplacement of waste • safety levels / levels of safety • Hapax: • radiation weighting factor, tissue weighting factor
Purpose of term extraction • Extraction vs. annotation
Problems • Distinguishing between generic and text-specific terms (same form, same frequency!) • Capturing low frequency terms in inflected languages • We want to capture domain-specific terms. But most texts are multi-domain!