490 likes | 668 Views
Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy http://www.jrc.cec.eu.int/langtech Addressing the Language Barrier Problem in the Enlarged EU Automating Eurovoc Descriptor Assignment. Contents. Overview of the process Background
E N D
Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy http://www.jrc.cec.eu.int/langtech Addressing the Language Barrier Problem in the Enlarged EU Automating Eurovoc Descriptor Assignment
Contents • Overview of the process • Background • Eurovoc Thesaurus • Corpus of texts • Approaches to thesaurus indexing • Vector space • Training • Pre-processing the texts • Building Eurovoc profiles • Tuning various parameters • Assignment • “guessing” the descriptors for a new text • …results
Europe Poland … Poland Cultural policy Poland Culture programme Artistic Polish Poland Poland Poland programme Polish … Cultural policy Cultural policy Cultural policy Culture programme Europe … Europe Cultural programme … revival artistic Culture programme … Overview: starting point • Set of texts, manually indexed
Europe Poland … Poland Cultural policy Poland Culture programme Artistic Polish Poland Poland Poland programme Polish … Cultural policy Cultural policy Cultural policy Culture programme Europe … Europe Cultural programme … revival artistic Culture programme … Overview: learning processfor automatic assignment • Produce descriptor profiles Poland Cultural policy Culture 41 Cultural 32 … Artistic 21 Revival 10 … Poland 23 Polish 20 … Producers 9 …
Overview: assignment • A new document is compared with descriptor profiles ...
Overview: assignment results TITLE: Association council between the European Communities and the Republic of Poland DECISION No 3/96 OF THE ASSOCIATION COUNCIL between the European Communities and their Member States, of the one part, and the Republic of Poland, of the other part of 16 July 1996 settling the dispute between the European Communities and the Republic of Poland concerning skins and hides in accordance with Article 105 (1) and (2) of the Europe Agreement between the European Communities and their Member States, of the one part, and the Republic of Poland, of the other part (96/496/Euratom, ECSC, EC)THE ASSOCIATION COUNCIL,Having regard to the Europe Agreement establishing an Association between the European Communities and their Member States, of the one part, and the Republic of Poland, of the other part (hereinafter 'the Europe Agreement`), and in particular Article 105 thereof,Whereas it is laid down in Article 105 (1) and (2) of the Europe Agreement that the Association Council may settle by means of a decision any dispute relating to the application or interpretation of the Europe Agreement;Considering that in view of a critical shortage of raw material in the form of skins and hides the Republic of Poland introduced on1 January 1994 a quota for the export of skins and hides set at 1 400 tonnes for 1994 and 1995 and 3 000 tonnes for 1996, invoking Article 31 of the Europe Agreement;Recognizing that, at the first meeting of the Association Council held in Warsaw on 23 and 24 June 1994, the Community requested Poland to increase the quota to 15 000 tonnes for 1994 and 20 000 tonnes for 1995 in order to maintain a balance, in accordance with the Europe Agreement, between the measures taken by Poland and the real shortage of the raw material existing;Considering that Poland informed the Community that the restriction had been introduced temporarily, was the result of the existing shortage and would be withdrawn as soon as the causes of its implementation disappeared;Considering that both sides have not reached a common understanding;Recognizing that in its letter of 28 July the Community referred the matter to the Association Council in accordance with Article 105 (1) of the Europe Agreement in order that it might settle the dispute;Considering that at the second meeting of the Association Council held in Brussels on 17 July 1995 the Community proposed that the quota for 1995 be raised to 13 500 tonnes;Recognizing that, as Poland could not accept the Community's proposal and as various proposals by Poland to increase the quota had not been accepted by the Community, both sides agreed to the application of Article 105 (4) of the Europe Agreement;Considering that the Republic of Poland and the Community have both notified their arbiters;Considering that, in the meantime the Republic of Poland in its letter of 18 March 1996 submitted a compromise proposal concerning the establishment of a timetable for liberalization of the export of skins and hides which envisages the final withdrawal of restrictions on 1 January 1999 at the latest and provides for another investigation of the matter in 1997 in order to hasten the process of full liberalization by one year;Recognizing that in such circumstances both sides have decided to stop the arbitration procedure provided for in Article 105 (4) and finish it according to Article 105 (2) of the Europe Agreement,HAS DECIDED AS FOLLOWS:Article 1The amount of the annual quota for exports from Poland of skins and hides, set by Poland at 3 000 tonnes for 1996 shall be increased for the same products to 10 000 tonnes for 1996, 12 000 tonnes for 1997 and 15 000 tonnes for 1998. The Republic of Poland will eliminate the restriction in export of skins and hides with effect from 1 January 1999. Poland 28 EC association 16agreement Eastern Europe 14 … Cultural Policy 0.05
pre processing pre processing training Training corpus Descriptor profiles Descriptor profiles Descriptor profiles assignment New text Descriptor Descriptor Descriptor Visual overview of the process
Contents • Overview of the process • Background • Eurovoc Thesaurus • Corpus of texts • Approaches to thesaurus indexing • Vector space • Training • Pre-processing the texts • Building Eurovoc profiles • Tuning various parameters • Assignment • “guessing” the descriptors for a new text • …results
Background: Eurovoc thesaurus • Created for indexing European texts • Created for human use • Contains some abstract concepts (cultural policy) • Hierarchically organised (BT/NT)
Background: Corpus • Corpus: set of (homogeneous) texts • Collected: • Parliamentary questions • Debate • Resolution • Protocol • Council regulation • Coucil decision • Council proposition • Agreements and contracts • ...
Background: Corpus • Our corpus: • About 75000 texts in de,en,fr,it… • About 30000 in fi, sv... • And... • 22000 in Lithuanian • 8000 in Hungarian • Hopefully 8000, or more, in the new EU languages • ...
Main approaches for thesaurus indexing • Look for the descriptor text in documents • Linguistic, rule-based approach • Machine learning, statistical approach (JRC)
Look for the descriptor text in documents • Most intuitive attempt for Eurovoc indexing • Try to find a descriptor text explicitly in documents • Many texts (69%) do not include descriptor text explicitly Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20 … Poland Cultural policy ? Community programme ? Financing of the community project ?
Look for the descriptor text in documents(2) • Many documents do contain some descriptor text without being indexed with it (90%) Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20 … Culture Form Decision
Rule-based approach for Eurovoc indexing • Manually write a set of rules for each descriptor Example: [Hlava & Hainebach 1996] • 40000 rules for English • Words and word combinations • Fishery AND management • Word locations in the text • Proximity (words in the same sentence…) • Location (title/text/beginning of sentence…) • Format (capital letters) • Exploit legal references • E.g. “Council directive 79/112/EEC” • Too expensive, difficult to update
Machine learning • Inductive process • Tries to “learn” from manually indexed examples • Tries to “reproduce” indexing on new document • Advantages • Fast and cheap • Easy to adapt to new languages • Easier to update • Possibility to re-index old documents with new descriptors • More consistent than manual indexing • Ranked assignment (better relevance ranking in search)
Machine learning: our approach • Trained on manually Eurovoc-indexed documents • Basic and fast representation of texts: “bag of words” • No consideration of relationship between words (syntax, semantic, discourse…) • Build a representation of each descriptor • the profile: (weighted) list of the most representative words
Vector space representation • A text is represented as a vector • The dimensions are the words • A Eurovoc descriptor profile is also a vector Agreement in the form of an exchange of letters (…)relating to the amendment of the Convention of 20 May 1987on a common transit procedure Nuclear Material See “Vector space” annex for more information
Contents • Overview of the process • Background • Eurovoc Thesaurus • Corpus of texts • Approaches to thesaurus indexing • Vector space • Training • Pre-processing the texts • Building Eurovoc profiles • Tuning various parameters • Assignment • “guessing” the descriptors for a new text • …results
Training Eurovoc assignment: text pre-processing • Why? Find a better representation in the “bag-of-words” • Semantically not relevant to consider some common words • ‘a’, ‘the’, ‘very’… • Unify word forms • ‘culture’, ‘cultures’, ’cultural’… • Word sense disambiguation • ‘pilot’, ‘order’…
Training Eurovoc assignment: text pre-processing • How? • “Stop word” list, avoid the use of non-meaningful words • Lemmatisation Replace inflected word forms by their base form (lemma) • ‘towns’ => ‘town’ • ‘difficulties’=>’difficulty’ • ‘mice’=>’mouse’ • Multi-word units Reduce polysemy • ‘European_Union’ • ‘in_order_to’ • ‘sustainable_development ‘ • ‘pilot_project’ • Remove annexes and signatures
Text pre-processing example Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20 … Article1 Poland shall participate_in the Culture 2000 programme accord_tothe term_and_condition set_outinannex_iandIIwhich shallforman integral_part of thisDecision . Article2ThisDecision shall enter_into_force on the day of its adoption . It shall apply_for the duration of the Culture 2000 programme , start from 1 January 2001 .
Europe Poland Polish … Poland training training Cultural policy Poland Culture programme Artistic Polish … Poland producers Polish … Europe Poland Polish … Poland Culture programme Artistic Polish … Poland Culture programme Artistic Polish … Poland Poland Poland producers Polish … Cultural policy Cultural policy Cultural policy Culture programme europe … Europe Cultural programme … revival artistic Culture programme … Culture programme europe … Europe Cultural programme … revival artistic Culture programme … Machine learning: training • “learn” from examples • Based on human assignment • Take every text indexed with a given descriptor as the “training sample” of this descriptor Poland Cultural policy
Training: representation of texts • Bag-of-Words representation of text Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20 … Poland Culture Programme Decision starting Terms …
radioactive ukraine resolution plutonium deuterium parliament nuclear blottnitz ... plutonium deuterium assembly nuclear schmidt radioactive korea iaea ... Illegal_traffic chernobyl radioactive ukrainian plutonium lithium dangerous mox ... Training: identifying most representative words RADIOACTIVE MATERIALS radioactive (3) plutonium (3) nuclear (2) deuterium (2) Illegal_traffic (1) chernobyl (1) ... + + =
Europe Poland Polish … Poland training training Cultural policy Poland Culture programme Artistic Polish … Poland Poland Poland producers Polish … Cultural policy Cultural policy Cultural policy revival artistic Culture programme … Europe Cultural programme … Culture programme europe … Training: identifying most representative words Poland Poland 23 Polish 20 … Producers 9 … Poland Polish producers Poland 2 Polish 2 producers europe Poland 3 Polish 2 producers Europe programme Cultural policy Culture 41 Cultural 32 … Artistic 21 Revival 10 …
Building Eurovoc descriptor profiles For each Eurovoc Descriptor • Find the texts it appears in • Find the words appearing in those texts • Combine various weights to compute theweight of each word for this descriptor • Various normalisations used: • A very common word has less impact than a rare word • The word ‘contradiction ’ (400 times) is less meaningful than ‘cloud’ (40 times) • A word used with only one descriptor has higher impact • ‘Chernobyl’ does not appear with many descriptors • ‘redistribute’ (same frequency in texts) appears with various descriptors • Texts being indexed with one descriptor have better impact than those with 20
Descriptor profile: weight of a word Weight of a word in a descriptor profile • Based on the frequency of the word • Number of texts it appears in • Each text being indexed with Nd descriptors, word contribution is 1/Nd • Normalised by the number of other descriptors it appears in
Eurovoc descriptor profile • List of weighted words (associates)
RADIOACTIVE MATERIALS Associate List:
Associate List: FISHERY MANAGEMENT fishery-related management-related
Tuning: various parameters • Minimum size and number of training texts available for each descriptor. We chose to require at least 5 texts (with at least 2000 characters each). • Select words in texts: log-likelihood formula (p-value to the low value of 0.15 to produce long associate lists) • Reference corpus for the log-likelihood formula. A general corpus vs. the training corpus • Meta-text vs individual texts • Minimum number of texts per descriptor for which the word is an associate (a word has to appear in at least 2 texts to be an associate) • Use: number of texts / cumulated frequency of word / log-likelihood value • Impact of all associates occurring at least 10% as often as the most common associate word • We do not consider the length of each training text • Minimum weight threshold for each associate • Minimum length of the associate list
Contents • Overview of the process • Background • Eurovoc Thesaurus • Corpus of texts • Approaches to thesaurus indexing • Vector space • Training • Pre-processing the texts • Building Eurovoc profiles • Tuning various parameters • Assignment • “guessing” the descriptors for a new text • …results
Assignment Phase • Normalise new document (lemmatise, multi-word mark-up) • Produce word frequency list(excluding stop words) ... • Calculate similarity between word frequency list and descriptor associate lists, using statistical formulae
Eurovoc assignment: example Document: Resolution on human rights in Ethiopia ETHIOPIA 30% HUMAN RIGHTS 25% POLITICAL VIOLENCE 19% REPRESSION 18% DEMOCRATIZATION 18% … EXTREMISM 10% DEATH PENALTY 10% … TREATY ON EUROPEAN UNION 6% Ethiopia, ethiopian human_rights, (condemn, respect…) human_rights, (condemn, killing…) human_rights, (condemn, repression…) human_rights, (condemn, democratic..) human_rights, (condemn, call_on…) human_rights, (condemn, call_on…) human_rights, (citizen, respect…)
Descriptor 1 Document Descriptor 2 (Eurovoc assignment in vector space) Keyword 2 Keyword 1 Keyword 3
Formulae tested for descriptor assignment Term Frequency, Inverse Document Frequency Considers occurrence frequency of lemma (l) in meta-text (TFl,t) and number of descriptors (d) for which the lemma is an associate (DFl) Cosine uses TF.IDF; computes the angle of two multi-dimensional vectors (of the document (t) and of the descriptor associate list) Okapi considers occurrence frequency of lemma as an associate (DFl); the number of associates in the associate list (size, |d|); the average size of descriptor associate lists (M); the total number of descriptors used (N) ‘Scalar Product’ adds product of TF.IDF values of associates and text lemmas ‘622’ mixed formula, uses all of the above
Sample Assignment Result Title:Legislative resolution embodying Parliament's opinion on the proposal for a Council Regulation amending Regulation N. 2847/93 establishing a control system applicable to the common fisheries policy (COM(95)0256 - C4-0272/95 - 95/ 0146) (Consultation procedure)
Results: …starting from a plain text… Resolution on human rights in EthiopiaThe European Parliament,- having regard to its resolution of 18 July 1996 on human rights in Ethiopia ((OJ C 261, 9.9.1996, p. 166.)),A. whereas respect for human rights, democratic principles and the rule of law constitute essential elements of the revised Lomé IV Convention and whereas the Ethiopian constitution also includes respect for human rights,B. having regard to the continuing process of democratic and institutional change in Ethiopia,C. concerned by the repression of civil society associations which recently forced into exile leaders like Mr Kefale Mamo and Mr Mulugeta Lule (president and vice-president respectively of the Ethiopian Free Press Journalist Association), Mr Gemorav Kassa (General Secretary of the Ethiopian Teachers Association), and Mr Dawi Ibrahim (president of the Confederation of Ethiopian Trade Unions),D. deeply concerned by the killing on 11 June 1997 of Assefa Maru, an executive board member of both the Ethiopian Teachers Association and of the Ethiopian Human Rights Council,1. Condemns the killing of Assefa Maru;2. Condemns all human rights violations committed by the government and military forces;3. Calls on the Ethiopian authorities to guarantee the fundamental rights of all Ethiopian citizens and to put an end to politically motivated persecutions and to abuses such as extrajudicial disappearances, torture, detention, rapes and arrests, in accordance with the Ethiopian constitution;4. Calls on the Ethiopian authorities fully to respect freedom of the press, independence of unions and the right of association of citizens;5. Urges the Ethiopian Government to release all prisoners of conscience and to provide corrective procedures to the judiciary system whereby people can be charged and tried in a fair way;6. Calls on the Council, the Commission and the Member States to monitor closely human rights in Ethiopia and use all means to improve the situation;7. Instructs its President to forward this resolution to the Council, the Commission, the Government of Ethiopia, the Secretary- General of the United Nations and the UN High Commissioner for Human Rights. New text
pre processing Pre-processing and keyword extraction New text
assignment Descriptor Descriptor Descriptor
What next? • New languages • Experiments, using output of current assignment • SVM • Other categorization techniques • Filter output • Geographic descriptors
Automatic Eurovoc Indexing: vector model Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy http://www.jrc.cec.eu.int/langtech Addressing the Language Barrier Problem in the Enlarged EU Automating Eurovoc Descriptor Assignment
Document Annex: Vector space representation Word 2 Word 1 Word 3
EU-Poland culture 2000 Vector space representation Example: document in a three dimensional space Programme Poland Culture
Vector space representation A Eurovoc descriptor in the same three dimensional space Programme Poland Cultural policy Culture
Eurovoc descriptor Document Vector space representation Eurovoc descriptor and documents comparison Keyword 2 Keyword 1 Keyword 3
EU-Poland culture 2000 Vector space representation Eurovoc descriptor and documents comparison Programme Poland Cultural policy Culture
Descriptor 1 Document Descriptor 2 Eurovoc assignment in vector space • A text is “compared” to each Eurovoc descriptor profile Keyword 2 Keyword 1 Keyword 3