760 likes | 972 Views
Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R esearch C entre ( JRC ) – Italy http:// langtech .jrc.it/ http://press.jrc.it/NewsExplorer. Joint Research Centre, IPSC, SES, langtech.
E N D
Language Technology, JRC, European Commission Multilingual Information Extraction Bruno PouliquenEuropean Commission –Joint Research Centre (JRC) – Italy http://langtech.jrc.it/http://press.jrc.it/NewsExplorer UkVisit’2006, Slide 1
Joint Research Centre, IPSC, SES, langtech • Joint Research Centre Mission: provide customer-driven scientific and technical support for the conception, development, implementation and monitoring of EU policies. (As a service of the European Commission, the JRC functions as a reference centre of science and technology for the Union. Close to the policy-making process, it serves the common interest of the Member States, while being independent of special interests, whether private or national.) • Institute for the Protection and Security of the Citizen • Support for external Security (unit) • Web and language technology (action) UkVisit’2006, Slide 2
Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 3
Motivation for cross-lingual document linking • EU: need for multilinguality • 23 official languages (since 1st Jan 2007:Irish, Romania, Bulgarian) • 253 language pairs • + interest in Arabic, Farsi, Russian… • Enhanced Information Extraction by combining information from texts in various languages • Applications • NewsExplorer • Based on Europe Media Monitor (collecting news in 30 languages from 1000 news web site) • Analysing News in 18 languages (ar, bg, da, de, en, es, et, fa, fr, it, nl, pl, pt, ro, ru, sl, sv, tr) • OSInt • Combines internet searches and entity extraction • Medisys (medical intelligence) UkVisit’2006, Slide 4
NewsExplorer UkVisit’2006, Slide 5
OSInt UkVisit’2006, Slide 6
Medisys UkVisit’2006, Slide 7
Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 8
Approach to CLDS in a nutshell • Map documents onto multilingual thesauri, nomenclatures, gazetteers, …(general, medical, technical, geographical, …) • Identify and normalise language-independent text features • Numbers; dates; currency expressions; measurement expressions; …le treize août 2007, am 13.8.2007, by 13/08/07 DATE_2007_08_13 • Names of persons, organisations, …Muqtada al-Sadr, Mouqtada Sadr, Mokdata Sadr, Muqtadā aş-Şadr, Муктада Садр ID=236 • Calculate the various vector similarities and add up • For details, see Steinberger et al. (2004, Informatica 28). UkVisit’2006, Slide 9
Geo-coding multilingual text Latitude / Longitude • Place names cannot be recognised by looking for patterns in text (Gey 2000) Multilingual gazetteer needed ‘Санкт-Петербург’, ‘Saint Petersburg’, ‘Saint Pétersbourg’, ‘Leningrad’, ‘Petrograd’, … • Place name recognition via the lookup of text words in the gazetteer UkVisit’2006, Slide 10
Major challenges for geocoding (1) 1. Places homographic with common words Geo-stopwords 2. Places homographic with person names Location only if not part of a person name e.g. ‘Kofi Annan’, ‘Annan’ UkVisit’2006, Slide 11
Major challenges for geocoding (2) 3. Completeness of the gazetteer: exonyms, endonyms, …‘Санкт-Петербург’, ‘Saint Petersburg’, ‘Saint Pétersbourg’, ‘Leningrad’, ‘Petrograd’, … • Inflection • Romanian: Parisului (of Paris) • Estonian: Londonit(London), NewYorgile(New York) • Arabic: (the Paris inhabitants) Usage of context-sensitive suffix lists and replacement rules to generate all variants (Romanian example) الباريسون [albaRiziu:n] Paris(ul|ului)? / Români(a|e) UkVisit’2006, Slide 12
Major challenges for geocoding (3) 5. Homographic place names Use size class information Use country context (news source; unambiguous anchors) Use kilometric distance E.g. “from Warsaw to Brest” Brest (France): 2000 km from Warsaw Brest (Belarus): 200 km from Warsaw UkVisit’2006, Slide 13
Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 14
en death of former Prime MinisterRafik Hariri, blamed by many opposition es asesinato del exprimer ministroRafic al-Hariri, que la oposición atribuyó fr l'assassinat de l'ex-dirigeantRafic Hariri et le départ du chef de la diplom nl na de moord op oud-premierRafiq al-Hariri gingen gisteren bijna een de libanesischen RegierungschefRafik Hariri vor einem Monat wichtige B sl danjega libanonskegapremieraRafika Haririja. Libanonska opozicija si et möödumisele ekspeaminister Rafik al-Hariri surma põhjustanud pommipl اغتيال رئيس الوزراء السابقرفيق الحريري بأياد يهودية وما حدث سابقا ar Бывший премьер-министр ЛиванаРафик Харири, который ru Multilingual name recognition and variant merging UkVisit’2006, Slide 15
Recognition of person names • Lookup of known names from database • Currently over 600,000 names but about 50.000 are really looked for (+ 135.000 variants) • Insertion of about 1000 new names per day • Pre-generate morphological variants (Slovene example): Tony(a|o|u|om|em|m|ju|jem|ja)?\s+Blair(a|o|u|om|em|m|ju|jem|ja) • Guessing names using empirically-derived lexical patterns • Trigger word(s) + Name Surname • President, Minister, Head of State, Sir, American • “death of”, “[0-9]+-year-old”, … • Known first names (John, Jean, Hans, Giovanni, Johan, …) • … • Combinations: “56-year-old former prime ministerKurmanbek Bakiyev” UkVisit’2006, Slide 16
Person name recognition – dealing with inflection • Recognition of person names, using regular expressions (Slovene example): • kandidat(a|u|om)? • legend(a|e|i|o) • milijarder(ja|ju|jem)? • predsednik(a|u|om|em)? • predsednic(a|e|i|o) • ministric(a|e|i|o) • sekretar(ja|ju|jom|jem)? • diktator(ja|ju|jem)? • playboy(a|u|om|em)? + uppercase words … verskega voditeljaMoktade al Sadra je z notranjim … = Muqtada al-Sadr (ID=236) UkVisit’2006, Slide 17
Learning name variants • For all new names found: apply approximate name matching • Slow method: • Based on sets of letter bigrams and letter trigrams • Merge two names if cosine similarity is > 70% • Fast method: • Pre-select merging candidate using a “normalised form” • Collect variants automatically from Wikipedia • Cross-script name matching (Cyrillic, Greek, Arabic + Farsi, Hindi) UkVisit’2006, Slide 18
Person name recognition – Result • Frequency list of normalised person names found(numerical person ID) • Together with the “trigger words” associated UkVisit’2006, Slide 19
Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 20
Eurovoc Thesaurushttp://europa.eu.int/celex/eurovoc • Over 6000 classes • Covering many different subject domains (wide coverage) • Multilingual (over 20 languages, one-to-one translations) • Developed by the European Parliament and others • Hierarchically organised UkVisit’2006, Slide 21
Eurovoc categorisation – Major challenges • Eurovoc is a conceptual thesaurus categorisation vs. term extraction • Large number of classes (~ 6000) • Very unevenly distributed • Various text types (heterogeneous training set) • Multi-label categorisation (both for training and assignment) E.g. • SPORT • PROTECTION OF MINORITIES • CONSTRUCTION AND TOWN PLANNING • RADIOACTIVE MATERIALS UkVisit’2006, Slide 22
Eurovoc categorisation – Approach • Profile-based, category ranking task • Training: Identification of most significant words for each class • Assignment: combination of measures to calculate similarity between profiles and new document • Empirical refinement of parameter settings • Training: • Stop words • Lemmatisation • Multi-word terms • Consider number of classes of each training document • Thresholds for training document length and number of training documents per class • Methods to determine significant words per document (log-likelihood vs. chi-square, etc.) • Choice of reference corpus • … • Assignment: • Selection and combination of similarity measures (cosine, okapi, …) • ... For details, see Pouliquen et al. (Eurolan 2003) UkVisit’2006, Slide 23
Assignment Result (Example) Title:Legislative resolution embodying Parliament's opinion on the proposal for a Council Regulation amending Regulation No 2847/93 establishing acontrol system applicable to the common fisheries policy (COM(95)0256 - C4-0272/95 - 95/ 0146(CNS)) (Consultation procedure) UkVisit’2006, Slide 24
The ‘JRC-Acquis’ parallel corpus in 21 languages • Freely available for research purposes on our web site: http://langtech.jrc.it/JRC-Acquis.html • For details, see Steinberger et al. (2006, LREC) • Average of 8.8 Million words per language • Pair-wise alignment for all 210 language pairs! • Average of 7600 documents per language • Most documents have been Eurovoc-classified manually • Incoming new version soon (about twice the same size) • useful for: • Training of multilingual subject domain classifiers. • Creation of multilingual lexical space (LSA, KCCA) • Training of automatic systems for Statistical Machine Translation. • Producing multilingual lexical or semantic resources such as dictionaries or ontologies. • Training and testing multilingual information extraction software. • Automatic translation consistency checking. • Testing and benchmarking alignment software (sentences, words, etc.), across a larger variety of language pairs. • All types of multilingual and cross-lingual research. UkVisit’2006, Slide 25
Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 26
Monolingual document representation • Vector of keywords and their keyness using log-likelihood test (Dunning 1993) “Michael Jackson Jury Reaches Verdicts” Keyness Keyword 109.24 jackson 41.54 neverland 37.93 santa 32.61 molestation 24.51 boy 24.43 pop 20.68 documentary 18.79 accuser 13.59 courthouse 11.12 jury 10.08 ranch 9.60 california Keyness Keyword 9.39 verdict 7.56 testimony 6.50 maria 4.09 michael 1.73 reached 1.68 ap 1.05 appeared 0.53 child 0.50 trial 0.45 monday 0.26 children 0.09 family Original cluster UkVisit’2006, Slide 27
Calculation of a text’s Country Score • Aim: show to what extent a text talks about a certain country • Sum of references to a country, normalised using the log-likelihood test • Add country score vector to keyword vector Keyness Keyword 7.5620 testimony 6.5014 maria 4.0957 michael 1.7368 reached 1.6857 ap 1.5610 *gb* 1.5610 *il* 1.5610 *br* 1.0520 appeared 0.5384 child 0.5045 trial 0.4502 monday 0.2647 children 0.0946 family Keyness Keyword 109.2478 jackson 41.5450 neverland 37.9347 santa 32.6105 molestation 24.5193 boy 24.4351 pop 20.6824 documentary 18.7973 accuser 13.5945 courthouse 11.1224 jury 10.4184 *us* 10.0838 ranch 9.6021 california 9.3905 verdict UkVisit’2006, Slide 28
10% 20% 40% 45% 76% 87% 97% Monolingual news clustering • Input: Vectors consisting of keywords and country score • Similarity measure: cosine • Method: Bottom-up group average unsupervised clustering • Build the binary hierarchical clustering tree (dendrogram) • Retain only “big” nodes in the tree with a high cohesion (empirically refined minimum intra-node similarity: 45%) • Use the title of the cluster’s medoid as the cluster title • For details, see Pouliquen et al. (CoLing 2004) UkVisit’2006, Slide 29
Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 30
Cross-lingual cluster linking – combination of 4 ingredients CLDS = α∙S1 + β∙S2 + γ∙S3 + δ∙S4 • Ranked list of Eurovoc classes (40%) • Country score (30%) • Names + frequency (20%) • Monolingual cluster representation without country score (10%) + + + UkVisit’2006, Slide 31
example 46% UkVisit’2006, Slide 32
Cross-lingual cluster linking – evaluation • Evaluation results depending on similarity threshold • Ingredients: 40/30/30 (names not yet considered) • Evaluation for EN FR and EN IT (136 EN clusters) * Recall at 15% similarity threshold = 100% • For details, see Pouliquen et al. (CoLing 2004) UkVisit’2006, Slide 33
Filter out bad links by exploiting all cross-lingual links Assumption: If EN is linked to FR, ES, IT, … FR should also be linked to ES, IT, ... If not: lower link likelihood UkVisit’2006, Slide 34
Agenda • Other applications • automatic detection of man made violent events • Social networks and relation extraction • Quotation extraction • Medisys: extraction of medical concepts (MeSH thesaurus) from health news UkVisit’2006, Slide 35
Automatic detection of man made violent events • NEXUSNews cluster Event eXtraction Using language Structures • 3415 templates in total • Processes the news clusters for one day for about 30 minutes • Detects about 10-15 security-related events per day UkVisit’2006, Slide 36
NEXUS: Learning Templates by Bootstrapping - Example • We learned core templates like [action killed] [person X] and others • In some of the news clusters we match a text like“U.S. troops killedAl Zarkawi” • Al Zarkawi is taken as an anchor entity and we consider the contexts in which this name appears in the same news cluster where the above mentioned text appears • For example: “the body of Al Zarkawi was identified”, “the body of Al Zarkawi was found”, etc. • All these contexts from different anchor entities are passed to a pattern learning algorithm to extract candidate patterns (e.g. the body of X) UkVisit’2006, Slide 37
NEXUS: Event Structure UkVisit’2006, Slide 38
Knowledge base UkVisit’2006, Slide 39
Social networks • Based on co-occurrences • Result of machine learning approach UkVisit’2006, Slide 40
Number of clusters they appear together Weighting depending on the total number of clusters they appear in Associated names: weighting formula Weighting depending on the total number of persons they are associated with UkVisit’2006, Slide 41
Solana’s spokesperson Solana’s assistant Associated names example Without weighting Javier Solana (European Union's currentForeign Secretary) With weight UkVisit’2006, Slide 42
Social network visualisation: interactive UkVisit’2006, Slide 43
Social network visualisation, summarising relations around a person UkVisit’2006, Slide 44
Relation Extraction • We parsed the news articles in a 6 month span using a full dependency parser (MiniPar) and represented the corpus via a model called Syntactic Network (Tanev, Magnini, 2006) • Beginning with a set of hand coded syntactic patterns (e.g. “X <-subj- contact -obj-> Y”) we capture anchor entities which fill the variable slots • On the next step using these anchor entities we learn new syntactic templates (e.g. “X <-subj- meet -obj-> Y”) • These two steps repeat until no more templates can be learned UkVisit’2006, Slide 45
Relation Extraction • Using the bootstrapping algorithm described above and the Syntactic Network, we extracted instances of the “contact”, “support”, “family”, and ”criticize” relations. • We captured in total about 9000 occurrences of these relations in the corpus UkVisit’2006, Slide 46
verb Person name Quotation mark Quotations e.g. -- Vi försökte uppmuntra samverkan, säger Urban Lundmark. UkVisit’2006, Slide 47
Medisys UkVisit’2006, Slide 48
Medisys: extraction of medical concepts (MeSH thesaurus) from health news • MeSH extractor (en, es, de, fr, it, pt) • Geocoding (allows for region-level analysis) • Highlight medical terms in text • NewsExplorer-like medical application UkVisit’2006, Slide 49
Agenda • Future work and Conclusions UkVisit’2006, Slide 50