1 / 76

Language Technology, JRC, European Commission Multilingual Information Extraction

Language Technology, JRC, European Commission Multilingual Information Extraction Bruno Pouliquen European Commission – J oint R esearch C entre ( JRC ) – Italy http:// langtech .jrc.it/ http://press.jrc.it/NewsExplorer. Joint Research Centre, IPSC, SES, langtech.

barny
Download Presentation

Language Technology, JRC, European Commission Multilingual Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Technology, JRC, European Commission Multilingual Information Extraction Bruno PouliquenEuropean Commission –Joint Research Centre (JRC) – Italy http://langtech.jrc.it/http://press.jrc.it/NewsExplorer UkVisit’2006, Slide 1

  2. Joint Research Centre, IPSC, SES, langtech • Joint Research Centre Mission: provide customer-driven scientific and technical support for the conception, development, implementation and monitoring of EU policies. (As a service of the European Commission, the JRC functions as a reference centre of science and technology for the Union. Close to the policy-making process, it serves the common interest of the Member States, while being independent of special interests, whether private or national.) • Institute for the Protection and Security of the Citizen • Support for external Security (unit) • Web and language technology (action) UkVisit’2006, Slide 2

  3. Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 3

  4. Motivation for cross-lingual document linking • EU: need for multilinguality • 23 official languages (since 1st Jan 2007:Irish, Romania, Bulgarian) • 253 language pairs • + interest in Arabic, Farsi, Russian… • Enhanced Information Extraction by combining information from texts in various languages • Applications • NewsExplorer • Based on Europe Media Monitor (collecting news in 30 languages from 1000 news web site) • Analysing News in 18 languages (ar, bg, da, de, en, es, et, fa, fr, it, nl, pl, pt, ro, ru, sl, sv, tr) • OSInt • Combines internet searches and entity extraction • Medisys (medical intelligence) UkVisit’2006, Slide 4

  5. NewsExplorer UkVisit’2006, Slide 5

  6. OSInt UkVisit’2006, Slide 6

  7. Medisys UkVisit’2006, Slide 7

  8. Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 8

  9. Approach to CLDS in a nutshell • Map documents onto multilingual thesauri, nomenclatures, gazetteers, …(general, medical, technical, geographical, …) • Identify and normalise language-independent text features • Numbers; dates; currency expressions; measurement expressions; …le treize août 2007, am 13.8.2007, by 13/08/07 DATE_2007_08_13 • Names of persons, organisations, …Muqtada al-Sadr, Mouqtada Sadr, Mokdata Sadr, Muqtadā aş-Şadr, Муктада Садр ID=236 • Calculate the various vector similarities and add up • For details, see Steinberger et al. (2004, Informatica 28). UkVisit’2006, Slide 9

  10. Geo-coding multilingual text Latitude / Longitude • Place names cannot be recognised by looking for patterns in text (Gey 2000)  Multilingual gazetteer needed ‘Санкт-Петербург’, ‘Saint Petersburg’, ‘Saint Pétersbourg’, ‘Leningrad’, ‘Petrograd’, … • Place name recognition via the lookup of text words in the gazetteer UkVisit’2006, Slide 10

  11. Major challenges for geocoding (1) 1. Places homographic with common words  Geo-stopwords 2. Places homographic with person names  Location only if not part of a person name e.g. ‘Kofi Annan’, ‘Annan’ UkVisit’2006, Slide 11

  12. Major challenges for geocoding (2) 3. Completeness of the gazetteer: exonyms, endonyms, …‘Санкт-Петербург’, ‘Saint Petersburg’, ‘Saint Pétersbourg’, ‘Leningrad’, ‘Petrograd’, … • Inflection • Romanian: Parisului (of Paris) • Estonian: Londonit(London), NewYorgile(New York) • Arabic: (the Paris inhabitants)  Usage of context-sensitive suffix lists and replacement rules to generate all variants (Romanian example) الباريسون [albaRiziu:n] Paris(ul|ului)? / Români(a|e) UkVisit’2006, Slide 12

  13. Major challenges for geocoding (3) 5. Homographic place names  Use size class information  Use country context (news source; unambiguous anchors)  Use kilometric distance E.g. “from Warsaw to Brest” Brest (France): 2000 km from Warsaw Brest (Belarus): 200 km from Warsaw UkVisit’2006, Slide 13

  14. Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 14

  15. en death of former Prime MinisterRafik Hariri, blamed by many opposition es asesinato del exprimer ministroRafic al-Hariri, que la oposición atribuyó fr l'assassinat de l'ex-dirigeantRafic Hariri et le départ du chef de la diplom nl na de moord op oud-premierRafiq al-Hariri gingen gisteren bijna een de libanesischen RegierungschefRafik Hariri vor einem Monat wichtige B sl danjega libanonskegapremieraRafika Haririja. Libanonska opozicija si et möödumisele ekspeaminister Rafik al-Hariri surma põhjustanud pommipl اغتيال رئيس الوزراء السابقرفيق الحريري بأياد يهودية وما حدث سابقا ar Бывший премьер-министр ЛиванаРафик Харири, который ru Multilingual name recognition and variant merging UkVisit’2006, Slide 15

  16. Recognition of person names • Lookup of known names from database • Currently over 600,000 names but about 50.000 are really looked for (+ 135.000 variants) • Insertion of about 1000 new names per day • Pre-generate morphological variants (Slovene example): Tony(a|o|u|om|em|m|ju|jem|ja)?\s+Blair(a|o|u|om|em|m|ju|jem|ja) • Guessing names using empirically-derived lexical patterns • Trigger word(s) + Name Surname • President, Minister, Head of State, Sir, American • “death of”, “[0-9]+-year-old”, … • Known first names (John, Jean, Hans, Giovanni, Johan, …) • … • Combinations: “56-year-old former prime ministerKurmanbek Bakiyev” UkVisit’2006, Slide 16

  17. Person name recognition – dealing with inflection • Recognition of person names, using regular expressions (Slovene example): • kandidat(a|u|om)? • legend(a|e|i|o) • milijarder(ja|ju|jem)? • predsednik(a|u|om|em)? • predsednic(a|e|i|o) • ministric(a|e|i|o) • sekretar(ja|ju|jom|jem)? • diktator(ja|ju|jem)? • playboy(a|u|om|em)? + uppercase words … verskega voditeljaMoktade al Sadra je z notranjim … = Muqtada al-Sadr (ID=236) UkVisit’2006, Slide 17

  18. Learning name variants • For all new names found: apply approximate name matching • Slow method: • Based on sets of letter bigrams and letter trigrams • Merge two names if cosine similarity is > 70% • Fast method: • Pre-select merging candidate using a “normalised form” • Collect variants automatically from Wikipedia • Cross-script name matching (Cyrillic, Greek, Arabic + Farsi, Hindi) UkVisit’2006, Slide 18

  19. Person name recognition – Result • Frequency list of normalised person names found(numerical person ID) • Together with the “trigger words” associated UkVisit’2006, Slide 19

  20. Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 20

  21. Eurovoc Thesaurushttp://europa.eu.int/celex/eurovoc • Over 6000 classes • Covering many different subject domains (wide coverage) • Multilingual (over 20 languages, one-to-one translations) • Developed by the European Parliament and others • Hierarchically organised UkVisit’2006, Slide 21

  22. Eurovoc categorisation – Major challenges • Eurovoc is a conceptual thesaurus  categorisation vs. term extraction • Large number of classes (~ 6000) • Very unevenly distributed • Various text types (heterogeneous training set) • Multi-label categorisation (both for training and assignment) E.g. • SPORT • PROTECTION OF MINORITIES • CONSTRUCTION AND TOWN PLANNING • RADIOACTIVE MATERIALS UkVisit’2006, Slide 22

  23. Eurovoc categorisation – Approach • Profile-based, category ranking task • Training: Identification of most significant words for each class • Assignment: combination of measures to calculate similarity between profiles and new document • Empirical refinement of parameter settings • Training: • Stop words • Lemmatisation • Multi-word terms • Consider number of classes of each training document • Thresholds for training document length and number of training documents per class • Methods to determine significant words per document (log-likelihood vs. chi-square, etc.) • Choice of reference corpus • … • Assignment: • Selection and combination of similarity measures (cosine, okapi, …) • ... For details, see Pouliquen et al. (Eurolan 2003) UkVisit’2006, Slide 23

  24. Assignment Result (Example) Title:Legislative resolution embodying Parliament's opinion on the proposal for a Council Regulation amending Regulation No 2847/93 establishing acontrol system applicable to the common fisheries policy (COM(95)0256 - C4-0272/95 - 95/ 0146(CNS)) (Consultation procedure) UkVisit’2006, Slide 24

  25. The ‘JRC-Acquis’ parallel corpus in 21 languages • Freely available for research purposes on our web site: http://langtech.jrc.it/JRC-Acquis.html • For details, see Steinberger et al. (2006, LREC) • Average of 8.8 Million words per language • Pair-wise alignment for all 210 language pairs! • Average of 7600 documents per language • Most documents have been Eurovoc-classified manually • Incoming new version soon (about twice the same size) • useful for: • Training of multilingual subject domain classifiers. • Creation of multilingual lexical space (LSA, KCCA) • Training of automatic systems for Statistical Machine Translation. • Producing multilingual lexical or semantic resources such as dictionaries or ontologies. • Training and testing multilingual information extraction software. • Automatic translation consistency checking. • Testing and benchmarking alignment software (sentences, words, etc.), across a larger variety of language pairs. • All types of multilingual and cross-lingual research. UkVisit’2006, Slide 25

  26. Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 26

  27. Monolingual document representation • Vector of keywords and their keyness using log-likelihood test (Dunning 1993) “Michael Jackson Jury Reaches Verdicts” Keyness Keyword  109.24 jackson   41.54 neverland   37.93 santa   32.61 molestation   24.51 boy   24.43 pop   20.68 documentary   18.79 accuser   13.59 courthouse   11.12 jury 10.08 ranch   9.60 california Keyness Keyword    9.39 verdict 7.56 testimony   6.50 maria   4.09 michael   1.73 reached   1.68 ap   1.05 appeared   0.53 child   0.50 trial   0.45 monday   0.26 children   0.09 family Original cluster UkVisit’2006, Slide 27

  28. Calculation of a text’s Country Score • Aim: show to what extent a text talks about a certain country • Sum of references to a country, normalised using the log-likelihood test • Add country score vector to keyword vector Keyness Keyword  7.5620 testimony   6.5014 maria   4.0957 michael   1.7368 reached   1.6857 ap   1.5610 *gb*   1.5610 *il*   1.5610 *br*   1.0520 appeared   0.5384 child   0.5045 trial   0.4502 monday   0.2647 children   0.0946 family Keyness Keyword  109.2478 jackson   41.5450 neverland   37.9347 santa   32.6105 molestation   24.5193 boy   24.4351 pop   20.6824 documentary   18.7973 accuser   13.5945 courthouse   11.1224 jury   10.4184 *us*   10.0838 ranch   9.6021 california   9.3905 verdict UkVisit’2006, Slide 28

  29. 10% 20% 40% 45% 76% 87% 97% Monolingual news clustering • Input: Vectors consisting of keywords and country score • Similarity measure: cosine • Method: Bottom-up group average unsupervised clustering • Build the binary hierarchical clustering tree (dendrogram) • Retain only “big” nodes in the tree with a high cohesion (empirically refined minimum intra-node similarity: 45%) • Use the title of the cluster’s medoid as the cluster title • For details, see Pouliquen et al. (CoLing 2004) UkVisit’2006, Slide 29

  30. Agenda • Motivation • EU: need for multilinguality • Enhanced Information Extraction by combining information from texts in various languages • Applications (analysing News, web mining, parliament texts…) • Overview of our approach • Description of the components, challenges • IE: Locations • IE: Person and Organisation names • Categorisation into Subject domains (Eurovoc classes) • Clustering of news • Linking clusters over time and languages • Others: event extraction, quotations, medical-concept indexation etc. • Future work and Conclusions UkVisit’2006, Slide 30

  31. Cross-lingual cluster linking – combination of 4 ingredients CLDS = α∙S1 + β∙S2 + γ∙S3 + δ∙S4 • Ranked list of Eurovoc classes (40%) • Country score (30%) • Names + frequency (20%) • Monolingual cluster representation without country score (10%) + + + UkVisit’2006, Slide 31

  32. example 46% UkVisit’2006, Slide 32

  33. Cross-lingual cluster linking – evaluation • Evaluation results depending on similarity threshold • Ingredients: 40/30/30 (names not yet considered) • Evaluation for EN  FR and EN  IT (136 EN clusters) * Recall at 15% similarity threshold = 100% • For details, see Pouliquen et al. (CoLing 2004) UkVisit’2006, Slide 33

  34. Filter out bad links by exploiting all cross-lingual links Assumption: If EN is linked to FR, ES, IT, … FR should also be linked to ES, IT, ... If not: lower link likelihood UkVisit’2006, Slide 34

  35. Agenda • Other applications • automatic detection of man made violent events • Social networks and relation extraction • Quotation extraction • Medisys: extraction of medical concepts (MeSH thesaurus) from health news UkVisit’2006, Slide 35

  36. Automatic detection of man made violent events • NEXUSNews cluster Event eXtraction Using language Structures • 3415 templates in total • Processes the news clusters for one day for about 30 minutes • Detects about 10-15 security-related events per day UkVisit’2006, Slide 36

  37. NEXUS: Learning Templates by Bootstrapping - Example • We learned core templates like [action killed] [person X] and others • In some of the news clusters we match a text like“U.S. troops killedAl Zarkawi” • Al Zarkawi is taken as an anchor entity and we consider the contexts in which this name appears in the same news cluster where the above mentioned text appears • For example: “the body of Al Zarkawi was identified”, “the body of Al Zarkawi was found”, etc. • All these contexts from different anchor entities are passed to a pattern learning algorithm to extract candidate patterns (e.g. the body of X) UkVisit’2006, Slide 37

  38. NEXUS: Event Structure UkVisit’2006, Slide 38

  39. Knowledge base UkVisit’2006, Slide 39

  40. Social networks • Based on co-occurrences • Result of machine learning approach UkVisit’2006, Slide 40

  41. Number of clusters they appear together Weighting depending on the total number of clusters they appear in Associated names: weighting formula Weighting depending on the total number of persons they are associated with UkVisit’2006, Slide 41

  42. Solana’s spokesperson Solana’s assistant Associated names example Without weighting Javier Solana (European Union's currentForeign Secretary) With weight UkVisit’2006, Slide 42

  43. Social network visualisation: interactive UkVisit’2006, Slide 43

  44. Social network visualisation, summarising relations around a person UkVisit’2006, Slide 44

  45. Relation Extraction • We parsed the news articles in a 6 month span using a full dependency parser (MiniPar) and represented the corpus via a model called Syntactic Network (Tanev, Magnini, 2006) • Beginning with a set of hand coded syntactic patterns (e.g. “X <-subj- contact -obj-> Y”) we capture anchor entities which fill the variable slots • On the next step using these anchor entities we learn new syntactic templates (e.g. “X <-subj- meet -obj-> Y”) • These two steps repeat until no more templates can be learned UkVisit’2006, Slide 45

  46. Relation Extraction • Using the bootstrapping algorithm described above and the Syntactic Network, we extracted instances of the “contact”, “support”, “family”, and ”criticize” relations. • We captured in total about 9000 occurrences of these relations in the corpus UkVisit’2006, Slide 46

  47. verb Person name Quotation mark Quotations e.g. -- Vi försökte uppmuntra samverkan, säger Urban Lundmark. UkVisit’2006, Slide 47

  48. Medisys UkVisit’2006, Slide 48

  49. Medisys: extraction of medical concepts (MeSH thesaurus) from health news • MeSH extractor (en, es, de, fr, it, pt) • Geocoding (allows for region-level analysis) • Highlight medical terms in text • NewsExplorer-like medical application UkVisit’2006, Slide 49

  50. Agenda • Future work and Conclusions UkVisit’2006, Slide 50

More Related