1 / 17

CAP: A Hierarchical Lexical Function

Amalia Todirascu Linguistique, Langues, Paroles (LILPA) University of Strasbourg todiras@unistra.fr. CAP: A Hierarchical Lexical Function. The Project. Goals to study a specific CAP lexical function, in several languages (French, English, German) economy, politics

vangie
Download Presentation

CAP: A Hierarchical Lexical Function

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Amalia Todirascu Linguistique, Langues, Paroles (LILPA) University of Strasbourg todiras@unistra.fr CAP: A Hierarchical Lexical Function

  2. The Project Goals to study a specific CAP lexical function, in several languages (French, English, German) economy, politics to provide a complete linguistic description of this function to extend a multilingual ontology, Prolexbase (Tran and Maurel, 2006)

  3. The Project (II) • collaboration with CLARIN European project (http://www.clarin.eu) • WP3 Humanities overview • WP3.3 Call for collaboration with Humanities projects • Collaboration • access to existing corpora and tools • consultancy

  4. CAP – a Lexical Function CAP lexical function (Mel'čuk 1984, 1988, 1992, 1999) – hierarchical relations Two persons François Fillon est premier ministre de Nicolas Sarkozy Sebek em war ein Oberpriester ca. 1780 v.Chr Two organisations Swiss Private Aviation AG, a fully-owned subsidiary of Swiss International Air Lines AG Peugeotest unefirme sochalienne A Person and an organization or a country SWISS Finanzchef Marcel Klaus Traian Băsescuis theRomanian president

  5. Context linguistics : noun classifications (Kleiber 1990, Kleiber 1999, Jonasson 1994) lexical databases: WordNet (Miller, 1995), EuroWordNet (Vossen, 1998), BalkaNet (Tufis, 2004), FrameNet (Baker, et al, 1998) ontologies: Prolexbase (Tran and Maurel, 2006) (Grass et al, 2004) , SUMO (Niles and Pease, 2001) several applications : information extraction QA systems

  6. The Methodology • we identify existing monolingual and parallel corpora • DE, EN, FR • CLARIN language resource registry • tagged and raw corpora • annotation tools (both from the repository and on-line web services) • we create our own multilingual corpora

  7. The Methodology (II) • we apply several data extraction strategies • searching synonyms of "chef/head of/Vorsitzender"; • searching Named Entities related by the CAP relation (Martine Aubry – Parti Socialiste); • searching annotated persons and organizations through aligned corpora • we analyse the contexts to classify the expressions and their arguments • we extend Prolexbase ontology

  8. Corpora (I) • Available publicdata • Web interfaces (CQP) • Various domains and genres • monolingual : • Wortschatz (http://corpora.informatik.uni-leipzig.de), IULA (http://bwananet.iula.upf.edu), COSMAS (http://www.ids-mannheim.de/cosmas2), BNC (http://www.natcorp.ox.ac.uk/) • multilingual : • Oslo (http://www.hf.uio.no), CLUVI (http://sli.uvigo.es/CLUVI),DGT-TM (http://langtech.jrc.it/DGT-TM.html)

  9. Corpora (II) Corpora built for the project monolingual : party chiefs (DE, EN), French president (FR) (200,000 tokens/language) multilingual (paralel and comparable) : aiplane companies (51,000-54,000 tokens) European parliament (127,000-134,000 tokens) European commission (175,000-195,000 tokens) Domains : politics, economy

  10. Preprocessing the Corpora Unitex tool (Paumier, 2000) Resources available for the three languages Tools : tokenizer, lemmatizer and tagger CasSys (Friburger and Maurel, 2004) to annotate French Named Entities Weblicht Platform NE annotations for German and English sentence aligner : Alinea (Kraif, 2001)

  11. Data extraction three strategies for data extraction we identify synonyms/hyponyms for English (WordNet, FrameNet) and their equivalents in French and German chef, président, PDG, directeurgénéral Chief executive officer, president, head of Vorsitzender, Direktor we search pairs of entities which are related by a CAP relation Barack Obama – United States of America José Manuel Barroso – la Commission européenne Marcel Klaus – SWISS we use aligned corpora and French NERCasSys (Friburger and Maurel, 2004) to obtain relevant contexts of Person or Organization

  12. Data Extraction (II) • Problems • few contextsfromexistingcorpora (30 to 50) • Variousqueries • CQP/web interface • rawtexts • Various annotations • few taggedcorpora • almost no NE annotatedcorpora • heterogenoustools to preprocesscorpora

  13. 'Cap' lexical units various lexical categories nouns : positions (e.g.Finanzdirektor), professions (infirmière en chef), titles (Dr.), armyranks (General) verbs : to lead, to organize, to command A trilingualontology 95 lexical units (FR), 93 lexical units (EN), 67 lexical units (DE) Fromexisting lexical databases Fromcorpora

  14. Linguistic Analysis arguments types persons, organizations, places commonnouns : anaphoricreferences to organisations or persons in charge, nationality adjectives variouslinguistic expressions Nouns – morpho-syntactic variations Verbs complexverbo-nominal predicates (sous la gouverne de, unter der Leitungvon, under the direction of, become président, être elu …)

  15. Morpho-Syntactic Properties Nouns affixation général, généralissime (FR) composition vice-roi (FR), vice-roy (EN), Vizekönig (DE) modification adjective (directeur général, FR, Generaldirektor DE) prepositional phrase (infirmière en chef FR, head nurse EN, Oberschwester DE) noun being the possessor of another noun du Conseil de Sécurité des Nations Unies, United Nation Security Council, des UN-Sicherheitsrates

  16. Conclusion and Further Work • study from the lexical semantics field : a hierarchy relation in a multilingual perspective – CAP • various expressions and various arguments types • data from monolingual and multilingual corpora • trilingual ontology (FR,DE, EN) – extension of Prolexbase • Overall experience • querying various interfaces • heterogeneous annotation information • heterogeneous tools • combining linguists’ and computational linguists’ competences

  17. The Lexico-Syntactic Patterns French patterns <Organization>de<Organization> Conseil d'Administration de SWISS English patterns <CAP function> of <Organisation>, <Person> Chief executive officer of the company TAROM, M. Gheorghe Birla German patterns <Person> <sein> <tokens>* <CAP function> <Organisation> Peter Siegenthaler ist seit Juli 2000 Direktor der Eidgenössischen Finanzverwaltung

More Related