170 likes | 264 Views
Amalia Todirascu Linguistique, Langues, Paroles (LILPA) University of Strasbourg todiras@unistra.fr. CAP: A Hierarchical Lexical Function. The Project. Goals to study a specific CAP lexical function, in several languages (French, English, German) economy, politics
E N D
Amalia Todirascu Linguistique, Langues, Paroles (LILPA) University of Strasbourg todiras@unistra.fr CAP: A Hierarchical Lexical Function
The Project Goals to study a specific CAP lexical function, in several languages (French, English, German) economy, politics to provide a complete linguistic description of this function to extend a multilingual ontology, Prolexbase (Tran and Maurel, 2006)
The Project (II) • collaboration with CLARIN European project (http://www.clarin.eu) • WP3 Humanities overview • WP3.3 Call for collaboration with Humanities projects • Collaboration • access to existing corpora and tools • consultancy
CAP – a Lexical Function CAP lexical function (Mel'čuk 1984, 1988, 1992, 1999) – hierarchical relations Two persons François Fillon est premier ministre de Nicolas Sarkozy Sebek em war ein Oberpriester ca. 1780 v.Chr Two organisations Swiss Private Aviation AG, a fully-owned subsidiary of Swiss International Air Lines AG Peugeotest unefirme sochalienne A Person and an organization or a country SWISS Finanzchef Marcel Klaus Traian Băsescuis theRomanian president
Context linguistics : noun classifications (Kleiber 1990, Kleiber 1999, Jonasson 1994) lexical databases: WordNet (Miller, 1995), EuroWordNet (Vossen, 1998), BalkaNet (Tufis, 2004), FrameNet (Baker, et al, 1998) ontologies: Prolexbase (Tran and Maurel, 2006) (Grass et al, 2004) , SUMO (Niles and Pease, 2001) several applications : information extraction QA systems
The Methodology • we identify existing monolingual and parallel corpora • DE, EN, FR • CLARIN language resource registry • tagged and raw corpora • annotation tools (both from the repository and on-line web services) • we create our own multilingual corpora
The Methodology (II) • we apply several data extraction strategies • searching synonyms of "chef/head of/Vorsitzender"; • searching Named Entities related by the CAP relation (Martine Aubry – Parti Socialiste); • searching annotated persons and organizations through aligned corpora • we analyse the contexts to classify the expressions and their arguments • we extend Prolexbase ontology
Corpora (I) • Available publicdata • Web interfaces (CQP) • Various domains and genres • monolingual : • Wortschatz (http://corpora.informatik.uni-leipzig.de), IULA (http://bwananet.iula.upf.edu), COSMAS (http://www.ids-mannheim.de/cosmas2), BNC (http://www.natcorp.ox.ac.uk/) • multilingual : • Oslo (http://www.hf.uio.no), CLUVI (http://sli.uvigo.es/CLUVI),DGT-TM (http://langtech.jrc.it/DGT-TM.html)
Corpora (II) Corpora built for the project monolingual : party chiefs (DE, EN), French president (FR) (200,000 tokens/language) multilingual (paralel and comparable) : aiplane companies (51,000-54,000 tokens) European parliament (127,000-134,000 tokens) European commission (175,000-195,000 tokens) Domains : politics, economy
Preprocessing the Corpora Unitex tool (Paumier, 2000) Resources available for the three languages Tools : tokenizer, lemmatizer and tagger CasSys (Friburger and Maurel, 2004) to annotate French Named Entities Weblicht Platform NE annotations for German and English sentence aligner : Alinea (Kraif, 2001)
Data extraction three strategies for data extraction we identify synonyms/hyponyms for English (WordNet, FrameNet) and their equivalents in French and German chef, président, PDG, directeurgénéral Chief executive officer, president, head of Vorsitzender, Direktor we search pairs of entities which are related by a CAP relation Barack Obama – United States of America José Manuel Barroso – la Commission européenne Marcel Klaus – SWISS we use aligned corpora and French NERCasSys (Friburger and Maurel, 2004) to obtain relevant contexts of Person or Organization
Data Extraction (II) • Problems • few contextsfromexistingcorpora (30 to 50) • Variousqueries • CQP/web interface • rawtexts • Various annotations • few taggedcorpora • almost no NE annotatedcorpora • heterogenoustools to preprocesscorpora
'Cap' lexical units various lexical categories nouns : positions (e.g.Finanzdirektor), professions (infirmière en chef), titles (Dr.), armyranks (General) verbs : to lead, to organize, to command A trilingualontology 95 lexical units (FR), 93 lexical units (EN), 67 lexical units (DE) Fromexisting lexical databases Fromcorpora
Linguistic Analysis arguments types persons, organizations, places commonnouns : anaphoricreferences to organisations or persons in charge, nationality adjectives variouslinguistic expressions Nouns – morpho-syntactic variations Verbs complexverbo-nominal predicates (sous la gouverne de, unter der Leitungvon, under the direction of, become président, être elu …)
Morpho-Syntactic Properties Nouns affixation général, généralissime (FR) composition vice-roi (FR), vice-roy (EN), Vizekönig (DE) modification adjective (directeur général, FR, Generaldirektor DE) prepositional phrase (infirmière en chef FR, head nurse EN, Oberschwester DE) noun being the possessor of another noun du Conseil de Sécurité des Nations Unies, United Nation Security Council, des UN-Sicherheitsrates
Conclusion and Further Work • study from the lexical semantics field : a hierarchy relation in a multilingual perspective – CAP • various expressions and various arguments types • data from monolingual and multilingual corpora • trilingual ontology (FR,DE, EN) – extension of Prolexbase • Overall experience • querying various interfaces • heterogeneous annotation information • heterogeneous tools • combining linguists’ and computational linguists’ competences
The Lexico-Syntactic Patterns French patterns <Organization>de<Organization> Conseil d'Administration de SWISS English patterns <CAP function> of <Organisation>, <Person> Chief executive officer of the company TAROM, M. Gheorghe Birla German patterns <Person> <sein> <tokens>* <CAP function> <Organisation> Peter Siegenthaler ist seit Juli 2000 Direktor der Eidgenössischen Finanzverwaltung