410 likes | 567 Views
Verbal Predications for Definition Extraction from Specialised Corpora. Gerardo Sierra, César Aguilar & Rodrigo Alarcón (gsierram, caguilar, ralarconm)@iingen.unam.mx 4th International Conference Practical Applications in Language Corpora Lodz, Poland, 4 – 6 April 2003. Outline.
E N D
Verbal Predications for Definition Extraction from Specialised Corpora Gerardo Sierra, César Aguilar & Rodrigo Alarcón (gsierram, caguilar, ralarconm)@iingen.unam.mx 4th International Conference Practical Applications in Language Corpora Lodz, Poland, 4 – 6 April 2003
Outline • Introduction • Background • Recurrent patterns in definitional contexts • Verbal paradigm evaluation • Conclusions
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Introduction • Terminographical work: to identifie terms and definitions from specialised texts • Main goal: To develop a conceptual information extraction system
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Conceptual information extraction system • To identify recurrent patterns in textual fragments where a term is defined • To expand the paradigm of the recurrent patterns
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Conceptual information extraction system • To evaluate those paradigms • To develop a computational linguistic technique to retrieve definitional contexts
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Definitional Context • Textual fragment of a specialised text that contain the necessary information to define a term. Term • En este estudio, la “erosión del bordo” se entiende como el desgaste de las superficies expuestas a la acción directa del agua y viento llegando a producir un adelgazamiento de tal magnitud que propicie o permita el paso del agua (por pérdida del bordo libre) o el desplome de una parte del bordo (por debilitamiento local) Definition Definitional Context
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Definitional Context • Textual fragment of a specialised text that contain the necessary information to define a term. In this studie Quotation marks It Is understand as • En este estudio, la “erosión del bordo” se entiende como el desgaste de las superficies expuestas a la acción directa del agua y viento llegando a producir un adelgazamiento de tal magnitud que propicie o permita el paso del agua (por pérdida del bordo libre) o el desplome de una parte del bordo (por debilitamiento local) Definitional Context Characteristic elements
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Background • Jennifer Pearson (1998) Terms in Contexts • Ingrid Meyer (1998) Knowledge-rich contexts • Carlos Rodríguez (1999) Operaciones Metalingüísticas Explícitas
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Corpus based analysis • Engineering texts: • Logistics • Transport • Expert systems • Bioclimatic structures • Artificial intelligence
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Definitional contexts’ elements • Minimal elements : • Term (T) • Definition (D)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Definitional contexts’ elements • Characteristic elements : • Tipographical mark (tm) • Pragmatic predication (PP) • Verbal predication (VP)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Patterns clasiffication • Typographical • Sintactic • Mixed
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Typographical patterns • Text format factors to emphasise both term and/or definition • Exclude a verbal predication • Verbal predications are substituted by punctuation marks
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Typographical patterns
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Syntactic patterns • They do not include typographical features • Pragmatic and verbal predications • (PP/VP) + T/D + (PP/VP) + D/T + (PP)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Pragmatic predications • Information about usage or treatment of the term. • Clues to understand a term in the context it appears.
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Pragmatic predications • Adverbial phrases: • generalmente (generrally) • Prepositional phrases: • en terminos generales (in general terms) • Simple words: • concepto (concept)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Verbal predications • Verbs to connect a term qwith its definition • Commonly called metalinguistic verbs • definir (to define) • denominar (to denominate)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Verbal predications • Simple forms • verb + grammatical particle • Compound forms • pronoun se + verb + grammatical particle
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Verbal predications • Simple forms • tambien llamado (also called) • consiste de (consist of) • Compund forms • se define como (it is defined as) • se denomina como (it is denominated as)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Sintactic patterns
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Mixed patterns • Tipographical marks • Syntactic elements • Pragmatic predications • Verbal predications
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Mixed patterns
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Verbal paradigm evaluation • To expand the verbal paradigm obtained • What grammatical particles could appear with each verb?
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Verbal predicactionexamples
Introduction • Background • Recurrent patterns • Evaluation • Conclusions CREA • Corpus de Referencia del Español Actual (Reference corpus of today’s Spanish) • www.rae.es • Boolean operators • (* / AND / AND NOT / OR / Dist #) • Restrictive criteria • (Theme / Media / Geographical)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Results DCI = Definitional Context Index definitional contexts / textual fragments retrieved
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Verbal predicaction evaluation • Automatic search of the expanded verbal paradigm • Precision & Recall
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Automatic search • All the structures possibilities of verbs • Verbal times (presente, pasado, futuro, antepresente del modo indicativo) • Grammatical pearsons (1, 3 plural and singular) • Without pragmatic predications
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Precision & Recall Precision definitional contexts automaticallyretrieved textual fragments automaticallyretrieved Recall definitional contexts automatically retrieved definitional contexts in the corpus
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Precision & Recall It is defined as It is based on It is denominated as To visualise It is considered as
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Precision & Recall improve • Recall • To expand the verbal paradigm (grammatical particles) • Precision • To consider other characteristic elements (typographical marks, pragmatic predications)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Corpus tagging • Some tags to consider: • Fonts • Size • Colour • Capital and small capital letters, etc. • Head elements (titles, subtitles, etc.) • Word spacing: “los d a ñ o s se definen como…” • Bullets, footnotes, quotes, superscripts, subscripts… • No need for tagging of punctuation(it can simply be recognised)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Corpus tagging • POS Tags • Necessary to determine internal structure of phrases (noun, verbal and adverbial) which constitute • Terms, • Definitions, • Verbal and pragmatic predications
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Corpus tagging • POS Tags • Some attributes are not relevant • Gender and number (Noun Phrases) • Verbal tense inflexion (present, future, past, imperfect, subjunctives… etc.) • Relevant attributes • Grammatical person (Verbal Phrases): • Conceptual information typically introduced by 3rd person • Whether or not a verb is auxiliary
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Corpus tagging • Parsing Tags • Necessary to determine syntactic relations among: • All kinds of phrases involved within (and without) • Terms • Definitions • Verbal and pragmatic predications
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Corpus tagging • POS Tags • Syntactic phrases • Terms may consist of both NP + PP (Cabré 2001) • Definitions are composed of at least one well formed sentence (one or more syntactic phrases) • Pragmatic predications (related to style) • Prepositional Phrase: “en términos generales” (in general terms) • Noun Phrase: “la característica principal” (the main characteristic) • Adverbial Phrase: “tradicionalmente” (traditionally) • Overlapping: prepositional phrases with adverbial function
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Corpus tagging • POS Tags • Verbal predications • Metalinguistic verbs: “se define como” • Non metalinguistic verbs: “se visualiza como” • Other structures • Verbal phrases consisting of verb + noun, where the verb has been semantically eroded • “tiene la finalidad” (it has the aim)
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Conclusions • Definitional contexts extraction system • Linguistic analysis of definitional contexts • Recurrent patterns
Introduction • Background • Recurrent patterns • Evaluation • Conclusions Conclusions • Expand and evaluate the verbal paradigm • Corpus based study