Identification of Composite Named Entities in a Spanish Textual Database

Identification of Composite Named Entitiesin a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor A. Bolshakov Lab. Lenguaje Natural CIC – IPN México, D. F.

Contents • Introduction • Named Entities in Textual Databases • NE Analysis • Recognition Method • Conclusions

Textual Databases They have been entered to computers and to Web • to save tons of paper • to allow people to have remote access • to provide much better access to texts in electronic format, etc. Searching through this huge material for informationis a time consuming task

Named Entities NE mentioned in textual databases constitute an important part of their semantic contents • A collection of political electronic texts shows that almost 50% of the total sentences contains at least one NE • This indicates the relevance of NE identification and its role in document indexing and retrieval

Composite Named Entities • NE with coordinated constituents Luz y Fuerza del Centro • NE with prepositional phrases Ejército Zapatista de Liberación Nacional

Collections of Political Mexican texts Coll. 1 Coll. 2 # Sentences 442,719 208,298 # Sentences w/named entities 243,165 100,602 NEs in Mexican Textual DB NEs appear at least in 50% of the sentences Selection of Collection 1 taken for training

Initial NE Recognition Step • Identification of linguistic characteristics Example: Prepositions • link two different NE • are included in the NE • Identification of style characteristics Ex: Specific words introduce convention names coordinadora del programaMundo Maya ‘Mundo Maya program’s coordinator’

Training File • A Perl program extracts “compounds” Los miembros del Ejercito Federal(1) lejos de aplicar la Ley sobre Armas de Fuego y Explosivos parecen(2) proteger a los participantes en el tiroteo. • Compounds contain no more than three non-capitalized words between capitalized words • Compounds are left- and right- delimitedby a punctuation marks or a word

Sentences of coll.1 • From 243,165 sentences 472,087 compounds • 500 randomly selected sentences were manually analyzed • Main result from analysis: Syntactic ambiguity is frequent

Syntactic Ambiguity • Coordination of coordinated names Comisión Federal de Electricidad y Luz y Fuerza del Centro Margarita Diéguez y Armas y Carlos Virgilio • Prepositional phrase attachment Different names linked by prepositions Comandancia General del Ejército Zapatista de Liberación Nacional

Knowledge Contributions • External lists • Linguistic knowledge • Heuristics • Statistics

External Lists • Hand-made list of similes (625 items) paz y justicia ‘peace and justice’ Latinoamérica y el Caribe • Hand-made list of words • Lists from the WEB • personal names (697 items) • main Mexican cities(910 items)

Linguistic Knowledge Examples of linguistic restrictions • Lists of groups of capitalized words Corea del Sur (1), Taiwan (2), Checoslovaquia (3) y Sudáfrica(4) • Preposition por followed by indefinite article cannot be the link within a personal name Cuauhtémoc Cárdenas (1) por la Alianza por la Ciudad de México (2)

Heuristics and Statistics • Heuristic example: a first name can be the part of only one name sequence among those coordinatedEx.: Margarita Diéguez y Armas y Carlos Virgilio Carlos belongs to the list of first names. Thus there are two name sequences here: Margarita Diéguez y Armas Vs.Carlos Virgilio • Statistics from training file With a high score, Estados Unidos is a 2-word group Thus Estados Unidos sobre Méxicocould be separated

Application of the Method • Obtaining compounds with functional words • Using previous resources, the program decides on splitting, delimiting or leaving each compound as such Extract • coordinated groups • prepositional phrases • the rest of groups of capitalized words

Number of: CoordinatedGroups Prepositional PhraseGroups total Precision 54 69 89 Recall 48 67 87 Results - 1 Obtained from 500 sentences of Coll. 2

Results - 2 • Total: 1496 NE • 63 names with coordination • 167 prepositional groups • To compare with: Carreras, X., L. Márques and L. Padró. Named Entity Extraction using AdaBoost, CoNLL-2002 • 92% for precision and 91% for recall • However, the test file only includes one coordinated name • If a NE is embedded in another one, only the top level entity was marked

Conclusions • We present a method to identify and disambiguate groups of capitalized words • Our work is focused on composite named entities • Our method use extremely short lists and a small POS-marked dictionary • The method use heterogeneous knowledge to decide on splitting or joining groups with capitalized words

Thanks! sofia@fciencias.unam.mx gelbukh@cic.ipn.mx igor@cic.ipn.mx

Identification of Composite Named Entities in a Spanish Textual Database