1 / 23

Identification of Composite Named Entities in a Spanish Textual Database

Identification of Composite Named Entities in a Spanish Textual Database. Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor A. Bolshakov Lab. Lenguaje Natural CIC – IPN México, D. F. Contents. Introduction Named Entities in Textual Databases

declan-hahn
Download Presentation

Identification of Composite Named Entities in a Spanish Textual Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identification of Composite Named Entitiesin a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor A. Bolshakov Lab. Lenguaje Natural CIC – IPN México, D. F.

  2. Contents • Introduction • Named Entities in Textual Databases • NE Analysis • Recognition Method • Conclusions

  3. Contents • Introduction • Named Entities in Textual Databases • NE Analysis • Recognition Method • Conclusions

  4. Textual Databases They have been entered to computers and to Web • to save tons of paper • to allow people to have remote access • to provide much better access to texts in electronic format, etc. Searching through this huge material for informationis a time consuming task

  5. Named Entities NE mentioned in textual databases constitute an important part of their semantic contents • A collection of political electronic texts shows that almost 50% of the total sentences contains at least one NE • This indicates the relevance of NE identification and its role in document indexing and retrieval

  6. Composite Named Entities • NE with coordinated constituents Luz y Fuerza del Centro • NE with prepositional phrases Ejército Zapatista de Liberación Nacional

  7. Contents • Introduction • Named Entities in Textual Databases • NE Analysis • Recognition Method • Conclusions

  8. Collections of Political Mexican texts Coll. 1 Coll. 2 # Sentences 442,719 208,298 # Sentences w/named entities 243,165 100,602 NEs in Mexican Textual DB NEs appear at least in 50% of the sentences Selection of Collection 1 taken for training

  9. Initial NE Recognition Step • Identification of linguistic characteristics Example: Prepositions • link two different NE • are included in the NE • Identification of style characteristics Ex: Specific words introduce convention names coordinadora del programaMundo Maya ‘Mundo Maya program’s coordinator’

  10. Contents • Introduction • Named Entities in Textual Databases • NE Analysis • Recognition Method • Conclusions

  11. Training File • A Perl program extracts “compounds” Los miembros del Ejercito Federal(1) lejos de aplicar la Ley sobre Armas de Fuego y Explosivos parecen(2) proteger a los participantes en el tiroteo. • Compounds contain no more than three non-capitalized words between capitalized words • Compounds are left- and right- delimitedby a punctuation marks or a word

  12. Sentences of coll.1 • From 243,165 sentences 472,087 compounds • 500 randomly selected sentences were manually analyzed • Main result from analysis: Syntactic ambiguity is frequent

  13. Syntactic Ambiguity • Coordination of coordinated names Comisión Federal de Electricidad y Luz y Fuerza del Centro Margarita Diéguez y Armas y Carlos Virgilio • Prepositional phrase attachment Different names linked by prepositions Comandancia General del Ejército Zapatista de Liberación Nacional

  14. Contents • Introduction • Named Entities in Textual Databases • NE Analysis • Recognition Method • Conclusions

  15. Knowledge Contributions • External lists • Linguistic knowledge • Heuristics • Statistics

  16. External Lists • Hand-made list of similes (625 items) paz y justicia ‘peace and justice’ Latinoamérica y el Caribe • Hand-made list of words • Lists from the WEB • personal names (697 items) • main Mexican cities(910 items)

  17. Linguistic Knowledge Examples of linguistic restrictions • Lists of groups of capitalized words Corea del Sur (1), Taiwan (2), Checoslovaquia (3) y Sudáfrica(4) • Preposition por followed by indefinite article cannot be the link within a personal name Cuauhtémoc Cárdenas (1) por la Alianza por la Ciudad de México (2)

  18. Heuristics and Statistics • Heuristic example: a first name can be the part of only one name sequence among those coordinatedEx.: Margarita Diéguez y Armas y Carlos Virgilio Carlos belongs to the list of first names. Thus there are two name sequences here: Margarita Diéguez y Armas Vs.Carlos Virgilio • Statistics from training file With a high score, Estados Unidos is a 2-word group Thus Estados Unidos sobre Méxicocould be separated

  19. Application of the Method • Obtaining compounds with functional words • Using previous resources, the program decides on splitting, delimiting or leaving each compound as such Extract • coordinated groups • prepositional phrases • the rest of groups of capitalized words

  20. Number of: CoordinatedGroups Prepositional PhraseGroups total Precision 54 69 89 Recall 48 67 87 Results - 1 Obtained from 500 sentences of Coll. 2

  21. Results - 2 • Total: 1496 NE • 63 names with coordination • 167 prepositional groups • To compare with: Carreras, X., L. Márques and L. Padró. Named Entity Extraction using AdaBoost, CoNLL-2002 • 92% for precision and 91% for recall • However, the test file only includes one coordinated name • If a NE is embedded in another one, only the top level entity was marked

  22. Conclusions • We present a method to identify and disambiguate groups of capitalized words • Our work is focused on composite named entities • Our method use extremely short lists and a small POS-marked dictionary • The method use heterogeneous knowledge to decide on splitting or joining groups with capitalized words

  23. Thanks! sofia@fciencias.unam.mx gelbukh@cic.ipn.mx igor@cic.ipn.mx

More Related