1 / 15

Named Entities in Czech Texts and Their Processing

Named Entities in Czech Texts and Their Processing. Magda Ševčíková Zdeněk Žabokrtský {sevcikova,zabokrtsky}@ufal.mff.cuni.cz ÚFAL MFF UK. Outline of the talk. The term ‘named entities’ Named entities in Czech Named entity classification Data annotation

Download Presentation

Named Entities in Czech Texts and Their Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Named Entities in Czech Textsand Their Processing Magda Ševčíková Zdeněk Žabokrtský {sevcikova,zabokrtsky}@ufal.mff.cuni.cz ÚFAL MFF UK

  2. Outline of the talk • The term ‘named entities’ • Named entities in Czech • Named entity classification • Data annotation • Quantitative characteristics of the data • Experiments in automatic named entity recognition • Future work {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  3. The term ‘named entities’ • English term ‘named entities’ (NE) • words and word sequences which have not a common lexical meaning: • proper nouns • e.g., person names, names of institutions, products, towns • numeric expressions which have other meaning than that of quantity • e.g., telephone number, page number • NE processing is of crucial importance for NLP • question answering, information extraction, machine translation • NE task ‘born’ in MUC conference in 1995 {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  4. Named entities in Czech • ‘pojmenované entity’ – direct equivalent of ‘named entities’ • up to now, NE task has not be solved for Czech • now: within the project 1ET101120503 (Integracejazykových zdrojů za účelem extrakce informací z přirozených textů) • some examples from Czech • jeho hlava (his head) vs. pan Hlava(Mr. Hlava), k jeho hlavě (to his head) vs. k panu Hlavovi (to Mr. Hlava) • 289 stran (289 pages) vs. na straně 289 (on page 289) {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  5. Named entity classification • NE-type, NE-super-type, NE-container; special tags • 1st version for the 1st round of annotation (focused on proper nouns): • 42 NE-types: pf, ps,... • 7 NE-super-types: a, g, i, m, o, p, t • 4 NE-containers: A, C, P, T • 2nd version for the 2nd round of annotation (extended to numeric expressions): • 62 NE-types: pf, ps,... na, np,... • 10 NE-super-types: a, c, g, i, m, n, o, p, q, t • 4 NE-containers: A, C, P, T {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  6. Named entity classificationTypes of person names ... ... ... ... ... ... {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  7. Named entity classificationNE-containers {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  8. Named entity classificationSpecial tags {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  9. Data annotation • NE-type, NE-container; special tags; spam;NE-instance • 2 rounds of annotation • 1st round • 2,000 sentencesfrom SYN2000 corpus • randomly selected from 5,364,071 sentences found, query: ([word=“.*[a-z0-9]”] [word=“.*[A-Z].*”]) • 2 parallel annotations, 3rd ‘unifying’ annotation • defect sentences eliminated, annotation of another 100 sent. • -> 2,010 sentences = train and test data • 2nd round • 2,000 sentences from SYN2005 corpus • randomly selected from 1,356,321 sentences found, query:[word=“.*[0-9].*”] • 1 annotation, not yet revised {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  10. Data annotationExample of annotated text {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  11. Quantitative characteristics of the data • 2,010 sentences • 51,921 tokens • 11,644 NE-instances • train:dtest:etest ~ 8:1:1 • in the train data • 1,608 sentences • 41,710 tokens • 6,109 NE-instances {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  12. Quantitative characteristics of the dataTags of all NE-instance in the train data {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  13. Quantitative characteristics of the dataTags of all NE-instance in the train data {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  14. Experiments in automatic NE recognition {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

  15. Future work {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

More Related