150 likes | 164 Views
Named Entities in Czech Texts and Their Processing. Magda Ševčíková Zdeněk Žabokrtský {sevcikova,zabokrtsky}@ufal.mff.cuni.cz ÚFAL MFF UK. Outline of the talk. The term ‘named entities’ Named entities in Czech Named entity classification Data annotation
E N D
Named Entities in Czech Textsand Their Processing Magda Ševčíková Zdeněk Žabokrtský {sevcikova,zabokrtsky}@ufal.mff.cuni.cz ÚFAL MFF UK
Outline of the talk • The term ‘named entities’ • Named entities in Czech • Named entity classification • Data annotation • Quantitative characteristics of the data • Experiments in automatic named entity recognition • Future work {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
The term ‘named entities’ • English term ‘named entities’ (NE) • words and word sequences which have not a common lexical meaning: • proper nouns • e.g., person names, names of institutions, products, towns • numeric expressions which have other meaning than that of quantity • e.g., telephone number, page number • NE processing is of crucial importance for NLP • question answering, information extraction, machine translation • NE task ‘born’ in MUC conference in 1995 {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Named entities in Czech • ‘pojmenované entity’ – direct equivalent of ‘named entities’ • up to now, NE task has not be solved for Czech • now: within the project 1ET101120503 (Integracejazykových zdrojů za účelem extrakce informací z přirozených textů) • some examples from Czech • jeho hlava (his head) vs. pan Hlava(Mr. Hlava), k jeho hlavě (to his head) vs. k panu Hlavovi (to Mr. Hlava) • 289 stran (289 pages) vs. na straně 289 (on page 289) {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Named entity classification • NE-type, NE-super-type, NE-container; special tags • 1st version for the 1st round of annotation (focused on proper nouns): • 42 NE-types: pf, ps,... • 7 NE-super-types: a, g, i, m, o, p, t • 4 NE-containers: A, C, P, T • 2nd version for the 2nd round of annotation (extended to numeric expressions): • 62 NE-types: pf, ps,... na, np,... • 10 NE-super-types: a, c, g, i, m, n, o, p, q, t • 4 NE-containers: A, C, P, T {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Named entity classificationTypes of person names ... ... ... ... ... ... {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Named entity classificationNE-containers {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Named entity classificationSpecial tags {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Data annotation • NE-type, NE-container; special tags; spam;NE-instance • 2 rounds of annotation • 1st round • 2,000 sentencesfrom SYN2000 corpus • randomly selected from 5,364,071 sentences found, query: ([word=“.*[a-z0-9]”] [word=“.*[A-Z].*”]) • 2 parallel annotations, 3rd ‘unifying’ annotation • defect sentences eliminated, annotation of another 100 sent. • -> 2,010 sentences = train and test data • 2nd round • 2,000 sentences from SYN2005 corpus • randomly selected from 1,356,321 sentences found, query:[word=“.*[0-9].*”] • 1 annotation, not yet revised {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Data annotationExample of annotated text {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Quantitative characteristics of the data • 2,010 sentences • 51,921 tokens • 11,644 NE-instances • train:dtest:etest ~ 8:1:1 • in the train data • 1,608 sentences • 41,710 tokens • 6,109 NE-instances {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Quantitative characteristics of the dataTags of all NE-instance in the train data {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Quantitative characteristics of the dataTags of all NE-instance in the train data {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Experiments in automatic NE recognition {sevcikova,zabokrtsky}@ufal.mff.cuni.cz
Future work {sevcikova,zabokrtsky}@ufal.mff.cuni.cz