Processing of Named Entities in Czech Texts

Named Entities in Czech Textsand Their Processing Magda Ševčíková Zdeněk Žabokrtský {sevcikova,zabokrtsky}@ufal.mff.cuni.cz ÚFAL MFF UK

Outline of the talk • The term ‘named entities’ • Named entities in Czech • Named entity classification • Data annotation • Quantitative characteristics of the data • Experiments in automatic named entity recognition • Future work {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

The term ‘named entities’ • English term ‘named entities’ (NE) • words and word sequences which have not a common lexical meaning: • proper nouns • e.g., person names, names of institutions, products, towns • numeric expressions which have other meaning than that of quantity • e.g., telephone number, page number • NE processing is of crucial importance for NLP • question answering, information extraction, machine translation • NE task ‘born’ in MUC conference in 1995 {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Named entities in Czech • ‘pojmenované entity’ – direct equivalent of ‘named entities’ • up to now, NE task has not be solved for Czech • now: within the project 1ET101120503 (Integracejazykových zdrojů za účelem extrakce informací z přirozených textů) • some examples from Czech • jeho hlava (his head) vs. pan Hlava(Mr. Hlava), k jeho hlavě (to his head) vs. k panu Hlavovi (to Mr. Hlava) • 289 stran (289 pages) vs. na straně 289 (on page 289) {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Named entity classification • NE-type, NE-super-type, NE-container; special tags • 1st version for the 1st round of annotation (focused on proper nouns): • 42 NE-types: pf, ps,... • 7 NE-super-types: a, g, i, m, o, p, t • 4 NE-containers: A, C, P, T • 2nd version for the 2nd round of annotation (extended to numeric expressions): • 62 NE-types: pf, ps,... na, np,... • 10 NE-super-types: a, c, g, i, m, n, o, p, q, t • 4 NE-containers: A, C, P, T {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Named entity classificationTypes of person names ... ... ... ... ... ... {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Named entity classificationNE-containers {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Named entity classificationSpecial tags {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Data annotation • NE-type, NE-container; special tags; spam;NE-instance • 2 rounds of annotation • 1st round • 2,000 sentencesfrom SYN2000 corpus • randomly selected from 5,364,071 sentences found, query: ([word=“.*[a-z0-9]”] [word=“.*[A-Z].*”]) • 2 parallel annotations, 3rd ‘unifying’ annotation • defect sentences eliminated, annotation of another 100 sent. • -> 2,010 sentences = train and test data • 2nd round • 2,000 sentences from SYN2005 corpus • randomly selected from 1,356,321 sentences found, query:[word=“.*[0-9].*”] • 1 annotation, not yet revised {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Data annotationExample of annotated text {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Quantitative characteristics of the data • 2,010 sentences • 51,921 tokens • 11,644 NE-instances • train:dtest:etest ~ 8:1:1 • in the train data • 1,608 sentences • 41,710 tokens • 6,109 NE-instances {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Quantitative characteristics of the dataTags of all NE-instance in the train data {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Experiments in automatic NE recognition {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Future work {sevcikova,zabokrtsky}@ufal.mff.cuni.cz

Processing of Named Entities in Czech Texts

Processing of Named Entities in Czech Texts

Presentation Transcript

Towards a semantic extraction of named entities

Knowledge/Mental Entities and their relationships

Indexing concepts and/or named entities

Named Anchors and Named Destinations

Linking named entities in Tweets with knowledge base via user interest modeling

Named Entity Recognition for Digitised Historical Texts

Linking Entities in Short Texts Based on a Chinese Semantic Knowledge Base

LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge

Using Wikipedia for Hierarchical Finer Categorization of Named Entities

Texts and Other Texts

Named Entities in Domain Unlimited Speech Translation

Approaches to Event-Focused Summarization Based on Named Entities and Query Words

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model

Linking Named Entities in Tweets with Knowledge Base via User Interest Modeling

LINDEN: Linking Named Entities with Knowledge Base via Semantic Knowledge

Biosignals and their processing Thermometry

Learning Formulation and Transformation Rules for Multilingual Named Entities

Text Classification and Named Entities for New Event Detection

Identification of Composite Named Entities in a Spanish Textual Database

Non Fiction Texts and their text features

Biosignals and their processing Thermometry

Iterative Set Expansion of Named Entities using the Web