Semantic Annotation – Week 3

Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Boncheva, Jia Cui, Klaus Macherey, David Guthrie, Martin Holub, Marco Cammisa, Cassia Martin, Jerry Liu, Kris Haralambiev Fred Jelinek Semantic Annotation – Week 3

Our Hypotheses • A transformation of a corpus to replace words and phrases with coarse semantic categories will help overcome the data sparseness problem encountered in language modeling • Semantic category information will also help improve machine translation • A noun-centric approach initially will allow bootstrapping for other syntactic categories

An Example • Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday • Humansaboardspace_vehicledodgesatellitetimeref.

Our Progress – Preparing the data- Pre-Workshop • Identify a tag set • Create a Human annotated corpus • Create a double annotated corpus • Process all data for named entity and noun phrase recognition using GATE Tools • Develop algorithms for mapping target categories to Wordnet synsets to support the tag set assessment

The Semantic Classes for Annotators • A subset of classes available in Longman's Dictionary of contemporary English (LDOCE) Electronic version • Rationale: • The number of semantic classes was small • The classes are somewhat reliable since they were used by a team of lexicographers to code • Noun senses • Adjective preferences • Verb preferences

Abstract T Concrete C Animate Q Inanimate I PhysQuant 4 Organic 5 Plant P Animal A Human H Liquid L Gas G Solid S Non-movable J Movable N B D F M - - Semantic Classes • Target Classes • Annotated Evidence

More Categories • U: Collective • K: Male • R: Female • W: Not animate • X: Not concrete or animal • Z: Unmarked We allowed annotators to choose “none of the above” (? in the slides that follow)

Our Progress – Data Preparation • Assess annotation format and define uniform descriptions for irregular phenomena and normalize them • Determine the distribution of the tag set in the training corpus • Analyze inter-annotator agreement • Determine a reliable set of tags – T • Parse all training data

Doubly Annotated Data • Instances (headwords): 10960 • 8,950 instances without question marks. • 8,446 of those are marked the same. • Inter-annotator agreement is 94% (83% including question marks) • Recall – these are non named entity noun phrases

Distribution of Double Annotated Data

Agreement of doubly marked instances

2 Inter-annotator agreement – for each category

Category distribution among agreed part 69%

A few statistics on the human annotated data • Total annotated 262,230 instances • 48,175 with ? • 214,055 with a category • of those Z .5% • W and X .5% • 4 , 5 1.6%

Our progress – baselines • Determine baselines for automatic tagging of noun phrases • Baselines for tagging observed words in new contexts (new instances of known words) • Baselines for tagging unobserved words • Unseen words – not in the training material but in dictionary • Novel words – not in the training material nor in the dictionary/Wordnet

Overlap of dictionary and head nouns (in the BNC) • 85% of NP’s covered • only 33% of vocabulary (both in LDOCE and in Wordnet) in the NP’s covered

Preparation of the test environment • Selected the blind portion of the human annotated data for late evaluation • Divided the remaining corpus into training and held-out portions • Random division of files • Unambiguouswords for training – ambiguous for testing

Baselines using only (target) words

Baselines using only (target) words and preceeding adjectives

Baselines using multiple knowledge sources • Experiments in Sheffield • Unambiguous tagger (assign only available semantic categories) • bag-of-words tagger (IR inspired) • window size 50 words • nouns and verbs • Frequency-based tagger (assign the most frequent semantic category)

Baselines using multiple knowledge sources (cont’d) • Frequency-based tagger • 16-18% error rate • bag-of-words tagger • 17% error rate • Combined architecture • 14.5-15% error rate

Bootstrapping to Unseen Words • Problem: Automatically identify the semantic class of words in LDOCE whose behavior was not observed in the training data • Basic Idea: We use the unambiguous words (unambiguous with respect to the our semantic tag set) to learn context for tagging unseen words.

Bootstrapping: statistics 6,656 different unambiguous lemmas in the (visible) human tagged corpus ...these contribute to 166,249 instances of data ...134,777 instances were considered correct by the annotators ! Observation: Unambiguous words can be used in the corpus in an “unforeseen” way

Bootstrapping baselines • Test Instances (instances of ambiguous words) : 62,853

Metrics for Intrinsic Evaluation • Need to take into account the hierarchical structure of the target semantic categories • Two fuzzy measures based on: • dominance between categories • edge distance in the category tree/graph • Results wrt inter annotator agreement is almost identical to exact match

What’s next • Investigate respective contribution of (independent) features • Incorporate syntactic information • Refine some coarse categories • Using subject codes • Using genus terms • Re-mapping via Wordnet

What’s next (cont’d) • Reduce the number of features/values via external resources: • lexical vs. semantic models of the context • use selectional preferences • Concentrate on complex cases (e.g. unseen words) • Preparation of test data for extrinsic evaluation (MT)

Semantic Annotation – Week 3

Semantic Annotation – Week 3

Presentation Transcript

, . , . .

bus630oldcoursesTutorial /uophelp

bus644ashcoursesTutorial /uophelp

bus694ashcoursesTutorial /uophelp

busn278devrycoursesTutorial /uophelp

cis319newcoursesTutorial /uophelp

cja214newcoursesTutorial /uophelp

BPA 303 UOP TUTORIAL / Uoptutorial

BUS 670 UOP Tutorial Course / Uoptutorial

FIN 415 Courses / fin415dotcom

PSy 270 Tutorials Peer Educator/psy270tutorialsdotcom

CJA 214 Tutorial Peer Educator/cja214tutorial.com

5 Best Western Gowns Every Women Should Own

BUS 644 Mart Peer Educator/bus644martdotcom

ACC 422 ACADEMIC ACHIEVEMENT / UOPHELP

ACC 206 INSTANT EDUCATION/uophelp

ACC 557 In order to succeed, you must read/Uophelpdotcom

Ultimate Limousine Service Provider in Chicago