1.2k likes | 1.47k Views
Team : Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Bontcheva, Jia Cui, Klaus Macherey, David Guthrie, Martin Holub, Marco Cammisa, Cassia Martin, Jerry Liu, Kris Haralambiev, Fred Jelinek. Attacking the Data Sparseness Problem. Motivation for the project.
E N D
Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Bontcheva, Jia Cui, Klaus Macherey, David Guthrie, Martin Holub, Marco Cammisa, Cassia Martin, Jerry Liu, Kris Haralambiev, Fred Jelinek Attacking the Data Sparseness Problem
Motivation for the project Texts for text extraction contain sentences like: The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today.
ORGANIZATION ATTACKED LOCATION DATE ORGANIZATION ATTACKED LOCATION DATE Motivation for the project We’d like to automatically recognize that both are of the form: The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today.
Our Hypotheses • A transformation of a corpus to replace words and phrases with coarse semantic categories will help overcome the data sparseness problem encountered in language modeling, and text extraction. • Semantic category information might also help improve machine translation • A noun-centric approach initially will allow bootstrapping for other syntactic categories
A six week goal – Labeling noun phrases • Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday • Humansaboardspace_vehicledodgesatellitetimeref.
Preparing the data- Pre-Workshop • Identify a tag set • Create a Human annotated corpus • Create a double annotated corpus • Process all data for named entity and noun phrase recognition using GATE Tools (26 million words) • Parsed about (26 million words) • Develop algorithms for mapping target categories to Wordnet synsets to support the tag set assessment
The Semantic Classes and the Corpus • A subset of classes available in Longman's Dictionary of contemporary English (LDOCE) Electronic version • Rationale: • The number of semantic classes was small • The classes are somewhat reliable since they were used by a team of lexicographers to code Noun senses, Adjective preferences and Verb preferences • Many words have subject area information which might be useful
FemaleAnim. MaleAnim. MaleAnim. The Semantic Classes Concrete Abstract Animate Inanimate Solid Gas Liquid Plant Animal Human Movable Non-movable FemaleAnim.
Female Male FemaleAnim. MaleAnim. MaleAnim. The Semantic Classes Concrete Abstract Animate Inanimate Solid Gas Liquid Plant Animal Human Movable Non-movable FemaleAnim.
Female Organic Physical Qualities Male FemaleAnim. MaleAnim. MaleAnim. The Semantic Classes Abstract Concrete Animate Inanimate Solid Gas Liquid Plant Animal Human Movable Non-movable FemaleAnim.
Female Organic Physical Qualities Male FemaleAnim. MaleAnim. MaleAnim. The Semantic Classes Abstract Concrete Animate Inanimate Solid Gas Liquid Plant Animal Human Movable Non-movable Collective FemaleAnim.
The human annotated statistics • Inter-annotator agreement is 94%, so that is the upper limit of our task. • 214,446 total annotated noun phrases (262,683 including “None of the Above”) • 29,071 unique vocabulary items (Unlemmatized) • 25 semantic categories (162 associated subject areas were identified) • 127,569 with semantic category - Abstract, 59 %
Human Annotated with semantic tags– Noun Phrases Only 220,000 instances 2million words The experimental setup BNC (Science, Politics, Business) 26 million words
The main development set (dev) Training 113,000 instances Held out 85,000 instances Blind portion Machine Learning to improve this
A challenging development set for experiments on useen words (Hard data set) Training – all unambiguous words 125,000 instances Held out – ambiguous words 73,000 instances Blind portion Machine Learning to improve this
Our Experiments include: • Supervised Approaches (Learning from Human Annotated data) • Unsupervised approaches • Using outside evidence (the dictionary or wordnet) • Syntactic information from parsing or pattern matching • Context words, the use of preferences, the use of topical information
Experiments on unseen words - Hard data set • Training corpus has only words with unambiguous annotations • 125,000 training instances • 73,000 instances held-out • Perplexity – 21 • Baseline – Accuracy 45% • Improvement – Accuracy 68.5 % • Context can contribute greatly in unsupervised experiments
Results on the dev set • Random with some frequent ambiguous words moved into testing • 113,000 training instances • 85,000 instances held-out • Perplexity – 3.44 • Baseline – Accuracy 80% • Improvement – Accuracy 87 %
The scheme for annotating the large corpus • After experimenting with the development sets, we need a scheme for making use of all of the dev corpus to tag the blind corpus. • We developed a incremental scheme within the maximum entropy framework • Several talks have to do with re-estimation techniques useful to bootstrapping process.
Terminology • Seen words – words seen in the human annotated data (new instances of known words) • Unseen words – not in the training material but in dictionary • Novel words – not in the training material nor in the dictionary/Wordnet
Bootstrapping Human Annotated Blind portion Unannotated Data
The Unannotated Data – Four types Human Annotated Blind portion Unambiguous 515,000 instances
The Unannotated Data – Four types Human Annotated Blind portion Seen in training 550,000 instances Unambiguous 515,000 instances
Unseen but in dictionary 9,000 The Unannotated Data – Four types Human Annotated Blind portion Seen in training 550,000 instances Unambiguous 515,000 instances
Unseen but in dictionary 9,000 The Unannotated Data – Four types Human Annotated Blind portion Seen in training 550,000 instances Unambiguous 515,000 instances Novel 20,000
Unseen Unambiguous/Annotated Marked as <0, 0, ...., 0, 1> Marked with appropriate probabilites. e.g. seen w is <p(C1|w), ...p(Cn|w)> Ambiguous Annotated 201K Unambiguous 515K Seen 550K 9K Novel 20K Training Training Training Tag TestData
Results on the Blind Data • We set aside one tenth of the annotated corpus • Randomly selected within each of the domains • It contained 13,000 annotated instances • The baseline here was very high - 90% with simple techniques • We were able to achieve 93.5% accuracy
Overview • Bag of words (Kalina) • Evaluation (Kris) • Supervised methods using maximum entropy (Klaus) • Incorporating context preferences (Jerry) • Experiments with Adjective Classes and Subject (David, Jia, Martin) • Structuring the context using syntax and semantics (Cassia, Fabio) • Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia) • Unsupervised Re-estimation (Roberto) • Student Proposals (Jia, Dave, Marco) • Conclusion
Semantic Categories and MT • 10 test words – high, medium, and low frequency • Collected their target translations using EuroWordNet (e.g. Dutch) • Crane: • [lifts and moves heavy objects] – hijskraan, kraan • [large long-necked wading bird] - kraanvogel
SemCats and MT (2) • Manually mapped synonym sets to semantic categories • automatic mapping will be presented later • Studied how many synonym sets are ruled out as translations by the semantic category
Some Results • 3 words – full disambiguation • crane (Mov.Solid/Animal), medicine (Abstract/Liquid), plant (Plant/Solid) • 7 words – the categories reduce substantially the possible translations • club - [Abstr/an association of people...], [Mov.Solid/stout stick...], [Mov.Solid/ an implement used by a golfer...], [Mov.Solid/a playing card...], [NonMov.Solid/a building …] • club/NonMov.Solid – [clubgebouw, clubhuis, …] • club/Abstr. – [bevolkingsgroep, broederschap, …] • club/Mov.Solid – [knots, kolf, malie], [kolf, malie], [club]
The architecture • The “multiple-knowledge sources” WSD architecture (Stevenson 03) • Allow use of multiple taggers and combine their results through a weighted function • Weights can be learned from a corpus • All taggers implemented as GATE components and combined in applications
The Bag-of-Words Tagger • The bag-of-words tagger is an Information Retrieval-inspired tagger with parameters: • Window size: 50 default value • What POS to put in the content vectors (default: nouns and verbs) • Which similarity measure to use • Used in WSD (Leacock et al 92) • Crane/Animal={species, captivity, disease…} • Crane/Mov.Solid={worker, disaster, machinery…}
BoW classifier (2) • Seen words classified by calculating the inner product between their context vector and the vectors for each possible category • Inner product calculated as: • Binary vectors – number of matching terms • Weighted vectors: • Leacock’s measure – favour concepts that occur frequently in exactly one category • Take into account the polysemy of concepts in the vectors
Current performance measures • The baseline frequency tagger on its own – 91% on the test (blind) set • Bag-of-words tagger on its own – 92.7% • Combined architecture –93.2% (window size 50, using only nouns, binary vectors)
Future work on the architecture • Integrate syntactic information, subject codes, and document topics • Experiment with cosine similarity • Implement [Yarowsky’92] WSD algorithm • Implement the weighted function module • Experiment with integrating the ME tools as one of the taggers supplying preferences for the weighting module
Overview • Bag of words (Kalina) • Evaluation (Kris) • Supervised methods using maximum entropy (Klaus) • Incorporating context preferences (Jerry) • Experiments with Adjective Classes and Subject (David, Jia, Martin) • Structuring the context using syntax and semantics (Cassia, Fabio) • Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia) • Unsupervised Re-estimation (Roberto) • Student Proposals (Jia, Dave, Marco) • Conclusion
Accuracy MeasurementsKris Haralambiev How to measure the accuracy How to distinguish “correct”, “almost correct” and “wrong”
Exact Match Measurements • W = (w1, w2, …, wn) – vector of the annotated words • X = (x1, x2, …, xn) – categories assigned by the annotators • Y = (y1, y2, …, yn) – categories assigned by a program • Exact match (default) measurement – 1 for match and 0 for mismatch of each (xi,yi) pair: accuracy(X,Y) = |{i : xi = yi}|
Abstract T Concrete C Animate Q Inanimate I PhysQual 4 Organic 5 Plant P Animal A Human H Liquid L Gas G Solid S Non-movable J Movable N B D F M The Hierarchy
Animate Q Human H Animal A F M Ancestor Relation Measurement • The exact match will assign 0 for the pairs (H,M), (H,F), (A,Q), … • Give a partial score for two categeories in ancestor relation • weight(Cat)=|{i : xi Tree with root Cat} | • score(xi, yi) = min(weight(xi)/weight(yi), weight(yi)/weight(xi) ) • accuracy(X,Y) = i score(xi,yi)
Animate Q Human H Animal A F M Edge Distance Measurement • The ancestor relation will assign some score for pairs like (H,M), (A,Q), but will assign 0 for pairs like (M,F), (A,H) • Going further, we want to compute the similarity (distance) between X and Y • distance(xi, yi) = the length of the simple path from xito yi • each edge can be given individual length or all edges have length 1 (we prefer the latter)
Edge Distance Measurement (cont' d) • distance(X,Y) = i distance(xi,yi) • Accuracy – Distance 100% - 0 ? - distance(X,Y) 0% - max_possible_distance • max_possible_distance = = i max(distance(xi,cat)) • might be reasonable to use aver. instead of max
Blind data Some Baselines • Training + held-out data
Overview • Bag of words (Kalina) • Evaluation (Kris) • Supervised methods using maximum entropy (Klaus) • Incorporating context preferences (Jerry) • Experiments with Adjective Classes and Subject (David, Jia, Martin) • Structuring the context using syntax and semantics (Cassia, Fabio) • Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia) • Unsupervised Re-estimation (Roberto) • Student Proposals (Jia, Dave, Marco) • Conclusion
Supervised Methods using Maximum Entropy Jia Cui, David Guthrie, Martin Holub, Jerry Liu, Klaus Macherey
Overview • Maximum Entropy Approach • Feature Functions • Word Classes • Experimental Results
Maximum Entropy Approach • Principle: • Define suitable features (constraints) on training data • Find maximum entropy distribution that satisfies constraints (GIS) • Properties : • Easy to integrate information from several knowledge sources • Always converges to the global optimum on training data • Usage of: YASMET toolkit (by F. J. Och) & JME (by J. Cui)
Feature Functions • Prior Features • Use Unigram probabilities P(c) for semantic categories c as feature • Lexical Features • Use the lexical information directly as a feature • Reduce number of features by using the following definition