160 likes | 289 Views
Declarative Learning Models for Natural Language Processing. Aria Haghighi 12/08/2006. Overview. Need Quick NLP System Deployment New languages New domains Typical User Sophisticated engineer Little statistical expertise No time to label data!. Overview. Target Label.
E N D
Declarative Learning ModelsforNatural Language Processing Aria Haghighi 12/08/2006
Overview Need Quick NLP System Deployment • New languages • New domains Typical User • Sophisticated engineer • Little statistical expertise • No time to label data!
Overview Target Label Prototypes Target Label Prototypes Annotated Data Unlabeled Data Prototype List +
Sequence Modeling Tasks Size Restrict Terms Location Features Information Extraction: Classified Ads Newly remodeled2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park.Paid water and garbage.No dogs allowed. Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed. Prototype List
Sequence Modeling Tasks PUNC NN VBN CC JJ CD IN DET NNS IN NNP RB English POS Newly remodeled 2Bdrms/1Bath,spacious upper unit,locatedin Hilltop Mallarea.Walkingdistance toshopping, public transportation,schoolsandpark.Paid water andgarbage. Nodogs allowed. Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed. Prototype List
Generalizing Prototypes awitness reported awitness reported said the president • Tie each word to its most similar prototype
Generalizing Prototypes DET NN VBD y: a witness reported x: ‘reported’ & VBD suffix-2=‘ed’ & VBD sim=‘said’ & VBD Weights ‘reported’ Æ VBD = 0.35 suffix-2=‘ed’ Æ VBD= 0.23 sim=‘said’ Æ VBD = 0.35
English POS Experiments • Data • 193K tokens (about 8K sentences) of WSJ portion of Penn Treebank • Features [Smith & Eisner 05] • Trigram tagger • Word type, suffixes up to length 3, contains hyphen, contains digit, initial capitalization
English POS Experiments BASE • Fully Unsupervised • Random initialization • Greedy label remapping
English POS Experiments • Prototype List • 3 prototypes per tag • Automatically extracted by frequency
English POS Distributional Similarity -1 +1 • Judge a word by the company it keeps <s> the president said a downturn is near </s> • Collect context counts from 40M words of WSJ • Similarity [Schuetze 93] • SVD dimensionality reduction • cos() similarity measure
English POS Experiments PROTO+SIM • Add similarity features • Top five most similar prototypes that exceed threshold 67.8% on non-prototype accuracy
English POS Transition Counts Learned Structure Target Structure
Classified Ads Experiments • Data • 100 ads (about 119K tokens) from [Grenager et. al. 05] • Features • Trigram tagger • Word type
Classified Ads Experiments BASE • Fully Unsupervised • Random initialization • Greedy label remapping
Classified Ads Experiments • Prototype List • 3 prototypes per tag • 33 words in total • Automatically extracted by frequency