530 likes | 664 Views
Understanding the more data effect A closer look at learning curves. Antal van den Bosch Tilburg University http://ilk.uvt.nl. Overview. The More Data effect Case study 1: learning curves and feature representations Case study 2: continuing learning curves with more data.
E N D
Understanding the more data effectA closer look at learning curves Antal van den Bosch Tilburg University http://ilk.uvt.nl
Overview • The More Data effect • Case study 1: learning curves and feature representations • Case study 2: continuing learning curves with more data
The More Data effect • There’s no data like more data (speech recognition motto) • Banko and Brill (2001): confusibles • Differences between algorithms flip or disappear • Differences between representations disappear • Growth of curve seems log-linear (constant improvement with exponentially more data) • Explanation sought in “Zipf’s tail”
Banko and Brill (2001) • Demonstrated on {to,two,too} using 1M to 1G examples: • Initial range between 3 classifiers at • 1M: 83-85% • 1G: 96-97% • Extremely simple memory-based classifier (one word left, one word right): • 86% at 1M, 93% at 1G • apparent constant improvement on log-growth
Zipf • Frequency of nth most frequent word is inversely proportional to n • ~ log-linear relation between token frequencies vs numbers of types that have these frequencies
Chasing Zipf’s tail • More data brings two benefits: • More observations of words already seen. • More new words become known (the tail) • This effect persists, no matter how often the data is doubled.
Case study 1 • Learning curves vs feature representations • Van den Bosch & Buchholz, ACL 2002 • Perspective: how important are PoS features in shallow parsing? • Idea: • PoS features are robust • Robustness effect may decrease when more data is available
Words, PoS, shallow parsing • “Assign limited syntactic structure to text” • Input: words and/or relevant clues from computed PoS • Most systems assume PoS • HPSG (Pollard & Sag 87) • Abney (91) • Collins (96), Ratnaparkhi (97): interleaved • Charniak (00): back-off
Could words replace PoS? Simple intuition: • PoS disambiguate explicitly suspect-N vs suspect-V • words disambiguate implicitly … thesuspect … … wesuspect …
Could words replace PoS? Words could provide PoS info implicitly • Pro: • No intermediary computation • No spurious PoS errors • Contra: • PoS offers back-off; PoS data is not sparse • PoS does resolve relevant ambiguity • What happens when there is more data?
Case study: Overall setup • “chunking-function tagging”, English • Select input: • Gold-standard or predicted PoS • Words only • Both • Learn with increasing amounts of training data • Which learning curve grows faster? • Do they meet or cross? Where?
Data (1): Get tree from PTB ((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .))
Data (2): Shallow parse [ADVPOnceADVP-TMP] [NPheNP-SBJ] [VP was heldVP/S] [PPforPP-TMP] [NP three monthsNP] [PPwithoutPP] [VP being chargedVP/SNOM]
Data (3): Make instances • … _Oncehe … I-ADVP - ADVP-TMP • … Oncehe was … I-NP - NP-SBJ • … he was held … I-VP - NOFUNC • … washeldfor … I-VP - VP/S • … heldforthree … I-PP - PP-TMP • … for three months … I-NP - NOFUNC • … threemonthswithout … I-NP - NP • … monthswithoutbeing … I-PP - PP • … without being charged … I-VP - NOFUNC • … beingcharged. … I-VP - VP/S-NOM • … charged . _ … O - NOFUNC
Case study: Details • experiments based on Penn Treebank III (WSJ, Brown, ATIS) • 74K sentences, 1,637,268 tokens (instances) • 62,472 unique words, 874 chunk-tag codes • 10-fold cross-validation experiments: • Split data 10 times in 90% train and 10% test • Grow every training set stepwise • precision-recall on correctly chunked and typed chunks with correct function tags • memory-based learning (TiMBL) • MVDM, k=7, gain ratio feature weights, inverse distance class voting • TRIBL level 2 (approximate k-NN)
Case study: Extension (1) • Word attenuation (after Eisner 96): • Distrust low-frequency information (<10) • But keep whatever is informative (back-off) • Convert to MORPH-[CAP|NUM|SHORT|ss] A Daikin executive in charge of exports when the high-purity halogenated hydrocarbon was sold to the Soviets in 1986 received a suspended 10-month jail sentence . A MORPH-CAP executive in charge of exports when the MORPH-ty MORPH-ed MORPH-on was sold to the Soviets in 1986 received a suspended MORPH-th jail sentence .
Case study: Extension (2) • In contrast with gold-standard PoS, use automatically generated PoS • Memory-based tagger (MBT, Daelemans et al., 1996) • Separate optimized modules for known and unknown words • Generate tagger on training set • Apply generated tagger to test set • With all training data: 96.7% correct tags
Case study: Observations • Word curve grows roughly log-linear • PoS curve flattens more • Merit of words vs. PoS for current task depends on amount of training material • Extensions: • Attenuation improves performance • Adding (real) PoS improves performance • Both effects become smaller with more training material
Case study 2 • Continuing learning curves with more data • Work in progress • Idea: • Add data from • the same annotated source, • a different annotated source, • unlabeled data, • And measure curve on test data from • the same annotated source, • a different annotated source
Data • PTB II (red=test) • Wall Street Journal financial news articles • CoNLL shared task training set (211,737 words) • CoNLL shared task test set (47,377 words) • Rest of WSJ (914,662 words) • Brown (459,148 words) written English mix • ATIS (4,353 words): spoken English, questions, first-person sentences • Reuters Corpus (3.7Gb xml) newswire • Tagged by MBL trained on CoNLL shared task training set (w/ paramsearch)
Tasks • CoNLL 2000 shared task (chunking) • (Tjong Kim Sang and Buchholz, 2000) • Kudo & Matsumoto, pairwise classif. SVM, 93.5 F-score • Later improvements over 94 • Function tagging on same data • (Van den Bosch and Buchholz, 2002) • MBL, 78 F-score • 3-1-3 word windows for both (no PoS) • Paramsearch and attenuation on both
Use unlabeled data • Why not classify unlabeled data and add that? Well, • “One classifier does not work” • Negative effects outweigh positive (from % correct) • Adds more of the same • Imports errors • What does? (M$ question) • Co-training (2 interleaved classifiers) • Active learning (n classifiers plus 1 human) • Yarowski-boosting (1 iterated classifier) • Cf. Abney (2002)
Yarowski boosting (1995) • Given labeled data and unlabeled data • Train rule inducer on labeled data, • Loop: • Relabel labeled data with rules • Remove examples below labeling confidence threshold • Label unlabeled data with rules • Add labeled examples above confidence threshold to labeled set • Train rule inducer again. • Demonstrated to work for WSD.
Example boosting • Given labeled data and unlabeled data • Train MBL on labeled data, and use it to label unlabeled data • Revert roles: • Train MBL on automatically labeled data • Test on original labeled data • For each training instance, measure class prediction strength (Aha et al., Salzberg) • CPS = # times correct NN / # times NN • Select examples with top CPS, add them to labeled data • Here: 4M examples of Reuters labeled
CPS: examples • High CPS: • zone with far less options , " I-NP-NOFUNC • would allow it to bypass its requirements I-VP-NOFUNC • surprise after he faded from the picture I-VP-VP/S • remain under pressure in the next quarter I-PP-PP-TMP • Low CPS: • ago and are not renewable . _ O-NOFUNC • six world championship points and is just I-NP-NP • is competitive , influenced mainly by the I-VP-VP/S • of key events in the rebellion : I-PP-PP-LOC
Observations, chunking • Testing on WSJ: • Curve flattens • Example boosting (50k and 100k) in line with others • Adding random data does not lower curve • Testing on ATIS • Curve does not appear to flatten • Adding Brown or example boosting works just as well • Adding random data shows negative effect • Note lower scores on ATIS
Observations, function tagging • Testing on WSJ: • Curve still going up • Brown yields flat curve • Example boosting (100k) unclear • Testing on ATIS: • Curve steeper at later stage • Adding Brown or example boosting works just as well
Summary of results • Adding data from same source is generally good (testing on same data or other data) • Adding data from other source may only be effective when testing on other data • Learning curves testing on other data may go up later • Vocabulary effects
Summary of results (2) • Adding random data labeled with 1 classifier shows predicted negative effect • Negative effect assumedly outweighs positive • Except when curve is already flat: negative effects are muted because there is sufficient ‘positive’ data? • Example boosting promising, but • Threshold issue: why 100k of 4M (2.5%)? Higher percentages make curve approach random curve of course • CPS is weak; smoothing (e.g. Laplace correction) • More work and comparisons needed
Discussion • ‘More data’ tends to hold when source of data added is not changed, except that • Flattened curves appear to remain flat with more data • Vocabulary effects when testing on data from other sources • May produce delayed upward learning curves • Positive effect from adding data from other sources to training • Learning curves are a useful instrument • Comparisons between algorithms and between parameter settings (here: algorithm fixed = bias) • Comparisons between representations • Predictions for annotation projects