170 likes | 274 Views
Boosting Textual Source Attribution. Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006. HEY??? What’s so funny?. What makes something funny? Can we tell by just reading? Can a computer? Shakespeare’s Comedies and Tragedies.
E N D
Boosting Textual Source Attribution Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006
HEY??? What’s so funny? • What makes something funny? • Can we tell by just reading? Can a computer? • Shakespeare’s Comedies and Tragedies. • Actually, Comedies, Tragedies, Historical Plays and Sonnets.
Experimenting with Boosting • Most work done on binary classification. • Needs lots of “weak” learners. • Some variants work well with limited Data Set. • Will provide knowledge about importance of features.
Data Set (Training) • Tragedies • Anthony and Cleopatra • Titus Andronicus • Hamlet • Julius Caesar • Romeo and Juliet • Comedies • Measure for Measure • Much Ado about Nothing • Merchant of Venice • Midsummer Night’s Dream • Taming of the Shrew • Twelfth Night
Data Set (Test) • All’s Well that End’s Well [c] • Comedy of Errors [c] • As You Like It [c] • The Tempest [c] • Mary Wives of Windsor [c] • King Lear [t] • Macbeth [t] • Coriolaunus [t] • Othello [t]
Feature Selection • Features: words • Selection method: picked 2500 most common words in the Training Set • Preprocessing: 300 common English words and grammar operators removed • HTML and stage directions removed • 429 out of 2500 words were not common to all plays. Chose the 429 for weak learner functions. (this particular run)
COMEDY WORDS TRAGEDY WORDS 429 Words: 225(Com.), 204(Trag.) Data: Vector of 2500 words: X= [X1, X2…X2500] Weak Learners F1(X)…F429(X), each returning 1 for a positive hit.
Boosting • A mix of LP Boost and TotalBoost • No Termination (finite weak learners) • Didn’t have a Gamma function, used Eta (error) instead. • Didn’t use Zero Sum constraint on normalization of weight updates.
Classification • Used Accumulated weights at the very end • Every presense in Test Corpus means (1*W) added to totalW, some W’s negative • At the end it was a simple matter of observing if the results were positive or negative and by how much.
Program Output • [root@localhost output]# ./classify.sh • 00_allswell.html-ratio.txt: 14.6807 • 01_comedyErrors.html-ratio.txt: 13.2634 • 02_measure.html-ratio.txt: 34.2748 • 03_muchAdo.html-ratio.txt: -6.43018 • 04_asyoulikeit.html-ratio.txt: 18.8413 • 05_cleopatra.html-ratio.txt: 14.1148 • 06_lear.html-ratio.txt: 32.2858 • 07_macbeth.html-ratio.txt: -21.095 • 08_coriolanus.html-ratio.txt: 43.5599 • 09_titus.html-ratio.txt: -3.31167 • 10_cleopatraFull.html-ratio.txt: -300.179 • 11_learFull.html-ratio.txt: 356.504 • 13_tempestFull.html-ratio.txt: 454.171 • 14_marryWivesFull.html-ratio.txt: 147.738 • 15_measure2.html-ratio.txt: 39.0357 • 16_measureFull.html-ratio.txt: 112.527 • 17_muchAdoFull.html-ratio.txt: 256.078 • 18_veronaFull.html-ratio.txt: -222.444 • 19_othelloFull.html-ratio.txt: -433.769 • 20_titusFull.html-ratio.txt: -564.977
Results • All’s Well that End’s Well [c][1] • Comedy of Errors [c][1] • As You Like It [c][1] • The Tempest [c][1] • Mary Wives of Windsor [c][1] • King Lear [t][0] • Macbeth [t][1] • Coriolaunus [t][0] • Othello [t][1] 2/9 mistakes, 7/9 or 77%, (also 66% and 69%) Previous run on Neural Net (different setup: 5/13 61%) - With no proportionals!
Challenges • Natural language has a lot nuances that could make a difference (preprocessing methods, “common word” sets, adaptations) • Boosting has great potential in this area • Words provide easy method for coming up with (many) weak learners