570 likes | 853 Views
2. Basically a talk about clustering. Get a clustering algorithm (unsupervised) from (~any) classifier learning algorithm (supervised).Lets you do domain-specific clustering.Cute idea, works wellBuilds on some tricks often used in natural language processingBut essentially proceduralLike the
E N D
1. 1 Bootstrapping without the Boot Jason Eisner
Damianos Karakos Several very funny students and I sent out an April Fool’s conference call this year.
This was one of our suggested topics. And here is the actual paper.Several very funny students and I sent out an April Fool’s conference call this year.
This was one of our suggested topics. And here is the actual paper.
2. 2 Basically a talk about clustering Get a clustering algorithm (unsupervised)from (~any) classifier learning algorithm (supervised).
Lets you do domain-specific clustering.
Cute idea, works well
Builds on some tricks often used in natural language processing
But essentially procedural
Like the tricks that it builds on
Not clear what is being optimized
3. 3 First, a version for the “space” people Want to learn a red/blue classifier
Nearest neighbor, kernel SVM, fit a surface …
4. 4 First, a version for the “space” people Harder if you have fewer training data
But maybe you can use the unlabeled data
5. 5 First, a version for the “space” people How unlabeled can you go?
(Could you use single-link clustering?)
6. 6 “Bootstrapping” How unlabeled can you go?
Let’s try “bootstrapping” (not the statistical bootstrap)
7. 7 “Bootstrapping”
8. 8 “Bootstrapping” Oops!
Doesn’t work even with soft labeling (as in EM)
Sparse good data immediately get swamped by bad guesses
9. 9 “Bootstrapping”
10. 10 “Bootstrapping” When will this work?
11. 11 “Bootstrapping” When will this work?
Depends on where you start …
but real datasets may allow many good starting points
12. 12 “Bootstrapping” Here’s a really bad starting point
incorrect! red & blue actually in same class ?
but even if we pick at random, ˝ chance of different classes ?
13. 13 Executive Summary(if you’re not an executive, you may stay for the rest of the talk) What:
We like minimally supervised learning (bootstrapping).
Let’s convert it to unsupervised learning (“strapping”).
How:
If the supervision is so minimal, let’s just guess it!
Lots of guesses ? lots of classifiers.
Try to predict which one looks plausible (!?!).
We can learn to make such predictions.
Results (on WSD):
Performance actually goes up!
(Unsupervised WSD for translational senses, English Hansards, 14M words.) A very popular family of methods is bootstrapping, where you get a machine learning algorithm started with a little bit of supervision,
and it pulls itself up by its bootstraps.
In this talk, we’d like to eliminate that initial bit of supervision and go to unsupervised learning.
Now, how could that be possible?
Well, you could guess the starting point.
You don’t just have to make one guess. Make lots of guesses. Try them in parallel.
Then decide which of the resulting classifiers “looks like a real classifier” for this task.
And when we tried this on word sense disambiguation, it worked A very popular family of methods is bootstrapping, where you get a machine learning algorithm started with a little bit of supervision,
and it pulls itself up by its bootstraps.
In this talk, we’d like to eliminate that initial bit of supervision and go to unsupervised learning.
Now, how could that be possible?
Well, you could guess the starting point.
You don’t just have to make one guess. Make lots of guesses. Try them in parallel.
Then decide which of the resulting classifiers “looks like a real classifier” for this task.
And when we tried this on word sense disambiguation, it worked
14. 14 WSD by bootstrapping we know “plant” has 2 senses
we hand-pick 2 words that indicate the desired senses
use the word pair to “seed” some bootstrapping procedure Start with an ambiguous word like plant. We’re told it has two senses.
We observe that we can pick out those senses with the collocates “leaves” and “machinery.”
And from that seed, that pair of words, we kick off some bootstrapping algorithm that grows a sense classifier for “plant.”
But if we’d picked a different seed, it might have done even better.
What do I mean “better”?
Well, for every seed s, we can measure its fertility f(s) – i.e., how good is the bootstrapped classifier that we grew.
So with a different seed, you could have grown an even mightier oak.
A mighty oak is one that beats the baseline with statistical significance.Start with an ambiguous word like plant. We’re told it has two senses.
We observe that we can pick out those senses with the collocates “leaves” and “machinery.”
And from that seed, that pair of words, we kick off some bootstrapping algorithm that grows a sense classifier for “plant.”
But if we’d picked a different seed, it might have done even better.
What do I mean “better”?
Well, for every seed s, we can measure its fertility f(s) – i.e., how good is the bootstrapped classifier that we grew.
So with a different seed, you could have grown an even mightier oak.
A mighty oak is one that beats the baseline with statistical significance.
15. 15 Yarowsky’s bootstrapping algorithm The minimally supervised scenariofrom which we’ll eliminate supervision today:
16. 16 Yarowsky’s bootstrapping algorithm
17. 17 Yarowsky’s bootstrapping algorithm
18. 18 Yarowsky’s bootstrapping algorithm
19. 19 Yarowsky’s bootstrapping algorithm
20. 20 Yarowsky’s bootstrapping algorithm
21. 21 Other applications Yarowsky (1995) has had a lot of influence …
~ 10 bootstrapping papers at EMNLP’05
Examples (see paper):
Is a webpage at CMU a course home page?
Is a webpage relevant to the user? (relevance feedback)
Is an English noun subjective (e.g., “atrocious”)?
Is a noun phrase a person, organization, or location?
Is this noun masculine or feminine? (in Spanish, Romanian,…)
Is this substring a noun phrase?
Any example of EM … including grammar induction
22. 22 WSD by bootstrapping we know “plant” has 2 senses
we hand-pick 2 words that indicate the desired senses
use the word pair to “seed” some bootstrapping procedure Start with an ambiguous word like plant. We’re told it has two senses.
We observe that we can pick out those senses with the collocates “leaves” and “machinery.”
And from that seed, that pair of words, we kick off some bootstrapping algorithm that grows a sense classifier for “plant.”
But if we’d picked a different seed, it might have done even better.
What do I mean “better”?
Well, for every seed s, we can measure its fertility f(s) – i.e., how good is the bootstrapped classifier that we grew.
So with a different seed, you could have grown an even mightier oak.
A mighty oak is one that beats the baseline with statistical significance.Start with an ambiguous word like plant. We’re told it has two senses.
We observe that we can pick out those senses with the collocates “leaves” and “machinery.”
And from that seed, that pair of words, we kick off some bootstrapping algorithm that grows a sense classifier for “plant.”
But if we’d picked a different seed, it might have done even better.
What do I mean “better”?
Well, for every seed s, we can measure its fertility f(s) – i.e., how good is the bootstrapped classifier that we grew.
So with a different seed, you could have grown an even mightier oak.
A mighty oak is one that beats the baseline with statistical significance.
23. 23 How do we choose among seeds? Want to maximize fertility but we can’t measure it! Fertility is measured against the right answers, the gold standard.
If you had the gold standard, you’d use it to do supervised learning.Fertility is measured against the right answers, the gold standard.
If you had the gold standard, you’d use it to do supervised learning.
24. 24 How do we choose among seeds? Want to maximize fertility but we can’t measure it! Fertility is measured against the right answers, the gold standard.
If you had the gold standard, you’d use it to do supervised learning.
Without a gold standard, you have to do learning Fertility is measured against the right answers, the gold standard.
If you had the gold standard, you’d use it to do supervised learning.
Without a gold standard, you have to do learning
25. 25 Why not pick a seed by hand? Your intuition might not be trustworthy
(even a sensible seed could go awry)
You don’t speak the language / sublanguage
You want to bootstrap lots of classifiers
All words of a language
Multiple languages
On ad hoc corpora, i.e., results of a search query
You’re not sure that # of senses = 2
(life, manufacturing) vs. (life, manufacturing, sow)
which works better? maybe you don’t even have an intuition, because you don’t speak the language.
Or maybe you just don’t have time to intuit seeds for all the classifiers you want to bootstrap.
Lots of words, lots of languages, so little time.
Or maybe you want to discover the senses of “plant” in a particular document set returned by a search engine,
so you need seeds appropriate to that document set.
maybe you don’t even have an intuition, because you don’t speak the language.
Or maybe you just don’t have time to intuit seeds for all the classifiers you want to bootstrap.
Lots of words, lots of languages, so little time.
Or maybe you want to discover the senses of “plant” in a particular document set returned by a search engine,
so you need seeds appropriate to that document set.
26. 26 How do we choose among seeds? Our answer (ahem) …
You look at the seeds, and who knows? But you look at the classifiers they grew, and you can tell by the smell when all is not well.
Even a computer could see that something went wrong with the seed on the left. We don’t need no stinkin’ gold standard.
Instead of finding out the classifier is actually good, we check whether it passes a few sniff tests. And we hope that correlates with the real fertility and picks good classifiers.Our answer (ahem) …
You look at the seeds, and who knows? But you look at the classifiers they grew, and you can tell by the smell when all is not well.
Even a computer could see that something went wrong with the seed on the left. We don’t need no stinkin’ gold standard.
Instead of finding out the classifier is actually good, we check whether it passes a few sniff tests. And we hope that correlates with the real fertility and picks good classifiers.
27. 27 “Strapping” Somehow pick a bunch of candidate seeds
For each candidate seed s:
grow a classifier Cs
compute h(s) (i.e., guess whether s was fertile)
Return Cs where s maximizes h(s)
heuristically generate a lot of seeds
for each seed, grow a classifier, then figure out if you liked the classifier, then return the classifier you like best.
generate and test
one thing about those methods is that they combine many classifiers.
we’re not doing that. we just return a single classifier.heuristically generate a lot of seeds
for each seed, grow a classifier, then figure out if you liked the classifier, then return the classifier you like best.
generate and test
one thing about those methods is that they combine many classifiers.
we’re not doing that. we just return a single classifier.
28. 28 Data for this talk Unsupervised learning from 14M English words (transcribed formal speech).
Focus on 6 ambiguous word types:
drug, duty, land, language, position, sentence
each has from 300 to 3000 tokens basically 2 clusters.basically 2 clusters.
29. 29 Data for this talk Unsupervised learning from 14M English words (transcribed formal speech).
Focus on 6 ambiguous word types:
drug, duty, land, language, position, sentence We’re going to try to learn the contextual clues in English that tell us whether this English sentence has got drug1 or drug2.
If we had enough translated data, we could do supervised learning – we’d see how the words were translated depending on context.
But we’re going to assume we don’t have enough data for that, and do unsupervised learning, i.e., from the English only.We’re going to try to learn the contextual clues in English that tell us whether this English sentence has got drug1 or drug2.
If we had enough translated data, we could do supervised learning – we’d see how the words were translated depending on context.
But we’re going to assume we don’t have enough data for that, and do unsupervised learning, i.e., from the English only.
30. 30 Data for this talk Unsupervised learning from 14M English words (transcribed formal speech).
Focus on 6 ambiguous word types:
drug, duty, land, language, position, sentence Thank you Canada … you host us … you give us dataThank you Canada … you host us … you give us data
31. 31 Quickly pick a bunch of candidate seeds
For each candidate seed s:
grow a classifier Cs
compute h(s) (i.e., guess whether s was fertile)
Return Cs where s maximizes h(s)
Strapping word-sense classifiers
32. 32 Strapping word-sense classifiers Quickly pick a bunch of candidate seeds
For each candidate seed s:
grow a classifier Cs
compute h(s) (i.e., guess whether s was fertile)
Return Cs where s maximizes h(s)
33. 33 Strapping word-sense classifiers Quickly pick a bunch of candidate seeds
For each candidate seed s:
grow a classifier Cs
compute h(s) (i.e., guess whether s was fertile)
Return Cs where s maximizes h(s)
“good” means accuracy is 3-4% worse than best
“lousy” means accuracy is 50%, i.e., random.
traffickers & trafficking appear a lot with drug and never appear together – but they pick out the SAME sense.
length could show up with either sense. But if it’s a life sentence, you don’t otherwise mention its length, so these two words never appear together.“good” means accuracy is 3-4% worse than best
“lousy” means accuracy is 50%, i.e., random.
traffickers & trafficking appear a lot with drug and never appear together – but they pick out the SAME sense.
length could show up with either sense. But if it’s a life sentence, you don’t otherwise mention its length, so these two words never appear together.
34. 34 Strapping word-sense classifiers Quickly pick a bunch of candidate seeds
For each candidate seed s:
grow a classifier Cs
compute h(s) (i.e., guess whether s was fertile)
Return Cs where s maximizes h(s)
Followed some careful written rules to keep ourselves honest.Followed some careful written rules to keep ourselves honest.
35. 35 Strapping word-sense classifiers Quickly pick a bunch of candidate seeds
For each candidate seed s:
grow a classifier Cs
compute h(s) (i.e., guess whether s was fertile)
Return Cs where s maximizes h(s)
“good” means accuracy is 3-4% worse than best
“lousy” means accuracy is 50%, i.e., random.
traffickers & trafficking appear a lot with drug and never appear together – but they pick out the SAME sense.
length could show up with either sense. But if it’s a life sentence, you don’t otherwise mention its length, so these two words never appear together.“good” means accuracy is 3-4% worse than best
“lousy” means accuracy is 50%, i.e., random.
traffickers & trafficking appear a lot with drug and never appear together – but they pick out the SAME sense.
length could show up with either sense. But if it’s a life sentence, you don’t otherwise mention its length, so these two words never appear together.
36. 36 Unsupervised WSD as clustering Easy to tell which clustering is “best”
A good unsupervised clustering has high
p(data | label) – minimum-variance clustering
p(data) – EM clustering
MI(data, label) – information bottleneck clustering Well, unsupervised WSD is really trying to cluster tokens of the target word into two senses.
If this were Euclidean clustering, you could tell me which clusterings looked better.
In fact there are several metrics that people use, and we could draw on any of these.
We’ll do something like this on the next slide.
We’re also careful to define our metrics so a skewed classifier like this one doesn’t get an inappropriately high score.
In fact, EM clustering is a kind of bootstrapping
You could try many seeds (starting points)
Well, unsupervised WSD is really trying to cluster tokens of the target word into two senses.
If this were Euclidean clustering, you could tell me which clusterings looked better.
In fact there are several metrics that people use, and we could draw on any of these.
We’ll do something like this on the next slide.
We’re also careful to define our metrics so a skewed classifier like this one doesn’t get an inappropriately high score.
In fact, EM clustering is a kind of bootstrapping
You could try many seeds (starting points)
37. 37 Clue #1: Confidence of the classifier Final decision list for Cs
Does it confidently classify the training tokens, on average?
Opens the “black box” classifier to assess confidence (but so does bootstrapping itself) Think about the final phase of bootstrapping, when you’ve got some words labeled as A, and some labeled B, and you learn a decision list to tell them apart.
If you got the true senses, it should be relatively easy to tell them apart.
But if you got some random labeling, that final decision list won’t be very sure of itself.
Look at the final learned decision list Cs.
When classifying training tokens, is it pretty sure of itself on average?
Or more precisely,
Then correct for the “classifier skew.” Cs may be sure of itself just because it decided everything is sense A.
Opens the decision-list black box a bit … but Yarowsky’s algorithm already needs to choose the “most confidently classified” examples.
Other variants on this, e.g.,
Think about the final phase of bootstrapping, when you’ve got some words labeled as A, and some labeled B, and you learn a decision list to tell them apart.
If you got the true senses, it should be relatively easy to tell them apart.
But if you got some random labeling, that final decision list won’t be very sure of itself.
Look at the final learned decision list Cs.
When classifying training tokens, is it pretty sure of itself on average?
Or more precisely,
Then correct for the “classifier skew.” Cs may be sure of itself just because it decided everything is sense A.
Opens the decision-list black box a bit … but Yarowsky’s algorithm already needs to choose the “most confidently classified” examples.
Other variants on this, e.g.,
38. 38 Clue #1: Confidence of the classifier Q: For an SVM kernel classifier, what is confidence?
A: We are more confident in a large-margin classifier.
This leads to semi-supervised SVMs:
A labeling smells good if ?? large-margin classifier for it
De Bie & Cristianini 2003, Xu et al 2004 optimize over all labelings, not restricting to bootstrapped ones as we do Well, unsupervised WSD is really trying to cluster tokens of the target word into two senses.
If this were Euclidean clustering, you could tell me which clusterings looked better.
In fact there are several metrics that people use, and we could draw on any of these.
We’ll do something like this on the next slide.
We’re also careful to define our metrics so a skewed classifier like this one doesn’t get an inappropriately high score.
In fact, EM clustering is a kind of bootstrapping
You could try many seeds (starting points)
Well, unsupervised WSD is really trying to cluster tokens of the target word into two senses.
If this were Euclidean clustering, you could tell me which clusterings looked better.
In fact there are several metrics that people use, and we could draw on any of these.
We’ll do something like this on the next slide.
We’re also careful to define our metrics so a skewed classifier like this one doesn’t get an inappropriately high score.
In fact, EM clustering is a kind of bootstrapping
You could try many seeds (starting points)
39. 39 Clue #2: Agreement with other classifiers Intuition: for WSD, any reasonable seed s should find a true sense distinction.
So it should agree with some other reasonable seeds r that find the same distinction. Two 90-10 classifiers will tend to agree most of the time just by chance.
So we actually do a significance test …Two 90-10 classifiers will tend to agree most of the time just by chance.
So we actually do a significance test …
40. 40 Clue #2: Agreement with other classifiers
41. 41 Clue #2: Agreement with other classifiers Intuition: for WSD, any reasonable seed s should find a true sense distinction.
So it should agree with some other reasonable seeds r that find the same distinction. Two 90-10 classifiers will tend to agree most of the time just by chance.
So we actually do a significance test …Two 90-10 classifiers will tend to agree most of the time just by chance.
So we actually do a significance test …
42. 42 Clue #2: Agreement with other classifiers Remember, ˝ of starting pairs are bad (on same spiral)
But they all lead to different partitions: poor agreement!
The other ˝ all lead to the same correct 2-spiral partition
(if spirals are dense and well-separated)
43. 43 Clue #3: Robustness of the seed Cs was trained on the original dataset.
Construct 10 new datasets by resampling the data (“bagging”).
Use seed s to bootstrap a classifier on each new dataset.
How well, on average, do these agree with the original Cs? (again use prob of agreeing this well by chance) Suppose your seed didn’t lock onto the signal. It picked up on some irrelevant feature of the data.
Then if you changed the data a bit, you’d get a different answer.
So let’s try planting the same seed in different data.
If the answer is always different, then who do you trust? You don’t trust any of them.
You should only trust a seed if it seems to get the same answer no matter what.Suppose your seed didn’t lock onto the signal. It picked up on some irrelevant feature of the data.
Then if you changed the data a bit, you’d get a different answer.
So let’s try planting the same seed in different data.
If the answer is always different, then who do you trust? You don’t trust any of them.
You should only trust a seed if it seems to get the same answer no matter what.
44. 44 How well did we predict actual fertility f(s)? Measure true fertility f(s) for all 200 seeds.
Spearman rank correlation with f(s):
0.748 Confidence of classifier
0.785 Agreement with other classifiers
0.764 Robustness of the seed
(avg correlation over 6 words)
0.794 Average rank of all 3 clues drug
A: 0.708 (agreement)
R: 0.697 (robustness)
C: 0.651 (confidence)
E: 0.739 (equal weighted regression)
sentence
A: 0.900
R: 0.875
C: 0.797
E: 0.890
land
A: 0.720
R: 0.602
C: 0.713
E: 0.697
duty
A: 0.855
R: 0.902
C: 0.747
E: 0.879
language
A: 0.850
R: 0.716
C: 0.746
E: 0.776
position
A: 0.679
R: 0.797
C: 0.836
E: 0.781
A: (1/6)*(.708+.900+.720+.855+.850+.679)
.78533333333333333330
R: (1/6)*(.697+.875+.602+.902+.716+.797)
.76483333333333333330
C: (1/6)*(.651+.797+.713+.747+.746+.836)
.74833333333333333330
E: (1/6)*(.739+.890+.697+.879+.776+.781)
.79366666666666666663
drug
A: 0.708 (agreement)
R: 0.697 (robustness)
C: 0.651 (confidence)
E: 0.739 (equal weighted regression)
sentence
A: 0.900
R: 0.875
C: 0.797
E: 0.890
land
A: 0.720
R: 0.602
C: 0.713
E: 0.697
duty
A: 0.855
R: 0.902
C: 0.747
E: 0.879
language
A: 0.850
R: 0.716
C: 0.746
E: 0.776
position
A: 0.679
R: 0.797
C: 0.836
E: 0.781
A: (1/6)*(.708+.900+.720+.855+.850+.679)
.78533333333333333330
R: (1/6)*(.697+.875+.602+.902+.716+.797)
.76483333333333333330
C: (1/6)*(.651+.797+.713+.747+.746+.836)
.74833333333333333330
E: (1/6)*(.739+.890+.697+.879+.776+.781)
.79366666666666666663
45. 45 Smarter combination of clues? Really want a “meta-classifier”!
Output: Distinguishes good from bad seeds.
Input: Multiple fertility clues for each seed (amount of confidence, agreement, robustness, etc.) how good seeds behave: how confident they are, how robust, how agreeable, …
Maybe confidence is important and robustness isn’t, or vice-versahow good seeds behave: how confident they are, how robust, how agreeable, …
Maybe confidence is important and robustness isn’t, or vice-versa
46. 46 Yes, the test is still unsupervised WSD ? Unsupervised WSD research has always relied on supervised WSD instances to learn about the space (e.g., what kinds of features & classifiers work). We’re just using the supervised instances to learn about the WSD problem.
They tell us what kind of classifier to look for when we’re doing unsupervised sense discovery.We’re just using the supervised instances to learn about the WSD problem.
They tell us what kind of classifier to look for when we’re doing unsupervised sense discovery.
47. 47 How well did we predict actual fertility f(s)? Spearman rank correlation with f(s):
0.748 Confidence of classifier
0.785 Agreement with other classifiers
0.764 Robustness of the seed
0.794 Average rank of all 3 clues
0.851% Weighted average of clues
plant/tank pseudowords
drug 0.891 0.863
duty 0.873 0.905
land 0.752 0.718
language 0.839 0.825
position .811 0.842
sentence .942 0.937
average 0.851 0.848 plant/tank pseudowords
drug 0.891 0.863
duty 0.873 0.905
land 0.752 0.718
language 0.839 0.825
position .811 0.842
sentence .942 0.937
average 0.851 0.848
48. 48 How good are the strapped classifiers??? the one that agreed best with gold standard …
the strapped classifier always had significantly better than chance agreement with the gold standard …
it always did significantly better than either hand-picked seed.
A lot of the hand-picked seeds turned out to be duds, actually, They just sort of grew weeds, at the baseline.
This doesn’t look so good for ordinary bootstrapping! But of course Yarowsky had much better results picking seeds by hand. Probably because he used 460 million words. We only used 14M, and simpler feature sets.
So we had fewer good seeds available to either the human or the machine.the one that agreed best with gold standard …
the strapped classifier always had significantly better than chance agreement with the gold standard …
it always did significantly better than either hand-picked seed.
A lot of the hand-picked seeds turned out to be duds, actually, They just sort of grew weeds, at the baseline.
This doesn’t look so good for ordinary bootstrapping! But of course Yarowsky had much better results picking seeds by hand. Probably because he used 460 million words. We only used 14M, and simpler feature sets.
So we had fewer good seeds available to either the human or the machine.
49. 49 Hard word, low baseline: drug You can see that 89% correlation between “how good they smelled” and up “how tall they actually grew.” (gesture right & up)
These seeds stink, and they basically just grew weeds, down here around the 50% baseline.
At the other end, our best-smelling seed actually grew the most accurate classifier.
Here are the 2 hand-picked seeds, somewhere in between.
These are the most confident, most agreeable, and most robust seeds, and in fact our winner was also the most robust seed.
This was actually our worst-performing word, but the plot is pretty typical.
You can see that 89% correlation between “how good they smelled” and up “how tall they actually grew.” (gesture right & up)
These seeds stink, and they basically just grew weeds, down here around the 50% baseline.
At the other end, our best-smelling seed actually grew the most accurate classifier.
Here are the 2 hand-picked seeds, somewhere in between.
These are the most confident, most agreeable, and most robust seeds, and in fact our winner was also the most robust seed.
This was actually our worst-performing word, but the plot is pretty typical.
50. 50 Hard word, high baseline: land Here’s another hard word – our second worst – again, none of the 200 seeds do that great.
But the baseline now is much higher, and most seeds perform below it. (Could be why the correlation’s a little worse.)
In fact, lots of them are as low as you can go – 50%. And they smell pretty bad.
At least the hand-picked seeds match baseline, but you could match baseline a lot easier by giving the same sense to all tokens.
Even these guys are now below baseline.
If you go by classifier confidence, the least useful clue on the previous graph, you can now eke out a small but significant margin. This seed is actually 3rd from the top.
And you get the very top seed if you insist that it does well on all clues, using our weighted combination.Here’s another hard word – our second worst – again, none of the 200 seeds do that great.
But the baseline now is much higher, and most seeds perform below it. (Could be why the correlation’s a little worse.)
In fact, lots of them are as low as you can go – 50%. And they smell pretty bad.
At least the hand-picked seeds match baseline, but you could match baseline a lot easier by giving the same sense to all tokens.
Even these guys are now below baseline.
If you go by classifier confidence, the least useful clue on the previous graph, you can now eke out a small but significant margin. This seed is actually 3rd from the top.
And you get the very top seed if you insist that it does well on all clues, using our weighted combination.
51. 51 Reducing supervision for decision-list WSD
52. 52 How about no supervision at all? We’re just using the supervised instances to learn about the WSD problem.
They tell us what kind of classifier to look for when we’re doing unsupervised sense discovery.We’re just using the supervised instances to learn about the WSD problem.
They tell us what kind of classifier to look for when we’re doing unsupervised sense discovery.
53. 53 Automatic construction of pseudowords … and we know in which contexts it was originally death, or originally page.… and we know in which contexts it was originally death, or originally page.
54. 54 Does pseudoword training work as well? This doesn’t look so good for ordinary bootstrapping! But of course Yarowsky did much better using 460 million words. We only used 14M, and simpler feature sets.
So we had fewer good seeds available to either the human or the machine.This doesn’t look so good for ordinary bootstrapping! But of course Yarowsky did much better using 460 million words. We only used 14M, and simpler feature sets.
So we had fewer good seeds available to either the human or the machine.
55. 55 Opens up lots of future work Compare to other unsupervised methods (Schütze 1998)
More than 2 senses
Other tasks (discussed in the paper!)
Lots of people have used bootstrapping!
Seed grammar induction with basic word order facts?
Make WSD even smarter:
Better seed generation (e.g., learned features ? new seeds)
Better meta-classifier (e.g., polynomial SVM)
Additional clues: Variant ways to measure confidence, etc.
Task-specific clues
56. 56 Future work: Task-specific clues
57. 57 Summary Bootstrapping requires a “seed” of knowledge.
Strapping = try to guess this seed.
Try many reasonable seeds.
See which ones grow plausibly.
You can learn what’s plausible.
Useful because it eliminates the human:
You may need to bootstrap often.
You may not have a human with the appropriate knowledge.
Human-picked seeds often go awry, anyway.
Works great for WSD! (Other unsup. learning too?)