120 likes | 287 Views
Third Recognizing Textual Entailment Challenge. Potential SNeRG Submission. RTE3 Quick Notes. RTE Web Site: http://www.pascal-network.org/Challenges/RTE3/ Textual Entailment resource pool: http://aclweb.org/aclwiki/index.php?title=Textual_Entailment_Resource_Pool
E N D
Third Recognizing Textual Entailment Challenge Potential SNeRG Submission
RTE3 Quick Notes • RTE Web Site: http://www.pascal-network.org/Challenges/RTE3/ • Textual Entailment resource pool: http://aclweb.org/aclwiki/index.php?title=Textual_Entailment_Resource_Pool • New development set released to correct errors last week • Test set released on March 5th • !!! New !!! submission date March 12th • Report deadline date March 26th
Development set examples • Example of a YES result <pair id="5" entailment="YES" task="IE" length="short"> <t>A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured.</t> <h>30 die in a bus collision in Uganda.</h> </pair> • Example of a NO result <pair id="20" entailment="NO" task="IE" length="short"> <t>Blue Mountain Lumber is a subsidiary of Malaysian forestry transnational corporation, Ernslaw One.</t> <h>Blue Mountain Lumber owns Ernslaw One.</h> </pair>
Development set examples – cont. • 4 Different types of entailment tasks • Information Retrieval (IR) • Question Answering (QA) • Information Extraction (IE) • Multi-document summarization (SUM) • Development set consists of 200 samples of each type of entailment • 400 evaluate to “YES” and 400 to “NO” • Another attribute “length” in the development set has only 134 long and 666 short. [Note to self: gather a group of demon hunters to hunt down the short samples, will need volunteers and holy water.]
Evaluation • Two submissions per team can be made • Program output is a file that contains the following information. Line 1 must contain: “ranked:<blank space>yes/no” Line 2..end contains: “pair_id<blank space>judgment “ For example: ranked: yes 4 YES 3 YES 6 YES 1 NO 5 NO 2 NO • Accuracy is calculated from the answers returned correct • Precision is determined by the order and the correctness of the answers returned by the formula: 1/R * sum for i=1 to n (E(i) *#-correct-up-to-pair-i/i) n is the number of the pairs in the test set R is the total number of positive pairs in the test set E(i) is 1 if the i-th pair is positive and 0 otherwise and i ranges over the pairs, ordered by their ranking.
Possible Implementation • Discover features that can be measured with a continuous variable For example: • Wordbag match ratio = # of words matched between text and hypothesis / # of words in the hypothesis • Arrange feature values in a feature vector x • Apply the general multivariate normal density for the assembled feature vector x
Implementation to Determine Baseline • I have done an implementation to determine the baseline of what we can expect out of a full implementation of all syntactic features • First baseline result: Used 1 feature: Wordbag count > n, where n is decided after development set is processed Success: 509 , Fail: 290 Final rate: 63.9% • Second baseline result: Used simple preprocessing and Wordbag count: removing punctuation, case insensitivity, ignoring simple words Success: 534 , Fail: 265 Final rate: 66.8% • Attempted a little semantic processing, like increasing weight based on “negative” words for returning negative results, but results did not increase • In RTE2 competition highest accuracy was only 70%!
Potential Features • Wordbag ratio = # of matches between text and hypothesis / # of words in hypothesis Works for: <t>A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured.</t> <h>30 die in a bus collision in Uganda.</h> Wordbag ratio = 6 / 8 Fails for: <t>Blue Mountain Lumber is a subsidiary of Malaysian forestry transnational corporation, Ernslaw One.</t> <h>Blue Mountain Lumber owns Ernslaw One.</h> Wordbag ratio = 5 / 6 • Potential solution needs to include processing semantic knowledge about the relationship between the highlighted red words.
Potential Features – cont. • Word proximity = average distance between matched words in the text For example: <t>A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured.</t> <h>30 die in a bus collision in Uganda.</h> Matched words: 30 in bus collision in Uganda 30: 3, 12, 11, 3, 6 in: 3, 5, 4, 1 bus: 12, 5, 1, 5, 6 collision: etc… • May not help much or at all, but by adding additional independent features (from a gaussian distribution), we can potentially increase the P(wn|x)
Potential Features – cont. • Word grouping = average of counts of word groups of length 2 / possible combos For example: <t>A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured.</t> <h>30 die in a bus collision in Uganda.</h> Matched groups: “bus collision”, “in Uganda”, 7 possible combinations = 2/7 <t>Blue Mountain Lumber is a subsidiary of Malaysian forestry transnational corporation, Ernslaw One.</t> <h>Blue Mountain Lumber owns Ernslaw One.</h> Matched groups: “Blue Mountain”, “Mountain Lumber”, “Ernslaw One”, 5 combinations = 3/5 • Once again this may not help much or at all, but may help us brainstorm a bit
Potential Features – cont. • Quick and easy stats we can generate may include using • Stemmers – count matching verbs? • Synonyms/Antonyms – count any matches for both types • Parts of speech - brainstorm anyone? • Removal or weighting of names, place-names – make a multiple word “match” into a single symbol so as not to give extra weight to names or place-names • Matching phrases that appear similar in both the text and the hypothesis • Any “count” that can be created from any processing of semantic or syntactic information would be able to be used • I am now using Matlab to implement, so any Unix program can be used to process a feature – maybe there is some existing feature extraction Unix command-line program that someone knows about
RTE3 Important Dates • Test set released on March 5th • Gives us 10 days before we can submit • Last day to submit is March 12th • Submission consists of running the data yourself and then submitting the result file • A cheater says whaaaa? • Technical report deadline March 26th • I will be working on this on and off until March 6th, then I can devote full time for our submission