1 / 12

Third Recognizing Textual Entailment Challenge

Third Recognizing Textual Entailment Challenge. Potential SNeRG Submission. RTE3 Quick Notes. RTE Web Site: http://www.pascal-network.org/Challenges/RTE3/ Textual Entailment resource pool: http://aclweb.org/aclwiki/index.php?title=Textual_Entailment_Resource_Pool

kamran
Download Presentation

Third Recognizing Textual Entailment Challenge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Third Recognizing Textual Entailment Challenge Potential SNeRG Submission

  2. RTE3 Quick Notes • RTE Web Site: http://www.pascal-network.org/Challenges/RTE3/ • Textual Entailment resource pool: http://aclweb.org/aclwiki/index.php?title=Textual_Entailment_Resource_Pool • New development set released to correct errors last week • Test set released on March 5th • !!! New !!! submission date March 12th • Report deadline date March 26th

  3. Development set examples • Example of a YES result <pair id="5" entailment="YES" task="IE" length="short"> <t>A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured.</t> <h>30 die in a bus collision in Uganda.</h> </pair> • Example of a NO result <pair id="20" entailment="NO" task="IE" length="short"> <t>Blue Mountain Lumber is a subsidiary of Malaysian forestry transnational corporation, Ernslaw One.</t> <h>Blue Mountain Lumber owns Ernslaw One.</h> </pair>

  4. Development set examples – cont. • 4 Different types of entailment tasks • Information Retrieval (IR) • Question Answering (QA) • Information Extraction (IE) • Multi-document summarization (SUM) • Development set consists of 200 samples of each type of entailment • 400 evaluate to “YES” and 400 to “NO” • Another attribute “length” in the development set has only 134 long and 666 short. [Note to self: gather a group of demon hunters to hunt down the short samples, will need volunteers and holy water.]

  5. Evaluation • Two submissions per team can be made • Program output is a file that contains the following information. Line 1 must contain: “ranked:<blank space>yes/no” Line 2..end contains: “pair_id<blank space>judgment “ For example: ranked: yes 4 YES 3 YES 6 YES 1 NO 5 NO 2 NO • Accuracy is calculated from the answers returned correct • Precision is determined by the order and the correctness of the answers returned by the formula: 1/R * sum for i=1 to n (E(i) *#-correct-up-to-pair-i/i) n is the number of the pairs in the test set R is the total number of positive pairs in the test set E(i) is 1 if the i-th pair is positive and 0 otherwise and i ranges over the pairs, ordered by their ranking.

  6. Possible Implementation • Discover features that can be measured with a continuous variable For example: • Wordbag match ratio = # of words matched between text and hypothesis / # of words in the hypothesis • Arrange feature values in a feature vector x • Apply the general multivariate normal density for the assembled feature vector x

  7. Implementation to Determine Baseline • I have done an implementation to determine the baseline of what we can expect out of a full implementation of all syntactic features • First baseline result: Used 1 feature: Wordbag count > n, where n is decided after development set is processed Success: 509 , Fail: 290 Final rate: 63.9% • Second baseline result: Used simple preprocessing and Wordbag count: removing punctuation, case insensitivity, ignoring simple words Success: 534 , Fail: 265 Final rate: 66.8% • Attempted a little semantic processing, like increasing weight based on “negative” words for returning negative results, but results did not increase • In RTE2 competition highest accuracy was only 70%!

  8. Potential Features • Wordbag ratio = # of matches between text and hypothesis / # of words in hypothesis Works for: <t>A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured.</t> <h>30 die in a bus collision in Uganda.</h> Wordbag ratio = 6 / 8 Fails for: <t>Blue Mountain Lumber is a subsidiary of Malaysian forestry transnational corporation, Ernslaw One.</t> <h>Blue Mountain Lumber owns Ernslaw One.</h> Wordbag ratio = 5 / 6 • Potential solution needs to include processing semantic knowledge about the relationship between the highlighted red words.

  9. Potential Features – cont. • Word proximity = average distance between matched words in the text For example: <t>A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured.</t> <h>30 die in a bus collision in Uganda.</h> Matched words: 30 in bus collision in Uganda 30: 3, 12, 11, 3, 6 in: 3, 5, 4, 1 bus: 12, 5, 1, 5, 6 collision: etc… • May not help much or at all, but by adding additional independent features (from a gaussian distribution), we can potentially increase the P(wn|x)

  10. Potential Features – cont. • Word grouping = average of counts of word groups of length 2 / possible combos For example: <t>A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured.</t> <h>30 die in a bus collision in Uganda.</h> Matched groups: “bus collision”, “in Uganda”, 7 possible combinations = 2/7 <t>Blue Mountain Lumber is a subsidiary of Malaysian forestry transnational corporation, Ernslaw One.</t> <h>Blue Mountain Lumber owns Ernslaw One.</h> Matched groups: “Blue Mountain”, “Mountain Lumber”, “Ernslaw One”, 5 combinations = 3/5 • Once again this may not help much or at all, but may help us brainstorm a bit

  11. Potential Features – cont. • Quick and easy stats we can generate may include using • Stemmers – count matching verbs? • Synonyms/Antonyms – count any matches for both types • Parts of speech - brainstorm anyone? • Removal or weighting of names, place-names – make a multiple word “match” into a single symbol so as not to give extra weight to names or place-names • Matching phrases that appear similar in both the text and the hypothesis • Any “count” that can be created from any processing of semantic or syntactic information would be able to be used • I am now using Matlab to implement, so any Unix program can be used to process a feature – maybe there is some existing feature extraction Unix command-line program that someone knows about

  12. RTE3 Important Dates • Test set released on March 5th • Gives us 10 days before we can submit • Last day to submit is March 12th • Submission consists of running the data yourself and then submitting the result file • A cheater says whaaaa? • Technical report deadline March 26th • I will be working on this on and off until March 6th, then I can devote full time for our submission

More Related