1 / 15

LING 573 Deliverable 3

LING 573 Deliverable 3. Jonggun Park Haotian He Maria Antoniak Ron Lockwood. Closed Class filters 14. Animals, colors, companies, continents, countries, sports team, languages, occupations, periodic table, race, us-cities, us-presidents, us-states, and us-universities. Query EXPAANSSION!.

gamada
Download Presentation

LING 573 Deliverable 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING573Deliverable3 JonggunParkHaotianHe MariaAntoniakRonLockwood

  2. Closed Class filters 14 • Animals, colors, companies, continents, countries, sports team, languages, occupations, periodic table, race,us-cities, us-presidents, us-states, and us-universities.

  3. Query EXPAANSSION!

  4. Query Expansion • Who is the president of the United States? • President united states nations council • How long did it take to build the Tower of Pisa? • long build tower pisawomen’s station

  5. Question Classification Software package: Mallet Classification algorithms: MaxEnt, NaiveBayes, Winnow, DecisionTree Training Data: - TREC-2004.xml - Training set 5 (5500 labeled questions) (Li & Roth) Test Data: - TREC-2005.xml - Testing set (Li & Roth)

  6. Featureselection: -Unigram -Bigram -Trigram -Questionword -NER tags

  7. Conclusion: Maximumaccuracy: TREC-2005astestfile:0.8535911602209945 -MaxEnt,Unigram+Bigram+Wh-words TREC-10astestfile:0.854 -MaxEnt,Unigram+Bigram+Wh-words

  8. Otherfindings: Trigramdoesnothelpsanddragstheaccuracydown. NERfeaturedoesnothelpsandcausesaslightdrop-down.

  9. Web Boosting • Resources: jsoup, Bing.com • Query: original question + target string • Results: top 50 web snippets, stored in a text file

  10. Web Boosting Challenges and Successes • Which search engine or answer website to use? • How to avoid throttling? • How to integrate results into our system? • How to edit results to make them more useful for our answer ranking system?

  11. Main Changes • Use web query as input to the redundancy-based answer extraction engine • This replaces our paragraph based index • Answer type classification now feeds into answer extraction • Filtering of candidate answers by answer type in combination with NER on the answers • Following types are handled: NUM, LOC, HUM, ENTY

  12. Main changes (continued) • Filtering of closed class questions using lists • E.g. pro sports teams, colors, etc. • Filtering out of terms with occurrences in less than 2 snippets • Return 250 char. answer instead of 1-4 words

  13. Answer Extraction Details • Input to the Extraction Engine • Query word list • Stop-word list • Focus-word list (e.g. meters, liters, miles, etc.) • Passage list – the paragraph results of the query • N-gram generation and occurrence counting • Filtering out stop words and query words • Filter by answer type

  14. Answer Extraction Details 4. Combining unigram counts with n-gram counts 5. Weighting candidates with idf scores 6. Re-rank candidates • Eliminate ones that don’t have evidence in at least 2 snippets • Eliminate ones that don’t match a closed class list (for certain questions.) 7. Verifying candidates in documents • Use bag of words query from the candidate sub-snippet + query words against Lucene index

  15. Results D2: strict = 0.01 lenient = 0.064 D3: strict = 0.133 lenient = 0.371

More Related