LING 573 Deliverable 3

LING573Deliverable3 JonggunParkHaotianHe MariaAntoniakRonLockwood

Closed Class filters 14 • Animals, colors, companies, continents, countries, sports team, languages, occupations, periodic table, race,us-cities, us-presidents, us-states, and us-universities.

Query EXPAANSSION!

Query Expansion • Who is the president of the United States? • President united states nations council • How long did it take to build the Tower of Pisa? • long build tower pisawomen’s station

Question Classification Software package: Mallet Classification algorithms: MaxEnt, NaiveBayes, Winnow, DecisionTree Training Data: - TREC-2004.xml - Training set 5 (5500 labeled questions) (Li & Roth) Test Data: - TREC-2005.xml - Testing set (Li & Roth)

Featureselection: -Unigram -Bigram -Trigram -Questionword -NER tags

Conclusion: Maximumaccuracy: TREC-2005astestfile:0.8535911602209945 -MaxEnt,Unigram+Bigram+Wh-words TREC-10astestfile:0.854 -MaxEnt,Unigram+Bigram+Wh-words

Otherfindings: Trigramdoesnothelpsanddragstheaccuracydown. NERfeaturedoesnothelpsandcausesaslightdrop-down.

Web Boosting • Resources: jsoup, Bing.com • Query: original question + target string • Results: top 50 web snippets, stored in a text file

Web Boosting Challenges and Successes • Which search engine or answer website to use? • How to avoid throttling? • How to integrate results into our system? • How to edit results to make them more useful for our answer ranking system?

Main Changes • Use web query as input to the redundancy-based answer extraction engine • This replaces our paragraph based index • Answer type classification now feeds into answer extraction • Filtering of candidate answers by answer type in combination with NER on the answers • Following types are handled: NUM, LOC, HUM, ENTY

Main changes (continued) • Filtering of closed class questions using lists • E.g. pro sports teams, colors, etc. • Filtering out of terms with occurrences in less than 2 snippets • Return 250 char. answer instead of 1-4 words

Answer Extraction Details • Input to the Extraction Engine • Query word list • Stop-word list • Focus-word list (e.g. meters, liters, miles, etc.) • Passage list – the paragraph results of the query • N-gram generation and occurrence counting • Filtering out stop words and query words • Filter by answer type

Answer Extraction Details 4. Combining unigram counts with n-gram counts 5. Weighting candidates with idf scores 6. Re-rank candidates • Eliminate ones that don’t have evidence in at least 2 snippets • Eliminate ones that don’t match a closed class list (for certain questions.) 7. Verifying candidates in documents • Use bag of words query from the candidate sub-snippet + query words against Lucene index

Results D2: strict = 0.01 lenient = 0.064 D3: strict = 0.133 lenient = 0.371

LING 573 Deliverable 3

LING 573 Deliverable 3

Presentation Transcript

Deliverable No. 3

Deliverable #3: Document and Passage Retrieval

LING 573 D4: The Final D

LING 573: Deliverable 3

Deliverable R4.1.5

LING 573: Deliverable 4

Deliverable 1

Deliverable 5

D2.1 Deliverable

Quality Management Qual 573 الجودة النوعية جود 573 3 (2+1)

Deliverable 3.1

Deliverable R4.1.3

Deliverable product

Deliverable 10

Deliverable 1

CIS 499 Project Deliverable 3

CIS 499 PROJECT DELIVERABLE 3

Systems Analysis Project Deliverable 3 Requirements Models

Deliverable #3

Deliverable G

Systems Design Project Deliverable 3