Artificial intelligence & natural language processing

Artificial intelligence & natural language processing Mark Sanderson Porto, 2000

Aims • To provide an outline of the attempts made at using NLP techniques in IR

Objectives • At the end of this lecture you will be able to • Outline a range of attempts to get NLP to work with IR systems • Idly speculate on why they failed • Describe the successful use of NLP in a limited domain

Why? • Seems an obvious area of investigation • Why not working?

Use of NLP • Syntactic • Parsing to identify phrases • Full syntactic structure comparison • Semantic • Building an understanding of a document’s content • Discourse • Exploiting document structure?

Syntactic • Parsing to identify phrases • The issues. • Explain how it’s done (a bit). • Is it worth it? • Other possibilities • Grammatical tagging • Full syntactic structure comparison • Explain how it’s done (a little bit). • Show results.

Simple phrase identification • High frequency terms could be good candidates. • Why? • Terms co-occurring more often than chance. • Within small number of words. • Surrounding simple terms. • Not surrounding punctuation.

Problems • Close words that aren’t phrases. • “the use of computers in science & technology” • Distant words that are phrases. • “preparation & evaluation of abstracts and extracts”

Parsing for phrases • Using parsers to identify noun phrases. • Make a phrase out of a head and the head of its modifiers. NP PP PREP ADJ NOUN ADJ NOUN “automatic analysis of scientific text”

Errors • Not a perfect rule by any means. • Need restrictions to eliminate bogus phrases. NP PP PREP ADJ NOUN DET QUANT ADJ NOUN “automatic analysis of these four scientific texts”

Do they work? • Fagan compared statistical with syntactic, statistics won, just • J. Fagan (1987) Experiments in phrase indexing for document retrieval: a comparison of syntactic & nonsyntactic methods, in TR 87-868 - Department of Computer Science, Cornell University • More research has been conducted. • T. Strzalkowski (1995) Natural language information retrieval, in Information Processing & Management, Vol. 31, No. 3, pp 397-417

Check out TREC • Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees, D. Harman (National Institute of Standards and Technology) • http://trec.nist.gov/ • Ad hoc track • Fairly even between statistical phrases, syntactic phrases and no phrases.

Grammatical tagging? • Tag document text with grammatical codes? • R. Garside (1987). The CLAWS word tagging system, in The computational analysis of english: a corpus based approach, R. Garside, G. Leech, G. Sampson Eds., Longman: 30-41. • Doesn’t appear to work • R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using syntactic analysis in a document retrieval system that uses signature files, in Proceedings of 13th ACM SIGIR Conference: 179-191.

Syntactic structure comparison • Has been tried… • A. F. Smeaton & P. Sheridan (1991) Using morpho-syntactic language analysis in phrase matching, in Proceedings of RIAO ‘91, Pages 414-429 • Method • Parse sentences into tree structures • When you get a phrase match • Look at linking syntactic operator. • Look at the residual tree structure that didn’t match • Does not to work

Semantic • Disambiguation • Given a word appearing in a certain context, disambiguators will tell you what sense it is. • IR system • Index document collections by senses rather than words • Ask the users what senses the query words are • Retrieve on senses

Disambiguation • Does it work? • No (well maybe) • M. Sanderson, Word sense disambiguation and information retrieval, in Proceedings of the 17th ACM SIGIR Conference, Pages 142-151, 1994 • M. Sanderson & C.J. van Rijsbergen, The impact on retrieval effectiveness of skewed frequency distributions, in ACM Transactions on Information Systems (TOIS) Vol. 17 No. 4, 1999, Pages 440-465.

Partial conclusions • NLP has yet to prove itself in IR • Agree • D.D. Lewis & K. Sparck-Jones (1996) Natural language processing for information retrieval, in Communications of the ACM (CACM) 1996 Vol. 39, No. 1, 92-101 • Sort of don’t agree • A. Smeaton (1992) Progress in the application of natural language processing to information retrieval tasks, in The Computer Journal, Vol. 35, No. 3.

Mark’s idle speculation • What people think is going on always Keywords NLP

Keywords Mark’s idle speculation • What’s usually actually going on NLP

Areas where NLP does work • Systems with the following ingredients. • Collection documents cover small domain. • Language use is limited in some manner. • User queries cover tight subject area. • Documents/queries very short • Image captions • LSI, pseudo-relevance feedback • People willing to spend money getting NLP to work

RIME & IOTA • From Grenoble • Y. Chiaramella & J. Nie (1990) A retrieval model based on an extended modal logic and its application to the RIME experimental approach, in Proceedings of the 13th SIGIR conference, Pages 25-43 • Medical record retrieval system • Some database’y parts • Free text descriptions of cases

SGN - observed sign LOC - localisation Indexing • “an opacity affecting probably the lung and the trachea” {[p], SGN} {[and], SGN} {[bears-on], SGN} {[bears-on], SGN} {[opacity], SGN} {[lung], LOC} {[opacity], SGN} {[trachea], LOC}

t - uncertainty Retrieval • How do we match a user’s query to these structures? • Using transformations - bit like logic. {[bears-on], SGN} Þ {[lung], LOC}, t Þ {[opacity], SGN}, t {[opacity], SGN} {[lung], LOC}

Tree transformation {[has-for-value], SGN} {[bears-on], SGN} {[has-for-value], SGN} {[opacity], SGN} {[lung], LOC} {[contour], SGN} {[blurred], LOC} Þ {[has-for-value], SGN}, t {[opacity], SGN} {[has-for-value], SGN} {[contour], SGN} {[blurred], LOC}

Term transforms • Basic medical terms stored in a hierarchy. • Transformations possible again with uncertainty added. Level 1 Level 2 Level 3 tumour cancer sarcoma hygroma kyste polykystosis pseudokyst polyp polyposis

Isn’t this a bit slow? • Yes • Optimisation • Scan for potential documents. • Process them intensively. • Evaluation? • Not in that paper.

Not unique • SCISOR • P.S. Jacobs & L.F. Rau (1990) SCISOR: Extracting Information from On-line News, in Communications of the ACM (CACM), Vol. 33, No. 11, 88-97

Why do they work? • Because of the restrictions • Small subject domain. • Limited vocabulary. • Restricted type of question. • Compare with large scale IR system. • Keywords are good enough. • Long time to set up. • Hard to adapt to new domain.

Anything else for NLP? • Text Generation • IR system explaining itself?

Conclusions • By now, you will be able to • Outline a range of attempts to get NLP to work with IR systems • Idly speculate on why they failed • Describe the successful use of NLP in a limited domain

Artificial intelligence & natural language processing