300 likes | 389 Views
Artificial intelligence & natural language processing. Mark Sanderson Porto, 2000. Aims. To provide an outline of the attempts made at using NLP techniques in IR. Objectives. At the end of this lecture you will be able to Outline a range of attempts to get NLP to work with IR systems
E N D
Artificial intelligence & natural language processing Mark Sanderson Porto, 2000
Aims • To provide an outline of the attempts made at using NLP techniques in IR
Objectives • At the end of this lecture you will be able to • Outline a range of attempts to get NLP to work with IR systems • Idly speculate on why they failed • Describe the successful use of NLP in a limited domain
Why? • Seems an obvious area of investigation • Why not working?
Use of NLP • Syntactic • Parsing to identify phrases • Full syntactic structure comparison • Semantic • Building an understanding of a document’s content • Discourse • Exploiting document structure?
Syntactic • Parsing to identify phrases • The issues. • Explain how it’s done (a bit). • Is it worth it? • Other possibilities • Grammatical tagging • Full syntactic structure comparison • Explain how it’s done (a little bit). • Show results.
Simple phrase identification • High frequency terms could be good candidates. • Why? • Terms co-occurring more often than chance. • Within small number of words. • Surrounding simple terms. • Not surrounding punctuation.
Problems • Close words that aren’t phrases. • “the use of computers in science & technology” • Distant words that are phrases. • “preparation & evaluation of abstracts and extracts”
Parsing for phrases • Using parsers to identify noun phrases. • Make a phrase out of a head and the head of its modifiers. NP PP PREP ADJ NOUN ADJ NOUN “automatic analysis of scientific text”
Errors • Not a perfect rule by any means. • Need restrictions to eliminate bogus phrases. NP PP PREP ADJ NOUN DET QUANT ADJ NOUN “automatic analysis of these four scientific texts”
Do they work? • Fagan compared statistical with syntactic, statistics won, just • J. Fagan (1987) Experiments in phrase indexing for document retrieval: a comparison of syntactic & nonsyntactic methods, in TR 87-868 - Department of Computer Science, Cornell University • More research has been conducted. • T. Strzalkowski (1995) Natural language information retrieval, in Information Processing & Management, Vol. 31, No. 3, pp 397-417
Check out TREC • Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees, D. Harman (National Institute of Standards and Technology) • http://trec.nist.gov/ • Ad hoc track • Fairly even between statistical phrases, syntactic phrases and no phrases.
Grammatical tagging? • Tag document text with grammatical codes? • R. Garside (1987). The CLAWS word tagging system, in The computational analysis of english: a corpus based approach, R. Garside, G. Leech, G. Sampson Eds., Longman: 30-41. • Doesn’t appear to work • R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using syntactic analysis in a document retrieval system that uses signature files, in Proceedings of 13th ACM SIGIR Conference: 179-191.
Syntactic structure comparison • Has been tried… • A. F. Smeaton & P. Sheridan (1991) Using morpho-syntactic language analysis in phrase matching, in Proceedings of RIAO ‘91, Pages 414-429 • Method • Parse sentences into tree structures • When you get a phrase match • Look at linking syntactic operator. • Look at the residual tree structure that didn’t match • Does not to work
Semantic • Disambiguation • Given a word appearing in a certain context, disambiguators will tell you what sense it is. • IR system • Index document collections by senses rather than words • Ask the users what senses the query words are • Retrieve on senses
Disambiguation • Does it work? • No (well maybe) • M. Sanderson, Word sense disambiguation and information retrieval, in Proceedings of the 17th ACM SIGIR Conference, Pages 142-151, 1994 • M. Sanderson & C.J. van Rijsbergen, The impact on retrieval effectiveness of skewed frequency distributions, in ACM Transactions on Information Systems (TOIS) Vol. 17 No. 4, 1999, Pages 440-465.
Partial conclusions • NLP has yet to prove itself in IR • Agree • D.D. Lewis & K. Sparck-Jones (1996) Natural language processing for information retrieval, in Communications of the ACM (CACM) 1996 Vol. 39, No. 1, 92-101 • Sort of don’t agree • A. Smeaton (1992) Progress in the application of natural language processing to information retrieval tasks, in The Computer Journal, Vol. 35, No. 3.
Mark’s idle speculation • What people think is going on always Keywords NLP
Keywords Mark’s idle speculation • What’s usually actually going on NLP
Areas where NLP does work • Systems with the following ingredients. • Collection documents cover small domain. • Language use is limited in some manner. • User queries cover tight subject area. • Documents/queries very short • Image captions • LSI, pseudo-relevance feedback • People willing to spend money getting NLP to work
RIME & IOTA • From Grenoble • Y. Chiaramella & J. Nie (1990) A retrieval model based on an extended modal logic and its application to the RIME experimental approach, in Proceedings of the 13th SIGIR conference, Pages 25-43 • Medical record retrieval system • Some database’y parts • Free text descriptions of cases
SGN - observed sign LOC - localisation Indexing • “an opacity affecting probably the lung and the trachea” {[p], SGN} {[and], SGN} {[bears-on], SGN} {[bears-on], SGN} {[opacity], SGN} {[lung], LOC} {[opacity], SGN} {[trachea], LOC}
t - uncertainty Retrieval • How do we match a user’s query to these structures? • Using transformations - bit like logic. {[bears-on], SGN} Þ {[lung], LOC}, t Þ {[opacity], SGN}, t {[opacity], SGN} {[lung], LOC}
Tree transformation {[has-for-value], SGN} {[bears-on], SGN} {[has-for-value], SGN} {[opacity], SGN} {[lung], LOC} {[contour], SGN} {[blurred], LOC} Þ {[has-for-value], SGN}, t {[opacity], SGN} {[has-for-value], SGN} {[contour], SGN} {[blurred], LOC}
Term transforms • Basic medical terms stored in a hierarchy. • Transformations possible again with uncertainty added. Level 1 Level 2 Level 3 tumour cancer sarcoma hygroma kyste polykystosis pseudokyst polyp polyposis
Isn’t this a bit slow? • Yes • Optimisation • Scan for potential documents. • Process them intensively. • Evaluation? • Not in that paper.
Not unique • SCISOR • P.S. Jacobs & L.F. Rau (1990) SCISOR: Extracting Information from On-line News, in Communications of the ACM (CACM), Vol. 33, No. 11, 88-97
Why do they work? • Because of the restrictions • Small subject domain. • Limited vocabulary. • Restricted type of question. • Compare with large scale IR system. • Keywords are good enough. • Long time to set up. • Hard to adapt to new domain.
Anything else for NLP? • Text Generation • IR system explaining itself?
Conclusions • By now, you will be able to • Outline a range of attempts to get NLP to work with IR systems • Idly speculate on why they failed • Describe the successful use of NLP in a limited domain