1 / 30

Artificial intelligence & natural language processing

Artificial intelligence & natural language processing . Mark Sanderson Porto, 2000. Aims. To provide an outline of the attempts made at using NLP techniques in IR . Objectives. At the end of this lecture you will be able to Outline a range of attempts to get NLP to work with IR systems

gella
Download Presentation

Artificial intelligence & natural language processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Artificial intelligence & natural language processing Mark Sanderson Porto, 2000

  2. Aims • To provide an outline of the attempts made at using NLP techniques in IR

  3. Objectives • At the end of this lecture you will be able to • Outline a range of attempts to get NLP to work with IR systems • Idly speculate on why they failed • Describe the successful use of NLP in a limited domain

  4. Why? • Seems an obvious area of investigation • Why not working?

  5. Use of NLP • Syntactic • Parsing to identify phrases • Full syntactic structure comparison • Semantic • Building an understanding of a document’s content • Discourse • Exploiting document structure?

  6. Syntactic • Parsing to identify phrases • The issues. • Explain how it’s done (a bit). • Is it worth it? • Other possibilities • Grammatical tagging • Full syntactic structure comparison • Explain how it’s done (a little bit). • Show results.

  7. Simple phrase identification • High frequency terms could be good candidates. • Why? • Terms co-occurring more often than chance. • Within small number of words. • Surrounding simple terms. • Not surrounding punctuation.

  8. Problems • Close words that aren’t phrases. • “the use of computers in science & technology” • Distant words that are phrases. • “preparation & evaluation of abstracts and extracts”

  9. Parsing for phrases • Using parsers to identify noun phrases. • Make a phrase out of a head and the head of its modifiers. NP PP PREP ADJ NOUN ADJ NOUN “automatic analysis of scientific text”

  10. Errors • Not a perfect rule by any means. • Need restrictions to eliminate bogus phrases. NP PP PREP ADJ NOUN DET QUANT ADJ NOUN “automatic analysis of these four scientific texts”

  11. Do they work? • Fagan compared statistical with syntactic, statistics won, just • J. Fagan (1987) Experiments in phrase indexing for document retrieval: a comparison of syntactic & nonsyntactic methods, in TR 87-868 - Department of Computer Science, Cornell University • More research has been conducted. • T. Strzalkowski (1995) Natural language information retrieval, in Information Processing & Management, Vol. 31, No. 3, pp 397-417

  12. Check out TREC • Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees, D. Harman (National Institute of Standards and Technology) • http://trec.nist.gov/ • Ad hoc track • Fairly even between statistical phrases, syntactic phrases and no phrases.

  13. Grammatical tagging? • Tag document text with grammatical codes? • R. Garside (1987). The CLAWS word tagging system, in The computational analysis of english: a corpus based approach, R. Garside, G. Leech, G. Sampson Eds., Longman: 30-41. • Doesn’t appear to work • R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using syntactic analysis in a document retrieval system that uses signature files, in Proceedings of 13th ACM SIGIR Conference: 179-191.

  14. Syntactic structure comparison • Has been tried… • A. F. Smeaton & P. Sheridan (1991) Using morpho-syntactic language analysis in phrase matching, in Proceedings of RIAO ‘91, Pages 414-429 • Method • Parse sentences into tree structures • When you get a phrase match • Look at linking syntactic operator. • Look at the residual tree structure that didn’t match • Does not to work

  15. Semantic • Disambiguation • Given a word appearing in a certain context, disambiguators will tell you what sense it is. • IR system • Index document collections by senses rather than words • Ask the users what senses the query words are • Retrieve on senses

  16. Disambiguation • Does it work? • No (well maybe) • M. Sanderson, Word sense disambiguation and information retrieval, in Proceedings of the 17th ACM SIGIR Conference, Pages 142-151, 1994 • M. Sanderson & C.J. van Rijsbergen, The impact on retrieval effectiveness of skewed frequency distributions, in ACM Transactions on Information Systems (TOIS) Vol. 17 No. 4, 1999, Pages 440-465.

  17. Partial conclusions • NLP has yet to prove itself in IR • Agree • D.D. Lewis & K. Sparck-Jones (1996) Natural language processing for information retrieval, in Communications of the ACM (CACM) 1996 Vol. 39, No. 1, 92-101 • Sort of don’t agree • A. Smeaton (1992) Progress in the application of natural language processing to information retrieval tasks, in The Computer Journal, Vol. 35, No. 3.

  18. Mark’s idle speculation • What people think is going on always Keywords NLP

  19. Keywords Mark’s idle speculation • What’s usually actually going on NLP

  20. Areas where NLP does work • Systems with the following ingredients. • Collection documents cover small domain. • Language use is limited in some manner. • User queries cover tight subject area. • Documents/queries very short • Image captions • LSI, pseudo-relevance feedback • People willing to spend money getting NLP to work

  21. RIME & IOTA • From Grenoble • Y. Chiaramella & J. Nie (1990) A retrieval model based on an extended modal logic and its application to the RIME experimental approach, in Proceedings of the 13th SIGIR conference, Pages 25-43 • Medical record retrieval system • Some database’y parts • Free text descriptions of cases

  22. SGN - observed sign LOC - localisation Indexing • “an opacity affecting probably the lung and the trachea” {[p], SGN} {[and], SGN} {[bears-on], SGN} {[bears-on], SGN} {[opacity], SGN} {[lung], LOC} {[opacity], SGN} {[trachea], LOC}

  23. t - uncertainty Retrieval • How do we match a user’s query to these structures? • Using transformations - bit like logic. {[bears-on], SGN} Þ {[lung], LOC}, t Þ {[opacity], SGN}, t {[opacity], SGN} {[lung], LOC}

  24. Tree transformation {[has-for-value], SGN} {[bears-on], SGN} {[has-for-value], SGN} {[opacity], SGN} {[lung], LOC} {[contour], SGN} {[blurred], LOC} Þ {[has-for-value], SGN}, t {[opacity], SGN} {[has-for-value], SGN} {[contour], SGN} {[blurred], LOC}

  25. Term transforms • Basic medical terms stored in a hierarchy. • Transformations possible again with uncertainty added. Level 1 Level 2 Level 3 tumour cancer sarcoma hygroma kyste polykystosis pseudokyst polyp polyposis

  26. Isn’t this a bit slow? • Yes • Optimisation • Scan for potential documents. • Process them intensively. • Evaluation? • Not in that paper.

  27. Not unique • SCISOR • P.S. Jacobs & L.F. Rau (1990) SCISOR: Extracting Information from On-line News, in Communications of the ACM (CACM), Vol. 33, No. 11, 88-97

  28. Why do they work? • Because of the restrictions • Small subject domain. • Limited vocabulary. • Restricted type of question. • Compare with large scale IR system. • Keywords are good enough. • Long time to set up. • Hard to adapt to new domain.

  29. Anything else for NLP? • Text Generation • IR system explaining itself?

  30. Conclusions • By now, you will be able to • Outline a range of attempts to get NLP to work with IR systems • Idly speculate on why they failed • Describe the successful use of NLP in a limited domain

More Related