440 likes | 634 Views
ICT619 Intelligent Systems. Topic 9: Natural Language Processing and Language Technology. What is natural language processing (NLP)?. An ideal goal for human-computer communication is the ability to communicate in a natural language NLP grew as a sub-domain of AI and linguistics
E N D
ICT619 Intelligent Systems Topic 9: Natural Language Processing and Language Technology
What is natural language processing (NLP)? • An ideal goal for human-computer communication is the ability to communicate in a natural language • NLP grew as a sub-domain of AI and linguistics - the task of developing software capable of understanding information (commands, text) expressed in a natural language in order to achieve specific goals • Understanding natural languages is a challenging task for computers • Due to ambiguities, frequent use of context and the overall knowledge acquisition and use problem ICT619
Speech (voice) recognition and natural language processing • Speech recognition concerns understanding spoken commands or sentences from voice inputs Example: Telstra’s directory assistance • A speech recognition system must first extract and recognise words from audio input • We might also like the system to be able to answer in speech - this requires speech generation as well • In NLP, input is already available in machine-readable form (eg words as Unicode text) • Future improvements of speech recognition will to some extent depend on progress in NLP ICT619
Speech Recognnition – The state-of-the-art • 60-90% accuracy - good enough for general dictation • Speaker dependent – needs training • Cheap desktop software available • Example: IBM ViaVoice, Dragon Naturally Speaking • Issues: • Isolated vs. continuous speech • Vocabulary size • Better speaker independence ICT619
Language Technology • Covers all areas related to NLP with a practical focus • Language technology is defined as: The application of knowledge about human language in computer-based solutions • Applications covered by language technology include: • Spoken language dialogue systems (speech recognition, some understanding, and speech generation) • Machine translation • Text summarisation • Information retrieval ICT619
Language Technology (cont’d) • The input to a language technology system may be provided through • speech recognition • optical character recognition (OCR) • handwriting recognition and • the output may be in the form of speech or tailored documents, or web pages. ICT619
Approaches to natural language processing Main Approaches • Keyword searching • Linguistic analysis • AI-based • ANN-based • Statistical analysis Keyword searching systems • Early NLP systems - and some in use today - are based on keyword searching (pattern matching) ICT619
Keyword searching NLP systems • Selected keywords or phrases are searched for in the input sentence • The program responds with specific pre-stored responses based on the keywords or phrases • Program may actually construct a response based on a partial reply coupled with keywords and phrases from the input • No real understanding of the input is involved ICT619
Keyword searching NLP systems (cont’d) The most well known example - ELIZA program from MIT mid-1960s ICT619
Keyword systems • Limitations • Inflexible - really just reactive responses • Unable to cope with anything not in their keyword look-up tables, and • No knowledge modelling • Today’s more sophisticated NLP systems • Try to understand the content of language by doing syntactical, semantic and pragmatic analyses • May be able to do some conceptual modelling • Better able to maintain continuous dialogues • Attempt to cope with the ambiguity and other features common in natural language ICT619
Other approaches to NLP Linguistic analysis approach • Based on encoding formal grammar rules for sentence-level processing • A linguistically-oriented system focuses on the syntax and semantics AI based systems • Focuses on using world knowledge to understand language • One example of an AI-based NLP system is BORIS • written by Michael Dyer, a student of Roger Schank's • a story understanding program that reads a narrative and answers questions about it ICT619
AI-based NLP example - BORIS The BORIS system (from Roger Schank and Peter Childers, The Cognitive Computer). ICT619
Artificial neural networks based NLP ANN based systems • Uses ANNs for processing language, particularly for lexical disambiguation • A neural net is trained to disambiguate by using context • Trained presents units of 6 or so words containing target word to be learned • Example: Disambiguation of word “bank” in “We got a bank loan to buy a house” • Two possible senses: money sense, river sense • Groups of co-occurring words (neighbourhoods): • Money sense: bank money loan branch fee robbery • River sense: bank river bridge erosion earth slope ICT619
Statistical approach to NLP Linguistic approach • Based on extracting statistically significant information - tags - from large corpora or bodies of text (millions of words) and using these as very general indexes to model parts or responses • Valuable because it does not require as much hand-modelling of knowledge, but acquires the tags automatically • Statistical methods are now receiving much attention, and more systems are likely to incorporate them in future. • Most NLP systems use a combination of the linguistic and AI approaches ICT619
Components of NLP systems • Five major elements: the parser, the lexicon, the semantic analyser, the knowledge base, and the generator ICT619
Components of NLP systems (cont’d) • A syntactical parser analyses the input sentence using the language's grammar or rules of syntax • Output produced is a structural description of the sentence - known as a parse tree • Some rules of syntax for English: S = NP + VP S : sentence NP: noun phrase VP: predicate or verb phrase The noun phrase can be more than a single noun NP = D + ADJ + N • D: determiner (D) eg, “a”, “this”, ADJ: adjective, N: main noun ICT619
Components of NLP systems (cont.) The lexicon • An internal dictionary used to perform the syntactic and semantic analysis • Contains semantic and grammatical information (eg, part-of-speech) about words or word strings Fig. An example parse tree for the sentence “Mary had a little lamb” ICT619
The semantic analyser and the knowledge base • The semantic analyser uses the parse tree and the knowledge base to try to determine what the sentence means • It creates another data structure that represents the meaning of the input sentences • It can also draw inferences from input statements using general knowledge in the KB • The semantic analyser's data structure and those in the KB should be in a common knowledge representation, such as KQML or Conceptual Graphs ICT619
The Generator • The generator uses the KB data structure created by the semantic analyser to create a usable output • The response depends in part on the pragmatics of the input language eg greetings require greetings, questions require answers, commands require actions • The data structure can be used to initiate some action, • eg the language system is a front-end to a DBMS. The generator writes commands in a query language to begin a search • Simple generators feed standard pre-stored output responses to the user based on the built meaning representation • More sophisticated generators construct an original response by instantiating templates based on models of language use ICT619
Applications of NLP - Natural language interfaces (NLI) • An NLP system can be the front-end of information systems to provide a more user-friendly interface • Eg, the command: “List details of all files in this folder sorted by time of creation” much friendlier than ”ls –atl”, especially using voice input • An NLI processes sentences exchanged between a user and an application • Task made easier by the restricted domain of discourse, eg, in databases due to their highly restricted domain containing information on a single area of application • Interfaces for expert systems, operating systems and document retrieval systems are also being developed ICT619
Homer: A Language-using Agent Source: Vere, S. & Bickmore, T. A Basic Agent Computational Intelligence, 1990, 6, 4, 41-60.
Natural Language - Homer TIM> Drop the package at the barge next Saturday at 9pm. HOMER> OK. TIM> Are you going to be at the pier next Saturday? HOMER> Yes. TIM> Why are you going to the pier? HOMER> I’ll go to the pier in order to pick up the package. TIM> What time are you going to pick it up? HOMER> 8:56pm. STEVE> Where will the package be next Saturday? HOMER> At the barge. STEVE> What is in front of you? HOMER> A log. STEVE> Do you own the log? HOMER> No I don’t. STEVE> The log belongs to you. HOMER> Oh. STEVE> Cows eat grass. HOMER> I know. STEVE> Do you own the log now? HOMER> Yes I do. ICT619
Examples of commercial NL : Intellect Intellect (Trinzic Corp.) • One of the most widely used natural language front-end interfaces available for mainframes • Designed for use with DBMS under IBM operating systems environments • In addition to allowing access to data in a database, Intellect allows creation of databases using natural language • The built-in lexicon may be modified to fit a particular application ICT619
Q&A (Symantec Corp.) • A basic file manager with a natural language front-end called “The Intelligent Assistant” • Parses common English input questions and converts them into queries that the file manager can understand • Paraphrases input requests to ensure full understanding of what user wants • Eg, User input: Show the total 1992 sales for the Central Region • Q&A Intelligent Assistant’s response: Shall I do the following? Create a report showing the amount of sales for the central region in 1992? Y(es) – Continue N(o) – Cancel request • Semantec discontinued and then sold Q&A to a German company called CAB GmbH. ICT619
Machine translation Goal: • To support translation of some language into a language other than the original Applications include: • Desktop and web-based translation services • Spoken language translation services (eg phone-based) Requirements: • Understanding meaning of input sentences • This would involve a semantic analysis of the input using semantic knowledge • An automatic translation system is expected to be robust and not stop whenever it encounters an item it cannot understand ICT619
Machine translation (cont’d) Current approaches use a transfer grammar • Input text Partial analysis 1st Intermediate representation of content (related to the source language) • Intermediate representation Transformation using a transfer grammar 2nd intermediate representation (related to the target language) • 2nd intermediate representation NL generator Text in target language • Machine translation as performed since mid-1960s is not true “understanding” of text • By 1991, systems that could process sentences with limited vocabulary started appearing ICT619
Current state-of-the-art of machine translation • Broad coverage MT systems already available on the Web with fast turnaround time and acceptable error rate • Higher accuracy achieved by domain-specific systems • For example, controlled language used in Caterpillar manuals Machine translation products • Bowne Global Solution’s iTranslator • www.itranslator.com • Systran’s Babel Fish (used by AltaVista) • www.systransoft.com ICT619
Current state-of-the-art of machine translation (cont’d) An example: Systran’s Web-based Translator ICT619
Spoken language dialogue systems • Communicate with users via automatic speech recognition and text-to-speech interfaces • Mediate the user’s access to a back-end database Examples: • Information services: stock quotes, timetables • Transaction services: banking, betting, flight reservations • Current technology has been claimed to be capable of reducing call centre costs from $75 to 18c a call Some issues: • Telephony-based systems cannot afford a training period • Making a conversation too realistic falsely raises user expectations and can confuse the system ICT619
Spoken language dialog systems (cont’d) More issues: • Error handling is a significant issue • Giving initiative to the user increases difficulty Some relatively successful examples: • A Sydney taxi booking service (about 30% of cases have to go to human operators). • Telstra directory assistance service (15-20% accuracy but 15-20% of automation may be useful enough) • Spoken language dialog systems fielded applications: • Nuance (www.nuance.com) • ScanSoft/SpeechWorks( (www.scansoft.com) • Philips (www.speech.philips.com) ICT619
Text processing • A number of different applications dealing with the processing of continuous text may be grouped together under this heading • Editing tools • Most common example: spelling and syntax (or grammar) checkers Characterised by avoidance of deep semantic processing • Content extraction • Concerns extraction of specific information from texts • Examples: • Extraction of information related to financial transaction from a bank telex or of bibliographic information from research papers ICT619
Text processing (cont’d) • Content extraction (cont’d) • Requires deep semantic analysis which is aided by the restricted domain and a priori knowledge of the information to be extracted • Commercial systems exist for electronic mail processing, banking systems and automatic summary generation • Examples: • ATRANS from Cognitive Systems • DEAL-READER from Gecosys ICT619
Text processing (cont.) Text summarisation Objective: • To produce a version of a document shorter than the original document • Applications of text summarisation are found in • Information browsing • Voice delivery of Web pages and email • Issues concerning text summarisation • Different kinds of summaries: • Indicative (what is it about?) vs Informative (what is there of interest to user?) • Real summarisation requires real understanding ICT619
Text summarisation state-of-the-art • Commercial systems work on a ‘sentence-extraction’ model Sentences regarded as ‘important’ are extracted and put together • Importance of sentences decided on the basis of location, inclusion of key words, statistical information such as frequency • Current systems are relatively knowledge-free • Not based on real understanding of the text • Some text summarisation applications currently available: • CognIT’s CORPORUM (www.cognit.com) • INXight’s Summarizer (www.inxight.com) • MS Word’s summarisation tool ICT619
Search and Information Retrieval • Ever increasing amount of information available worldwide, particularly on the Internet • Searching for and retrieving information relevant to a topic of interest an active area of research and application. • Document retrieval (DR) • Also known as text retrieval • Involves retrieving text ranging from paragraph to book length for humans to read • DR may involve • searching well-maintained bibliographic databases • scanning hard disks for missing files • searching thousands of Web servers for natural language articles on a topic of interest ICT619
Search and Information Retrieval (cont’d) • Efficacy of a DR system measured by • Precision –proportion retrieved that are relevant, and • Recall –proportion of relevant documents retrieved • Retrieval depends on indexing - indicating what documents are about • Indexing requires an indexing language, a term vocabulary, and a method for constructing requests and document descriptions • Both controlled language indexing and the more sophisticated natural language indexing require NLP capabilities • Compact descriptions of a document’s significance may increase the efficiency of matching • Increasing both recall and precision is the fundamental goal of index languages ICT619
Search and Information Retrieval (cont’d) Current topics of interest in search and information retrieval include: • In a concept-based search, documents are characterised by relevant concepts and not just key words • For example, a search for ‘car’ should also retrieve documents on 'automobiles' • Named entity recognition involves recognising names of peoples, places, organisations etc. • One person or organisation can be referred to by many name variants – eg, John Howard, Mr. Howard, J.W. Howard, the PM • Many persons or organisations can share the same name – eg, politician John Howard, actor John Howard ICT619
Search and Information Retrieval (cont’d) Search and Information Retrieval State-of-the-art • Current trend (eg Google) is to expand the search vocabulary by using thesauri (eg, ‘car’ ‘automobile’) • Linguistic analysis to identify phrases relevant to the initial query • Key phrases can be more useful than just key word • Can be used to expand an initial user query (Khan & Khor 2004) • Some current search and information retrieval applications: • Ultra Find: www.ultradesign.com/untrafind/ultrafind.html • Lotus Discovery Server: www.lotus.com/products/discserver.nsf • Smart text processing suites: • Inxight: www.inxight.com • Verity: wwwl.verity.com ICT619
Challenges faced by NLP • A good NLP system must be capable of handling common linguistic problems caused by ambiguities and the use of context • Prepositional phrase attachment • A sentence can often be analysed in more than one way, producing multiple parse trees for the sentence. • Example sentence: • “John saw the boy in the park with a telescope” has 3 possible parses Without contextual knowledge, it is not known whether John was looking through the telescope, the boy had a telescope, or the park had a telescope in it. ICT619
Challenges faced by NLP (cont’d) Lexical ambiguity • When words have multiple meanings • A classic example: • Time flies like an arrow. • Fruit flies like a banana. • In the first case, “flies” is a verb and “like” is an adverb • In the second case, “flies” is a noun and “like” is a verb. ICT619
Challenges faced by NLP (cont.) Anaphoric reference or pronoun resolution • Problem of figuring out what a pronoun refers to • Example: Give me the names of all managers and how much they earn. (1) Mary went to see Jane. She was happy to see her (2) • In (1), easy to decide that “they” refers to the managers already mentioned • In (2), difficult to decide who “she” and “her” refer to – was Mary happy to see Jane, or was Jane happy to see Mary? ICT619
Challenges faced by NLP (cont.) Ellipsis • Sentences appearing to have parts missing • Example • John works in Personnel, Mary in Accounting. “Mary in accounting” lacks a verb but is understandable using context of entire sentence “Mary in accounting” is an elliptical form of “Mary works in accounting”. ICT619
Challenges faced by NLP (cont.) Quantifier scope • Quantifiers such as “all”, “every”, “some”, and “no” can be ambiguous • Example: • Every employee does not like Mr Smith Meaning - not a single employee likes Mr Smith or - some do and some don’t. • No current NLP system can handle all of these problems – no unrestricted NLP system yet • Yet some such as HOMER can handle the most common forms ICT619
REFERENCES • Germain, E., Introducing Natural Language Processing, AI Expert, August 1992, pp.30-35. • Lewis, D.D., and Jones, K.S., Natural Language Processing for Information retrieval, Communications of the ACM Vol. 39, No. 1 (January 1996), pp.92-100. • Turban, E., Decision Support and Expert Systems, Prentice Hall, Englewood Cliffs, New Jersey, 1995, pp. 242-257. • Thayse, A. (Editor), From Natural Language Processing to Logic for Expert Systems, John Wiley & Sons, 1991. • Cole, R., Zaenen A., & Zampolli (eds), Survey of the State of the Art in Human Language technology, Cambridge University Press, 1998 • Available on the web: http://cslu.cse.ogi.edu/HLTsurvey/ • Dale, R., Language Technology: Applications and Techniques Tutorial 2004, The 8th Pacific Rim Int. Conf. on Artificial Intelligence, Auckland, 9-13 August, 2004. • Khan, M.S., and Khor, S. “Automatic Query Expansion for Enhanced Web Document Retrieval”, Journal of the American Society for Information Science and Technology, Vol. 55, No. 1, 2004, pp.29-40. ICT619