Computational Language

Computational Language Andrew Hippisley

Computational Language • Computational language and AI • Language engineering: applied computational language • Case study: spell checkers

Computational language & AI • Artificial Intelligence: “the simulation on computer of distinctly human mental functions.” Wilks (1993)

Computational language & AI • Language integral to intelligent systems • Artificial Intelligence • Turing Test • ELIZA

ELIZA

Computational language & AI • Why language engineering? • Language integral to intelligent systems • Artifiicial Intelligence • Turing Test • ELIZA • Expert systems: natural language interface, natural language database

Computational language & AI • Methods shared across systems • Finite State Transition Networks (FSTN) • Logic • Formal rules • Probability • Data: you know it!

Applied computational language • History of the field • Machine Translation: 1960, 1966, post 1966 • Database access • Text interpretation • Information retrieval • Text categorisation

Language engineering • Information overload • Need a way of automatically processing text documents • Information extraction

Language engineering • Information extraction • GIDA: system for automatically monitoring financial market sentiment

GIDA

Language engineering • Information overload • Need a way of automatically processing text documents • Information extraction • Summarisation

Automatic summarisation(courtesy of Paulo FERNANDES de OLIVEIRA, PhD) • Get information source; • Extract some content from it; • Present the most important part to the user xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx x xxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xx xx xxxx xxx xxx xx xx xxxx x xxx xx x xx xx xxxxx x x xx xxx xxxxxx xxxxxx x x xxxxxxx xx x xxxxxx xxxx xx xx xxxxx xxx xx x xx xx xxxx xxx xxxx xx xxx xx xxx xxxx xx xxx x xxxx x xx xxxx xx xxx xxxx xx x xxx xxx xxxx x xxx x xxx xx xx xxxxx x x xx xxxxxxx xx x xxxxxx xxxx xx xx xxxxx xxx xx xxx xx xxxx x xxxxx xx xxxxx x

Lexical Cohesion Links Example Sentence 23: J&J's stock added 83 cents to $65.49. Sentence 15: "For the stockmarket this move was so deeply discounted that I don't think it will have a major impact". Sentence 26: Flagging stockmarkets kept merger activity and new stock offerings on the wane, the firm said. Sentence 42: Lucent, the most active stock on the New York Stock Exchange, skidded 47 cents to $4.31, after falling to a low at $4.30. Text title: U.S. stocks hold some gains. Collected from Reuters’ Website on 20 March 2002.

Lexical Cohesion Bonds Example 17. In other news, Hewlett-Packard said preliminary estimates showed shareholders had approved its purchase of CompaqComputer -- a result unconfirmed by voting officials. 19. In a related vote, Compaqshareholders are expected on Wednesday to back the deal, catapulting HP into contention against International Business Machines for the title of No. 1 computer company. Text title: U.S. stocks hold some gains. Collected from Reuters’ Website on 20 March 2002.

Language engineering • Information overload • Need a way of automatically processing text documents • Information extraction • Summarisation • Translation • Retrieve only relevant documents • Voice processing

Language engineering • Two main approaches • Symbolic • Stochastic

Case study spell checkers

Spelling dictionaries • aim? • given a sequence of symbols: • 1. identify misspelled strings • 2. generate a list of possible ‘candidate’ correct strings • 3. select most probable candidate from the list

Spelling dictionaries • Implementation: • Probabilistic framework • bayesian rule • noisy channel model

Spelling dictionaries • Types of spelling error • actual word errors • non-word errors

Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors

Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors • /graffe/ instead of /giraffe/

Spelling dictionaries • Types of spelling error • actual word errors • /piece/ instead of /peace/ • /there/ instead of /their/ • non-word errors • /graffe/ instead of /giraffe/ • of all errors in type written texts, 80% are non-word errors

Spelling dictionaries • non-word errors • Cognitive errors • /seperate/ instead of /separate/ • phonetically equivalent sequence of symbols has been substituted • due to lack of knowledge about spelling conventions

Spelling dictionaries • non-word errors • Cognitive errors • Typographic (‘typo’) errors • influenced by keyboard • e.g. substitution of /w/ for /e/ due to its adjacency on the keyboard • /thw/ instead of /the/

Spelling dictionaries • non-word errors • noisy channel model • The actual word has been passed through a noisy communication channel • This has distorted the word, thereby changing it in some way • The misspelled word is the distorted version of the actual word • Aim: recover the actual word by hypothesising about the possible ways in which it could have been distorted

Spelling dictionaries • non-word errors • noisy channel model • What are the possible distortions? • insertion • deletion • substitution • transposition • all of these viewed as transformations that take place in the noisy channel

Spelling dictionaries • Implementing spelling identification and correction algorithm

Spelling dictionaries • Implementing spelling identification and correction algorithm • STAGE 1: compare each string in document with a list of legal strings; if no corresponding string in list mark as misspelled • STAGE 2: generate list of candidates • Apply any single transformation to the typo string • Filter the list by checking against a dictionary • STAGE 3: assign probability values to each candidate in the list • STAGE 4: select best candidate

Spelling dictionaries • STAGE 3 • prior probability • given all the words in English, is this candidate more likely to be what the typist meant than that candidate? • P(c) = c/N where N is the number of words in a corpus • likelihood • Given, the possible errors, or transformation, how likely is it that error y has operated on candidate x to produce the typo? • P(t/c), calculated using a corpus of errors, or transformations • Bayesian rule: • get the product of the prior probability and the likelihood • P(c) X P(t/c)

Spelling dictionaries • non-word errors • Implementing spelling identification and correction algorithm • STAGE 1: identify misspelled words • STAGE 2: generate list of candidates • STAGE 3a: rank candidates for probability • STAGE 3b: select best candidate • Implement: • noisy channel model • Bayesian Rule

Next week Finite state machines and regular expressions

Computational Language