Towards an NLP `module’

Towards an NLP `module’ The role of an utterance-level interface

Modular architecture Language independent application Meaning representation Language module Utterance-level interface text or speech

Desiderata for NLP module • Application- and domain- independent • Bidirectional processing • No grammar-specific information should be needed in the application • Architecture should support multiple languages • Practical • Coverage: all well-formed input should be accepted, robust to speaker errors

Why? • developers could build `intelligent’ responsive applications without being NLP experts themselves • less time-consuming and expensive than doing the NLP for each application domain • multilingual applications • support further research

LinGO/DELPH-IN • Software and `lingware’ for application- and domain- independent NLP • Linguistically-motivated (HPSG), deep processing • Multiple languages • Analysis and generation • Informal collaboration since c.1995 • NLP research and development, theoretical research and teaching • www.delph-in.net

What’s different? • Open Source, integrated systems • Data-driven techniques combined with linguistic expertise • Testing • empirical basis, evaluation • linguistic motivation • No toy systems! • large scale grammars • maintainable software • development and runtime tools

Progress • Application- and domain- independent:reasonable (lexicons, text structure) • Bidirectional processing:yes • No grammar-specifics in applications:yes • Multiple languages:English, Japanese, German, Norwegian, Korean, Greek, Italian, French: plus grammar sharing via the Matrix • Practical: efficiency OK for some applications and improving, interfaces? • Coverage and robustness:80%+ coverage on English, good parse selection, not robust

Integrating deep and shallow processing • Shallow processing: speed and robustness, but lacks precision • Pairwise integration of systems is time-consuming, brittle • Common semantic representation language: shallow processing underspecified • Demonstrated effectiveness on IE (Deep Thought) • Requires that systems share tokenization (undesirable and impractical) or that output can be precisely aligned with original document • Markup complicates this

Utterance-level interface • text or speech • complex cases • text structure (e.g., headings, lists) • non-text (e.g., formulae, dates, graphics) • segmentation (esp., Japanese, Chinese) • speech lattices • integration of multiple analyzers

Utterance interface • Standard interface language • allow for ambiguity at all levels • XML • collaborating with ISO working group (MAF) • processors deliver standoff annotations to original text • Plan to develop finite-state preprocessors for some text types, allow for others • Plan to experiment with speech lattices

Assumptions about tokenization • tokenization: input data is transformed to form suitable for morph processing or lexical lookup: • What’s in those 234 dogs’ bowls, Fred? • what ’s in those <num 234> dogs ’s bowls , Fred ? • tokenization is therefore pre-lexical and cannot depend on lexical lookup • normalization (case, numbers, dates, formulae) as well as segmentation • used to be common to strip punctuation, but large-coverage systems utilize it • in generation: go from tokens to final output

Tokenization ambiguity • Unusual to find cases where humans have any difficulty: problem arises because we need a pipelined system • Some examples: • `I washed the dogs’ bowls’, I said. (first ’ could be end of quote) • The ’keeper’s reputations are on the line. (first ’ actually indicating abbreviation for goalkeeper but could be start of quote in text where ’ is not distinct from `) • I want a laptop-with a case. (common in email not to have spaces round dash)

Modularity problems • lexicon developers may assume particular tokenization: e.g., hyphen removal • different systems tokenize differently: big problem for system integration • DELPH-IN - `characterization’ – record original string character positions in token and all subsequent units

Speech output • Speech output from a transcribing recognizer is treated as a lattice of tokens • may actually require retokenization

Non-white space languages • Segmentation in Japanese (e.g., Chasen) is (in effect) accompanied by lexical lookup / morphological analysis • definitely do not want to assume this for English – for some forms of processing we may not have a lexicon.

Towards an NLP `module’

Towards an NLP `module’

Presentation Transcript