SProUT Shallow Processing with Unification and Typed Feature Structures

SProUT Shallow Processing with Unificationand Typed Feature Structures Jakub PiskorskiLanguage Technology LabDFKI GmbH

Concept indices, more accurate queries Domain-specific patterns Building ontologies Document Indexing/Retrieval Tokens EXECUTIVEINFORMATIONSYSTEMS Clause structure MULTI-AGENTS Word Stems Term association extraction Shallow Text Processing Components Template generation Phrases Text Mining Information Extraction Q/A Systems Semi-structured data Fine-grained concept matching Named Entities E-COMMERCE DATAWAREHOUSING WORKFLOWMANAGEMENT Text Classification Automatic Database Construction Shallow Text Processing TEXT DOCUMENTS

Finite-State based approaches  SPPC - pure finite-state based STP, small number of basic predicates SMES – predciates inspect arbitrary properties of the input tokens/fragmentsFASTUS – uses CPSL (Common Pattern Specification Language)GATE – uses JAPE (Java Annotation Patterns Engine)

Motivation for SProUT  One System for Multilingual and Domain Adaptive Shallow Text Processing Trade-off between efficiency and expressivenessModularityFlexible integration of different processing modules Portability Industrial standards

SProUT is a joint work by: Markus Becker, Witold Drożdżyński, Ulrich KriegerJakub Piskorski, Ulrich Schäfer, FeiyuXu

LEXICAL RESOURCES INPUT DATA JTFS STREAM OFTEXT ITEMS …. [..] [..] [..] …. STRUCTURED OUTPUT DATA FINITE-STATE MACHINE TOOLKIT SProUT Architecture LINGUISTIC PROCESSING RESOURCES EXTENDED OPTIMIZED FINITE-STATE NETWORK REGULAR COMPILER XTDL INTERPRETER XTDL GRAMMAR G R A M M A R D E V E L O P M E N T E N V I R O N M E N T O N L I N E P R O C E S S I N G

Core Components – FSM Toolkit  Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices Finite-state Machine model: FSA, WFSA, FST, WFST Arbitrary real-valued semirings Some new crucial STP-relevant operations (e.g., incremental construction of minimal deterministic FSAs) Functionality similar to AT&T tools

Core Components – Regular Compiler  Definition and configuration via XML Unicode compatible Extendible set of circa 20 operations Scanner definitions vs. general regular expressions Biasing optimization process Various ways of handling ambiguities Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation  Regular expressions over TFSs (SProUT) with restrictions

Core Components – Typed Feature Structure Package  JAVA implementation of TFSs Efficient unification operations  Dynamic extension of the type hierarchy Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers

XTDL Formalism  Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application XTDL grammar rules – production part on LHS, and output description on RHS TDL used for establishment of a type hierarchy of linguistic entities *top* atom *avm* *rule* tense sign infl index-avm present token morph lang tokentype de en separator url morph := sign & [POS atom, STEM atom, INFL infl]

XTDL Formalism  Couple of standard regular operators: concatenation optionality ?disjunction | Kleene star *Kleene plus + n-fold repetition {n}m-n span repetition {m,n} Unidirectional coreference under Kleene star (and restricted iteration) [POS Det, ...] ([POS Adj, ..., RELN %LIST])* [POS Noun, ...] -> [..., RELN %LIST]

XTDL Formalism loc-pp :> morph & [POS Prep & #preposition, INFL [CASE #1, NUMBER #2, GENDER #3]] morph & [POS Determiner, INFL [CASE #1, NUMBER #2, GENDER #3]] ? morph & [POS Adjective, INFL [CASE #1, NUMBER #2, GENDER #3]] * gazetteer & [TYPE general-location, SURFACE #location] -> [CAT location-pp, PREP #preposition LOCATION #location].

XTDL Interpreter 1. Matching of regular patterns using unifiability (LHS)2. LHS Pattern instance creation3. Unfication of the rule instance and matched input  Longest match strategy Ambiguities allowed Interpreter generates TFSs as output (cascaded architecture)

XTDL Interpreter  Matched input sequence “im sonnigen Rom” (in sunny Rome)

XTDL Interpreter  Rule with an instantiated pattern on the LHS

XTDL formalism  Unified result

Linguistic Processing Resources  Tokenization with fine-grained token classification Gazetteer (static named-entity lexica) Morphology Full-form lexica obtained from ‘compactified’ MMORPH: English 200,000 entriesGerman 830,000 entries + Shallow Compound RecognitionFrench 225,000 entriesSpanish 570,000 entriesItalian 330,000 entries Asian Languages: Chinese – ShanxiJapanese – Chasen Other: Czech – HMM-based Part-of-Speech Tagging + Morphology

System Description Language  Construction of a concrete system instance via definition of a regular expression of module specifications  All lingusitic modules must implement a specific JAVA interface  Automatic compilation of system description into a single JAVA class

System Description Language (M1 M2)(input) M1.clearState(); M1.setInput(input); M1.setOutput(M1.computeOutput(M1.getInput())); M2.clearState(); M2.setInput(mediateSeq(M1,M2)); M2.setOutput(M2.computeOutput(M2.getInput())); return M2.getOutput(); (M*)(input) M.clearState(); M.setInput(input); M.setOutput(mediateFix(M)); return M.getOutput();

Future Work  Optimization of grammar interpretation Various search strategiesAdditional linguistic processing resourcesReal data testing: large grammars and real-world texts

SProUT Shallow Processing with Unification and Typed Feature Structures

SProUT Shallow Processing with Unification and Typed Feature Structures

Presentation Transcript

Focusing and Processing Structures

Division and Unification

Resolution and Unification

Typed With Clean Hands

SProUT Shallow Processing with Unification and Typed Feature Structures

Feature Level Processing

Feature Pre -processing

Shallow Processing: Recap Domain Adapt

Sprout Race

Shallow Processing: Summary

Features and Unification

Dependently Typed Data Structures

Synthetic Sprout

74.793 NLP and Speech 2004 Feature Structures

SPPC Shallow Processing Production Center

Feature structures and unification

Typed AG

DCT2023 Data Structures and File Processing

ADAPTING TREE STRUCTURES FOR PROCESSING WITH SIMD INSTRUCTIONS

Applying SAS Parallel-Processing Feature

creative sprout media

Feature PRE-PROCESSING