200 likes | 391 Views
SProUT Shallow Processing with Unification and Typed Feature Structures. Jakub Piskorski Language Technology Lab DFKI GmbH. Concept indices, more accurate queries. Domain-specific patterns. Building ontologies. Document Indexing/Retrieval. Tokens. EXECUTIVE INFORMATION SYSTEMS. Clause
E N D
SProUT Shallow Processing with Unificationand Typed Feature Structures Jakub PiskorskiLanguage Technology LabDFKI GmbH
Concept indices, more accurate queries Domain-specific patterns Building ontologies Document Indexing/Retrieval Tokens EXECUTIVEINFORMATIONSYSTEMS Clause structure MULTI-AGENTS Word Stems Term association extraction Shallow Text Processing Components Template generation Phrases Text Mining Information Extraction Q/A Systems Semi-structured data Fine-grained concept matching Named Entities E-COMMERCE DATAWAREHOUSING WORKFLOWMANAGEMENT Text Classification Automatic Database Construction Shallow Text Processing TEXT DOCUMENTS
Finite-State based approaches SPPC - pure finite-state based STP, small number of basic predicates SMES – predciates inspect arbitrary properties of the input tokens/fragmentsFASTUS – uses CPSL (Common Pattern Specification Language)GATE – uses JAPE (Java Annotation Patterns Engine)
Motivation for SProUT One System for Multilingual and Domain Adaptive Shallow Text Processing Trade-off between efficiency and expressivenessModularityFlexible integration of different processing modules Portability Industrial standards
SProUT is a joint work by: Markus Becker, Witold Drożdżyński, Ulrich KriegerJakub Piskorski, Ulrich Schäfer, FeiyuXu
LEXICAL RESOURCES INPUT DATA JTFS STREAM OFTEXT ITEMS …. [..] [..] [..] …. STRUCTURED OUTPUT DATA FINITE-STATE MACHINE TOOLKIT SProUT Architecture LINGUISTIC PROCESSING RESOURCES EXTENDED OPTIMIZED FINITE-STATE NETWORK REGULAR COMPILER XTDL INTERPRETER XTDL GRAMMAR G R A M M A R D E V E L O P M E N T E N V I R O N M E N T O N L I N E P R O C E S S I N G
Core Components – FSM Toolkit Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices Finite-state Machine model: FSA, WFSA, FST, WFST Arbitrary real-valued semirings Some new crucial STP-relevant operations (e.g., incremental construction of minimal deterministic FSAs) Functionality similar to AT&T tools
Core Components – Regular Compiler Definition and configuration via XML Unicode compatible Extendible set of circa 20 operations Scanner definitions vs. general regular expressions Biasing optimization process Various ways of handling ambiguities Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation Regular expressions over TFSs (SProUT) with restrictions
Core Components – Typed Feature Structure Package JAVA implementation of TFSs Efficient unification operations Dynamic extension of the type hierarchy Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers
XTDL Formalism Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application XTDL grammar rules – production part on LHS, and output description on RHS TDL used for establishment of a type hierarchy of linguistic entities *top* atom *avm* *rule* tense sign infl index-avm present token morph lang tokentype de en separator url morph := sign & [POS atom, STEM atom, INFL infl]
XTDL Formalism Couple of standard regular operators: concatenation optionality ?disjunction | Kleene star *Kleene plus + n-fold repetition {n}m-n span repetition {m,n} Unidirectional coreference under Kleene star (and restricted iteration) [POS Det, ...] ([POS Adj, ..., RELN %LIST])* [POS Noun, ...] -> [..., RELN %LIST]
XTDL Formalism loc-pp :> morph & [POS Prep & #preposition, INFL [CASE #1, NUMBER #2, GENDER #3]] morph & [POS Determiner, INFL [CASE #1, NUMBER #2, GENDER #3]] ? morph & [POS Adjective, INFL [CASE #1, NUMBER #2, GENDER #3]] * gazetteer & [TYPE general-location, SURFACE #location] -> [CAT location-pp, PREP #preposition LOCATION #location].
XTDL Interpreter 1. Matching of regular patterns using unifiability (LHS)2. LHS Pattern instance creation3. Unfication of the rule instance and matched input Longest match strategy Ambiguities allowed Interpreter generates TFSs as output (cascaded architecture)
XTDL Interpreter Matched input sequence “im sonnigen Rom” (in sunny Rome)
XTDL Interpreter Rule with an instantiated pattern on the LHS
XTDL formalism Unified result
Linguistic Processing Resources Tokenization with fine-grained token classification Gazetteer (static named-entity lexica) Morphology Full-form lexica obtained from ‘compactified’ MMORPH: English 200,000 entriesGerman 830,000 entries + Shallow Compound RecognitionFrench 225,000 entriesSpanish 570,000 entriesItalian 330,000 entries Asian Languages: Chinese – ShanxiJapanese – Chasen Other: Czech – HMM-based Part-of-Speech Tagging + Morphology
System Description Language Construction of a concrete system instance via definition of a regular expression of module specifications All lingusitic modules must implement a specific JAVA interface Automatic compilation of system description into a single JAVA class
System Description Language (M1 M2)(input) M1.clearState(); M1.setInput(input); M1.setOutput(M1.computeOutput(M1.getInput())); M2.clearState(); M2.setInput(mediateSeq(M1,M2)); M2.setOutput(M2.computeOutput(M2.getInput())); return M2.getOutput(); (M*)(input) M.clearState(); M.setInput(input); M.setOutput(mediateFix(M)); return M.getOutput();
Future Work Optimization of grammar interpretation Various search strategiesAdditional linguistic processing resourcesReal data testing: large grammars and real-world texts