The CHAOS Project: Theory and Practice

The CHAOS Project:Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma “Tor Vergata”

People • INVESTIGATORS • Roberto Basili • Fabio Massimo Zanzotto • Maria Teresa Pazienza • FORMER CONTRIBUTORS • Daniele Pighin • Daniele Previtali • Alessandro Bahgat • Marco Pennacchiotti • Massimo Di Nanni • Michele Vindigni • Luigi Mazzucchelli • Paola Velardi • Paolo Zirilli • Alessandro Cucchiarelli • Alessandro Marziali • Fabrizio Grisoli • Gianluca De Rossi

Outline • Theory: Customizable parsing architectures • XDG: eXtended Dependency Graph • Task oriented parsing design • Practice: System Implementation and Use • A component-based approach • An object-oriented platform • Linguistic data • Processing modules • How to use the parser in an application • Demo!!!

Theory Customizable parsing architectures

Motivation • The Chaos Project unofficially began in ’96 • … on the long tradition of ARIOSTO (Basili, Pazienza, Velardi) @ the University of Rome “Tor Vergata” (RTV) • Aim • building robust parsers for Italian and for English • that use verb sub-categorization (syntactic) lexicons induced from corpora • that can be used in applications • Constraints • use the long tradition @ RTV • “Social” background • Microtheories for microphenomena • Language analysis can be reduced to a cascade of modules (e.g., FSA) • Application-oriented language anaysis (e.g., IE) • Robust (formely, shallow) parsing approaches

Inf(S1) Inf(S2) Motivation contribute-NP-PP(to) value-NP-PP(at) [ Mr. Gaubert] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

Motivation (found on vinyl supports) • Different NLP applications have different performance constraints in term of: • Accuracy • Throughput • Customizable parsing architectures are reusable in different application scenarios if: the architectural design supports performance control

Customizable parsing architectures (found on vinyl supports) Modularization • clarifies the interdependency between different syntactic information (grammatical/lexicalized) • allows to control • throughput via eliciting modules • quality via a clear relation between modules (prerequisites/contributions)

Modular approach • Syntactic parser SP(S,K)=I  SP(S)=I • Syntactic parsing module: Pi(Si,Ki)=Si+1 Pi(Si)=Si+1 • Modular syntactic parser SP = Pn... P2P1

Modular approach • To push a modular approach we need: • a suitable annotation scheme • a classification of the processing modules

A suitable annotation scheme • Requirements: • Modularization • a stable representation of partially analyzed structures • Lexicalization • a clear representation of the (semantic) head of a given structure able to activate the lexicalized rule

XDG: Extended Dependency Graph • XDG combines constituency and dependency based formalisms XDGGD=(C,D) C = {(c,t,h)|cS,t,hc} D = {(c1,c2,t)| c1,c2C, t} • Nice property: allow to store persistent ambiguity (for interpretations projected by the same nodes)

XDG: Extended Dependency Graph • C are constituents • syntactic head • potential semantic governor • D are dependencies among constituents

Classification of parsing modules Pi(XDGi,Ki)=Pi(XDGi)=XDGi+1 • The classification is performed according to: • the type of information K used • how they manipulate the sentence representation

Task oriented parsing design • Given: • The NLP application requirementsR • The test-bed T • A pool of parsing modules PM • The designing activity is: • The research of a combination of the parsing modulesPM that fits R on the T

NLP application requirements • Target phenomena: es. VP_PP, NP_PP, etc • Metrics: • Recall R per sentence • Precision P per sentence • F-measure per sentence

Dependencies Clauses Chunks NPK VPK PPK NPK VPK NNS TO VB IN NNS PRP MD VB POS CHAOS: Levels of Analysis Strategies to use with questions you cannot answer

Inf(S1) Inf(S2) Verb dependencies and Clause Boundaries contribute-NP-PP(to) value-NP-PP(at) [ Mr. Gaubert] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

Verb dependencies and Clause Boundaries contribute-NP-PP(to) value-NP-PP(at) Inf(S1) Inf(S2) [ Mr. Gaubert] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

Verb dependencies and Clause Boundaries • The algorithm: • Initial Hypoteses: • Minimal boundaries of the clauses in the sentence • Derived Hierarchy • Until all verbs have not been analyzed: • Take the rightmost not analyzed verb v: • Take the lexicalized rules R(v) for the verb v • Find the dependencies of • Augment the clause boundaries

Practice System Implementation and Use

A Computational Framework • Object-oriented backbone • Objects for the different data • Objects for the different sub-processes • Linguistic sub-processors as libraries • Coexisting languages: Java, C++, C, Prolog

System implementation • A component-based approach • An object-oriented platform • Linguistic data • Textual entities: Text, Paragraphs • XDG • Linguistic processors

A Component-based Approach Advantages: • Computational efficiency • Rapid prototyping • Integration of different technologies • Easy reuse

Linguistic processors

Linguistic processors • Tokenizer, Complex Tokenizer • Dictionary lookup modules • Yellow page look-up • Morphology analyzer • Name Entity Recognition • Part-of-speech tagging • Chunker • Verb shallow analyzer • Shallow analyzer

Linguistic modules • Each process is encapsulated in an object • initialize() • Load lexicons and rules (general or domain specific) • finalize() • Dismiss the process rules and lexicons • run() • Enrich the input with the contributes of the process

Linguistic processors Microtheories for microphenomena • Each processor implements its own theory: • It has its language for describing rules • It is written in its own programming language

Processor: Yellow page look-up, Morphology analyzer Dictionary compra comprare d(a) v.tran.sempl 2.sing.imper.pres ~:u:~ compra comprare d(a) v.tran.sempl 3.sing.ind.pres ~:u:~ comprai comprare d(a) v.tran.sempl 1.sing.ind.pass_rem ~:u:~ comprammo comprare d(a) v.tran.sempl 1.plur.ind.pass_rem ~:u:~ compran comprare d(a) v.tran.sempl 3.plur.ind.pres ~:u:~ comprando comprare d(a) v.tran.sempl geru.pres ~:u:~ comprano comprare d(a) v.tran.sempl 3.plur.ind.pres ~:u:~

Processor: Chunker Rules … constituent_class([_cst1, _cst2, _cst3], 'VerFin', _mor, 1, 3):- verb_finite(_cst1), verb_to_have(_cst1), verb_past_particle(_cst2), verb_to_be(_cst2), verb_past_particle(_cst3), common_morfology(_cst1,_mor). …

Processor: Verb Shallow Analyser Sub-categorization lexicon … pattern(comprare,[ [(oggetto,Post),(per,Post)], [(oggetto,Post),(da,Post),(per,Post)], [(oggetto,Post),(a,Post),(per,Post)],[(oggetto,Post)]]). pattern(comprendere,[[(oggetto,Post)],[],[(oggetto,Post)]]). pattern(comprimere,[[(oggetto,Post)],[(oggetto,Post)]]). pattern(compromettere,[[(con,Post)],[(oggetto,Post)]]). pattern(comunicare,[[], [(con,Post)], [(a,Post)], [(oggetto,Post),(a,Post)],[(oggetto,Post)]]). …

Implemented Italian Shallow Grammar • Constituent Categories • Part-of-Speech Tags • Chunk Types • Dependency Categories • Dependency Categories over Chunk Types

A survival user guide • Version stand-alone: • chaosparser -h • Version client-server: • chaosserver –h • chaosclient –h • XDG editor and actual gui: • choasgui

Using CHAOS in applications • In JAVA applications: ConfigurationHandler.initialize(); ConfigurationHandler.parseKBPropFile(“LANGUAGE”,”KB”); Parser ms = new Parser(); ms.initialize(); • In Non-JAVA applications: • Using one of the possible output forms: • XDG in Xml • XDG in Prolog • XDG in QLF (in prolog)

Perspective • Building a statistical Italian parser • Increasing the Itailan annotated corpora • Reusing existing corpora • TUT • SITAL • VIT

Tools • XDG editor • DEMO!!!! • Syntactic annotation transformer

People • INVESTIGATORS • Roberto Basili • Fabio Massimo Zanzotto • Maria Teresa Pazienza • FORMER CONTRIBUTORS • Daniele Pighin • Daniele Previtali • Alessandro Bahgat • Marco Pennacchiotti • Massimo Di Nanni • Michele Vindigni • Luigi Mazzucchelli • Paola Velardi • Paolo Zirilli • Alessandro Cucchiarelli • Alessandro Marziali • Fabrizio Grisoli • Gianluca De Rossi

The CHAOS Project: Theory and Practice