160 likes | 360 Views
Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc. Logic Programming for Natural Language Processing. Purpose. To link together Recent developments in natural language processing (NLP): Information Extraction (IE) Classical logic programming: Prolog
E N D
Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc. Logic Programming for Natural Language Processing
Purpose • To link together • Recent developments in natural language processing (NLP): Information Extraction (IE) • Classical logic programming: Prolog • New Paradigm: bifurcated process • An IE application which will produced structured output from a corpus of free, unstructured text. • Transformation of extracted information into a Prolog knowledge-base (sets of fact-triples) • Documents: biographies
Why NLP? • Language is the cornerstone of intelligence • The Turing Test: the ability to converse like man • Understanding and generating texts in a natural language, e.g. English • Many specific NLP tasks • Chatterbots, e.g. Eliza • Machine Translation • Information Retrieval (IR), e.g. Google • Information Extraction!! • SciFi Dreams: universal translation, computers you can talk to, etc.
Information Extraction (IE) • Most generally, the transformation of • Information contained in free, unstructured text in a natural language into • A prescribed, structured format. • More specifically, the identification of • Instances of certain object classes • Their attributes • Relationships between object instances • Always restricted into a particular domain • In order to have a reasonably sized and sufficiently expressive ontology
Why IE? • An Expert must read many documents • Advent of the Internet & Information Age • Explosion of the sheer volume of textual information, readily available in electronic form • New opportunity: lots and lots of available information to exploit • Formidable challenge: impossible for an expert to read and analyze that much text. • A pragmatic approach: • Full text understanding is out of reach • Automate just some of the tasks, i.e. the identification of objects, attributes, and relations
IE - Details • Five Tasks in IE • Named Entity Recognition (NE) • Coreference Resolution (CO) • Template Element Construction (TE) • Template Relation Construction (TR) • Scenario Template Production (ST) • Metrics for Evaluation • Precision: • Recall: • F-measure (borrowed from IR): • More intuitive reformulation:
Annotations • Annotations identify objects in text • Annotation graph: a directed, acyclic graph (DAG) • Nodes • position in the text • Edges • The literal text • Annotations
Frames • Frame: representation of an object, consisting of slots, which contain values • Typical Prolog fact: Frame(Slot, Value). • We propose to synthesize it with the idea of annotations: Doc(Annot, Text). • Main idea: represent the document directly as an object: compromise between text and knowledge • Several Advantages • A corpus of multiple related documents • Direct link between information and its source • Opens the door for the application of Prolog's logic.
Design • The IE application • Input: corpus of free, unstructured text • Output: the annotated documents, represented as annotation graphs • How: use GATE (language: JAPE) • The Prolog application • Input: the annotated document • Output: a frame, i.e. a set of Prolog facts. • How: use XSB (language: Prolog)
General Architecture for Text Engineering (GATE) • A comprehensive architecture for development of NLP applications • Documents treated as an annotation graph • Java Annotation Patterns Engine • Its own language for writing grammars that identify instances of object classes to annotate • A Nearly New Information Extraction (ANNIE) system • An already implemented rudimentary IE system, that can be extended through addition of • JAPE grammars for annotating • Machine-learning models for annotating
Procedures • Obtain the corpus – Python script • Write the Jape grammars • annotations 'Mathematician', 'Father'. • Train a model • annotation 'Protagonist' • Write the Prolog application to • Parse GATE's XML output into a structure • Construct the annotation graph from it • Process the annotations into a document frame • Output the document frame • Test by posing queries
IE Result: Fermat.html • Precision: 1. (why so high?) • use of a gazetteer list • aggressive pruning by context • Recall: 0.9474 • paid for aggressive pruning, missed some • F-measure (β = 2) • 0.973
Prolog Result • Correctly constructs facts. • Sample session: | ?- 'Galois.html.xml'('Mathematician', X). X = Abel; X = Cauchy; X = Evariste Galois; X = Fourier; X = Galois; X = Gauss; X = Gergonne; X = Jacobi; X = Lagrange; X = Legendre; X = Libri; X = Liouville; X = Poisson; X = Vernier
Results • The Prolog layer is universal, cross-domain • The IE application may produce any annotation, not restricted to one subject area • Bifurcation: success • Opens door to logic and rules, esp. for cross-document relations | ?- 'Galois.html.xml'('Mathematician', X), 'Cauchy.html.xml'('Protagonist', X). X = Cauchy; no
Conclusion • With the recent advancements in computing power, logic programming is finally feasible for practical use • To run my Prolog application, ran it on the server robustus, giving it 2 GB of memory • However, computing power continues to be a limitation (GATE crashed every day) • Where do we go from here? • More expressive document frame • Context analysis (through proximity, etc) • Better IE applications through statistical processing