1 / 16

Logic Programming for Natural Language Processing

Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc. Logic Programming for Natural Language Processing. Purpose. To link together Recent developments in natural language processing (NLP): Information Extraction (IE) Classical logic programming: Prolog

shayla
Download Presentation

Logic Programming for Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc. Logic Programming for Natural Language Processing

  2. Purpose • To link together • Recent developments in natural language processing (NLP): Information Extraction (IE) • Classical logic programming: Prolog • New Paradigm: bifurcated process • An IE application which will produced structured output from a corpus of free, unstructured text. • Transformation of extracted information into a Prolog knowledge-base (sets of fact-triples) • Documents: biographies

  3. Why NLP? • Language is the cornerstone of intelligence • The Turing Test: the ability to converse like man • Understanding and generating texts in a natural language, e.g. English • Many specific NLP tasks • Chatterbots, e.g. Eliza • Machine Translation • Information Retrieval (IR), e.g. Google • Information Extraction!! • SciFi Dreams: universal translation, computers you can talk to, etc.

  4. Information Extraction (IE) • Most generally, the transformation of • Information contained in free, unstructured text in a natural language into • A prescribed, structured format. • More specifically, the identification of • Instances of certain object classes • Their attributes • Relationships between object instances • Always restricted into a particular domain • In order to have a reasonably sized and sufficiently expressive ontology

  5. Why IE? • An Expert must read many documents • Advent of the Internet & Information Age • Explosion of the sheer volume of textual information, readily available in electronic form • New opportunity: lots and lots of available information to exploit • Formidable challenge: impossible for an expert to read and analyze that much text. • A pragmatic approach: • Full text understanding is out of reach • Automate just some of the tasks, i.e. the identification of objects, attributes, and relations

  6. IE - Details • Five Tasks in IE • Named Entity Recognition (NE) • Coreference Resolution (CO) • Template Element Construction (TE) • Template Relation Construction (TR) • Scenario Template Production (ST) • Metrics for Evaluation • Precision: • Recall: • F-measure (borrowed from IR): • More intuitive reformulation:

  7. Annotations • Annotations identify objects in text • Annotation graph: a directed, acyclic graph (DAG) • Nodes • position in the text • Edges • The literal text • Annotations

  8. Frames • Frame: representation of an object, consisting of slots, which contain values • Typical Prolog fact: Frame(Slot, Value). • We propose to synthesize it with the idea of annotations: Doc(Annot, Text). • Main idea: represent the document directly as an object: compromise between text and knowledge • Several Advantages • A corpus of multiple related documents • Direct link between information and its source • Opens the door for the application of Prolog's logic.

  9. Design • The IE application • Input: corpus of free, unstructured text • Output: the annotated documents, represented as annotation graphs • How: use GATE (language: JAPE) • The Prolog application • Input: the annotated document • Output: a frame, i.e. a set of Prolog facts. • How: use XSB (language: Prolog)

  10. General Architecture for Text Engineering (GATE) • A comprehensive architecture for development of NLP applications • Documents treated as an annotation graph • Java Annotation Patterns Engine • Its own language for writing grammars that identify instances of object classes to annotate • A Nearly New Information Extraction (ANNIE) system • An already implemented rudimentary IE system, that can be extended through addition of • JAPE grammars for annotating • Machine-learning models for annotating

  11. GATE

  12. Procedures • Obtain the corpus – Python script • Write the Jape grammars • annotations 'Mathematician', 'Father'. • Train a model • annotation 'Protagonist' • Write the Prolog application to • Parse GATE's XML output into a structure • Construct the annotation graph from it • Process the annotations into a document frame • Output the document frame • Test by posing queries

  13. IE Result: Fermat.html • Precision: 1. (why so high?) • use of a gazetteer list • aggressive pruning by context • Recall: 0.9474 • paid for aggressive pruning, missed some • F-measure (β = 2) • 0.973

  14. Prolog Result • Correctly constructs facts. • Sample session: | ?- 'Galois.html.xml'('Mathematician', X). X = Abel; X = Cauchy; X = Evariste Galois; X = Fourier; X = Galois; X = Gauss; X = Gergonne; X = Jacobi; X = Lagrange; X = Legendre; X = Libri; X = Liouville; X = Poisson; X = Vernier

  15. Results • The Prolog layer is universal, cross-domain • The IE application may produce any annotation, not restricted to one subject area • Bifurcation: success • Opens door to logic and rules, esp. for cross-document relations | ?- 'Galois.html.xml'('Mathematician', X), 'Cauchy.html.xml'('Protagonist', X). X = Cauchy; no

  16. Conclusion • With the recent advancements in computing power, logic programming is finally feasible for practical use • To run my Prolog application, ran it on the server robustus, giving it 2 GB of memory • However, computing power continues to be a limitation (GATE crashed every day) • Where do we go from here? • More expressive document frame • Context analysis (through proximity, etc) • Better IE applications through statistical processing

More Related