Natural Language Processing Applied to Archival Description of Textual E-records. William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA Workshop on Digital Preservation of Complex Engineering Data WVU NRCCE, Morgantown, West Virginia April 20-21, 2009.

  1. Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA Workshop on Digital Preservation of Complex Engineering Data WVU NRCCE, Morgantown, West Virginia April 20-21, 2009

  2. Archival Description • Method for extracting metadata from textual e-records • Use of the metadata in archival description • Next Steps Overview

  3. Archival Description includes: • The titling of records that do not have titles • The summary of the content of records, folders of records and series of records. • When time allows, the creation of other finding aids such as subject indexes to record series. Archival Description

  4. Archivists cannot describe a series until the record series has been manually read and reviewed. • With increasing volumes of e-records, it may be decades, even centuries, before new acquisitions are described. • In responding to FOIA requests, Archivists need to be able to search collections of e-records with high precision and recall. • However, at the time of responding to FOIA requests, archivists have not read all of the records, so cannot index the records and search on document types, dates of records, author’s and addressee’s names and the topics of records. • The results set of a query is a list of file names, not record titles and summaries of content Archival Description:Research Motivation

  5. Descriptions of records include names of author(s) and addressees, topics, actions and sometimes dates. Example of an item (record) description from NARA’s Archival Research Catalog (ARC) This letter was typewritten by President George H. W. Bush and addressed to his children: George, Jeb, Neil, Marvin, and Doro. He expresses his happiness at their Christmas celebration held at Camp David, then writes concerning his conflicted feelings as he prepares for the possibility of war with Iraq. Archival Description: Item Scope and Content Note

  6. Input: Textual Document • Information Extraction • Document Type Recognition • Speech Act Transducer • Discourse Analysis for Topic Recognition Output: [document(e1), author(e1, S), addressee(e1, H), act(e1 F(P)), topic(e1, T), date(e1, D)] A Method for Extracting Metadata for Archival Description

  7. Information extraction (semantic tagging) is a technology used to identify and annotate semantic categories in text (e.g. names of persons, organizations and locations, job titles, dates). • Document Reader • English Tokenizer • Wordlist Lookup + enhanced wordlists • Sentence Splitter • Hepple POS Tagger + lexicon • Semantic Tagger + Named Entity Rules Information Extraction: Method

  8. Person_female_first.lst (8263) • Person_female_first_ambig.lst (117) • Person_male_first.lst (3704) • Person_male_first_ambig.lst (1,117) • Person_surname.lst (83,805) • Person_surname_ambig.lst (6,802) • Person_headofstate_90.lst (478) • Location_city_US.lst (33,017) • Location_city_us_ambig.lst (5,478) • Location_foreign_city.lst (3802) Information Extraction: Wordlist Lookup

  9. Java Annotation Pattern Engine (JAPE) Rules

  10. Annotated Person Names and Job Titles

  11. Information Extraction: Performance

  12. Agenda Bar Chart Biography Briefing Memo Decision Memo Correspondence Diary Executive Order Information Memo Job Application List of Candidates for Federal Office Mailing List Memo Minutes of Meeting National Security Directive (NSD) Newsletter Nomination to Federal Office Notes Presidential Statement Press Pool Report Press Release Referral Memo Resume Schedule Signature Memo Situation Report Summary Transcript of Speech Telephone Call Recommendation Transcript of News Conference Document Types

  13. Input: Annotated text from Information Extractor • Intellectual Element Annotator + Intellectual Element Rules • SUPPLE Parser/Interpreter + Document Type Grammars augmented with Semantics • Extract Metadata Output: [document(e1), author(e1, S), addressee(e1, H), topic(e1, T), date(e1, D)] Document Type Recognition

  14. Document Types:Intellectual Element Recognition

  15. Document Types: Grammar for the Structure of a Memorandum

  16. Document Types: Grammar for Memorndum with Semantic Rules

  17. Parse Tree and Semantics of a Document

  18. Document_Type = memo Date = April 27, 1992 Author = SAM SKINNER Addressee = EDE HOLIDAY Topic = California Earthquake A memorandum dated April 27, 1992 from EDE Holiday to Sam Skinner regarding California Earthquake. Extracted Metadata andItem Description

  19. Annotation of Explicit Speech Acts • Annotation of Implicit Speech Acts • Annotation of Speech Acts Indicated by Text Structure • Annotation of Indirect Speech Acts • Annotation of the Primary Speech Acts Speech Act Transducer

  20. Performative verb - Verb whose action is accomplished merely by saying it or writing it. I recommend that you attend the conference. • Illocutionary force of a message. recommend • Propositional content of a message you attend the conference • An explicit performative sentence is a sentence in which the illocutionary force is made explicit by naming the force. I promise to be there • An implicit performative sentence is a sentence in which the illocutionary force is not made explicit by naming the force. I shall be there Speech Acts

  21. Declarative, imperative and interrogative sentences also express speech acts. • Declarative (state) • You completed the report. • Imperative (request) • Please, complete the report. • Interrogative (ask) • Did you complete the report? Speech Acts: Implicit

  22. An indirect speech act is a speech act that is performed indirectly by way of performing another. Can you pass the salt? (ask) in the appropriate context means Please, pass the salt. (request) • Textual structure can also indicate illocutionary force. Example: a section heading RECOMMENDATIONS can indicate the sentences in a section have the illocutionary force recommend. Speech Acts

  23. assert, deny, state, declare(1), tell(1), report, advise(1), remind, inform, certify(1), agree(1), acknowledge, praise(1), commit, pledge, direct, request, ask(1), ask(2), urge, encourage, invite, order(1), prohibit, suggest(2), propose, recommend, declare(2), resign, confirm, nominate, appoint, authorize, pray, terminate, veto, approve(1), disapprove, revoke, mourn, congratulate, thank, apologize, and welcome(2). • concur, salute, amend, counsel, welcome(1), tender(2), call on, block, retire, proclaim, delegate, designate, determine, find, reject(2), endorse, appreciate, regret, trust(1) , believe, want, desire, and intend. Speech Acts in Presidential Records

  24. Signature Memorandum from Boyden Gray to the President recommending the nomination of Ronald B. Leighton to be a US District Judge. • Letter from President Bush to President Mikhail Gorbachev suggesting an informal meeting. • Memorandum from President Bush to Boyden Gray requesting an analysis of the War Powers Resolution. • Letter from Susan Black to President Bush expressing appreciation for nomination and commitment to serve. • Referral Memorandum from Sally Kelley to FEMA requesting appropriate action to a letter from Beryl Anthony to the President. Uses of Extracted Metadata in Automatic Description

  25. Inducing grammars for documentary form from samples • Create rules for annotating implicit speech acts and speech acts indicated by textual structure. • Evaluate performance of Speech act recognition method • Recognition of the topics of sentences • Discourse Analysis to identify primary topic(s) of records • Generate item, folder and series descriptions and evaluate the method Next Steps

  26. Website: perpos.gtri.gatech.edu W. Underwood and S. Isbell, Semantic Annotation of Presidential E-Records, Technical Report ITTL/CSITD 08-01, May 2008 W. Underwood and S. Laib. Automatic Recognition of Documentary Forms, Technical Report ITTL/CSITD 08-02, May 2008 W. Underwood. Recognizing Communication Acts in Presidential E-Records. Technical Report ITTL/CSITD 08-03, October 2008 Additional Information

