1 / 22

Natural Language Processing within the Archaeotools Project

Natural Language Processing within the Archaeotools Project. Michael Charno, Stuart Jeffrey, Julian Richards, Fabio Ciravegna , Stewart Waller, Sam Chapman and Ziqi Zhang. CAA Williamsburg, March 2009.

Download Presentation

Natural Language Processing within the Archaeotools Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing within the Archaeotools Project Michael Charno, Stuart Jeffrey, Julian Richards, Fabio Ciravegna, Stewart Waller, Sam Chapman and Ziqi Zhang. CAA Williamsburg, March 2009

  2. “To support research, learning and teaching with high quality and dependable digital resources.”

  3. AHRC-EPSRC-JISC eScience research grants scheme: Joint Information Systems Committee PARTNERS: Natural Language Processing Research Group, Department of Computer Science, University of Sheffield AIM: To allow archaeologists to discover, share and analyse datasets and legacy publications which have hitherto been very difficult to integrate into existing digital frameworks BUILDS UPON: Common Information Environment Enhanced Geospatial browser

  4. Work package 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media) – Reported on in CAA, Budapest. Work package 2 – Natural language processing /Data-mining of Grey Literature. Work package 3 – Data-mining of Historic Literature; plus geoXwalk Three distinct Work packages:

  5. WP1 Datasets include: National Monuments Records (Scotland, Wales, England) Excavation Index (EH) Archive Holdings Local Authority Historic Environment Records WP2/3 Datasets include: ‘Grey’ (Gray) Literature Proceedings of the Society of Antiquaries of Scotland (PSAS) Thesauri include: Thesaurus of Monuments Types (TMT) Thesaurus of Object Types MIDAS Period list UK Government list of administrative areas, County, District, Parish (CDP) – Not MIDAS

  6. Oracle RDBMS MIDAS XML Record RDF Resource Information Extraction Input When, Where, What ontologies as entries to faceted index Knowledge triple store XML Docs of Thesaurus Information Extraction Input Query User Interface

  7. UP TO DATE VERIOSN OF THIS

  8. Work package 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media) – Reported on in CAA, Budapest. Work package 2 – Natural language processing /Data-mining of Grey Literature. Work package 3 – Data-mining of Historic Literature; plus geoXwalk Three distinct Work packages:

  9. BARROW BARROW BARROW

  10. Was it Bonnie or Clyde? “I never said she stole my money” “I never said she stole my money” Someone else said it, but I didn’t. “I never said she stole my money” I simply didn’t ever say it. “I never said she stole my money” I might have implied it, but I never said it. “I never said she stole my money” I said someone stole it, I didn’t say it was she. “I never said she stole my money” I just said she probably borrowed it. I said she stole someone else’s money “I never said she stole my money” “I never said she stole my money” I said she stole something, but not my money.

  11. State-of-the-art review – approaches to rule induction… • Two mainstream methodologies towards rule induction: • Human handcrafted rules (Rule based system) – built manually by analysing example annotations and derive human readable discriminative patterns • Easy to understand, easy to implement, effective for structured texts and simple patterns, no need for training learning models but… • Not robust to less-structured texts, and time consuming and difficult to derive rules for large amount of example annotations • Machine learned rules (Machine Learning) – built automatically by analysing example annotations and converting features into numeric representations, which are to be consumed by mathematic models to derive discriminative pattern that are not readable • Very robust, copes with large amounts of data and complex patterns; we only select features and machine analyses examples and induce rules, but… • Very sensitive to feature selections, implementation and feature tuning are difficult and takes time; may not work well with few amounts of examples

  12. The fundamental idea... • The fundamental idea is to study the features of positive and negative examples of entities, and/or their surrounding N words over a large collection of annotated documents (training data prepared by human) and design rules that capture instances of a given type. (Nadeau et al, 2006) Then apply the rules to new corpus, and classify each individual token (both previously seen and unseen) into suitable classes. • Features - descriptors or characteristic attributes of words designed for algorithmic consumption • Positive examples - instances of a given type to be extracted • Negative examples - any text units that are not annotated as the given type

  13. The fundamental idea... Un-annotated texts are negative examples Example annotations in highlighted colours are positive examples • Features of this annotation: • first_letter_capitalised: true • word_found_in_gazetteer: true • preceded_by: the • followed_by: period

  14. Rule based systems are good for extracting information that match with simple patterns, and/or occur in regular contexts, thus are applied to: • Grid reference (easting and northing) • Report title* • Report creator* • Report publisher* • Report publication date* • Report publisher contact • Bibliography & references • Machine Learning is good for extracting information that can not be matched by patterns, or occur irregularly with contexts, or are large amount, thus is applied to: • What (subject) • Where (place name) • When (temporal info) • Event date

  15. From the 1st batch of annotated corpus • 35 unique annotated documents • Number of annotations by class: • publisher.name: 93 • title: 53 • date.event: 129 • coverage.temporal: 2185 • subject: 7935 • publisher.contact: 21 • date.publication: 28 • coverage.spatial.placename:1467 • creator: 67

  16. * These features are generally applied to every other classes too. See following slides.

More Related