1 / 51

CLiMB: Computational Linguistics for Metadata Building

Center for Research on Information Access Columbia University Libraries. CLiMB: Computational Linguistics for Metadata Building. Overall Goals. Research: Development of richer retrieval through increased numbers of descriptors

lwalton
Download Presentation

CLiMB: Computational Linguistics for Metadata Building

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Center for Research on Information Access Columbia University Libraries CLiMB: Computational Linguistics for Metadata Building CLiMB - Columbia University

  2. CLiMB - Columbia University

  3. Overall Goals • Research: Development of richer retrieval through increased numbers of descriptors • Research and Practice: Creation of enabling technologies for new large digitization projects • Research and Practice: Expand capability for cross-collection searching • Practice: Development of suite of CLiMB tools • Resources: Vocabulary list which can be used by other visual resource professionals The essence of CLiMB: • Use scholars themselves as “catalogers” by utilizing scholarly publications • Enhance existing descriptive metadata CLiMB - Columbia University

  4. CLiMB Project Teams Coordinating Collections (Curatorial) Technical External Advisory CLiMB - Columbia University

  5. CliMB: 2 year timetable • YEAR 1 • Evaluating existing computational tools • Developing additional software as needed • Selecting and building (scanning, converting) needed candidate texts • Loading initial descriptive metadata into end-user system • Evaluating initial results with user groups CLiMB - Columbia University

  6. CliMB: 2 year timetable • YEAR 2 • Use feedback to refine metadata generation & filtering • Prepare additional collections for testing • Incorporate data in different user platforms • Seek external partners for using CLiMB toolset CLiMB - Columbia University

  7. Computational Linguistic Techniques • What techniques have we tried? • Goal: Identify high quality metadata terms • Goal: Use metadata for finding images • How well have they worked? • What else do we want to try? CLiMB - Columbia University

  8. Text about Images The Blacker House is known for its porte cochère and adjacent terraces. Samuel Parker Williams, an occasional Greene collaborator, worked on the site, particularly on the sandstone boulder foundation for the sleeping porch. -- Based on Bosley CLiMB - Columbia University

  9. Techniques We Have Tried Supervised (using existing resources) • Matching algorithms - proper names & variants • Back of book index analysis • Composite list of terms from authoritative lists Unsupervised • Part of speech tagging • Noun phrase identification • Proper noun identification CLiMB - Columbia University

  10. Computational Linguistic Techniques • What techniques have we tried? • Goal: Identify high quality metadata terms • Goal: Load metadata into image search database • Goal: Use enriched metadata for finding images • How well have they worked? • What else do we want to try? CLiMB - Columbia University

  11. CLiMB Art Keys (CAKEs) • Need Unique Identifiers • Key of database records • Varies from collection to collection • Greene & Greene – Project Names • Chinese Paper Gods – God Names • South Asian Temples – Temple Names CLiMB - Columbia University

  12. Text about Images The Blacker House is known for its porte cochère and adjacent terraces. Samuel Parker Williams, an occasional Greene collaborator, worked on the site, particularly on the sandstone boulder foundation for the sleeping porch. -- Based on Bosley CLiMB - Columbia University

  13. Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link CAKE described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate CLiMB - Columbia University

  14. Create Composite List of Subject Terms Philosophy: Use whatever resources exist • Catalog records • Robert R. Blacker house (Pasadena, Calif.) • Greene, Charles Sumner • Blacker, Robert R. • Art and Architecture Thesaurus • porte cochère • Back of the book index • Blacker house CLiMB - Columbia University

  15. How CLiMB Works Today Official pages: • www. columbia.edu/cu/cria Work in progress (temporary sites): • www.cs.columbia.edu/~delson/cni • www.cs.columbia.edu/~delson/CAKEFinder CLiMB - Columbia University

  16. CLiMB - Columbia University

  17. CLiMB - Columbia University

  18. CLiMB - Columbia University

  19. Progress – Composite List • Greene & Greene • Extracted back of the book indexes • Direct matching of index terms to the text • Terms found - highlighted in yellow • David Gamble • Pasadena • Westmoreland Place • furniture CLiMB - Columbia University

  20. CLiMB - Columbia University

  21. Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link CAKEs described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate CLiMB - Columbia University

  22. Three Term Types and Approaches 1) Art Object keys (Charles Pratt) • Named Entity noun phrase finders, POS taggers 2) Proper nouns important to the domain 3) Common noun terms • generic domain vocabulary (chimney) • semantically significant to the domain (V-shaped plan) CLiMB - Columbia University

  23. Part of Speech (POS) taggers • Why use a part of speech tagger? • To identify nouns, verbs and proper nouns • The Blacker House is known for its porte cochère… • <Determiner>The • <Proper_Noun> • <Singular_Proper_Noun>Blacker • <Singular_Proper_Noun>House • <Verb_Present>is • <Verb_Past_Participle>known • <Preposition>for • <Possessive_Pronoun>its • <Adjective>adjacent • <Noun_Plural>terraces CLiMB - Columbia University

  24. Part of Speech (POS) taggers • Strength: An essential step allows the rest of the system to work • Weakness: The best POS taggers have 95% accuracy • A typical 20-word sentence is likely to have a mistake! • But: some errors do not matter much • E.g. sleeping porch CLiMB - Columbia University

  25. Proper Nouns • Alembic WorkBench Results • 91.2% recall • Misses The senior Pratt, Hall brothers • 97.5% precision using Alembic • Successfully finds William Issac Ott, University of California • This is very good! • LTChunk proper nouns highlighted in peach • Laurabelle Robinson • Greenes • Pasadena • Etc. CLiMB - Columbia University

  26. CLiMB - Columbia University

  27. Noun Phrase Chunking [The [ Blacker House ] ] is known for [ [its porte cochère] and [adjacent terraces] ]. [Samuel Parker Williams], [an occasional Greene collaborator], worked on [the site], particularly on [the [ [sandstone boulder] foundation] ] for [the [ sleeping porch ] ]. -- Based on Bosley CLiMB - Columbia University

  28. NP Chunkers • Columbia’s LinkIT • Regular expression grammar over POS tags • Improves WorkBench results through finding simplex NPs • LTChunk • By LTG Group, University of Edinburgh • Not as many NPs • Arizona - commercialized • IBM – also commercial CLiMB - Columbia University

  29. Results: NP Chunking • Common noun phrases highlighted in light blue: • His brother’s property • Planter boxes • The south wing • Etc. CLiMB - Columbia University

  30. CLiMB - Columbia University

  31. Experiments with Algorithms • TF/IDF and term frequency ratios • Filter technical terms from frequent common nouns • Term frequency ratio algorithm to improve accuracy • Co-occurrence • Useful terms may appear near other good ones • Machine learning • Use learning algorithms to discover complex associational context CLiMB - Columbia University

  32. Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link CAKEs described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate CLiMB - Columbia University

  33. What is Segmentation? • Divide texts into cohesive chunks • Needed for determining associational context • Needed to determine what terms are related to an art object CLiMB - Columbia University

  34. Results: Segmentation • Use the frequency that our terms appear within a document to estimate where the document is about that term • This graph shows where different names are mentioned in Bosley on Greene & Greene Ch. 5 CLiMB - Columbia University

  35. What We’ve Tried: Segmenters • Marti Hearst’s TextTiling • Performs well for a general algorithm, but not sufficient for this specialized task • M. Hearst, ACL, 1993 • F. Choi’s C99 segmenter • Performance comparable to TextTiling • F. Y. Y. Choi, NAACL, 2000 • Frequency ratio approach outperformed TextTiling • In-house tool to be tested • Kan & Klavans, WVLC-6, 1998, Segmenter CLiMB - Columbia University

  36. Meronymy as “Part-Of” • Why is this potentially useful? • A method for identifying “hot” paragraphs • Descriptive text contains “part of” relations • Details that correlate to the whole • Porch is a part of house • An early hypothesis – in testing stages CLiMB - Columbia University

  37. Meronymy for Cohesion TheSpinks housedesign is an elaboration ofthe rectangular, large-gabled form of the “California House” ….has …porchesandterraces. In front, an expanse of …lawnrises nearly to the level of theentry terrace…. Thefront dooris approached obliquely in the shaded recess of the terrace…. CLiMB - Columbia University

  38. Meronymy and Other Relations The California House Other Houses Spinks House entry terrace front entry terrace porch front door CLiMB - Columbia University

  39. Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link CAKEs described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate CLiMB - Columbia University

  40. Progress – Project Name Matching • Finding project names in Greene & Greene • Challenge: finding variations • CLiMB Art key (CAKE) Robert Roe Blacker House • RRB House • The house • 1214 Fairlawn Terrace. • Possible techniques to improve matching • Developing a semi-automatic technique • Use existing information to label text • An iterative platform for manual intervention CLiMB - Columbia University

  41. Variants of The Culbertson House • Cordelia A. Culbertson house (Pasadena, Calif.) • Francis F. Prentiss house (Pasadena, Calif.) • Culbertson sisters house (Pasadena, Calif.) • Prentiss, Francis F. • Culbertson, Cordelia A. • Allen, Elizabeth S. • Allen, Mrs. Dudley P. • House was purchased by Allen’s, who remarried and became Prentiss! CLiMB - Columbia University

  42. Zaoshen (Chinese deity) • USE FOR: Dingfuzhenjun (Chinese deity) • USE FOR: Kitchen God (Chinese deity) • USE FOR: Simingzaojun (Chinese deity) • USE FOR: Simingzaoshen (Chinese deity) • USE FOR: Ssu-ming-tsao-chèun (Chinese deity) • USE FOR: Ssu-ming-tsao-shen (Chinese deity) • USE FOR: Ting-fu-chen-chèun (Chinese deity) • USE FOR: Tsao-chèun (Chinese deity) • USE FOR: Tsao-shen (Chinese deity) • USE FOR: Tsao-wang (Chinese deity) • USE FOR: Tsao-wang-yeh (Chinese deity) • USE FOR: Zaojun (Chinese deity) • USE FOR: Zaowang (Chinese deity) • REFERENCE: Encyc. Britannicab(Tsao Shen, pinyin Zao Shen, in Chinese mythology, the god of the kitchen (god of the hearth), who is believed to report to the celestial gods on family conduct and have it within his power to bestow poverty or riches on individual families; has also been confused with Ho Shen (god of fire) and Tsao Chèun (Furnace Prince)) CLiMB - Columbia University

  43. Some Data to Illustrate • Unaltered Project Names • 0 matches (both case sensitive and insensitive) • Case Insensitive Project Name matching • 4 matches • {Theodore Irwin house} occurs 1 time • {California Institute of Technology} occurs 1 time • {William R. Thorsen house} occurs 1 time • {William T. Bolton house} occurs 1 time • At least double in the chapter CLiMB - Columbia University

  44. Results: Finding CAKEs • References to CAKEs are the highlighted phrases: • Robert R. Blacker House (Pasadena, Calif.) • The Blacker house • The house • William R. Thorsen House (Berkeley, Calif.) • The Thorsen house • The house CLiMB - Columbia University

  45. CLiMB - Columbia University

  46. A Future Solution • Bootstrapping algorithm • Seed terms hand labelled • Terms mapped into multi-dimensional feature space • Other terms that are close to the seed terms are added to the set • Features: • Window size • Headedness • Modifier similar to that of a seed term CLiMB - Columbia University

  47. Summary: Research Tools Tested • Part of Speech Taggers • Noun Phrase Chunkers • Merging techniques • Proper Noun Finders • Proper Name Variant Finder • Segmenters CLiMB - Columbia University

  48. Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link CAKEs described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate CLiMB - Columbia University

  49. Future: Determine relationships • The Blacker House related to Greene • The Greenes built the house. • Porte Cochère is related to Blacker House • because they are directly a part of the house. • William Issac Ott is related to • Blacker House (on which he worked) • Greene (with whom he worked). • Detecting these semantic relationships statistically is a challenge for our next steps: • Co-occurrence • Use of subject headings • Meronymy and other relations (WordNet) CLiMB - Columbia University

  50. Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link CAKEs described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate CLiMB - Columbia University

More Related