1 / 56

Rada Mihalcea University of North Texas

Building Multilingual and Crosslingual Semantic Resources with Volunteer Contributions over the Web. Rada Mihalcea University of North Texas. Facts. Globalization “Breaking down of political, cultural, and trade barriers” (Thomas Friedman) Universal communication Dying languages

taipa
Download Presentation

Rada Mihalcea University of North Texas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Multilingual and Crosslingual Semantic Resources with Volunteer Contributions over the Web Rada Mihalcea University of North Texas

  2. Facts • Globalization • “Breaking down of political, cultural, and trade barriers” (Thomas Friedman) • Universal communication • Dying languages • One language dying every other week

  3. Some Figures (set 1) • 7,000 languages spoken worldwide • + even more dialects • [http://ethnologue.com] • Resources currently available for 15-20 languages (or less)

  4. 10,773,000,000 hours spent online every month [some 5 million man-years!] Some Figures (set 2) • On average, an Internet user spends 11 h. 24’ / month • United States users: 25 h. 25’ [home] + 74 h. 26’ [work] / month • [Lyman & Varian 2003] • Internet population • 945,000,000 total Web users • [“The Main Thing”, June 2004] • http://www.rebron.org/mozilla/archives/000085.html

  5. Availability vs. Needs • The Web as collective mind • A Different View of the Web: WWW ≠ large set of pages WWW = a way to ask millions of people Users spending online 10,773,000,000 hours / mo. [~ 5,000,000 man-years] Resources required for 7,000 languages

  6. Outline [building resources] • I: Building multilingual WordNets • II: Building a (crosslingual) pictorial WordNet [using resources] • III: Applications

  7. Outline [building resources] • I: Building multilingual WordNets • II: Building a (crosslingual) pictorial WordNet [using resources] • III: Applications

  8. Building WordNets • Other WordNets: • Princeton WordNet • Euro WordNet • BalkaNet • … • Methodology: • Manual • Lexicographers • Time-consuming and expensive

  9. automatic Bilingual Monolingual non-expert expert Corpus manual Romanian Semantic Dictionary • Distributed / Web based • Non-expert users / expert validators WordNet RSDNET

  10. Resources Used • WordNet semantic network • English-Romanian dictionary • Romanian dictionary • Romanian corpus

  11. words choose meaning confirm Main Phases • non-expert contributions • choose a WordNet synset • pick the correct translations (add other words to the synset) • choose a sentence from the corpus that displays the appropriate meaning • confirm the new synset • get points and rewards! • expert validation • correct errors / remove entries score!

  12. roSynset synsetID definition example validated engSynset synsetID definition example engWords ID word synsetID roMatch roWords ID word synsetID engMatch Result

  13. Quantity • Large number of contributions in short amount of time • 6 months: more than 2,000 synsets from 150 contributors

  14. Quality

  15. Pros / Cons • Pros • faster than manual experts • more accurate than automatic • derived from WordNet => inherits WordNet relations • Limitations • bilingual users (English/Romanian) • capturing difficult concepts

  16. Outline [building resources] • I: Building multilingual WordNets • II: Building a (crosslingual) pictorial WordNet [using resources] • III: Applications

  17. A Picture is Worth 7,000 Words

  18. An Image Dictionary • Add image representations to concepts defined in WordNet • Encode word/image associations • Combine visual and linguistic representations of world concepts

  19. + pictorial representations Typical entry in a dictionary • pipe, tobacco pipe • a tube with a small bowl at one end; used for smoking tobacco • pipe, pipage, piping • a long tube made of metal or plastic that is used to carry water or oil or gas etc.) • pipe, tabor pipe • a tubular wind instrument

  20. What for? • Language learning • Children • Second (foreign) language • People with language disorders • International language-independent knowledge base • Pictures are transparent to languages • Applications • Pictorial translations (“Letters to my cousin”) • Bridge the gap between research in image and text processing • Image retrieval/classification, natural language

  21. Word/Image Associations • Difficult • First iteration: • Concrete nouns (flower, dog) • Concrete verbs (write, drink ) • Next: • Abstractions (friendship, love) • Object properties (red, large)

  22. Building PicNet • An illustrated semantic dictionary • Web-users perform the mapping • Resources • WordNet • 150,000+ words, grouped in synsets • 250,000+ semantic relations • Image Search Engines • PicSearch http://www.picsearch.com • AltaVista http://www.altavista.com/image • To date 72,000 images automatically collected

  23. Activities in PicNet • Administrator functions • Word/image associations (Web-users) • Free association • Competitive free association (tournament) • Image validation / Scoring • Image donation • Word lookup (search)

  24. Administrator functions • Validate uploaded images • Determine whether to allow the images into the system • Does not verify the mapping • Delete corporate, offensive, or unclear images • Options • Ban User • can delete all activity by a particular user from the database

  25. Word lookup • User contribution • Contributing / validating images • Free association • Tournaments (competitive free association)

  26. Activity 1 Word Lookup (Search) • Synsets with words matching the search phrase are displayed with their best image match. • Finding the desired synset, a user may: • rate the validity of the current synset – image mapping • upload a new image to be attached to this synset.

  27. Activity 2 Image Validation (Scoring) • User is shown a synset-image pair – rank its appropriateness. • Factors to consider: • fitness for the given synset. • quality of the image (size, clarity)

  28. Scoring • Score based on the user response • Not related ( -5 ) • Loosely related ( 1 ) • Some similarity ( 2 ) • Well suited ( 3 ) • Result: • Determine a score for each synset-image pair • Concept/image pairs that are not related are quickly discovered • Typically after a response from one or two users

  29. Activity 3 Free Word Association • Task: given an image, provide a word to match.

  30. Free Word Association – problems • Difficult to identify images with optimal specificity • E.g. violet vs. flower • Sometimes tedious to find the intended word from the synset list • However, the user can often determine a hypernym (more general concept) • useful information • [Scoring] A free word association is considered to be “well suited” and scores 3

  31. Activity 4 Image Upload • Given a concept, upload a matching image • Search facilitated with shortcuts to three search engines (PicSearch, AltaVista, Google) • Scoring for uploaded images • An image uploaded for a particular synset is considered “well suited” and scored at 5 • Account for the extra effort required from the user • Possible indicator of a stronger correlation.

  32. User Motivation • Points for each activity • Leaderboard • Competitive activity – The PicNet Game • Combine ideas into a competitive game

  33. Activity 5 The PicNet Game • Phase 1: Each player is shown an image and asked to provide a matching synset (as in free word association)

  34. The PicNet Game • Phase 2: Each player votes for the best match (cannot vote on her own entry).

  35. Scoring and Winning • Each synset-image pair scores one point for being entered, and one point for each vote received. • If multiple players enter the same synset-image pair, the score is 2 * number of players entering that synset • Players also receive a “game score”, which counts towards winning the game • A player receives 100 points for winning the round • If multiple players entered the synset-image pair winning the best match, the score is split evenly • A player reaching 300 points wins the tournament

  36. Quality and Quantity • [1 year] 6,200 concepts from 320 contributors • Competitive free association • Number of users voting for the same synset suggestion in each round • User concurrence: 43% (consistent agreement) • Random sampling 100 images • 85% correct associations

  37. Sample Word/Image Associations exodus, hegira, hejira – a journey by a large group to escape from a hostile environment

  38. Sample Word/Image Associations humerus – bone extending from the shoulder to the elbow

  39. Sample Word/Image Associations Castro, Fidel Castro – Cuban socialist leader who overthrew a dictator in 1959 and established a socialist state in Cuba (born in 1927)

  40. Outline [building resources] • I: Building multilingual WordNets • II: Building a (crosslingual) pictorial WordNet [using resources] • III: Applications

  41. Translation with Pictures • What do you understand by the following ? The house has four bedrooms and one kitchen.

  42. Understanding with Pictures: Pros • Universal • Requires minimal learning • Intuitive • Cheap (free contribution by users of PicNet) • Proven success (iconic languages for augmentative communication)

  43. Understanding with Pictures: Cons • Complex information cannot be conveyed through pictures • e.g. “An inhaled form of insulin won federal approval yesterday” • A large number of concepts with a level of abstraction that prohibits a visual representation • e.g. politics, paradigm, regenerate • Culture differences • e.g. some Latin American tribes do not understand the concept of coffee

  44. A First Cut • Simple sentences • no complex states or evens (e.g. emotional states, temporal markers, change) or their attributes (adjectives, adverbs) • no linguistic structure (e.g. complex noun phrases, prepositional attachments, lexical order, certainty) • basic concrete nouns and verbs translated “as is” • Evaluate the amount of understanding achieved through pictures as opposed to words

  45. Does It Work? • Experiments carried out within a translation framework with simple sentences • A communication process • a speaker of an “unknown” language • a listener of a “known” language • Chinese (unknown) to English (known) • Three translation scenarios • fully pictorial representations (PicNet) • mixed pictorial/linguistic representations • fully linguistic representations

  46. this Sample Pictorial and Linguistic Translations this

  47. Evaluation Study • Interpretations • Users asked to provide an interpretation based on their first intuition • Users’ background: Hispanics, Caucasians, Latin Americans • Data set: 50 short sentences (10-15 words) • 30 sentences from language learning courses • 20 sentences from various domains (sports, politics,…) • Various levels of difficulty • 15(average)interpretations for each sentence • One interpretation for each translation scenario • Total of 15*3*50=2,250 interpretations

  48. Sample Interpretations

  49. Evaluation Results • Manual and automatic evaluations: • Adequacy • NIST [Bleu] • GTM

  50. Evaluation Results • Significant amount of information can be conveyed through pictures • 76%, compared to the baseline of 0% • Due to the intuitive visual descriptions that can be assigned to some of the concepts in the text • Due to humans’ ability to contextualize • Read a book is a more common interpretation than read about a book • “He sees the riverbank illuminated by a torch”

More Related