1 / 26

Building a Dictionary from WWW

Building a Dictionary from WWW. Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology Center Thailand virach@nectec.or.th Languages and Cultures of the East and West (LACEW) July 25-27, 2001, Tsukuba University. Motivations.

casey
Download Presentation

Building a Dictionary from WWW

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a Dictionary from WWW Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology Center Thailand virach@nectec.or.th Languages and Cultures of the East and West (LACEW) July 25-27, 2001, Tsukuba University

  2. Motivations • WWW is an only one huge, up-to-date, language-and-area thoroughness online resource • Lexicon & terminology database needed in • Electronic Dictionary • Machine Translation • Text Summarization, etc. • Lack of open sharable resource • No standard formats • Legal issues Collaborative Open Lexicon Development National Electronics and Computer Technology Center

  3. Open Format XML-based Open Protocol XML-based request/response DICT (RFC 2229) Open Participation Data entry Approver Open Source Software Tools Dictionary Content Corpus-based To reflect the uses and meanings of terms in real life To assist human thinking process Concepts National Electronics and Computer Technology Center

  4. Format : Standards Survey • Models • ISO 12620:1999 : Terminology data categories • ISO 12200:1999 : Machine-Readable Terminology Interchange Format (MARTIF) • Variations • OTELO : OLIF (Open Lexicon Interchange Format) A format for MT dictionary interchange • OSCAR (LISA) : TMX (Translation Memory eXchange format) • SALT : Standards-based Access to multilingual Lexicons and Terminologies XLT (XML representation of Lexicons and Terminologies) National Electronics and Computer Technology Center

  5. Development Procedure WWW Text Corpus • Document collection with robot (w/ language identification) Robot Sample Texts • Term candidate extraction (C4.5 on MI, Entropy, etc.) Terms Term Candidate Extraction • Context-based concept classification (text classification) Concept Classification Term-Concepts Syntactic Structure Analysis Annotated Concepts • Syntactic structure extraction (POS tagger) Concept Correlation Discovery Ontology • Semantic correlation discovery (Ontology) National Electronics and Computer Technology Center

  6. A Thai Running Text สวัสดีครับ ผมชื่อวิรัช ศรเลิศล้ำวาณิช ปัจจุบันเป็นผู้อำนวยการฝ่ายวิจัยและพัฒนาสาขาสารสนเทศ ศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ ผมเริ่มสนใจงานวิจัยในสาขาการประมวลผลภาษาธรรมชาติตั้งแต่ที่ได้มีโอกาสเข้าร่วมโครงการวิจัยและพัฒนาระบบแปลภาษาในปี 1989 Word/Sentence Segmentation สวัสดีครับ ผมชื่อวิรัชศรเลิศล้ำวาณิชปัจจุบันเป็นผู้อำนวยการฝ่ายวิจัยและพัฒนาสาขาสารสนเทศศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ ผมเริ่มสนใจงานวิจัยในสาขาการประมวลผลภาษาธรรมชาติตั้งแต่ที่ได้มีโอกาสเข้าร่วมโครงการวิจัยและพัฒนาระบบแปลภาษาในปี1989 . . . . National Electronics and Computer Technology Center

  7. vowel tone consonant vowel Writing System • 46 consonants; 18 vowels;4 tones; 9 symbols; 10 digitswritten 4 levels • No punctuation • No word/sentence marker • No upper/lower case letter • No inflection baseline Hard to identify (single/compound) word/phrase/sentence National Electronics and Computer Technology Center

  8. Sentence Extraction Training POS tagged corpus Input paragraph Word segmentation andPOS tagging Winnow(Feature-based ML) Word sequence with tagged POS Winnow Trained network Paragraph with sentence break National Electronics and Computer Technology Center

  9. Accuracy in Word/Sentence Segmentation • Word Segmentation • Longest matching (92%) • Maximal matching (93%) • POS tri-gram (96%) • Machine learning (97%) • Sentence Segmentation • POS tri-gram (85%) • Machine learning (89%) Supervised approaches National Electronics and Computer Technology Center

  10. Term Candidate Extraction • Virach Sornlertlamvanich et. al. (COLING 2000) : • Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm • C4.5-trained decision tree for determining potential word boundary from MI, Entropy, Linguistic information • Capable of discovering new words in document without assistance from static dictionary National Electronics and Computer Technology Center

  11. Mutual Information y z x y x z where x is the leftmost character of string xyz y is the middle substring of xyz z is the rightmost character of string xyz p( ) is the probability function. High mutual information implies that xyz co-occurs more than expected by chance. If xyz is a word, its Lm and Rm must be high.…Efunction… and ...Function... National Electronics and Computer Technology Center

  12. Entropy y x y z where A is the set of characters x is the leftmost character of string xyz y is the middle substring of xyz z is the rightmost character of string xyz p( ) is the probability function. Entropy shows the variety of characters before and after a word. If y is a word, its left and right entropy must be high....?function... , ...?unction... National Electronics and Computer Technology Center

  13. Other Features • Frequency Words tend to be used more often than non-word string sequences. • Length Short strings are likely to happen by chance. The long and short strings should be treated differently. • Functional Words Functional words are used mostly in phrases. They are useful to disambiguate words and phrases. Result of subjective test : Word precision 85% Word recall 56% National Electronics and Computer Technology Center

  14. Evaluation Result of Word Extraction RID : Royal Institute Dictionary (30,000 words of Thai-Thai dictionary) National Electronics and Computer Technology Center

  15. เกาะ (sense1: to attach) … มัน เกาะ ตัวเองกับกิ่งไม้ …(It clings itself on a tree ) … ผู้โดยสารไม่จำเป็นต้องยืน เกาะ ห่วงอีกต่อไปแล้ว …(Passengers don't have to hold peddlesanymore.) เกาะ (sense2: an island) …บ้านผมอยู่ที่ เกาะ สมุย…(I live at the Samui island.) …ญี่ปุ่นประกอบด้วยเกาะ ใหญ่ 4 เกาะ… (There are four big islands in Japan.) Concept Classification • Word and their contexts in the corpora • Manual word-sense disambiguation • Unsupervised word sense disambiguation (Yarowsky 1995) National Electronics and Computer Technology Center

  16. Concept Classification National Electronics and Computer Technology Center

  17. Syntactic Structure Analysis • Sentence/word segmentation by POS trigram tagger • POS assignment • Word co-occurrence • Parser • Pattern of usages National Electronics and Computer Technology Center

  18. Ontologies • EDR • Approach: Word description as employed in dictionaries • Problem: Ambiguities and incomputability • Wordnet • Approach: Synonym set and simple semantic relations to other words • Problem: Ambiguities • UW • Approach: Headwords and semantic restrictions • Advantage: Computability and no ambiguity National Electronics and Computer Technology Center

  19. Ontologies Representation of concept “tired”in different schemes EDR Wordnet 1.5 UW - having or displaying a need for rest- having lost of interest- lack of imagination -A1: tired (vs. rested)-A2: bromidic, commonplace, hackneyed, … -V1: tire, pall, grow weary, fatigue-V2: tire, wear upon, fag out-V3: run down, exhaust, sap, …-V4: bore, tire, ... - tired- tired(icl>physical)- tired(icl>mental) National Electronics and Computer Technology Center

  20. Universal Word (UW) • UW format :<headword> ( <list of restrictions> ) e.g. book (icl > do, obj > room) • Headword : An English word roughly describes the UW sense. • Restrictions : • Inclusion (icl ) indicates the class of the sensee.g. car ( icl > movable thing) National Electronics and Computer Technology Center

  21. Universal Word (UW) • Restrictions (continued) • UNL semantic relationse.g. eat ( agt > volitional thing, obj > food )The agent of this UW is restricted to be volitional thing.The object of this UW is restricted to befood. UW Class Hierarchy National Electronics and Computer Technology Center

  22. Architecture • Centralized activities • For data integrity & consistency • Distributed sites • For open participation • For backing up • Job-based • Jobs generated by corpus analysis tools • Participants download jobs to work off-line and submit back when done National Electronics and Computer Technology Center

  23. Job-based Working Job Billboard Job A Generator Job B Generator Job A Pool Job B Pool Participant Job A Submittal Job B Submittal Participant Job A Acceptor Job B Acceptor Job A Approval Pool Job B Approval Pool Approver Approver Job A Approved Pool Job B Approved Pool Job A Approval Acceptor Job B Approval Acceptor Central DB National Electronics and Computer Technology Center

  24. Network Connection • Committee nodes • Replicate same database • Closely synchronized • Provide service to neighbor participants • Agents (optional) • Propagate communications between committee and participants National Electronics and Computer Technology Center

  25. Contents to Develop • Thai word list • Corpus-based Thai lexicon • Co-occurrence dictionary • Thai ontology National Electronics and Computer Technology Center

  26. SNLP+O-COCOSDA The Fifth Symposium on Natural Language Processing + Oriental COCOSDA Workshop 20029-11 May 2002Hua Hin, Prachuapkirikhan, Thailandhttp://kind.siit.tu.ac.th/snlp-o-cocosda2002/ orhttp://www.links.nectec.or.th/itech/snlp-o-cocosda2002/ National Electronics and Computer Technology Center

More Related