260 likes | 380 Views
Building a Dictionary from WWW. Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology Center Thailand virach@nectec.or.th Languages and Cultures of the East and West (LACEW) July 25-27, 2001, Tsukuba University. Motivations.
E N D
Building a Dictionary from WWW Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology Center Thailand virach@nectec.or.th Languages and Cultures of the East and West (LACEW) July 25-27, 2001, Tsukuba University
Motivations • WWW is an only one huge, up-to-date, language-and-area thoroughness online resource • Lexicon & terminology database needed in • Electronic Dictionary • Machine Translation • Text Summarization, etc. • Lack of open sharable resource • No standard formats • Legal issues Collaborative Open Lexicon Development National Electronics and Computer Technology Center
Open Format XML-based Open Protocol XML-based request/response DICT (RFC 2229) Open Participation Data entry Approver Open Source Software Tools Dictionary Content Corpus-based To reflect the uses and meanings of terms in real life To assist human thinking process Concepts National Electronics and Computer Technology Center
Format : Standards Survey • Models • ISO 12620:1999 : Terminology data categories • ISO 12200:1999 : Machine-Readable Terminology Interchange Format (MARTIF) • Variations • OTELO : OLIF (Open Lexicon Interchange Format) A format for MT dictionary interchange • OSCAR (LISA) : TMX (Translation Memory eXchange format) • SALT : Standards-based Access to multilingual Lexicons and Terminologies XLT (XML representation of Lexicons and Terminologies) National Electronics and Computer Technology Center
Development Procedure WWW Text Corpus • Document collection with robot (w/ language identification) Robot Sample Texts • Term candidate extraction (C4.5 on MI, Entropy, etc.) Terms Term Candidate Extraction • Context-based concept classification (text classification) Concept Classification Term-Concepts Syntactic Structure Analysis Annotated Concepts • Syntactic structure extraction (POS tagger) Concept Correlation Discovery Ontology • Semantic correlation discovery (Ontology) National Electronics and Computer Technology Center
A Thai Running Text สวัสดีครับ ผมชื่อวิรัช ศรเลิศล้ำวาณิช ปัจจุบันเป็นผู้อำนวยการฝ่ายวิจัยและพัฒนาสาขาสารสนเทศ ศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ ผมเริ่มสนใจงานวิจัยในสาขาการประมวลผลภาษาธรรมชาติตั้งแต่ที่ได้มีโอกาสเข้าร่วมโครงการวิจัยและพัฒนาระบบแปลภาษาในปี 1989 Word/Sentence Segmentation สวัสดีครับ ผมชื่อวิรัชศรเลิศล้ำวาณิชปัจจุบันเป็นผู้อำนวยการฝ่ายวิจัยและพัฒนาสาขาสารสนเทศศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ ผมเริ่มสนใจงานวิจัยในสาขาการประมวลผลภาษาธรรมชาติตั้งแต่ที่ได้มีโอกาสเข้าร่วมโครงการวิจัยและพัฒนาระบบแปลภาษาในปี1989 . . . . National Electronics and Computer Technology Center
vowel tone consonant vowel Writing System • 46 consonants; 18 vowels;4 tones; 9 symbols; 10 digitswritten 4 levels • No punctuation • No word/sentence marker • No upper/lower case letter • No inflection baseline Hard to identify (single/compound) word/phrase/sentence National Electronics and Computer Technology Center
Sentence Extraction Training POS tagged corpus Input paragraph Word segmentation andPOS tagging Winnow(Feature-based ML) Word sequence with tagged POS Winnow Trained network Paragraph with sentence break National Electronics and Computer Technology Center
Accuracy in Word/Sentence Segmentation • Word Segmentation • Longest matching (92%) • Maximal matching (93%) • POS tri-gram (96%) • Machine learning (97%) • Sentence Segmentation • POS tri-gram (85%) • Machine learning (89%) Supervised approaches National Electronics and Computer Technology Center
Term Candidate Extraction • Virach Sornlertlamvanich et. al. (COLING 2000) : • Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm • C4.5-trained decision tree for determining potential word boundary from MI, Entropy, Linguistic information • Capable of discovering new words in document without assistance from static dictionary National Electronics and Computer Technology Center
Mutual Information y z x y x z where x is the leftmost character of string xyz y is the middle substring of xyz z is the rightmost character of string xyz p( ) is the probability function. High mutual information implies that xyz co-occurs more than expected by chance. If xyz is a word, its Lm and Rm must be high.…Efunction… and ...Function... National Electronics and Computer Technology Center
Entropy y x y z where A is the set of characters x is the leftmost character of string xyz y is the middle substring of xyz z is the rightmost character of string xyz p( ) is the probability function. Entropy shows the variety of characters before and after a word. If y is a word, its left and right entropy must be high....?function... , ...?unction... National Electronics and Computer Technology Center
Other Features • Frequency Words tend to be used more often than non-word string sequences. • Length Short strings are likely to happen by chance. The long and short strings should be treated differently. • Functional Words Functional words are used mostly in phrases. They are useful to disambiguate words and phrases. Result of subjective test : Word precision 85% Word recall 56% National Electronics and Computer Technology Center
Evaluation Result of Word Extraction RID : Royal Institute Dictionary (30,000 words of Thai-Thai dictionary) National Electronics and Computer Technology Center
เกาะ (sense1: to attach) … มัน เกาะ ตัวเองกับกิ่งไม้ …(It clings itself on a tree ) … ผู้โดยสารไม่จำเป็นต้องยืน เกาะ ห่วงอีกต่อไปแล้ว …(Passengers don't have to hold peddlesanymore.) เกาะ (sense2: an island) …บ้านผมอยู่ที่ เกาะ สมุย…(I live at the Samui island.) …ญี่ปุ่นประกอบด้วยเกาะ ใหญ่ 4 เกาะ… (There are four big islands in Japan.) Concept Classification • Word and their contexts in the corpora • Manual word-sense disambiguation • Unsupervised word sense disambiguation (Yarowsky 1995) National Electronics and Computer Technology Center
Concept Classification National Electronics and Computer Technology Center
Syntactic Structure Analysis • Sentence/word segmentation by POS trigram tagger • POS assignment • Word co-occurrence • Parser • Pattern of usages National Electronics and Computer Technology Center
Ontologies • EDR • Approach: Word description as employed in dictionaries • Problem: Ambiguities and incomputability • Wordnet • Approach: Synonym set and simple semantic relations to other words • Problem: Ambiguities • UW • Approach: Headwords and semantic restrictions • Advantage: Computability and no ambiguity National Electronics and Computer Technology Center
Ontologies Representation of concept “tired”in different schemes EDR Wordnet 1.5 UW - having or displaying a need for rest- having lost of interest- lack of imagination -A1: tired (vs. rested)-A2: bromidic, commonplace, hackneyed, … -V1: tire, pall, grow weary, fatigue-V2: tire, wear upon, fag out-V3: run down, exhaust, sap, …-V4: bore, tire, ... - tired- tired(icl>physical)- tired(icl>mental) National Electronics and Computer Technology Center
Universal Word (UW) • UW format :<headword> ( <list of restrictions> ) e.g. book (icl > do, obj > room) • Headword : An English word roughly describes the UW sense. • Restrictions : • Inclusion (icl ) indicates the class of the sensee.g. car ( icl > movable thing) National Electronics and Computer Technology Center
Universal Word (UW) • Restrictions (continued) • UNL semantic relationse.g. eat ( agt > volitional thing, obj > food )The agent of this UW is restricted to be volitional thing.The object of this UW is restricted to befood. UW Class Hierarchy National Electronics and Computer Technology Center
Architecture • Centralized activities • For data integrity & consistency • Distributed sites • For open participation • For backing up • Job-based • Jobs generated by corpus analysis tools • Participants download jobs to work off-line and submit back when done National Electronics and Computer Technology Center
Job-based Working Job Billboard Job A Generator Job B Generator Job A Pool Job B Pool Participant Job A Submittal Job B Submittal Participant Job A Acceptor Job B Acceptor Job A Approval Pool Job B Approval Pool Approver Approver Job A Approved Pool Job B Approved Pool Job A Approval Acceptor Job B Approval Acceptor Central DB National Electronics and Computer Technology Center
Network Connection • Committee nodes • Replicate same database • Closely synchronized • Provide service to neighbor participants • Agents (optional) • Propagate communications between committee and participants National Electronics and Computer Technology Center
Contents to Develop • Thai word list • Corpus-based Thai lexicon • Co-occurrence dictionary • Thai ontology National Electronics and Computer Technology Center
SNLP+O-COCOSDA The Fifth Symposium on Natural Language Processing + Oriental COCOSDA Workshop 20029-11 May 2002Hua Hin, Prachuapkirikhan, Thailandhttp://kind.siit.tu.ac.th/snlp-o-cocosda2002/ orhttp://www.links.nectec.or.th/itech/snlp-o-cocosda2002/ National Electronics and Computer Technology Center