270 likes | 367 Views
Notes about my studies of Information Engineering and Natural Language Processing. by Changhua Yang, 04/09, 2003. Outline. Knowledge Management information from images Classification Problem SVM A Chinese product 漢字基因 SVM Tool Demo. Knowledge E poch. Data? Information? Knowledge?
E N D
Notes about my studies ofInformation Engineering andNatural Language Processing by Changhua Yang, 04/09, 2003.
Outline • Knowledge Management • information from images • Classification Problem • SVM • A Chinese product • 漢字基因 • SVM Tool Demo
Knowledge Epoch • Data? • Information? • Knowledge? • Data Processing->Information Engineering->Knowledge Management
Data: a compressed JPEG file a {(x, y, color)}-bit mapping • Metadata: data describing data • Information: • A dog(狐狸狗) on grassland • Knowledge: • Daytime photograph • A easy case for outlining the objects
Problem • From Data to Information • An Search (Match) problem - relevance • A Classification problem • A Decision problem Feature X Feature Y
Problem Conversion • 問這是不是狗 • 從Knowledge中形成一個temporary classifier {dog ,!dog} • 這裡面有沒有狗 • Phase 1: Identify all objects • Phase 2: for each object, determine {dog, !dog} • 這裡面有沒有狐狸狗 • Option 1: a classifier for {狐狸狗,!狐狸狗} from the training sets of all objects • Option 2: one of those from all dogs
Shallow Semantic Parsing using Support Vector Machines Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James H. Martin, & Dan Jurafsky HLT-NAACL 2004
using PropBank • PropBank (Kingsbury et al., 2002) • a 300k-word corpus • Wall Street Journal (WSJ) part of the Penn Tree-Bank (Marcus et al., 1994) (hand-corrected parses) • predicate argument relations are marked for part of the verbs • The arguments of a verb are labeled ARG0 to ARG5 • ARG0 is the PROTOAGENT (usually the subject) • ARG1 is the PROTO-PATIENT (usually its direct object) • PB attempts to treat semantically related verbs consistently • In addition to these CORE ARGUMENTS, additional ADJUNCTIVE ARGUMENTS, referred to as ARGMs are marked • Some examples are ARGMLOC, for locatives, and ARGM-TMP, for temporals
Problem Description- Shallow Semantic Parsing • Argument Identification • the process of identifying parsed constituents in the sentence that represent semantic arguments of a given predicate • Argument Classification • Given constituents known to represent arguments of a predicate, assign the appropriate argument labels to them • Argument Identification and Classification • A combination of the above two tasks
Baseline Features • Predicate • Path NP↑S↓VP↓VBD • Phrase Type (NP, PP, S) • Position • Voice • Head Word • Syntactic head • Sub-categorization VP->VBD-PP
Classifier and Implementation • SVM – binary classifiers • One vs ALL (OVA) formalism • training n binary classifiers for a n-class problem • Converted multi-class problem • 80% of the nodes have NULL labels • a binary NULL vs NON-NULL classifier • remaining data for training OVA classifiers • Tool • TinySVM • YamCha
New Features • NE • Headword POS • Verb Clustering • Partial Path • Verb Sense Info • Head of PP • First and Last W/P • Ordinal position • Tree Distance • Relative Features • Temporal cue words • Dynamic class context
Technology • 中文倉頡輸入法發明人朱邦復領導的「香港文化傳信」正與IBM公司聯手開發中文嵌入式處理器V-Dragon(飛龍),希望結合Linux作業系統讓個人電腦售價降為目前的三分之一,打破英特爾和微軟的Wintel架構。 • 「文化傳信」的V-Dragon是一款中文CPU(中央處理器),內建3萬2000個中文字,並採用Linux作業系統Midori Linux。
UCLA Report Confirms Culturecom Processor for Chinese Character Generation • The SCS 1610 can generate about 32,000 characters in three fonts and sizes ranging from 11x11 to 127x127 pixels. • The display quality of the characters is optimized aesthetically for sizes generated. • The code and data for the generation algorithm and the character representations occupy no more than 256KB • The speed of character generation is good. 這個技術是中文及其他非拼音文字 最有效的解決方案
采用中文CPU,完全的中文环境,中文字型皆以向量方式由CPU产生,可产生多种字体并可自由放大、缩小,不需使用Mask-ROM存放字型。同时也完全支持英文。
漢字基因(1/2) • 漢字 • 百分之九十是形聲字 • 聲符之外,形聲字尚有「假借」的 機能 • 也就是說,字首代表分類,字身可作定義之用 • 對檢字法的要求,是以字義的理解為第一訴求 • 以字根觀念產生「向量字形產生器」 • 漢字概念,發現有「字碼、字序、字形、字辨、字音、字義」六大功能
漢字基因(2/2) • 字碼 倉頡25碼 • 字序 倉頡24個漢字字母排序 • 字形 • 向量筆形9個,字根64個,供 字庫組字用 • 僅佔160kb系統空間,可組成各種字形近一千萬個,採用無級次放大,可選用各種已知之字體變化,組字速度,p450為例,16*16之字形,每秒可生成及顯示四萬六千字 • 字辨 73(9+64)類字形基因特徵,轉換之字碼 • 字音 六書「形聲」為本的波形追蹤法 • 字義 字義基因512個 • 1/3 from 宋儒之「體用因果」 • 1/4 of 「常識定義」
TinySVM • Support standard C-SVR and C-SVM • Uses sparse vector representation • Can handle several ten-thousands of training examples, and hundred-thousands of feature dimension • Fast optimization algorithms stemming from SVM_light
+1 1:0.5 2:0.5 +1 1:1 2:1 +1 1:2 2:2 +1 1:3 2:2 +1 1:4 2:2 -1 1:2 2:1 -1 1:2 2:1.5 -1 1:3 2:1.5 -1 1:4 2:1.5
Steps • Define Feature space • Get feature values from [training|testing] set • Create Model from f-values of training set • svm_learn -t 1 -d 2 -c 1 news.trainnews_model • Verify the testing set with f-values • svm_classify -V news.testnews_model
My Trial • 13 Features are defined • Training Set: 4 articles • 2 are annotated – advantage of the government • 2 are annotated negative • 2 test articles