390 likes | 503 Views
An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram. Taizo YAMADA National Institutes for the Humanities, JAPAN. Contribution. We prototyped a search system which has
E N D
An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram TaizoYAMADA National Institutes for the Humanities, JAPAN PNC2012
Contribution • Weprototyped a search system which has • a method of collecting similar materials using Vector Space Model (VSM) for Japanese historical materials, and • a method of displaying similar materials into timeline. • In order to use VSM, we introduced a method of a word segmentation using nonparametric Bayesian model. • The word segmentation technique is unsupervised learning. • No dictionary. • trying to find hidden structure in unlabeled text. PNC2012
Outline • Background, purpose • Methodology • Show search results sorted by similarity using VSM • Word segmentation using nonparametric Bayesian model • Conclusion, future work PNC2012
Background • Recently, in Japanese historical material, the amount of encoded texts has been grown up. • Not only catalogue encoding. • Japanese historical full-text databases in Japan • SHIPS • Historiographical Institute, The University of Tokyo • AzumaKagamiDatabase • National Institute of Japanese Literature • Kaneaki-kyo-kiDatabase • National Museum of Japanese History • … PNC2012
Example : SHIPS @ Historiographical Institute URL: http://wwwap.hi.u-tokyo.ac.jp/ships/shipscontroller 27DBs Target: premodernJapanese historical materials From: Nara period to: Edo period PNC2012
Example : SHIPS @ Historiographical Institute The Dai Nihon Shiryo Unified Database Full-textDatabases The Full-text Database of the Old Japanese Diaries The Komonjo full-text database The Nara Period Komonjo full-text database The Heian Ibun full-text database The Kamakura Ibun full-text database PNC2012
The Komonjo full-text database(1) keywords PNC2012
The Komonjo full-text database(2) The number of results, The search condition Bibliographic information, Link to image Date Text (substring where is hit) PNC2012
The Komonjo full-text database(2) Date The system can sort according to date or bibliographic information. Sorted by date PNC2012
The Komonjo full-text database(2) Date The system can sort according to date or bibliographic information. Sorted by date But…sorting according to similarity with the query is not supported PNC2012
The Komonjo full-text database(2) Date The system can sort according to date or bibliographic information. Sorted by date But…sorting according to similarity with the query is not supported Furthermore… Collecting similar texts is not supported. The system has no visual analysis method. (e.g. displaying using timeline.) PNC2012
Purpose • In order to efficient analysis from a huge amount of texts, method of obtaining and visualizing similar texts is important. • Which material is similar to a query? • Which material is similar to a material selected by a user? • Our approach is as follows: • Show search results sorted by similarity with a query. • Show similar materials according to the selected search result. • Displaying the similar materials into the timeline. PNC2012
Sorting according to similarity (1) • Let me search with following conditions: • Query: “足利直義” (ASHIKAGA, Tadayoshi) • Target: • Full-text databases in SHIPS. • Temporal: “南北朝時代”Nanboku-chōperiod, JAPAN (spanning from 1333 to 1392) • 7007 materials (catalogues) • Text: 4067 kinds of characters, 1,204,594total characters PNC2012
Sorting according to similarity (2) • The search results are sorted according to the similarity to the query. • The similarity (score) is calculated using Vector Space Model (VSM). PNC2012
Sorting according to similarity (2) • The search results are sorted according to the similarity to the query. • The similarity (score) is calculated using Vector Space Model (VSM). Let me select No.2 PNC2012
Show similar materials (1) • To select a result, presenting the detail of the result (title, text, temporal information, …), and the materials which are similar to the result. PNC2012
Show similar materials (1) • To select a result, presenting the detail of the result (title, text, temporal information, …), and the materials which are similar to the result. Moreover, select No.2 PNC2012
Show similar materials (2) • It is displayed in the same way. PNC2012
Displaying into the timeline (1) • It is displayed in the same way. click PNC2012
Displaying into the timeline (2) PNC2012
Displaying into the timeline (2) This timeline feature was developed using Simile Timeline (version 2.3.0 (with Ajax lib pre 2.3.0)) PNC2012
Displaying into the timeline (2) Selected material PNC2012
Displaying into the timeline (2) Similar materials PNC2012
Displaying into the timeline (2) Similar materials • A color of an icon shows the similarity of the material. • : this material (font color: green) • : top 1- 5 • : top 6 - 9 • : top 11- PNC2012
Displaying into the timeline (2) The materials have temporal information which is normalized. According to the temporal information, the materials are displayed into the timeline. PNC2012
What should be solved? • In order to sort according to the similarity, method of similarity calculation is needed. • Then, we use the Vector Space Model(VSM). • In VSM, a material is represented as a vector. • An element in a vector (the weight of the term in the material) is calculated using tf.idf measure. • In order to calculate a similarity between a query and a text, the vector of the query is created. PNC2012
Similarity Calculation using VSM • The similarity between x and y is calculated by the following formula. • : the frequency of term i in document(material) x • : the frequency of document(material) x • : the number of documents (materials) • Currently, the numbers of terms in the system is 119,282. Similarity PNC2012
Similarity Calculation using VSM • The similarity between x and y is calculated by the following formula. • : the frequency of term i in document(material) x • : the frequency of document(material) x • : the number of documents (materials) • Currently, the numbers of terms in the system is 119,282. By the way… The tf.idf measure needs terms from text. PNC2012
Similarity Calculation using VSM • The similarity between x and y is calculated by the following formula. • : the frequency of term i in document(material) x • : the frequency of document(material) x • : the number of documents (materials) • Currently, the numbers of terms in the system is 119,282. By the way… The tf.idf measure needs terms from text. Is it possible to extract terms from Japanese historical text? PNC2012
Word segmentation (1) • An example of text of Japanese historical material: • The term extraction is very hard… • There are no morphological analysis for a text of Japanese historical material. • Moreover, we don’t have any dictionaries (for person, location,…) to cover the Nanboku-chōperiod. • Here, we use word segmentation results of text as terms. • In Japanese historical text, there are • many nouns, • some auxiliary verbs, some verbs and few others. 足利直義御教書案(切紙) 島津家文書_1 御教書案師直師泰誅伐事早馳参御方可致軍忠之状如件観応元年十一月三日御判島津左京進入道殿 PNC2012
Word segmentation(3) • We introduce word segmentation technique using NPYLM (Nested Pitman-Yor Language Model) [1]. • NPYLM can generate probabilistic distributions by nonparametric Bayesian model, and is a word n-gram model without any dictionaries. (unsupervised learning technique) • Word is represented by Character variable order n-gram in the model. • The world of nonparametric: There are “many parameters” not “no parameters” . • An example: the result of the word segmentation using NPYLM is follows: [1] DaichiMochihashi, Takeshi Yamada, Naonori Ueda.: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling". ACL-IJCNLP 2009, pp.100-108, 2009. 御教書案師直師泰誅伐事早馳参御方可致軍忠之状如件観応元年十一月三日御判島津左京進入道殿 text: 御教書 | 案 | 師直師 | 泰誅伐事 | 早馳 | 参御方 | 可致 | 軍忠之 | 状如件 | 観応元年十一月三 | 日 | 御判 | 島津 | 左京進 | 入道殿 result: 御教書 | 案 | 師直| 師泰| 誅伐事 | 早馳参 | 御方 | 可致 | 軍忠之状 | 如件 | 観応 | 元年 | 十一月 | 三日 | 御判 | 島津 | 左京進 | 入道 | 殿 correct?: It is not so bad!? There are a number of correct segmentations PNC2012
The technique of word segmentation(1) • Forward filtering-Backward sampling • This is often used in Bayesian learning of HMM (Hidden Markov Model) • 2 phases : Forward filtering, Backward sampling • Forward sampling phase: computing forward variables () • The variable is the probability of string s(=c1, … ,ct) with final k characters being a word. Aggregating marginal probabilities! 1 t-k-j+1 t-k t-k+1 t t+1 s: 御教書案 師直師泰 誅伐事 早馳参… k 御教書案師 直師泰 御教書案師直 師泰 j Here, : probability of bigram How is the probability computed ? using NPYLM PNC2012
The technique of word segmentation(2) • Backward sampling phase: Determination the value of k in • In proportion of the forward variable , value of k is drawn. • Thus, • Start status : • N: the length of string s (thus, s = c1, …, cN) • $: a sentence boundary. $ is at the end of the string. • Terminate condition : t = 0 • : string length • If k is defined, word is defined too. 1 t-k+1 t t+1… s: 御教書案 師直師泰 誅伐事 早馳参… 御教書案師 直師泰 御教書案師直 師泰 k Here, : probability of bigram How is the probability calculated ? using NPYLM PNC2012
The technique of word segmentation(3) • NPYLMis an extension of HPYLM [1](Hierarchal Pitman-Yor Language Model). • [1]Y.W.Teh.: A Hierarchical Bayesian Language Model Based On Pitman-YorProcesses, ACL2006. • In HPYLM, context h, word w, the probability that w follows h is p(w|h), the probability is computed by • the suffix of h consisting of all but the first word • : parameter of HPYLM … many parameters, So “nonparametric” • Discounting frequency of n-degree and interpolated by frequency of (n-1)-degree • NPYLM has Character n-gram model and Word n-gram model. • Character n-gram is embedded in the base measure of Word n-gram. • Thus, 0-gram in Word n-gram is Character n-gram. • HPYLM is represented by Character n-gram and Word n-gram. • Detail written in DaichiMochihashi, Takeshi Yamada, Naonori Ueda.: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling". ACL-IJCNLP 2009, pp.100-108, 2009. PNC2012
The technique of word segmentation(4) • Finally, segmented string can be obtained. • Note: The procedure is accomplished by MCMC(Markov chain Monte Carlo) method. • Using all training data, various parameters should be sampled. We use the Gibbs sampling. • Parameters are fixed except sampling target parameter. • And the parameter is sampled with training data. • Next sampling target is decided. • The above-mentioned is applied until all the parameter is sampled. • The procedure should be repeated many times. • Until “you are satisfied“ (e.g., the change of parameters is quite small), or parameters are converged. 御教書案師直師泰誅伐事早馳参御方可致軍忠之状如件観応元年十一月三日御判島津左京進入道殿 text: 御教書 | 案 | 師直師 | 泰誅伐事 | 早馳 | 参御方 | 可致 | 軍忠之 | 状如件 | 観応元年十一月三 | 日 | 御判 | 島津 | 左京進 | 入道殿 Segment: PNC2012
Examples: Word segmentation 13750880550 赤堀文書 大日本史料6_46 譲與所領事右美作国塩湯郷壱分地頭職并公文職内半分者兄義季観応二年二月廿一日戴御判今半分事康季同日戴御判者也彼御判同御施行遵行軍忠之御判等相副之甥為帯刀左衛門尉季治於猶子譲與者也聊一族親類中不可有違乱妨儀也仍為後日譲状如件永和元年八月十日藤原康季 text: 譲與 | 所領 | 事右 | 美作国 | 塩湯郷壱分 | 地頭 | 職并 | 公文 | 職内 | 半分者 | 兄義季 | 観応二 | 年二月 | 廿一日 | 戴御判 | 今半分事 | 康季同日 | 戴御 | 判者也 | 彼 | 御判 | 同 | 御施行 | 遵行 | 軍忠之 | 御判 | 等相副之 | 甥為 | 帯刀 | 左衛門 | 尉季 | 治於 | 猶子 | 譲與 | 者也 | 聊一 | 族親 | 類中 | 不可有 | 違乱 | 妨儀也 | 仍為後日 | 譲状 | 如件 | 永和元 | 年八月 | 十日 | 藤原 | 康季 Segment: 13900080170 大内義弘挙状案 吉川家文書_2 吉河讃岐入道玄龍申安芸国山県郡内志路原庄事依軍忠預置之候御下文事申御沙汰候者可然候恐惶謹言明徳元年八月十七日左京権大夫義弘在判進上御奉行所 text: 吉河 | 讃岐 | 入道 | 玄龍申 | 安芸国 | 山県郡内 | 志路原庄事 | 依 | 軍忠 | 預置 | 之候 | 御下文 | 事申 | 御沙汰 | 候者 | 可然候 | 恐惶 | 謹言 | 明徳元 | 年八月 | 十七日 | 左京 | 権大夫 | 義弘 | 在判 | 進上御 | 奉行所 Segment: Segmentation may be fine more if some dictionaries can be introduced. We should challenge in semi-supervised learning!? PNC2012
Conclusion, Future Work • In order to support to search and show from a huge amount of the texts of the Japanese historical materials, we introduce the search system using VSM and Timeline. • In order to VSM, we introduce word segmentation using NPYLM. • All the methods which we introduced is the stage of the “sketch”. • The search system is prototype. • We introduce and prototype, but do not evaluate. • The evaluation is very hard!? • We'll consider to introducing a semi-supervised learning • For more fine word segmentation… • And investigating method of text analysis for Japanese historical material. • For example: • topic extraction, • With temporal data: topic detection / topic tracking,… • For discovering similar materials “semantically”… PNC2012
Thank you for listening to my presentation. • E-mail: t_yamada@nihu.jp PNC2012