290 likes | 419 Views
Buddha Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations . Jen- Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings 2013, Kyoto University, DEC 10-12, 2013. Building the Digital Research Platform for Chinese Buddhist Literature.
E N D
Buddha NgramViewer: a N-gram Visualization Tool of Chinese Buddhist Translations Jen-Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings 2013, Kyoto University, DEC 10-12, 2013
Building the Digital Research Platform for Chinese Buddhist Literature Jen-Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings 2013, Kyoto University, DEC 10-12, 2013
Achievements of Digitized Chinese Buddhist Texts • CBETA (Chinese Buddhist Electronic Text Association) is founded in 1998. • In the last 15 years, CBETA has converted a substantial number of Chinese Buddhist scriptures to digital format. • In CBETA 2011 DVD, it consists of more then 160 million Chinese characters.
The Chance and Challenge with “BIG DATA” (I) • The rapid growth of digital resources let scholars to be able to acquire more relevant materials with less time. • However, most of digital resources are not integrated. Scholars have to find an more efficient way to master the large amount of data in order not to be drown in the data ocean.
The Chance and Challenge with “BIG DATA” (II) • We also believe that these large amount of digital resources will not only provide a convenient research environment but also will help to gain new insights. • One very promising solution is to perform text analysis on Buddhist electronic text corpus to find out hidden pattern behind texts. • However, it sounds like a very difficult task for Buddhist scholars.
Digital Research Platform for Chinese Buddhist Literature • Main Mission of the Digital Research Platform: • Data Providing: Provide complete, integrated reference data in easy access way. • Data Organizing: Provide customization tools for user to organize materials into knowledge. • Data Analyzing: Provide digital analysis tools for discovering hidden patterns.
Project Information • 2 years project, granted by National Science Council. (Digital Humanities Project). It consists of three sub-projects: • Sub-project1: responsible for digitizing new resource for supporting this platform. (directed by Aming TU) • Sub-project2: responsible for developing new methodology for analyzing digital corpus, especially focusing on phonology materials. ( directed by Chien-Kang Huang) • Sub-project3: responsible for integrating project result, develop text quantitative analysis tool and establishing the platform.
Plan for the First year Target 1: build up the platform for integrating resources • Design a good way to integrate digital resources. • Incudes: CBETA full text, catalogue, dictionaries, phonology materials, other digital resource created by DDBC. Target 2: implement text analysis functions • Building up data set for text analysis. • Creating tools. Ex: Buddha N-gram viewer is an example tool for this purpose . It visualizes over time occurrences of inputted phrases in Chinese Buddhist texts.
Idea of the Research Platform • Our experience: in the last decade, we have executed more than 20 digital achieve projects. • Every database has its own archive content, design principle and different media type. • The only overlap is perhaps the sutra text • To integrate those resources, we decide to establish a rich functional sutra reading interface, and bind other related information to the text.
Basic Information Catalogue Data Information from Sutra catalogue, click here will be leaded to our catalogue Project. Other Related Sutra Only embed critical apparatus, and gaiji information. • Other Parallel Translation. • List of Commentary • Related Research N-gram Information 婆羅,727 如是,705 比丘,694 羅門,693 沙門,614 世尊,477 如來,469 云何,428 眾生,388 由旬,387 爾時,384 復有,358 是為,346 阿難,317 無有,313
Extra Informationfor Selected Terms Catalogue Data Dictionary Lookup 婆羅《丁福保佛學大辭典》 【職位】Vihārapāla,維那之別名,譯曰次第,司僧中之次第順序者。行事鈔下二曰:「維那出要律儀翻為寺護,又云悅眾。本正音婆邏,云次第。」 Information from our glossaries project, click here will be leaded to glossaries project website. Other Related Sutra Occurrences of 婆羅in different time period • Other Parallel Translation. • List of Commentary • Related Research This information is from Buddha Ngram Viewer. 婆羅 N-gram Information 婆羅,727 如是,705 比丘,694 羅門,693 沙門,614 世尊,477 如來,469 云何,428 眾生,388 由旬,387 爾時,384 復有,358 是為,346 阿難,317 無有,313 Word Segmentation Tools Place Name, Person Name, Calendar Look up
What is the Text Analysis • Text analysis: utilizing computer software to analyze the text content in large size corpus, e.g.: CBETA. The objective is to discover hidden patterns and further derive new insights. • The patterns could be: • Words that are frequently used in one place but never show anywhere else. • High-frequency collocations in a group of documents. • Special usage patterns of commonly used words. • Other possible and meaningful patterns ……
Difficulties in applying text analysis to the CBETA corpus • Data is too complex: • The textual content and structure of Buddhist works are highly complex and complicated. • Analysis Tool is very difficult to learn • The leverage of general text analysis tool requires some skills in computer programming and advanced statistical knowledge. • How to let more (Humanity) scholars to adopt ‘text analysis’ technique in addressing their research questions? • We create some easy-use tools.
Buddha Ngram Viewer: (http://dev.ddbc.edu.tw/BuddhaNgramViewer/) • Buddha Ngram Viewer (under construction) • A tool that allows users to visualize the over-time occurrences of inputted phrases in Chinese Buddhist texts. Click any point in the chart to start. http://dev.ddbc.edu.tw/BuddhaNgramViewer/
Idea of Buddha NgramViewer • Combine Search result and sutra translation time from triptaka catalogue. Search result in CBReader + + || Number of occurrences of search term in different time period.
泥洹,涅槃 Chinese Dynasties Click this point to see the details of CE.401 Western Years Number of occurrences
The occurrences of 泥洹,涅槃 in the sutras translated in C.E. 401 The occurrences in the 22 fascicles of T1 (長阿含經). Click this point to see the details of 3rd fascicles in T1 Scroll down for more sutras A quick way to understand the frequencies of selected terms in texts.
Shows the matched place of泥洹,涅槃 in the third fascicle of T1 Click here for displaying only matches of 泥洹
Only display matches of 泥洹 in the third fascicle of T1 Click for viewing this line in CBETA Text
Integrate Buddha Ngram Viewer to the Research Platform Dictionary Lookup 婆羅《丁福保佛學大辭典》 【職位】Vihārapāla,維那之別名,譯曰次第,司僧中之次第順序者。行事鈔下二曰:「維那出要律儀翻為寺護,又云悅眾。本正音婆邏,云次第。」 Occurrences of 婆羅 over time This information is from Buddha Ngram Viewer. 婆羅 Word Segmentation Tools Place Name, Person Name, Calendar Look up
Future Work • Keep adding temporal and spatial information of sutras: • Taisho shinshuDaizokyo, Showa hoboumakuroku. • The Korean Buddhist Canon: A Descriptive Catalogue by Dr. Lewis R. Lancaster, 1979. • Complete the sutra reading interface and continue to integrate more related information to the platform. • Keep bring new idea to the platform. Ex
Thank you for listening. Q & A !!