1 / 34

Context discovery with SHY (Song Huiyao – 宋會要 )

Context discovery with SHY (Song Huiyao – 宋會要 ). Jieh Hsiang ( 項潔 ) National Taiwan University and Academia Sinica. Joint work with. Hsieh-Chang Tu ( 杜協昌 ), NTU Shih-Pei Chen ( 陳詩沛 ), Harvard With special thanks to Cheng-yun Liu ( 劉錚雲 ) of IHP, Academia Sinica

ashby
Download Presentation

Context discovery with SHY (Song Huiyao – 宋會要 )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Context discovery with SHY (Song Huiyao – 宋會要) Jieh Hsiang (項潔) National Taiwan University and Academia Sinica PNC 2012, Berkeley

  2. Joint work with • Hsieh-Chang Tu (杜協昌), NTU • Shih-Pei Chen (陳詩沛), Harvard With special thanks to • Cheng-yun Liu (劉錚雲) of IHP, Academia Sinica • Peter Bol of Harvard University PNC 2012, Berkeley

  3. Songhuiyao《宋會要》 • Huiyao (會要): • Decrees and laws, usually collected throughout a dynasty • Songhuiyao (宋會要): The huiyao of the Song Dynasty, 960 – 1279 AD, most important government record of the Song Dynasty • Current version is only a remnant, extrated by Xu Song (清,徐松) around 1800 from Yong-le Dadian (永樂大典) PNC 2012, Berkeley

  4. Songhuiyao《宋會要》 • 35,000,000 words in full text, 17 categories • Full text done by the Institute of History and Philology (IHP) of the Academia Sinica and the Chinese Bibliographical Dababase (CBDB) project of Harvard University • Included in the ScriptaSinicaof IHP • Why another system? • Songhuiyao is fragmented and very difficult to use • Need a better way to re-contextualize the material PNC 2012, Berkeley

  5. Introducing THDL • Originally designed as a system for a Chinese corpus of full text historical documents related to Taiwan (thus the name THDL: Taiwan History Digital Library) • Tailored for scholarly use with many special features PNC 2012, Berkeley

  6. Key design philosophy of THDL (Preserving old) (creating new) (observing different) Assume that documents are related Treats a query return as a sub-collectionofinter-related documents provides ways to discover the collective meaningsof a sub-collection Contexts, contexts, contexts PNC 2012, Berkeley 6

  7. Features in THDL • Main goal: provide ways to show collective meanings (contexts) of documents • Multi-level classification of query result • Term co-occurrence analysis • GIS/time distributions • Term extraction tools • Text mining tools • Annotation/correction tools PNC 2012, Berkeley

  8. THDL as a shell • THDL • Taiwanese Land deeds • Ming Qing court documents • Dan-Xin archives • KMT (Nationalist Party) archives • Taiwanese democratic magazines • Songhuiyao (宋會要) (this talk) • Qingshilu – Veritable Records of Qing (清實錄) (IP) • Gujin tushu jicheng (古今圖書集成) and other leisu (類書) (IP), other smaller books • Over 400,000,000 Chinese words, 1,000,000 metadata records, 2,000,000 images PNC 2012, Berkeley

  9. XMLize the data CBDB processed the data into 80,396 entries into excel form, each with 7 fields : category, emperor, dates (4 fields), and full-text

  10. XMLize the data • Dates: use DDBC from Dharma Drum to convert the dates in western calendar (61,002 documents) • Extract names for SHY • 9,470 person names from CBDB (CBDB has 35,632 Song names) • 3,366 official titles from CBDB • 4,010 locations from CBDB • Text-mined 11,901 additional potential names (estimate correctness: 33%)

  11. Features of SHY (1) • Finding documents • Full text search, plus logical operations • Multiple contextual presentations of query results • Term frequency and co-occurrence (contextual) analysis of people, locations, and offices • Biography of people (from CBDB) PNC 2012, Berkeley

  12. Features of SHY (2) • Chronological distribution of query results • Geographic distribution of query results • Self-defined document sets (with all the features above) • Chronological comparison of two query result sets • User-feedback mechanism (especially useful for Song research community) • Appositional term analysis PNC 2012, Berkeley

  13. Full text search in SHY Query term “locust” PNC 2012, Berkeley

  14. Multi-contextual classification • Years • Era (of emperors) • Categories • Subcategories • Error detection PNC 2012, Berkeley

  15. Error detection using facets • Years that are not supposed to exist (e.g., 2nd month of first year of Xinguo) PNC 2012, Berkeley

  16. Facets within a facet • Distribution of result of the query “locust” within the category Ruiyi (瑞異 strange phenomenon) PNC 2012, Berkeley

  17. Biography of people from CBDB • Click biography (生平) by any name and get the information from CBDB PNC 2012, Berkeley

  18. Term frequency analysis • Common names and locations in the query result • df: document frequency • tf: term frequency • df(A)=4, tf(A)=6 • df(B)=3, tf(B)=4 • df(C)=2, tf(C)=3 • df(D)=2, tf(D)=2 A…C A…A D D…B …C…C A…B … A B…A …B PNC 2012, Berkeley

  19. Term frequency analysis query「史彌遠」 df: given query q, the number of documents of the query result in which term t appears. df(t) tq: percentage of documents in df(t) over the total number of documents in which t appears (the higher it is, the more relevant t is to q) PNC 2012, Berkeley

  20. Chronological distribution of documents • Chronological distribution of documents is often useful • Among the 80,396 documents in Songhuiyao, 61,002 have dates that were extracted automatically PNC 2012, Berkeley

  21. Comparing timelines of two queries • q1 ?vs q2 • Ex:Wenzhou?vs Raozhou Grey: with Raozhou Red: with Wenzhou PNC 2012, Berkeley

  22. Geographic distribution • Locations (with df) plotted on map. • Location names obtained from CBDB Query“locust” PNC 2012, Berkeley

  23. Self-defined folders • User can define her own folders of documents so that they can be used later • All the features described above apply to all self-defined folders (i.e., any sets of documents, not only query results) PNC 2012, Berkeley

  24. Self-defined folders • Light green color means that document has been kept in some folder PNC 2012, Berkeley

  25. User feedback mechanism • Simple way for users to report errors in metadata or full-text • Also used effectively for SHY users to determine the correctness of new names found through term extraction PNC 2012, Berkeley

  26. User feedback mechanism • 目前在詞頻分析的每個「其他」詞彙右方,都有一個「錯誤回報」連結 • 全文的右下方,有「更正全文錯誤」的連結 人地名詞彙的更正與回報 PNC 2012, Berkeley

  27. User feedback mechanism Feedback on terms Feedback on full text PNC 2012, Berkeley

  28. Ask the user community to check for correctness PNC 2012, Berkeley

  29. So far: 966 names confirmed from 2,390 candidates PNC 2012, Berkeley

  30. Appositional term analysis • Given a set of documents, what terms (and their frequency) appeared precede or after a certain word • Example: what words appeared before tax (which also gives an indication of what type of taxes there were) • Simple interface: simply type the keyword and a number that indicates the number of words precede or after the keyword PNC 2012, Berkeley

  31. Statistics of x tax PNC 2012, Berkeley

  32. Can directly read the text PNC 2012, Berkeley

  33. Discussion • SHY is an example of a new methodology of search systems • It can analyze contexts of documents resulting from a query • The first prototype of SHY was completed within a week (fine-tuning took longer) • Critical input from CBDB, especially on terms, locations, calendar, biography • THDL as a shell is very effective quickly prototyping such systems PNC 2012, Berkeley

  34. Thank you PNC 2012, Berkeley

More Related