180 likes | 339 Views
Some studies on Vietnamese multi-document summarization and semantic relation extraction. Laboratory of Data Mining & Knowledge Science. Content. Vietnamese multi-document summarization Vietnamese VNSEN search engine Clustering Semantic similarity Multi-document summarization
E N D
Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science Laboratory of Data Mining & Knowledge Science
Content • Vietnamese multi-document summarization • Vietnamese VNSEN search engine • Clustering • Semantic similarity • Multi-document summarization • Semantic relation extraction • Vietnamese medical ontology • Object relation extraction • Cause-and-effect relations • Vietnamese entity search engine Laboratory of Data Mining & Knowledge Science
Vietnamese multi-document summarization • Vietnamese VNSEN search engine • Based on NUTCH • Integrated Vietnamese word segmentation tool • JvnSegmenter • Indexed 500.000 pages from vi.wikipedia.org Laboratory of Data Mining & Knowledge Science
Vietnamese multi-document summarization • Clustering • Integrated clustering to VNSEN search engine • Using snippet results from VNSEN search engine • Hierarchical Agglomerative Clustering (HAC) algorithm • Estimation with Clustering on Vivisimo search engine • Cluster labeling • Compactness of clusters • Isolation of clusters Laboratory of Data Mining & Knowledge Science
Vietnamese multi-document summarization • Implementation of semantic similarity measures • Semantic similarity between words based on Semantic Network • Path length (PL) • Information content (IC) • Semantic similarity between sentences based on topic analysis • Word order similarity between sentences Laboratory of Data Mining & Knowledge Science
Vietnamese multi-document summarization • Building Vietnamese semantic corpus • Hidden topic corpus • Using Latent Dirichlet Allocation (LDA) model • Using JgibbsLDA tool to analyze topic • Vietnamese Wikipedia corpus • Using category graph model • Result • 120/150/200 hidden topics corpus based on Vnexpress/Wikipedia data set • Category graph with 14.000 category nodes and 200.000 articles Laboratory of Data Mining & Knowledge Science
Sentences weights S1 …. … …. Sk …. Documents Weights D1 … …. … Dk … Vietnamese multi-document summarization • Multi-document summarization • Maximal Marginal Relevance (MMR) method • Improving with Semantic Similarity Measures based on Hidden topic analysis List of sentences Summary document Cluster Hidden topic Pre-processing Label Cosine measure List of documents Laboratory of Data Mining & Knowledge Science
Vietnamese multi-document summarization • Multi-document summarization for simple Vietnamese Medical Q&A system • Semantic Similarity Measures based on Vietnamese Wikipedia corpus • Medical Ontology • Hidden topic analysis • Clustering Laboratory of Data Mining & Knowledge Science
Vietnamese multi-document summarization Laboratory of Data Mining & Knowledge Science
Vietnamese multi-document summarization • Table-of-Contents generation • Using some solutions of Text Segmentation and Title Generation for automatically generating a Table-of-Contents. Laboratory of Data Mining & Knowledge Science
Vietnamese multi-document summarization • Some our Vietnamese language processing utilities • Nguyen Cam Tu, Phan Xuan Hieu. JvnSegmenter. A Java-based Vietnamese Word Segmentation • Nguyen Cam Tu. JVnTextpro: A Java-based Vietnamese Text Processing Toolkit • Nguyen Cam Tu. JGibbsLDA: A Java and Gibbs Sampling based Implementation of Latent Dirichlet Allocation (LDA) • http://203.113.130.205:8080/sise: VNSEN Search Engine (Implementers: Nguyen Thu Trang, Nguyen Cam Tu, Nguyen Viet Cuong, Tran Mai Vu, Nguyen Minh Tuan etc.) Laboratory of Data Mining & Knowledge Science
Semantic Relation Extraction • Vietnamese Medical Ontology • 23 classes entity • 14 relations • 200 entities • Technique to improve ontology • Named Entity Recognition • Relation extraction • … Laboratory of Data Mining & Knowledge Science
Semantic Relation Extraction Laboratory of Data Mining & Knowledge Science
Semantic Relation Extraction • Object relation extraction • Product domain • Medical domain • Technique • Using Wrapper technique for structured data (HTML/XML/Table) • NLP for unstructured data (Text) • HMM Model • CRF Model • … Laboratory of Data Mining & Knowledge Science
Semantic Relation Extraction • Cause-and-effect relations Using the researching result by Corina Roxana Girju to investigated some cause-and-effect relations such as : • Adverbial causal link • Preposition causal link • Subordination causal link • Clause integrated link [Rox08] Corina Roxana Girju (2008). Semantic Relation Extraction and its Applications, Invited tutorialat the European Summer School in Logic, Language and Information (ESSLLI 2008), Hamburg, Germany, August 2008. Laboratory of Data Mining & Knowledge Science
Semantic Relation Extraction • Vietnamese entity search engine on the field of Medical Healthy Care • Using Medical Ontology, Object relation extraction, Cause-and-effect relation extraction… • Associating UIUC-DB&IS Lab (University of Illinois at Urbana-Champaign) • Object Search • Query Log Mining • Object Extraction [Cha08] Kevin C. Chang (2008). Data-Aware Search on the Web, Act. 2: Entity Search, Technical Report, University of Illinois at Urbana-Charmpaign (a talking at College of Technology, Vietnam National University, Hanoi, July 08, 2008). Laboratory of Data Mining & Knowledge Science
Some articles in 2008 [LNH08] Dieu-Thu Le, Cam-Tu Nguyen, Quang-Thuy Ha, Xuan-Hieu Phan, and Susumu Horiguchi (2008). Matching and Ranking with Hidden Topics towards Online Contextual Advertising, The 2008 IEEE/WIC/ACM International Conference on Web Intelligence (WI-08), University of Technology, Sydney, Australia, December 9 - 12, 2008 (accepted) [PNL08] Xuan-Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Le-Minh Nguyen, Susumu Horiguchi, and Quang-Thuy Ha (2008). Classification and Contextual Match on the Web with Hidden Topics from Large Data Collections, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING(Submitted) [VUH08] Tran Mai Vu, Pham Thi Thu Uyen, Hoang Minh Hien, Ha Quang Thuy (2008). Semantic Similarity of sentences and application for multi-document summarization to evalute on clustering component of Vietnamese search engine, Workshop on Information Communication Technology (ICTFIT08), College of Science, Vietnam National University, Ho Chi Minh City, November 14, 2008(in Vietnamese, accepted). 11/8/2014 Laboratory of Data Mining & Knowledge Science 17
THANK YOU Laboratory of Data Mining & Knowledge Science