180 likes | 309 Views
Topic Modeling using Semantic and Network structure. Sophia(Xueyao) Liang CPSC 503 Final Project. Topic modeling. Olympic, vancouver. Snow, cold. K=3. Moon light, spider man. P( |d). Unsupervised. P( |d). P( |d). plsa. plsa. z k ∈{z 1 ,z 2 ,…,z N }.
E N D
Topic Modeling using Semantic and Network structure Sophia(Xueyao) Liang CPSC 503 Final Project
Topic modeling Olympic, vancouver Snow, cold K=3 Moon light, spider man P( |d) Unsupervised P( |d) P( |d)
plsa zk∈{z1,z2,…,zN}
Plsa - Parameter inference Expectation: Maximization:
NetPLSA Parameter Inference: No closed form solution for expectation step • Efficient Algorithm: • Expectation (PLSA) • Maximization(PLSA) • The result of the previous steps may not ends in better value for O
NetPLSA • Potential Problems of the model • Parameter Inference • Higher time complexity and slower to converge -10000 100
CORPUS • Cora Data version 1.0 • Cited paper not in the corpus • No abstract for some post-script files • Too many categories • Duplicated or isolated papers 30000 scientific papers, with citation information Important files: papers (ID-name, link, author…..) citations (ID-cited ID) classifications (link-category) directory: extractions (post-script form of the papers)
CORPUS • Cora Data version 1.0 • Papers in category Machine Learning • About 2700 papers • 1400 Frequent Words (stop words removed, stemmed)
Results Overall Accuracy (A) Accuracy (B) Recall Accuray and Recall for each category
EvALUATION • Justified the claim that adding network structure into the model could improve the result of topic modeling • Modeled the network on a scale of articles • Inherent problem exists in the picked framework • The result is still far from satisfactory
Future work • How to model the network structure of blog articles, especially considering model them on a scale of articles • Bag-of-words matrix extraction • Better integral model, maybe LDA based • Efficiency of the algorithm • Recommendation based on topic communtiy discovery