60 likes | 227 Views
Project 2 Latent Dirichlet Allocation . 2014/4/29 Beom-Jin Lee. Data Selection. Enron Email Dataset http://www.cs.cmu.edu/~enron/ NIPS 1-17 data http://ai.stanford.edu/~gal/data.html http://www.cs.nyu.edu/~roweis/data.html Datahub (Wikipedia Data, Wikinews , etc )
E N D
Project 2Latent Dirichlet Allocation 2014/4/29 Beom-Jin Lee
Data Selection • Enron Email Dataset • http://www.cs.cmu.edu/~enron/ • NIPS 1-17 data • http://ai.stanford.edu/~gal/data.html • http://www.cs.nyu.edu/~roweis/data.html • Datahub (Wikipedia Data, Wikinews, etc) • http://datahub.io/en/dataset • Reuters Corpora (RCV1, RCV2, TRC2) • http://trec.nist.gov/data/reuters/reuters.html • News group data • http://www.infochimps.com/datasets/20-newsgroups-dataset-de-duped-version • Company Datasets • http://endb-consolidated.aihit.com/datasets.htm • Twitter Data • http://snap.stanford.edu/data/twitter7.html
Methodology • Original Paper • Latent Dirichlet Allocation, David M. Blei, Andrew Y. Ng, Michael I. Jordan, Journal of Machine Learning Research 3, 993 – 1022, 2003 • Toolbox • http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm • Help • http://www.4four.us/article/2010/11/latent-dirichlet-allocation-simply
평가방법 • Base line • Data Selection, Data inspection, Methodology report, Result from using LDA • Plus points • Big data processing method(Wikipedia, Wallstreet Journal, etc) • Different kind of model comparison • Improvement in LDA