Content Classification Analysis based on LDA Topic Model

Content Classification Analysis based on LDA Topic Model Projectleader:Hongbozhao

Content Classification Analysis based on LDA Topic Model Webcrawler AdvancedTF-IDF Testingparameters achievingwebnews chineseparsing&extracting contentsprocessing adding content-based tests finding best parameters in small data testing in big data comparing to content-based algorithm

Web crawler • achieving nearly 17,000 web news through Sougou Database • including htmlcharacters, insignificantly achieving web news chinese parsing & extracting

Web crawler • using ICTCLAS to parse and extract chinese words, excluding stop words, conjunctions, prepositions and numerals achieving web news chinese parsing & extracting

Advanced TF-IDF • Extracting news into TITLE, BEGIN, CONTENT and END section with different weights • Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database content processing adding content-based tests finding best parameters in small data TITLE BEGIN CONTENT END

Advanced TF-IDF • Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance. contents processing adding content-based tests finding best parameters in small data

Advanced TF-IDF • Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors • testing sets = 30% of whole data • training sets = 70% of whole data contents processing adding content-based tests finding best parameters in small data

Advanced TF-IDF • the keywords in training sets equals to testing sets contents processing adding content-based tests finding best parameters in small data • Unstable

Advanced TF-IDF • Using all keywords in training sets contents processing adding content-based tests finding best parameters in small data • Extremly low speed

Advanced TF-IDF • Using all keywords in testing sets contents processing adding content-based tests finding best parameters in small data • When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect

Testing parameters • Testingtobigdata,whenthetrainingsetineverysectionincreasesgraduallyto200,450,750andfinally1343(allwords),theaccuracyisshown in the figure.Thefinalaccuracyreaches82.5%or85.1%excludingtheculturesection.Theresultsshowstheperfectparametersweselected. testinginbigdata comparing to content-based algorithm

Testing parameters • to content-based algorithm, theaccuracyisgreater,however,thetimeefficiencyislower testinginbigdata comparing to content-based algorithm

Summary partialencoding&decodingproblems errorsinkeywordsparsingleadstoclassificationfaults partialrepeatedpassagesleadstoerrorsinaccuracy successfulalgorithmingeneral

Thanks Content Classification Analysis based on LDA Topic Model

Content Classification Analysis based on LDA Topic Model