140 likes | 294 Views
Content Classification Analysis based on LDA Topic Model. Project leader: Hongbo zhao. Content Classification Analysis based on LDA Topic Model. Web crawler. Advanced TF-IDF. Testing parameters. achieving web news chinese parsing & extracting. contents processing
E N D
Content Classification Analysis based on LDA Topic Model Projectleader:Hongbozhao
Content Classification Analysis based on LDA Topic Model Webcrawler AdvancedTF-IDF Testingparameters achievingwebnews chineseparsing&extracting contentsprocessing adding content-based tests finding best parameters in small data testing in big data comparing to content-based algorithm
Web crawler • achieving nearly 17,000 web news through Sougou Database • including htmlcharacters, insignificantly achieving web news chinese parsing & extracting
Web crawler • using ICTCLAS to parse and extract chinese words, excluding stop words, conjunctions, prepositions and numerals achieving web news chinese parsing & extracting
Advanced TF-IDF • Extracting news into TITLE, BEGIN, CONTENT and END section with different weights • Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database content processing adding content-based tests finding best parameters in small data TITLE BEGIN CONTENT END
Advanced TF-IDF • Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance. contents processing adding content-based tests finding best parameters in small data
Advanced TF-IDF • Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors • testing sets = 30% of whole data • training sets = 70% of whole data contents processing adding content-based tests finding best parameters in small data
Advanced TF-IDF • the keywords in training sets equals to testing sets contents processing adding content-based tests finding best parameters in small data • Unstable
Advanced TF-IDF • Using all keywords in training sets contents processing adding content-based tests finding best parameters in small data • Extremly low speed
Advanced TF-IDF • Using all keywords in testing sets contents processing adding content-based tests finding best parameters in small data • When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect
Testing parameters • Testingtobigdata,whenthetrainingsetineverysectionincreasesgraduallyto200,450,750andfinally1343(allwords),theaccuracyisshown in the figure.Thefinalaccuracyreaches82.5%or85.1%excludingtheculturesection.Theresultsshowstheperfectparametersweselected. testinginbigdata comparing to content-based algorithm
Testing parameters • to content-based algorithm, theaccuracyisgreater,however,thetimeefficiencyislower testinginbigdata comparing to content-based algorithm
Summary partialencoding&decodingproblems errorsinkeywordsparsingleadstoclassificationfaults partialrepeatedpassagesleadstoerrorsinaccuracy successfulalgorithmingeneral
Thanks Content Classification Analysis based on LDA Topic Model