1 / 14

Content Classification Analysis based on LDA Topic Model

Content Classification Analysis based on LDA Topic Model. Project leader: Hongbo zhao. Content Classification Analysis based on LDA Topic Model. Web crawler. Advanced TF-IDF. Testing parameters. achieving web news chinese parsing & extracting. contents processing

bobby
Download Presentation

Content Classification Analysis based on LDA Topic Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content Classification Analysis based on LDA Topic Model Projectleader:Hongbozhao

  2. Content Classification Analysis based on LDA Topic Model Webcrawler AdvancedTF-IDF Testingparameters achievingwebnews chineseparsing&extracting contentsprocessing adding content-based tests finding best parameters in small data testing in big data comparing to content-based algorithm

  3. Web crawler • achieving nearly 17,000 web news through Sougou Database • including htmlcharacters, insignificantly achieving web news chinese parsing & extracting

  4. Web crawler • using ICTCLAS to parse and extract chinese words, excluding stop words, conjunctions, prepositions and numerals achieving web news chinese parsing & extracting

  5. Advanced TF-IDF • Extracting news into TITLE, BEGIN, CONTENT and END section with different weights • Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database content processing adding content-based tests finding best parameters in small data TITLE BEGIN CONTENT END

  6. Advanced TF-IDF • Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance. contents processing adding content-based tests finding best parameters in small data

  7. Advanced TF-IDF • Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors • testing sets = 30% of whole data • training sets = 70% of whole data contents processing adding content-based tests finding best parameters in small data

  8. Advanced TF-IDF • the keywords in training sets equals to testing sets contents processing adding content-based tests finding best parameters in small data • Unstable

  9. Advanced TF-IDF • Using all keywords in training sets contents processing adding content-based tests finding best parameters in small data • Extremly low speed

  10. Advanced TF-IDF • Using all keywords in testing sets contents processing adding content-based tests finding best parameters in small data • When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect

  11. Testing parameters • Testingtobigdata,whenthetrainingsetineverysectionincreasesgraduallyto200,450,750andfinally1343(allwords),theaccuracyisshown in the figure.Thefinalaccuracyreaches82.5%or85.1%excludingtheculturesection.Theresultsshowstheperfectparametersweselected. testinginbigdata comparing to content-based algorithm

  12. Testing parameters • to content-based algorithm, theaccuracyisgreater,however,thetimeefficiencyislower testinginbigdata comparing to content-based algorithm

  13. Summary partialencoding&decodingproblems errorsinkeywordsparsingleadstoclassificationfaults partialrepeatedpassagesleadstoerrorsinaccuracy successfulalgorithmingeneral

  14. Thanks Content Classification Analysis based on LDA Topic Model

More Related