50 likes | 192 Views
Correlating Language Model Clusters of News Agencies and Political Campaigns with Political Biases. CS224N Winter 2011 Yaron Friedman, Issao Fujiwara. Scraping the web is hard!. wget -r -l inf -D <domain> <target> -R w -o log-<source> HTML is unstructured and inconsistent
E N D
Correlating Language Model Clusters of News Agencies and Political Campaigns with Political Biases CS224N Winter 2011 Yaron Friedman, Issao Fujiwara
Scraping the web is hard! wget -r -l inf -D <domain> <target> -R w -o log-<source> HTML is unstructured and inconsistent Webpages are littered with ads and comments The internet is full of spam! Heuristics based on HTML markup structure and text properties such as word count Ratio of usable English text of raw HTML varied from ~75% - 99%
Comparing and clustering corpora Perplexity of each corpora according to the language model for each other corpora. Normalize perplexities for model Apply hierarchical clustering