200 likes | 468 Views
LDA-based Dark Web Analysis. Outline. What is Dark Web? Why do we need to analyze it? How to analyze Dark Web: Our Strategy Web Crawling Topic Discovery based on Latent Dirichlet Allocation (LDA) Optimization Process Conclusion. What is Dark Web?.
E N D
Outline • What is Dark Web? • Why do we need to analyze it? • How to analyze Dark Web: Our Strategy • Web Crawling • Topic Discovery based on Latent Dirichlet Allocation (LDA) • Optimization Process • Conclusion
What is Dark Web? • Web is a global information platform accessible from different locations. • It is a fast tool to spread information anonymously or with few regulations. • Its cost is relatively low compared with other media. • Dark Web is the place where terrorist/extremist organizations and their sympathizers • exchange ideology • spread propaganda • recruit members • plan attacks • An example of dark web: www.natall.com
Why do we need to analyze it? • To find the hidden topics in the Dark Web community, which are: • embedded in other large scale on-line web sites • information overloaded • multi-lingual
How to analyze Dark Web: architecture of our strategy • GS: Gibbs Sampling – a random walk in the sample space to find the maximum estimation • LDA: Latent Dirichlet Allocation
How to analyze Dark Web: architecture of our strategy • Use a web crawler to download text-based documents • Pruning by removing: • all the HTML tags • irrelevant contents such as images, navigation instructions • Formatting into a plain text file F F := header {doc} header := a line contains the number of documents doc := {term_1} • Feed the text file to GibbbsLDA analyzer to discover the latent topics • Optimize topic discovery
Criteria to select web crawlers • Able to parse ill-coded web pages • Parameterized URLs • Flexible to handle different web site structures • The downloaded web pages will be read by machine rather than human, therefore some kind of normalization must be taken to ensure the text corpus is well formatted and readable • Easy maintenance and of minimal hardware resources • Not necessary to be super fast • Not introduce any intellectual property problem
Topic discovery based on LDA • LDA is an Information Retrieval (IR) technique • Information Retrieval (IR) • reduces information overload • preserves the essential statistical relationships • Basic and traditional IR methods • tf-idf scheme: term-count pair => term-by-document matrix • LSI (Latent semantic indexing) • pLSI (probabilistic LSI) • Clustering: divide data set into subsets
Dirichlet Distribution • a generalization of the beta distribution
Beta Distribution • a continuous probability distribution with the probability density function (pdf) defined on the interval [0, 1]
LDA graph • corpus level: • α: Dirichlet prior hyper-parameter on the mixing proportion • β: Dirichlet prior hyper-parameter on the mixture component distributions • M: number of documents • document level: • θ: the documents mixture proportion • φ: the mixture component of documents • N: # of words in a document • word level: • ι: hidden topic variable • ω: document variable [H Zhang et al, 2007]
LDA vs. Clustering • Clustering simply partition corpus; one document belongs to on category • LDA-based analysis allows one document to be classified into different categories because of its hierarchy structure
Optimizing the results (1) • LDA does not know how many topics could be there; this value is set by the user • However we can evaluate the multiple “wild guesses” and choose the best one • f(x) is the number of documents that contain the word x • f(y) is the number of documents that contain the word y • f(x,y) if the number of documents that contain both word x and word y • M is the total number of the documents
Optimizing the results (2) • For each topic discovery, find the minimum of average distance of each topic.
Optimizing the results (3) Results: Four topics has the minimum average distance between words in each topic.
A topic list of discovered topics from www.natall.com Discovering New Topics after Optimization
Conclusion Web-harvest integrated with LDA is able to • discover the hidden latent topics from dark web sites. • provide a more flexible and automated tool to counter terrorism. • support a measurable way to optimize the results of LDA. • provide a generic tool to analyze a variety of websites such as financial, medical, etc.
References • Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3:993-1022. Mar. 2003. • An LDA-based Community Structure Discovery Approach for Large-Scale Social Networks, Haizheng Zhang, Baojun Qiu, C. Lee Giles, Henry C. Foley and John Yen, In Proceedings of IEEE Intelligence and Security Informatics, 2007. • Tracing the Event from Evolution of Terror Attacks from On-Line News, Christopher C. Yang, Xiaodong Shi, and Chih-Ping Wei, In Proceedings of IEEE Intelligence and Security Informatics, 2006. • On the Topology of the Dark Web of Terrorist Groups, Jennifer Xu, Hsinchun Chen, Yilu Zhou, and Jialun Qin, In Proceedings of IEEE Intelligence and Security Informatics 2006.