10 likes | 148 Views
Song Gao, Chengcui Zhang, Wei Bang Chen Department of Computer and Information Sciences The University of Alabama at Birmingham { gaos , zhang , wbc0522}@ cis.uab.edu.
E N D
Song Gao, Chengcui Zhang, Wei Bang Chen Department of Computer and Information Sciences The University of Alabama at Birmingham {gaos, zhang, wbc0522}@cis.uab.edu In this paper we present a two-phase spam image clustering framework. The proposed framework performs a histogram based projective clustering on visual features in the first phase, followed by a text-based clustering in the second phase. There are several contributions in this study. First, we address the complex nature of spam image obfuscation techniques. Second, a multi-clue framework is developed to profile spam images of common spamming sources which provide evidence for tracking spam gangs. Third, projective clustering eliminates the need to choose among distance metrics for clustering analysis, while systematically exploring subspaces that correspond to clusters. Introduction and Motivation Wavy Image Correction • “Image spam is a kind of email spam where the message text of the spam is presented as a picture in an image file” – Wikipedia. • Occurrence rate of spam image in all spam emails is more than 30% in 2006. • Look similar, but essentially not! • Wavy images – failed to be detected by text recognition algorithm, such as optical character recognition (OCR). To extract the embedded texts from wavy images, correction needs to be done by realigning each vertical line to its correct position. Two perceivable approaches are proposed to find the guideline based on which realignment can be done: • Edge-based method: Curve lines that are originally horizontal lines in the undistorted image are served as a guideline for image correction. • Color-matching method: This approach finds the best color match of two adjacent vertical lines by fixing one line and slightly shifting the other line upward or downward. Edge Color • Challenges: • Current state of anti-spam. • The filtering techniques, such as text classification and image classification. • Disadvantages: CANNOT tell the origins of spam. Projective Clustering Identifying Image Spam Authorship with a Variable Bin-width Histogram-based Projective Clustering A histogram-based projective clustering algorithm REVBH (Relative Entropy on Variable Bin-width Histogram): Constructing a variable bin width histogram for each k-dimensional subspace. (e.g. k=2) Detecting dense areas iteratively in each histogram by using our proposed density threshold. Converting each object into a signature that describes how that data object is projected into different subspaces. Merging similar object signature entries. Assigning data objects to corresponding clusters. Group 2 Group 3 Group 1 • Goal: • Provide scientific evidence to the origins of spam. • Assist in tracking down the common sources of the spam based on spam image clustering. Multi-clue Framework • Partition on one dimension by using original histogram and equalized histogram. 2 • A histogram-based clustering framework: • Image preprocessing • Wavy correction. • Spam image segmentation – foreground and background. • Feature extraction • Color features: 6-bit color-code histogram. • Texture features: histogram of gradient direction with each bin representing k degrees among 360 degrees. • Layout features: proportion of the foreground object pixels in each 9-grid cell. • Text contents: recognized by performing OCR. • Two-phase clustering • Histogram-based projective clustering on visual features. • Text-based clustering on extracted text information. 1 * O1 3 4 5 * • The bin-width of each sub-range along one dimension is determined by using Freedman and Diaconis’s rule or Scott’s rule: * * h = max{2×IQR×n-1/3, 3.5 × σ ×n-1/3} Signature: O1 [ ] 6 5 9 • Dense bins are detected in terms of relative entropy metric: Original image Foreground mask after segmentation Resized illustration maskfor layout feature extraction hr_low(x)≤(1/T)Hr(x)≤hr_high(x) Experimental Results • hr(x) and Hr(X) represents the relative entropy of a single bin and its corresponding k-dimensional histogram: • 2100 spam images including 37 wavy images. • 476 classes labeled manually. • All feature values are normalized into z-score. • Clustering results are evaluated by V-measure and the number of produced clusters. 1.a Effectiveness of wavy image correction 1.b 2 3.a 3.b Performance comparison between proposed approach and hierarchical clustering