160 likes | 299 Views
N-Gram-based Dynamic Web Page Defacement Validation. Woonyon Kim Aug. 23, 2004 NSRI, Korea. Contents. Introduction Related Works N-Gram Frequency Index N-Gram-based Index Distance Experiments Conclusions. Introduction. Defacement of Web Sites CSI/FBI 2001
E N D
N-Gram-based Dynamic Web Page Defacement Validation Woonyon Kim Aug. 23, 2004 NSRI, Korea
Contents • Introduction • Related Works • N-Gram Frequency Index • N-Gram-based Index Distance • Experiments • Conclusions
Introduction • Defacement of Web Sites • CSI/FBI 2001 • 38 % of web sites were hacked. • 21% of hacked sites were not aware of their own defacements. • Zone-h • The defaced web pages are rapidly increased year by year. (.kr domain : about 200% increase) • Current solutions • Hash-based detection system for minimizing damage • Intrusion-tolerant system for contiguous service • Problems of current solutions • Current solutions use hash code as validation metric. Hash code can’t support dynamic characteristics.
Introduction • N-Gram-based Index Distance (NGID) • A validation metric of dynamically changing web pages • The sum of absolute differences of frequency probability of N-Grams that can be found from both indexes. • NGID represents the similarity of two web pages. • NGID can be used to validate web pages with dynamic components or static.
Related Works • Hash-based validation system • Detecting web page defacements by comparing two hash codes • Hash code is useful metric for large and static web pages. • Hash code can’t work properly on the dynamically changing web pages. • Intrusion-tolerant system • Hash code is used to validate web pages. • It also has limitation on dynamic web pages.
N-Gram Frequency Index (1) • N-Gram • An N-character slice of a string • For example “TEXT” • 2-Gram : TE, EX, XT • N-Gram Frequency Index • An index file that is sorted from the most frequent N-Grams to the least frequent ones • It cuts off N-Grams below at a particular rank. So, minor changes are ignored. And this feature of N-Gram Frequency Index supports dynamics.
N-Gram Frequency Index (2) • How to generate • Count all N-Grams frequencies in a web page. • Sort N-Grams from the most frequent to the least. • Cut off N-Grams below at a particular rank. • Sum up the frequencies of the remained N-Grams. • Compute the probability of each N-Gram frequency. • Save the N-Grams, frequency of the N-Grams, the probability of N-Grams into an index file.
N-Gram-based Index Distance(NGID) • The sum of absolute difference of frequency probability of same N-Grams that can be found from both web pages. • A metric for detectingwhether a web page is defaced or not.
N-Gram-based Index Distance • Evaluation is done by comparing NGID to validation threshold • Evaluation • Valid : NGID <= Validation Threshold • Invalid : NGID > Validation Threshold
Experiments • Assumptions • Select 100 web pages • Choose 0.1 for Validation Threshold of NGID. • Procedure for false positive • Connect to a selected web page at a time in remote place. • Download a page and save it a file. • Validate it using NGID. • Validate it using Hash Code. • Above four steps are recursively applied. • Every 30-minute in a day
Experiments • False Positive
Experiments • False Positive
Experiments • NGID valueas time flows 2 1 The time of contents update
Experiments • Procedure for false negative • Collecting 50 web pages that are normal pages and hacked pages from zone-h. • Validate it using NGID. • Validate it using Hash Code. • Result of Hash code • 50-web pages are detected to be defaced. • The number of false negative is 0.
Experiments • False Negative
Conclusions • N-Gram-based Index Distance • A metric to evaluate dynamic web page defacement. • NGID can validate dynamically changing web pages. • Future Works • Need a learning model to resolve a validation threshold of each web page. • Need a feedback mechanism of normal index.