Q&A for “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs”

Q&A for “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs” Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker KDD 2009 By Fu-Chi Ao

Questions • What’s the error rate? • What are the relevant/dominant features out of the selected 30783 features? • Indication of TTL values? • How to construct the feature vectors? • What are the 3959 features of WHOIS information features?

What’s the error rate? (In binary classification) • Accuracy: The proportion of the true results in the population • Error rate = 1 – Accuracy

What are the relevant/dominant features out of the selected 30783 features? non-zero features benign malicious • Breakdown of features for L1-regularized LR for an instance of the Yahoo-PhishTank data set • The training phase for L1-regularized LR yields a sparse parameter vector w • Focus on a smaller number of relevant features

Certain “Red Flags" Indicate Malicious Intent • 1) Suspicious ownership of the site • Benign features: IP rangesbelonging to Google, Yahoo and AOL • Malicious features: having an NSrecord in one of the IP prefixes run by GoDaddy • 2) Where the site is hosted geographically • Top-6 benign features: ‘.gov’, ‘.edu’, ‘.com’, ‘.org’, ‘.ca’ and ‘.se’ • Top-6 malicious features: ‘.info’, ‘.kr’, ‘.it’, ‘.hu’, and ‘.es’ • 3) The registration date of the site • Malicious: a recent registration or update date/missing any of the three WHOIS dates (registration, update, expiration) • 4) What kind of connection the server is using • Top-2 benign features: have T1 speed for the DNS A and MX records • Malicious sites hosted on compromised machines in residential ISPs • 5) The presence of certain URL extensions • "bankofamerica.com" vs. "bankofamerica.com.cz.rnl"

What are the relevant/dominant features out of the selected 30783 features? (cont’d) • Machine learning techniques can adapt to differing feature distributions by learning the appropriate decision rules automatically • The results of experiments show that different data sets provide different feature distributions for distinguishing malicious and benign URLs • Rather than manually discovering and adjusting the decision rules for different data sets

What are the relevant/dominant features out of the selected 30783 features? (cont’d) • Automation of the classifier • Select malicious and benign features for which domain experts had prior intuition • Automatically selected new, non-obvious features that were highly predictive and yielded additional, substantial performance improvements

Indication of TTL values? • “What is the time-to-live (TTL) value for the DNS records associated with the hostname?” • Set by an authoritative names server for a particular resource record • Low TTL value • Some well-known larger web sites depend on low TTL values to enable quick changes to their web sites • e.g. “www.cnn.com” • Some small web-sites require frequent DNS updates (when their IP address changes) • run on ADSL or cable connections with dynamic IP addresses

How to construct the feature vectors? • Use the selected features to encode individual URLs as very high dimensional feature vectors • Most generated by the “bag-of-words" representation of the URL, registrar name, and registrant name • Binary features are also used to encode all possible ASes, prefixes and geographic locales of an IP address • The resulting URL descriptors typically have tens of thousands of binary features • Overfitting • Not know in advance which features are relevant • Though only a subset of the generated features may correlate with malicious Web site • When there are more features than labeled examples  prone to overfitting!

Feature vector construction http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll WHOIS registration: 3/25/2009 Hosted from 208.78.240.0/22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad” ... [ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …] Host-based Lexical Real-valued No clear illustration for the construction methodology…

What are the 3959 features of WHOIS information features? • A distributed database contains contact information • the owner and registrar of the domain (including home page URL) • date of registration, last update, expiration • primary and secondary DNS servers • and any additional status information of the domain • Mainly tokens in the names of the registrar and registrant of the domain name

Q&A for “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs”