500 likes | 639 Views
Lecture Outline. Introducing Data-MiningGoogle HackingIntermissionExamples of Using Data-Mining for:MoneyPowerSexClosing. The Advent of Databases and the Internet. Fact: The amount of data we have access to is greater than ever before and is still growing exponentially.If nothing else, the
E N D
1. An Invitation to Data-Mining Virgil -- virgil@yak.net
GregR -- gregr@yak.net
Interz0ne IV
March 12, 2005
2. Lecture Outline Introducing Data-Mining
Google Hacking
Intermission
Examples of Using Data-Mining for:
Money
Power
Sex
Closing
3. The Advent of Databases and the Internet Fact: The amount of data we have access to is greater than ever before and is still growing exponentially.
If nothing else, the continued archival of current data will quickly add up.
4. Continued Growth of the Internet
5. Growth of Digital Information A Practical Example…
Back in the old days news of interesting websites propagated through word of mouth.
Then it moved to USENET groups (blogs are a modern equivalent).
But, then it became difficult to find the hottest newsgroups.
To compensate for this we started using search engines.
Today, we’re frequently using meta-search engines & meta-blogging sites like technorati.com, memestreams.net, and del.icio.us.
Data-Mining is an increasingly a powerful tool to take advantage of the availability of huge amounts of digitized information.
6. What is Data-Mining… From Wikipedia...Data mining is been defined as
[1] "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data”
[2] "The science of extracting useful information from large data sets or databases".
Like Artificial Intelligence, “Data-Mining” is an widely used term with general connotations.
7. What is Data-Mining (contd.) Data-Mining is usually broken up into two distinct steps.
1. “Data-Warehousing” – Collecting large amounts of data
2. “Mining / Extraction” – Analysis (often statistical) of the collected information:
8. Some Examples of Data Mining... Amazon.com’s Recommendation System
MusicPlasma.com
National Security Agency’s ECHELONECHELON is the largest electronic spy network in history, run by the United States, the United Kingdom, Canada, Australia, and New Zealand. It captures telephone calls, faxes, e-mails, and IMs from around the world. ECHELON is estimated to intercept about 3 billion communications every day. (text-mining)
9. Other Users of Data Mining Nazi’s in France during WWII
Mormons
The Alexa/Google Toolbar
Wal-Mart (i.e. urban myth of correlation of purchase of beer and diapers)
RIAA/MPAA in P2P
Microsoft in BitTorrent
Rotten.com’s NNDB
Basically, just about everyone is using data mining for all sorts of things.
10. Getting your feet wet inData-Mining: Using Google Using Google is a great place to start data-mining.
The data collection stage has already been done for you!
All you need to do is craft the perfect query to find the interesting parts.
11. But what could you possibly find just using Google?
12. How About…
13. Intro: “Google Hacking” "Google Hacking” is the use of Google’s data stores for naughty things.
Makes extensive use of the advanced Google syntaxes.
Is trivially easy to do and is rather trendy.
An excellent guide to get up to speed on the techniques of "Google Hacking” is the O'reily book Google Hacks by Tara Calishain.
14. Google Hacking: Tools of the Trade On the surface, searching Google is straight forward.
But, there are many special parameters (some of which are undocumented)
You can use these parameters to exclude everything but the data you're looking for.
15. Google Syntax Examples '' ''/-/+/( )
Site:
Filetype:
Related:
Link:
[all]inanchor
[all]inurl:
[all]intext:
[all]intitle: (interz0ne | outerz0ne) extraz0ne
site:.mil
filetype:.doc
related:yak.net
inanchor:''miserable failure''
inurl:robots.txt
16. Some Undocumented Syntaxes… Find between ranges of numbers
Single word wild-card
“Fuzzify”
Search only documents indexed within a particular timeframe.
17. Google Hacking: Further Reading Due to its ease, Google Hacking already has a large following.
Johnny Long runs a user-contributed a "Google Hacking Database" which contains over 1,000 ready made search queries.
http://johnny.ihackstuff.com/
Johnny Long also has a concise Google Hacking guide. http://johnny.ihackstuff.com/security/premium/The_Google_Hackers_Guide_v1.0.pdf
18. Intermission Questions on anything related or unrelated so far?
19. Going Beyond Google “Google Hacking” is just the easy stuff.
Data Mining techniques are applicable to virtually everything.
There is a large amount of interesting information digitally available which is not indexed by Google (or anyone else).
To do more interesting things you'll typically be using one of these as your data set.
All sorts of data is already out there, all you need is the ingenuity to find applications for it.
20. Further Examples of Data Mining
21. Using Data-Mining to… Derive Mother's Maiden Names
Uncover Corporate and Government Secrets
Embarrass minor-celebrities
22. Deriving Mother's Maiden Names Mother’s Maiden Names (MMN’s) are a common security authenticator
Used as an authenticator for credit cards, email accounts, websites, etc. etc.
Idea: You could mine public records information from online databases to automatically derive MMNs for random people.
23. About our Study The most relevant records are the birth and marriage records, both of which are “vital records” within public domain.
At the very least, there will be some easy cases to derive MMNs (i.e. uncommon last names, hyphenated last names, “Jr.”, “III”, etc.)
Although thse techniques can be applied anywhere, we focused on Texas.
24. Availability of Related Records Related public records are available at the county, state, and national level.
US Census aged 72 years before released
Searchsystems.net has a large listing of county-level records
Rootsweb provides full user-submitted family trees
We got most of our records from the Texas Bureau of Vital Statistics’ website
25. Getting Texas Vital Records Collected marriage data from the State Dept of Vital Statistics (records 1966-2002).
However, the birth records were sealed in 2000, the death records in 2003.
We found partial copies of the sealed records on archive.org and full copies on rootsweb.com and searchsystems.net.
Furthermore, the death records were only unlinked, and you can still download death info from their own servers 2 ½ years later.
26. Analyzing the Records Once we have a large corpus of both birth and marriage data, we can apply whatever heuristics we want in connecting children to marriages.
Lucky us! Birth records for <= 1950 include the MMN in plaintext!
This left us mostly state marriage records from 1966-2002 and state birth records from 1951-1995 to analyze.
27. Children will have the same last name as their parents.
We do not have to link a child to a particular marriage record, only to a particular maiden name. An attacker doesn’t have to pick the correct parents, just the correct MMN!
The parents' first and middle names are often repeated within a child's first or middle name.
Children are often born in the same county in which their parents were recently married.
Factor in Divorce Records [public domain]
Factor in SSDI / State Death Records [public domain]
28. Measuring our Success for Compromise Recall we need only match up to the correct MMN, not the correct parents.
After applying our heuristics we’ll have a list of possible maiden names. We use data entropy (Shannon entropy) to measure the ‘disorder’ of the set of remaining MMNs.
We then compare the entropy before and after the application of the heuristics to measure the success of our attack.
Before heuristics applied set of MMN’s ˜ 13 bits.
29. Entropy Graph assuming only same last names
30. Results from just assuming same last name.
31. Questions? (By the way, George Bush’s MMN is “Pierce”)
32. Data-Mining for .doc’s In case you weren't aware, the Microsoft .doc format contains all sorts of interesting “metadata” within the document.
At times, this metadata has been known to be intensely interesting.
This metadata includes (among other things) the: Title, Author, Date Created, Date Last Saved, Editing Time, User’s Machine ID#, and usernames of who made the last 10 revisions.
This fact is known to some groups (such as lawyers), but by in large people don't know about it.
33. Past Incidents UK Prime Minister Tony Blair published a dosier on the Iraq War
A Cambridge prof revealed that most of the documented was plagiarized from a grad student in Monterey.
Inspired by this, Richard Smith of computerbytesman.com ran analysis of the dosier's .doc metadata.
Smith uncover a good deal more of incriminating evidence and made the Blair government squirm. [Link]
34. That's a great idea! Lets do it better! Do massive crawling for all .doc’s on a particular domain
Extract all of their metadata
Put into a database with web-interfacee
See if anything interesting turns up!
35. What we've done (work in progress) No conclusive word metadata analysis system exists.
We’ve been weaving together bits and pieces together into an eventual whole.
Demonstrations:
[Demo of “The Revisionist” by Michal Zalewski]
[Demo of Yak’ified “WordLeaker” by Madelman]
[Demo of unreleased script strings_against_references. (Works similarly to Simon Byer’s work)]
36. .doc Mining -- Conclusions Okay, it's not finished yet.
But not bad for starting this project last week.
The core concept works completely, but needs a little more refinement.
Better integration is needed, still a few bugs.
37. Last Example
38. Cat Schwartz, TechTV eye candy As one of her fans comments….Cat Schwartz is one of the cute girls on TechTV. I know everybody jerks it to Morgan Webb, but Cat has that nerdy emo girl cuteness that I and many others find hard to resist. She has a blog on which she does bloggy things like posting pics of herself, writing crappy poems, and keeping her fans abreast of her schedule.
39. Cat Schwartz and her blog Like all blog girls, she likes to post suggestive images of herself on her blog. No one knows why blog girls do this, but for now let us simply accept that they do.
[www.catschwartz.com]
42. A little known fact… Programs like photoshop store a full thumbnail of the photo in the EXIF header extension.
Furthermore, if only a slight alteration is made (I.e. cropping), Photoshop doesn’t regenerate the thumbnail stored in the EXIF header.
43. So....
48. And the net goes wild! One enthusiastic fan comments…
“I SPANKED TWICE IN A ROW TO THESE!!! AND I'M GONNA SPANK AGAIN!!! OMG! OMG! OMG! I EVEN LICKED MY MONITOR!!!!!!!”
49. Doing This Even Better Crawl USENET for images
Do math to determine if the image in the EXIF thumbnail is different from the actual image
Display the images
Live Demo using a “Hot or Not” rating system
Sadly, the results haven’t been that amazing, most are just uninteresting croppings.
But a few interesting bits….
50. Some Data Sets dying for interesting applications FEC Political Donation Data http://ftp.fec.gov/FEC/presidential/
GPS Coordinates of Zipcodes + TerraServer http://www.census.gov/geo/www/tiger/zip1999.zip
More Public Records // Sexual Offender Databases http://www.searchsystems.net/
Social Security Death Index
htttp://ssdi.genealogy.rootsweb.com/
Library of Congress Print Cataolog http://www.loc.gov/rr/print/catalog.html
Flickr.com Ex:http://www.mappr.com
P2P Network User Behavior
Nanpa.com
51. End V. Griffith, M. Jakobsson (2005); Messin with Texas: Deriving Mother’s Maiden Names Using Public Records is available at: http://romanpoet.org/1/mmn.pdf
EXIF Data Mining References:
Steven J. Murdoch: www.cl.cam.ac.uk/~sjm217
Maximillian Dornseif: md.hudora.de