290 likes | 401 Views
Bulk Extractor Advanced Topics Webinar BitCurator Consortium. Michael Olson, Stanford University Sandy Ortiz, Stanford University February 16, 2017. Topics. Bulk Extractor overview Why we use it at Stanford Bulk Extractor 1.6.0 –dev Advanced features – definitions. Topics continued.
E N D
Bulk Extractor Advanced Topics Webinar BitCurator Consortium Michael Olson, Stanford University Sandy Ortiz, Stanford University February 16, 2017
Topics • Bulk Extractor overview • Why we use it at Stanford • Bulk Extractor 1.6.0 –dev • Advanced features – definitions
Topics continued • Requirements to run • Configuration • Sample run • Results • Discussion / Questions
What is Bulk Extractor? • Software that scans disk images, files, file directories • Identifies potentially sensitive information: SSN, financial data, etc. • Creates histograms of features
Bulk Extractor @ Stanford • Identify PII in BD collections • Data classification mandate • Identify collection specific data for further analysis
Bulk Extractor • BitCurator 1.8.16 running Bulk Extractor viewer 1.6.0 -dev • Performance / scanner improvements • BEViewer usability improvements
https://www.krogen.co/alice-in-wonderland-paintings/alice-in-wonderland-paintings-top-25-best-alice-in-wonderland-artwork-ideas-on-pinterest-picture/https://www.krogen.co/alice-in-wonderland-paintings/alice-in-wonderland-paintings-top-25-best-alice-in-wonderland-artwork-ideas-on-pinterest-picture/
Going Down the Rabbit hole... Source: https://image.shutterstock.com/z/stock-vector-alice-is-falling-down-into-the-rabbit-hole-170986505.jpg
Overview • Define Stop List, Wordlist, Alert List, Find regex text • Define requirements to run General Option Find regex text • Sample run configuration • Sample run results review
Definitions Stop List (White list, saves time and processing) A stop list can simply be a list of words that the user wants bulk_extractor to ignore. Stop lists can also be used to remove features not relevant to a case. See section 4.4 suppressing false positives, p.24 Wordlist (if password cracking or custom analysis is needed) A list of all “words” extracted from the disk, useful for password cracking or to discover if an author ever used a specific term (including in deleted/hidden files). Note that the words this scanner can access depend on which other scanners are on; to include words in .zip files, for example, you'd need to have the "zip" scanner enabled. General option. This is disabled by default. See Section 5.4 p. 32. Alert List (Red list, context sensitive term search) The alert list can contain a list of words and/or feature filenames, and when a match is found, it will alert the user. The way the feature file alert works is similar to how they are used for context-sensitive stop lists. It will only alert on a specified feature when it’s found in the specified context. General option. This is disabled by default. See section 4.5 p. 26 Find Regex Text File (custom lexicon file; read vs find occurrences of) The find scanner reads through the data for anything listed in the global find list. The format of the find list should be rows of regular expressions while any line beginning with a # is considered a comment. CASE SENSITIVE. Terms will match on case only. See section 5.3, p. 29-32. Source: Bulk Extractor Users Manual v1.4
Sample Run: How it works Program> Output (Destination path) bulk_extractor -o /media/veracrypt1/NTFS_Pract_2017/Find_NTFS_Pract_2017 Option (Use Find regex Text file) > File path -F /home/bcadmin/Desktop/Persona.faculty2.UCI.english.lex.txt Source (Image path) /media/veracrypt1/NTFS_Pract_2017/NTFS_Pract_2017.E01
Sample Run: How it works • Find scanner - One term/one pass over entire image. i.e. 853 term lexicon - 853 passes over image. Very inefficient. • Lightgrep scanner - Group of terms searched for in current buffer (processing segment). One pass through image, looking for all terms per “chunk.” Higher efficiency. • Refer to liblightgrep dev blog for details http://strozfriedberg.github.io/liblightgrep/ • Refer to Bulk Extractor summary by Garfinkel http://downloads.digitalcorpora.org/downloads/bulk_extractor/2014-07-17_BE15.pdf • Several scanners may write to several different feature files.
Note: bulk_extractor version Scanners included: httplogs, lightgrep msxml,sqlite Scanners Missing:
Sample Run Start: Hardware Monitor Keep an eye on your CPU... Open Hardware Monitor: CPU Temp rising 53c CPU Load 53% Host Memory: 16 GB Guest Memory: 10GB Guest Swap File Size: 11GB
Sample Run: Finish Approximately 12 min processing time Rabbit Hole curiosity #1: 272 MB processed, 524MB source image?? Default scanners + faculty lexicon.
Sample Run: Results Alert feature file Only one term found out of 41. Rabbit hole curiosity #2. Why?
Sample Run: Results Service term count 3693 Service term count 3435
References Bradley, J.R., Garfinkel, S. (2015, March 23). Bulk Extractor Users Manual v. 1.4l[PDF]. Retrieved from http://downloads.digitalcorpora.org/downloads/bulk_extractor/BEUsersManual.pdf Bulk Extractor 1.6.0 release notes. Retrieved from https://github.com/simsong/bulk_extractor/blob/master/doc/announce/announce_1.6.0.md ePadd Lexicon(n.d.). Persona.faculty2.UCI.english.lex.txt. Retrieved from https://drive.google.com/open?id=0B89h5GBZe8FaMGptb1locDQzOUk ePadd Lexicon Library(n.d.). Retrieved from https://library.stanford.edu/projects/epadd/community/lexicon-working-group Friedberg, S.(n.d.). Liblightgrep technical info[Blog]. Retrieved from http://strozfriedberg.github.io/liblightgrep/
References Garfinkel, S.(n.d.). Bulk Extractor 1.5 Overview[PDF]. Retrieved from http://downloads.digitalcorpora.org/downloads/bulk_extractor/2014-07-17_BE15.pdf Linux LEO(n.d.). Sample Image[.E01] NTFS, 524MB. Retreived from http://linuxleo.com/Files/NTFS_Pract_2017_E01.tar.gz Stanford Risk Classifications. Retrieved from https://uit.stanford.edu/guide/riskclassifications Stevens, C., Malan, D., Garfinkel, S., Dubec, K. A., & Pham, C. (2006). Advanced forensic format: An open, extensible format for disk imaging. International Federation for Information Processing. Retrieved from https://dash.harvard.edu/handle/1/2829932