Digital Forensics

Digital Forensics Dr. Bhavani Thuraisingham The University of Texas at Dallas Evidence Correlation November 2011

Papers to discuss • Forensic feature extraction and cross-drive analysis • http://dfrws.org/2006/proceedings/10-Garfinkel.pdf • A correlation method for establishing provenance of timestamps in digital evidence • http://dfrws.org/2006/proceedings/13-%20Schatz.pdf

Abstract of Paper 1 • This paper introduces Forensic Feature Extraction (FFE) and Cross-Drive Analysis (CDA), two new approaches for analyzing large data sets of disk images and other forensic data. FFE uses a variety of lexigraphic techniques for extracting information from bulk data; CDA uses statistical techniques for correlating this information within a single disk image and across multiple disk images. An architecture for these techniques is presented that consists of five discrete steps: imaging, feature extraction, first-order cross-drive analysis, cross-drive correlation, and report generation. CDA was used to analyze 750 images of drives acquired on the secondary market; it automatically identified drives containing a high concentration of confidential financial records as well as clusters of drives that came from the same organization. FFE and CDA are promising techniques for prioritizing work and automatically identifying members of social networks under investigation. Authors believe it is likely to have other uses as well.

Outline • Introduction • Forensics Feature Extraction • Single Drive Analysis • Cross drive analysis • Implementation • Directions

Introduction: Why? • Improper prioritization. In these days of cheap storage and fast computers, the critical resource to be optimized is the attention of the examiner or analyst. Today work is not prioritized based on the information that the drive contains. • Lost opportunities for data correlation. Because each drive is examined independently, there is no opportunity to automatically ‘‘connect the dots’’ on a large case involving multiple storage devices. • Improper emphasis on document recovery. Because today’s forensic tools are based on document recovery, they have taught examiners, analysts, and customers to be primarily concerned with obtaining documents.

Feature Extraction • An email address extractor, which can recognize RFC822- style email addresses. • An email Message-ID extractor. • An email Subject: extractor. • A Date extractor, which can extract date and time stamps in a variety of formats. • A cookie extractor, which can identify cookies from the Set-Cookie: header in web page cache files. • A US social security number extractor, which identifies the patterns ###-##-#### and ######### when preceded with the letters SSN and an optional colon. • A Credit card number extractor.

Single Drive analysis • Extracted features can be used to speed initial analysis and answer specific questions about a drive image. • Authors have successfully used extracted features for drive image attribution and to build a tool that scans disks to report the likely existence of information that should have been destroyed under Fair and Accurate Credit Transactions Act • Drive attribution: an analyst might encounter a hard drive and wish to determine to whom that drive previously belonged. For example, the drive might have been purchased on eBay and the analyst might be attempting to return it to its previous owner. • powerful technique for making this determination is to create a histogram of the email addresses on the drive (as returned by the email address feature extractor).

Cross drive analysis (CDA) • Cross-drive analysis is the term that coined to describe forensic analysis of a data set that spans multiple drives. • The fundamental theory of cross-drive analysis is data gleaned from multiple drives can improve the forensic analysis of a drive in question both in the case when the multiple drives are related to the drive in question and in the case when they are not. • two forms of CDA: first order, in which the results of a feature extractor are compared across multiple drives, an O(n) operation; and second order, where the results are correlated, an O(n2) operation.

Implementation • 1. Disks collected are imaged onto into a single AFF file. (AFF is the Advanced Forensic Format, a file format for disk images that contains all of the data accession information, such as the drive’s manufacturer and serial number, as well as the disk contents) • 2. The afxml program is used to extract drive metadata from the AFF file and build an entry in the SQL database. • 3. Strings are extracted with an AFF-aware program in three passes, one for 8-bit characters, one for 16-bit characters in lsb format, and one for 16-bit characters in msb format. • 4. Feature extractors run over the string files and write their results to feature files. • 5. Extracted features from newly-ingested drives are run against a watch list; hits are reported to the human operator. • 6. The feature files are read by indexers, which build indexes in the SQL server of the identified features.

Implementation • 7. A multi-drive correlation is run to see if the newly accessioned drive contained features in common with any drives that are on a drive watch list. • 8. A user interface allows multiple analysts to simultaneously interact with the database, to schedule new correlations to be run in a batch mode, or to view individual sectors or recovered files from the drive images that are stored on the file server.

Directions • Improve feature extraction • Improve the algorithms • Develop end to end systems

Abstract of Paper 2 • Establishing the time at which a particular event happened is a fundamental concern when relating cause and effect in any forensic investigation. Reliance on computer generated timestamps for correlating events is complicated by uncertainty as to clock skew and drift, environmental factors such as location and local time zone offsets, as well as human factors such as clock tampering. Establishing that a particular computer’s temporal behavior was consistent during its operation remains a challenge. The contributions of this paper are both a description of assumptions commonly made regarding the behavior of clocks in computers, and empirical results demonstrating that real world behavior diverges from the idealized or assumed behavior. Authors present an approach for inferring the temporal behavior of a particular computer over a range of time by correlating commonly available local machine timestamps with another source of timestamps. We show that a general characterization of the passage of time may be inferred from an analysis of commonly available browser records.

Outline • Introduction • Factors to consider • Drifting clocks • Identifying computer timescales by correlation with corroborating sources • Directions

Introduction • Timestamps are increasingly used to relate events which happen in the digital realm to each other and to events which happen in the physical realm, helping to establish cause and effect. • Difficulty with timestamps is how to interpret and relate the timestamps generated by separate computer clocks when they are not synchronized • Current approaches to inferring the real world interpretation of timestamps assume idealized models of computer clock • Uncertainty with the behavior of suspect’s clock computer before seizure. • Authors explore two themes related to this uncertainty. • investigate whether it is reasonable to assume uniform behavior of computer clocks over time, and test these assumptions by attempting to characterize how computer clocks behave in the wild. • investigate the feasibility of automatically identifying the local time on a computer by correlating timestamps embedded in digital evidence with corroborative time sources.

Factors • Computer timekeeping • Real-time synchronization • Factors affecting timekeeping accuracy • Clock configuration • Tampering • Synchronization protocol • Misinterpretation • Usage of timestamps in forensics

Drifting clocks behavior • Enumerate the main factors influencing the temporal behavior of the clock of a computer, and then attempt to experimentally validate whether one can make informed assumptions about such behavior. • Authors do this by empirically studying the temporal behavior of a network of computers found in the wild. • The subject of case study is a network of machines in active use by a small business. The network consists of a Windows 2000 domain, consisting of one Windows 2000 server, and mixed number of Windows XP and 2000 workstations. • The goal here is to observe the temporal behavior. In order to observe this behavior, authors have constructed a simple service that logs both the system time of a host computer and the civil time for the location. • The program samples both sources of time and logs the results to a file. The logging program was deployed on all workstations and the server

Correlation • Automated approach which correlates time stamped events found on a suspect computer with time stamped events from a more reliable, corroborating source. • Web browser records are increasingly employed as evidence in investigations, and are a rich source of time stamped data. • Techniques implemented are” Click stream correlation algorithm and Non-cached correlation algorithm • Authors compare the results of both algorithms

Directions • Need to determine whether the conditions and the assumptions of the experiments are realistic • What are the most appropriate correlation algorithms? • Need to integrate with clock synchronization algorithms

Digital Forensics