460 likes | 552 Views
Reversing Data Formats: What Data Can Reveal. by Anton Dorfman ZeroNights 0x03, Moscow, 2013. About me. Fan of & Fun with Assembly language Researcher Scientist Teach Reverse Engineering since 2001 Candidate of technical science. Observed topics. Samples Basics Thoughts Fields
E N D
Reversing Data Formats:What Data Can Reveal by Anton Dorfman ZeroNights 0x03, Moscow, 2013
About me • Fan of & Fun with Assembly language • Researcher • Scientist • Teach Reverse Engineering since 2001 • Candidate of technical science
Observed topics • Samples • Basics • Thoughts • Fields • Related works • Applications
Standard way • Hex editor • Researcher • Brain (equally important astheHexeditor) • Basic knowledge how data can be organized (in brain) • Analysis of the executable file that manipulates with data format
3ways of researching • Dynamic analysis – use reverse engineering of application that manipulate with data format • Static analysis – try to extract header, structures, fields and try to find relationship between data • Statistic analysis – when have some amount of samples and use statistics of changes and ranges of values
Breaking the rules • There is the rule RTFM (Read The F**king Manual) • Nobody likes it • I’m not exception • First of all I start my research, and second – try to find related works and generalize ideas from them
Definitions • Field – some value used to describe Data Format • Structure – way for organizing various fields • Data – information for representing which Data Format is developed • Header – common structure before data, may contain substructures
Main Idea • Data format is developed by human (not pets or aliens) • Sometimes looks like that no human works on it… • Format developers use common data organization concepts and similar thoughts when creating new data formats • If we find regularities in data format organization rules we can automate searching of them
Three types of data format • File formats – multimedia formats, database formats, internal formats for exchanging between program components and etc. • Protocols – network protocols, hardware device interaction protocols, protocols of interaction between driver and user space application and etc. • Structures in memory – OS structures, application structures and etc.
Levels of abstraction • Bit Order • Byte Order • Fields Size • Field Basic Type • Field Type • Structure • Field Semantics
Tasks to solve • Extract header, separate it from data • Find field boundaries • Find structures and substructures • Find types of fields • Detect bit and byte ordering • Determine semantics of fields • Interpret goal of field – it’s semantics
Bit order • There are most significant bit – MSB and least significant bit - LSB • MSB is a left-most bit – commonly used – value is 10010101b – 95h - 149 • LSB is a left-most bit - value is 10101001b – A9h - 169
Byte order • Little-endian – Intel byte order • Big-endian – network byte order • Middle-endian or mixed-endian – mixed byte order – when length of value more then default processor machine word
Types of fields • Service fields – for describing Data Format (size of structure and etc.) • Common fields – “fields from life” (time, date and etc.) • Specific fields – we can find range for that type (bit flags and etc.) • Application specific – can be interpret only by application, we should use dynamic analysis
Field Size and Levels of field interpretation • Commonly field is a byte sequence • Sometimes field is a bit sequence • Fixed size in bytes (1, 2, 3, 4, …) • Fixed size in bits (1, 2, 3, 4, …) • Variable size in bytes • Variable size in bits
Basic field types • Hex value – by default • Decimal value • Character value (up to 4 symbols) • String (ASCII or Unicode) • Float value • Bits value
Types of field • Text • Offset • Size • Quantity • ID • Flag • Counter • DateTime • Protect
Text • Usually fixed size, remaining space after text filled with 0 • Variable sized text ended with 0, or there are size of this text field • Usually text string contain their meanings
Offset • Offset to another field • Offset to data • Pointer to another structure (may be child or parent) • Pointer to another instance of such structure (next or previous in linked list) • Can be relative and absolute
Size • Size of data • Size of structure • Size of substructure • Size of field • Can be not only in bytes, but also in bits, words (2 byte), double words, paragraphs (16 byte) and etc.
Quantity • Quantity of substructures • Quantity of elements • IMAGE_NT_HEADERS.IMAGE_FILE_HEADER. NumberOfSections • IMAGE_EXPORT_DIRECTORY. NumberOfFunctions
ID • Identifier of data format, often called magic number • ID of structure • ID of substructure • Often ID is value consist of char symbols • Often ID looks like “magic” - BE BA FE CA
Flag • Bit Flags • Enum values as a flags • Special case – Bool value
Counter • Increment (possibly decrement) • Starts with 0 / begins with another value • Changes by 1 or another value
DateTime • Storing format • Resolution • Moment of beginning • Range
Protect • CRC value of data • CRC value of substructure • Various Hash functions and etc.
Projects, articles and authors • Protocol Informatics Project – based on “Network Protocol Analysis using Bioinformatics Algorithms” by Marshall A. Beddoe • Discoverer - “Discoverer: Automatic Protocol Reverse Engineering from Network Traces” by Weidong Cui, Jayanthkumar Kannan, Helen J. Wang • Laika – “Digging For Data Structures” by Anthony Cozzie, Frank Stratton, Hui Xue, and Samuel T. King • RolePlayer – “Protocol-Independent Adaptive Replay of Application Dialog” by Weidong Cuiy, Vern Paxsonz, Nicholas C. Weaverz, Randy H. Katzy • Tupni – “Tupni: Automatic Reverse Engineering of Input Formats” byWeidong Cui, Marcus Peinado, Karl Chen, Helen J. Wang, Luiz Irun-Briz
Protocol Informatics Project • Global and local sequence alignment - Needleman Wunsch and Smith Waterman algorithms
Laika • Using Bayesian unsupervised learning • For fixed size structures only
Tupni • Based on taint tracking engine
Applications • RE any program • RE undocumented/proprietary file formats • RE undocumented/proprietary network protocols • RE undocumented structures in memory • Fuzzing • Examination of protocol implementation • Replay network interaction • Zero-day vulnerability signature generation
Thank you! • Questions? • Meet me at researcher@inbox.ru, • dorfmananton@gmail.com