1 / 29

Deriving Input Syntactic Structure From Execution Zhiqiang Lin Xiangyu Zhang

The 16th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE’08). Deriving Input Syntactic Structure From Execution Zhiqiang Lin Xiangyu Zhang. Purdue University November 11 th , 2008. Motivation -- Most software takes structural input.

laszlo
Download Presentation

Deriving Input Syntactic Structure From Execution Zhiqiang Lin Xiangyu Zhang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The 16th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE’08) Deriving Input Syntactic Structure From Execution Zhiqiang Lin Xiangyu Zhang Purdue University November 11th, 2008

  2. Motivation -- Most software takes structural input

  3. Applications -- Software Testing/Debugging • Using Input Grammar to Generate Test Cases • K. Hanford. Automatic Generation of Test Cases. In IBM Systems Journal, 9(4), 1970. • P. Purdom. A sentence generator for testing parsers. In BIT Numerical Mathematics, 12(3), 1972 • Grammar based whitebox fuzz [PLDI’08] • Delta Debugging • Reducing large failure input [TSE’02] • Hierarchical Delta Debugging (HDD) [ICSE’06] • Execution Fast Forwarding • Reducing Event Log for failure replay[FSE’06]

  4. Applications -- Computer Security • Malware, Attack instanceSignature generation • Exploit (input) Signature • Payload length, keywords, Field structure… • Penetration testing  Software vulnerability • Play with Input (fuzz) • Packet Vaccine [CCS’06] • ShieldGen [IEEE S&P’07] • Malware Protocol Replayer • Malware feature  Replay the protocol  Input Format

  5. Challenges • Input structure exists in a machine unfriendly way • Plain text (ASCII Stream, e.g., C File) • Binary Code (Protocol Message Stream) • Known specification (RFC) • Implementation Deviation • Unknown Specification • Malware • Bot  Botnet protocol • Legal software • SAMBA protocol (12 years for open source community)

  6. Challenges • May not have the Source Code Access • Penetration testing • Malware analysis • Legal software • Working on binary

  7. Our Contributions • 2 different approaches to handling 2 types of parsers • Using Dynamic Control Dependency to handle top down parsers • A newdynamic analysis to handle bottom up parsers by identifying and analyzing the parsing stack • Experimental results show that the proposed analyses are highly effective in producing very precise input syntax trees

  8. Outline • Motivation • Technical Description • Handling Inputs with A Top-down Parser • Handling Inputs with A Bottom-up Parser • Evaluation • Discussion • Related Work • Conclusion

  9. I. Top down Parser • Parse input in a top-down manner. S B S H N bB|ε HB 1|2 hN B H h N b B B b 1 h1bbε ε

  10. Implementation Void Parser () { char c =getchar(); if (c == ’h’) { c = getchar(); if c ==‘1’ || c==‘2’) { c=getchar(); }else error(); } else error (); while(c==‘b’){ c=getchar(); if(c==‘ε’){ break; } }error(); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 H S H N B HB bB|ε 1|2 hN B

  11. Execution Trace c=getchar() Void Parser () { char c =getchar(); if (c == ’h’) { c = getchar(); if c ==‘1’ || c==‘2’) { c=getchar(); }else error(); } else error (); while(c==‘b’){ c=getchar(); if(c==‘ε’){ break; } }error(); } h 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 if(c==‘h’) c = getchar() 1 if(c==‘1’||’2’) c = getchar() b1 while(c==‘b’) c = getchar() b2 if(c==‘ε’’) b2 while(c==‘b’) c = getchar() ε if(c==‘ε’’) h1bbε Control Dependency: A Stmt Y is control-dependent on X iff X directly determines whether Y executes break

  12. Execution Trace c=getchar() Void Parser () { char c =getchar(); if (c == ’h’) { c = getchar(); if c ==‘1’ || c==‘2’) { c=getchar(); }else error(); } else error (); while(c==‘b’){ c=getchar(); if(c==‘ε’){ break; } }error(); } h 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 if(c==‘h’) c = getchar() 1 if(c==‘1’||’2’) c = getchar() b1 while(c==‘b’) c = getchar() c = getchar() b2 if(c==‘ε’’) if(c==‘ε’’) b2 while(c==‘b’) while(c==‘b’) c = getchar() ε if(c==‘ε’’) h1bbε Control Dependency: A Stmt Y is control-dependent on X iff X directly determines whether Y executes break

  13. Control dependency graph for the execution trace Void Parser () { char c =getchar(); if (c == ’h’) { c = getchar(); if c ==‘1’ || c==‘2’) { c=getchar(); }else error(); } else error (); while(c==‘b’){ c=getchar(); if(c==‘ε’){ break; } }error(); } START 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 b1 h c=getchar() if(c==‘h’) while(c==‘b’) 1 b2 S c = getchar() c = getchar() if(c==‘1’||’2’) if(c==‘ε’’) b2 B H while(c==‘b’) c = getchar() ε h N b B if(c==‘ε’’) c = getchar() break B b 1 A Control Dependency Graph: A Graph in which any given node directly controls its child node execution ε

  14. Eliminate non data use node START b1 h c=getchar() if(c==‘h’) while(c==‘b’) 1 b2 S c = getchar() c = getchar() if(c==‘1’||’2’) if(c==‘ε’’) b2 B H while(c==‘b’) c = getchar() ε h N b B if(c==‘ε’’) c = getchar() break B b 1 ε

  15. Add Data Use Leaf Node START b1 h if(c==‘h’) while(c==‘b’) 1 b2 S if(c==‘1’||’2’) if(c==‘ε’’) b2 B H while(c==‘b’) ε h N b B if(c==‘ε’’) B b 1 ε

  16. Add Data Use Leaf Node START if(c==‘h’) while(c==‘b’) S h if(c==‘1’||’2’) if(c==‘ε’’) b1 B H while(c==‘b’) b2 1 b2 h N b B if(c==‘ε’’) ε B b 1 ε

  17. Eliminate Redundant Node START 2 if(c==‘h’) 91 while(c==‘b’) S h 4 if(c==‘1’||’2’) 111 if(c==‘ε’’) b1 B H 92 while(c==‘b’) b2 1 b2 h N b B 112 if(c==‘ε’’) Identical Node ε B b 1 ε

  18. II. Bottom up parser • Parse input in a bottom up manner • Programming languages • lex/yacc S S AB A aa B b A B a b a aab

  19. A General Bottom Up Parsing Algorithm while (…) { if (stack should not be reduced ) { stack.push(c); … } else{ //A→ βstack.pop (|β|); stack.push (A); } } S AB aab A aa B b • Trace: • while (…) ; if (stack should not be reduced ) ; stack.push(a), while (…) ; if (stack should not be reduced ) ; stack.push(a), while (…) ; if (stack should not be reduced ) ;stack.pop(aa); stack.push(A)….

  20. A General Bottom Up Parsing Algorithm while (…) { if (stack should not be reduced ) { stack.push(c); … } else{ //A→ βstack.pop (|β|); stack.push (A); } } S AB aab A aa B b • Trace: • while (…) ; if (stack should not be reduced ) ;stack.push(a), while (…) ; if (stack should not be reduced ) ; stack.push(a), while (…) ; if (stack should not be reduced ) ;stack.pop(aa); stack.push(A)….

  21. Tree Construction Push(S) Push(B) S S AB Pop(b) aab A aa Push(A) B b Push(b) A B Identical Node Push(a) Push(a) • Stack Operation Trace: • Push(a), Push(a), Pop(aa), Push(A) • Push(b), Pop(b), Push(B), Pop(AB), Push(S) b a a Identify the parsing stack

  22. Evaluation – Top down grammar Bad?

  23. Evaluation – Top down grammar

  24. Evaluation – Bottom up grammar Identical Node

  25. Performance Overhead 5X-45X 6X-8X

  26. Discussion • Grammar categories • Top down, bottom up, any others? • Possible to evade the control dependency structure in top down parser implementation. • Individual input • Multiple input  final grammar • Syntactic Structure • Semantics

  27. Related Work • Network Protocol Format Reverse Engineering • Instruction Semantics (Comparison, loop keyword, delimiter) • Polyglot [CCS’07] • Automatic Network Protocol Analysis [NDSS’08] • Tupni [CCS’08] • Execution Context (Call stack, PC) • AutoFormat [NDSS’08] • Limitations • Part of the problem space • Only top-down parsers. • Part of the problem’s essence. • Comparison (predicate), call stack  control dependency

  28. Conclusion • Two dynamic analyses to construct input structure from program execution. • No source code access or any symbolic information. • Highly effective and produce input syntax trees with high quality.

  29. Q & A Thank you To further contact us: {zlin,xyzhang}@cs.purdue.edu

More Related