A Strategy for Efficient Crawling of Rich Internet Applications CASCON2010

A Strategy for Efficient Crawling of Rich Internet ApplicationsCASCON2010

Introduction – uOttawa SSRG team Guy-Vincent Jourdan, Ph.D., (uOttawa) Gregor v. Bochmann, Ph.D., (uOttawa) Iosif Viorel (Vio) Onut, Ph.D., (IBM) Emre Dincturk (PhD candidate) Kamara Benjamin, M.Sc. Suryakant Choudhary (Master candidate)

Agenda • (10 min) Project presentation • Vio Onut – Research Staff Member, CAS, IBM • (10 min) Product presentation • Jeff Turnham – Development Manager, Rational, IBM • (5 min) Problem statement • Jeff Turnham – Development Manager, Rational, IBM • Vio Onut – Research Staff Member, CAS, IBM • (10 min) High level algorithms • Emre Dincturk – Ph.D., Candidate • (15 min) Experimental results • Guy-Vincent Jourdan, Prof. uOttawa • (10min) Questions

Agenda • Project presentation • Vio Onut – Research Staff Member, CAS, IBM • Product presentation • Problem statement • High level algorithms • Experimental results • Questions

Project Presentation • Rational AppScan Enterprise • Rational Policy Tester • Rational AppScan Standard • Crawling Web 2.0 Applications

Project Presentation • Collaborators – IBM and uOtttawa • Timeline • 2 yrars in • Jan '10 RIA Crawling strategy • Feb ’10 Crawling Ajax Prototype – early version • May ’10 Crawling Ajax Prototype – early version • July-August ’10 AppScan Enterprise prototype • Oct ’10 1st Master Thesis - Kamara • 5 patents are being considered by IBM to file

Project Presentation – working environment • Collaborators – IBM and uOtttawa • Weekly meetings • Constant interaction • Students on IBM site to implement the technology

Where you can find us? • Technology Showcase • One of the 4 innovation impact projects • Website • http://ssrg.site.uottawa.ca

Agenda • Project presentation • Product presentation • Jeff Turnham – Development Manager, Rational, IBM • Problem statement • High level algorithms • Experimental results • Questions

IBM Security Solutions IBM Rational AppScan Enterprise Edition Product overview

11 IBM Rational AppScan Suite – Comprehensive Application Vulnerability Management Dynamic Analysis/Blackbox – Static Analysis/Whitebox - SECURITY CODE REQUIREMENTS BUILD PRE-PROD QA PRODUCTION AppScan Enterprise AppScan onDemand AppScan Reporting Console Security Requirements Definition AppScan Source AppScan Standard AppScan Tester AppScan Standard AppScan Build Security requirements defined before design & implementation Security & Compliance Testing, oversight, control, policy, audits Outsourced testing for security audits & production site monitoring Security / compliance testing incorporated into testing & remediation workflows Automate Security / Compliance testing in the Build Process Build security testing into the IDE Application Security Best Practices – Secure Engineering Framework

AppScan Enterprise Edition capabilities 12

AppScan Enterprise Workflows • Management • Review most common security issues • View trends • Assess risk • Evaluate progress • Development & QA • Conduct assessments • View assessment results • Remediate issues • Assign issue status • Compliance Officers • Review compliance reports AppScan Enterprise • Build automation • Source code analysis for security issues as part of build verification • Publish findings for remediation and trending • Information Security • Schedule and automate assessments • Conduct assessments with AppScan Standard and AppScan Source and publish findings for remediation and trending • Tools: • AppScan Standard Edition • AppScan Source Edition • Tools: • AppScan Source for Automation • AppScan Standard Edition CLI 13

View detailed security issues reports Security Issues Identified with Static Analysis Security Issues Identified with Dynamic Analysis Aggregated and correlated results Remediation Tasks Security Risk Assessment 14

Obtain a high-level view of the security of your applications Compare the number of issues across teams and applications Identify top security issues and risks View trending of the number of issues by severity over time Monitor the progress of issue resolution 15

Assess regulatory compliance risk Over 40 compliance reports, including: The Payment Card Industry Data Security Standard (PCI) VISA CISP Children Online Privacy Protection Act (COPPA) Financial Services (GLBA) Healthcare Services (HIPAA) Sarbanes-Oxley Act (SOX) 16

Agenda • Project presentation • Product presentation • Problem statement • Jeff Turnham – Development Manager, Rational, IBM • Vio Onut – Research Staff Member, CAS, IBM • High level algorithms • Experimental results • Questions

Problem Statement Challenges developing the AppScan family of products: Application languages/frameworks are constantly evolving Deep static or dynamic analysis involves heavy computation Product Challenge: More clients are moving to RIA (Rich Internet Applications) Crawling/analyzing RIA applications is challenging due to their dynamic nature 18

Problem Statement

Agenda • Project presentation • Product presentation • Problem statement • High level algorithms • Emre Dincturk – Ph.D., Candidate • Experimental results • Questions

Strategy : The Methodology In order to generate an efficient strategy, we need anticipation for how the application behaves We need some hypotheses regarding the behavior of the application. Once we have the anticipated model, we can generate an efficient (optimal) strategy that will try to confirm the its validity. During crawling if the actual model of the application deviates from the anticipated model, then the strategy and the anticipated model is updated. We still assume that the parts that have not been crawled will obey the initial hypotheses. 22

Strategy : The Anticipated Model We anticipate the model of the application based on 2 hypotheses The events {e1,e2,…, en} that are enabled at a state are pair wise independent (at a given state executing a set of events in different orders will lead to same state ) When an event e is executed at state s, the set of events that are enabled at the reached state is same as the events enabled at s except for e Using only these two hypotheses, our anticipated model becomes a hypercube. 23

Strategy : Hypercube • We aim for a strategy, that will • visit every state using the least possible number of paths (chains) • execute every transition using the least possible number of paths (chains) • visit every state as soon as possible 24

Step 1: Minimal Chain Decomposition (MCD) {e1, e2, e3, e4}<{e2, e3, e4}<{e3, e4}<{e4}<{} {e1, e2, e3}<{e2, e3}<{e3} {e1, e2, e4}<{ e2, e4}<{ e2} { e1, e2} {e1, e3, e4}<{ e1, e4}<{ e1} { e1, e3} The minimal number of paths to cover each state of the hypercube can be obtained using a MCD of the hypercube. MCD of hypercube contains paths. 25

Step 2: Minimal Transition Coverage • 8 different complete chains • All transitions can be covered with 2 chains • In an hypercube of dimension n, it is 26 “Finishing” the hypercube Missing transitions must be crawled However, there is no need to crawl every possible chain! 26

Step 2: Minimal Transition Coverage (MTC) • Upper Chains MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 27

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 28

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 29

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 30

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 31

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} • {e1,e4}<{e4} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 32

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} • {e1,e4}<{e4} • {e1,e4}<{e1}<{} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 33

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} • {e1,e4}<{e4} • {e1,e4}<{e1}<{} • {e2,e3}<{e3} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 34

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} • {e1,e4}<{e4} • {e1,e4}<{e1}<{} • {e2,e3}<{e3} • {e2,e3}<{e2} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 35

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} • {e1,e4}<{e4} • {e1,e4}<{e1}<{} • {e2,e3}<{e3} • {e2,e3}<{e2} • {e1,e3}<{e3} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 36

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} • {e1,e4}<{e4} • {e1,e4}<{e1}<{} • {e2,e3}<{e3} • {e2,e3}<{e2} • {e1,e3}<{e3} • {e1,e3}<{e1} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 37

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} • {e1,e4}<{e4} • {e1,e4}<{e1}<{} • {e2,e3}<{e3} • {e2,e3}<{e2} • {e1,e3}<{e3} • {e1,e3}<{e1} • {e1,e2}<{e2} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 38

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} • {e1,e4}<{e4} • {e1,e4}<{e1}<{} • {e2,e3}<{e3} • {e2,e3}<{e2} • {e1,e3}<{e3} • {e1,e3}<{e1} • {e1,e2}<{e2} • {e1,e2}<{e1} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 39

Step 2: Minimal Transition Coverage (MTC) • Upper Chains • {e3,e4}<{e4}<{} • {e3,e4}<{e3}<{} • {e2,e4}<{e4} • {e2,e4}<{e2}<{} • {e1,e4}<{e4} • {e1,e4}<{e1}<{} • {e2,e3}<{e3} • {e2,e3}<{e2} • {e1,e3}<{e3} • {e1,e3}<{e1} • {e1,e2}<{e2} • {e1,e2}<{e1} • Lower Chains • {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4} • {e1,e2,e3,e4}<{e1,e3,e4}<{e3,e4} • {e2,e3,e4}<{e2,e4} • {e1,e2,e3,e4}<{e1,e2,e4}<{e2,e4} • {e1,e3,e4}<{e1, e4} • {e1, e2,e4}<{e1,e4} • {e2,e3,e4}<{e2,e3} • {e1,e2,e3,e4}<{e1,e2,e3}<{e2,e3} • {e1,e3,e4}<{e1,e3} • {e1,e2,e3}<{e1,e3} • {e1,e2,e4}<{e1,e2} • {e1,e2,e3}<{e1,e2} MTC Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3,e4}<{e1,e3,e4}<{e3,e4}<{e3}<{} {e2,e3,e4}<{e2,e4}<{e4} {e1,e2,e3,e4}<{e1,e2,e4}<{e2,e4}<{e2}<{} {e1,e3,e4}<{e1,e4}<{e1}<{} {e1,e2,e4}<{e1,e4}<{e4} {e2,e3,e4}<{e2,e3}<{e2} {e1,e2,e3,e4}<{e1,e2,e3}<{e2,e3}<{e3} {e1,e3,e4}<{e1,e3}<{e3} {e1,e2,e3}<{e1,e3}<{e1} {e1,e2,e4}<{e1,e2}<{e2} {e1,e2,e3}<{e1,e2}<{e1} MCD Chains {e1,e2,e3,e4}<{e2,e3,e4}<{e3,e4}<{e4}<{} {e1,e2,e3}<{e2,e3}<{e3} {e1,e2,e4}<{e2,e4}<{e2} {e1,e2} {e1,e3,e4}<{e1,e4}<{e1} {e1,e3} 40

Step 3: Handling Discrepancies Our theoretical view of the states as a perfect hypercube might not reflect the reality of the crawled RIA. When executing an event, we may have a combination of: Unexpected Split: In this case, we are expecting to reach a state that has already been visited but the actual state reached is a new state. Unexpected Merge: In this case, we unexpectedly reach a known state. Appearing events: There are events we did not expect Disappearing events : The events we expect are not enabled at the reached state 41

Step 3: Handling Discrepancies • Replacing the problematic chains (having the same prefix) • Try to find an alternative prefix to the first state sk i+1<= k <=n • If there is one, a new chain to execute the suffix should be added to strategy • The problematic chains should be removed. • Generate a new strategy for the reached state if it is new • We still consider the hypotheses are valid for the unexplored part • Hence the anticipated model for the reached state is another hypercube, the strategy can be generated accordingly 42

Agenda • Project presentation • Product presentation • Problem statement • High level algorithms • Experimental results • Guy-Vincent Jourdan, Prof. uOttawa • Questions

Experimental results A prototype of the RIA crawling method has been implemented and some experiments against test web sites have been run. For each experiment, we present: • Uncovered models: • The actual model of the RIA • The model uncovered by our method • A model uncovered without our method, and with CrawlJax • A comparison between our method and breadth-first and depth first • Rate of states discovery • Number of resets required to find the states • Rate of transition discovery • Number of resets required to find the transitions 44

Websites http://localhost/ajax/

Uncovered Models: Discovered States ```` 46

Efficiency comparison: perfect hypercube ```` 47

Efficiency comparison: perfect hypercube ```` 48

Efficiency comparison: non-hypercube 1 ```` 49

Efficiency comparison: non-hypercube 1 ```` 50

A Strategy for Efficient Crawling of Rich Internet Applications CASCON2010