1 / 35

Semi-Automated Discovery of Application Session Structure

Semi-Automated Discovery of Application Session Structure. Jayanthkumar Kannan (Berkeley) , Jaeyeon Jung (Mazu Networks) , Vern Paxson (Berkeley) , Can Emre Koksal (EPFL) ACM Internet Measurement Conference 2006. Outline. Introduction Background Session Extraction Structure Abstraction

amaris
Download Presentation

Semi-Automated Discovery of Application Session Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-Automated Discovery of Application Session Structure Jayanthkumar Kannan (Berkeley), Jaeyeon Jung (Mazu Networks), Vern Paxson (Berkeley), Can Emre Koksal (EPFL) ACM Internet Measurement Conference 2006

  2. Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Speaker: Li-Ming Chen

  3. Network traffic analysis • Previous works have extensively examined network behavior at the level of packets and connections. • Dynamics, self-similarity • Packet delays and loses • Connection characteristics at different sites • Transport behavior, structural analysis • Applications: traffic engineering, capacity planning, anomaly detection • What about session level analysis? Speaker: Li-Ming Chen

  4. Understanding traffic at a higher level - Session level analysis • Comparatively, the structure of user-initialed sessions remains much less explored • Sessions – application sessions • Denote as a group of connections associated with a single network task (response to a user event !) • What could be considered as application sessions? • Applications have pre-specified forms (e.g., FTP sessions) • More types of sessions: • User behavior (e.g., Web surfing, sending e-mail) • Anomalies, mis-configuration • Malicious activities (e.g., Botnet) Speaker: Li-Ming Chen

  5. Results/Examples • FTP session • or imagine: • logging into a website and listening online music.. • Botnet zombie receiving instructions from its master and proceeding.. Speaker: Li-Ming Chen

  6. Benefits of session level analysis • For the researchers • Aid with traffic characterization and monitoring • Provide a foundation for forming source models • Descriptions of network activity in terms of what a source is attempting to achieve using the network • Aid with anomaly detection • For the administrators • Track application use in their network at a higher level • Provide richer information for framing network policies • Anomaly detection Speaker: Li-Ming Chen

  7. Problem & Goals • Mine a connection-level trace • Derive session descriptors(abstract descriptions of the session-structure) for the different applications present in the trace • Without any a prior knowledge about the application • Deduce descriptors to provide qualitative structure for the analysts • Express these descriptors as • Regular expressions • Deterministic finite automata (DFAs) • The expression focus on the order, type, directionality of the connections, but not their inter-arrival timing ! Speaker: Li-Ming Chen

  8. Approach (the concept) Session Descriptors Connection -level traffic trace Session Extraction Structure Abstraction • Reduces a stream of connections • down to a stream of sessions • (Observation) connections • belonging to the same session • tend to occur “close” to one another • Model the temporal characteristics • of session arrivals • Attempts to infer succinct session • descriptors from each application • Simplify the raw descriptions to a • generalized form • Provide complexity-coverage • curves to represent the trade off • between economy-of-expression • and more detailed fidelity Speaker: Li-Ming Chen

  9. Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Speaker: Li-Ming Chen

  10. Dataset • Connection-level traces collected at the border of the LBNL • 1 month trace, about 2700K connections per day • 1st half – used to develop and calibrate the model • 2nd half – apply the model to infer descriptors for about 40 different applications, including: • Content-transfer (SMTP, FTP, HTTP) • Remote access (SSH, Telnet) • Database (OracleSQL, MySQL) • P2P (BitTorrent) • Mapping, authentication, remote desktop…, etc • How to evaluate? Based on the Spec. or human inspect Speaker: Li-Ming Chen

  11. Terminology • Connection C: • Denote by (proto, dir, remote-host, local-host, start-time, duration) • proto: destination port X • dir: incoming or outgoing connection • Type of a connection T(C): • Define as (proto, dir) • Session S = (C1, C2,…, Cn) • a sequence of connections involve only a single local-host and single remote-host • Application A(S): • Associated with a session S as T(C1) • A session S belongs to the session type ST(S) = (T1, T2,…, Tn) • For all i ≦n, Ti = T(Ci) Speaker: Li-Ming Chen

  12. Types of Sessions • Singleton • A lone connection by itself • Homogeneous sessions • Sessions consisting of consecutive invocations of the same application protocol and all with the same directionality • -> same connection type ! • Mixed sessions • Sessions involving different connection types • Sessions involving multiple remote hosts… future work Speaker: Li-Ming Chen

  13. Applications vs. Types of Sessions • Different applications vary widely in the prevalence they exhibit for each of these types of session structure • E.g., • LDAP (mapping): 11% singleton, 88% homo • SSH (remote access): 80% singleton, 18% homo • GridFTP (content-transfer): 58% singleton, 42% mixed • About half of the 40 applications involve more complex structure.. Speaker: Li-Ming Chen

  14. Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Session Descriptors Connection -level traffic trace Session Extraction Structure Abstraction Speaker: Li-Ming Chen

  15. Session extraction • Problem: • Given a stream of connections, • Parse and reduce it into sessions (a stream of application-level sessions) • When observing a new connection Ci, the algorithm must decide: • (a) Ci is part of a current session !? • (b) Ci represents the beginning of a new session !? • Observation/Assumption: • The connections in a session are causally related • Such connections tend to occur “close” to each other Speaker: Li-Ming Chen

  16. time 1. Extracting homogeneous sessions(the aggregation rule) • Considering connections less than a time Taggreg apart as part of the same session [24] • For Ci and already existed active session Sj • Sj = (C1j, …, Cnj) and A(Sj) ≡ T(C1j) = T(Ci) • If Cnj arrived less than Taggreg in the past from Ci’s arrival, then we consider Ci part of Sj • What about the connections involving different proto, or some what further apart ?? Sj = C1j C2j Cnj Ci Taggreg … [24] C. Nuzman, I. Saniee, W. Sweldens, and A. Weiss, “A compound model for TCP connection arrivals for LAN and WAN applications,” Computer Network, 2002. Speaker: Li-Ming Chen

  17. 2. Extracting mixed sessions not exactly the same • Attempt to access possible causality • For Ci and already existed active session Sk • Sk = (C1k, …, Cmk) and A(Sk) ≡ T(C1k) ≠ T(Ci) • Try to find if Ci is a “triggered” connection of C1k ? • Bases on the observation, if Ci is causally related to Sk, then its arrival is likely to be “closer” to Sk, in comparison to the case where Ci is a normal connection. • (Approach) devised a statistical test: • Identifies pairs of causally linked connections • Builds a base model of what is “normal”, and flags deviations • Using null hypothesis test Speaker: Li-Ming Chen

  18. 2. Extracting mixed sessions(causality detection algorithm) • On the arrival of a connection C of type T involving a local-host L • Let the sessions observed at L in the previous Ttrigger (500) seconds be S1, S2, …, Sn • Check & simply aggregate C to the most recent homo-sessions Si • Estimate the rate of connection arrivals at L for each session type within the past Trate (3600) seconds • For 1 ≤ i ≤ n, compute P[Ti, T, xi], for xi the interval between the arrival of Si and C • If P[Ti, T, xi] < α and C and Si involve the same remote-host, then add C to Si • else C is considered to be the 1st connection of a new session Si+1 Speaker: Li-Ming Chen

  19. time 2. Extracting mixed sessions(causality detection algorithm) (cont’d) • (Empirically known fact) arrival model is often roughly stationary Poisson over hourly periods • Identify connections whose arrivals deviate from this model as triggered connections • Arrival process of unrelated (normal) connections = union of independent Poisson processes • Quite close coincidental arrivals are very rare • Therefore: arrivals that are close are likely related, i.e., part of same session • P[T1, T2, x] is the probability that two • sessions have an arrival within time x. • If P[..] < α, declare C1, C2 in same session FTP, rateλ1 C1 inter-arrival x HTTP rateλ2 might longer than Taggreg C2 Speaker: Li-Ming Chen

  20. Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Session Descriptors Connection -level traffic trace Session Extraction Structure Abstraction Speaker: Li-Ming Chen

  21. Structure abstraction • Derive succinct descriptions for application session based on the set of session types (ST) reported by Session Extraction • Use regular expressions & DFA to represent an application session • Good balance between expressiveness and ease of generation • Further refine this representation by labeling state transitions with probabilities • Avoid false positive Speaker: Li-Ming Chen

  22. Exact DFA vs. “Nature” DFA • (Naïve approach) Simply build • a DFA that matches the list of • all the observed sessions • More complex due to the fact • that it has to completely • capture several FTP sessions Exact FTP DFA Nature FTP DFA • A more traceable DFA for FTP • Benefits: • Simplicity, • Generalization, • Highlighting Common Behavior, • Minimizing False Positives Speaker: Li-Ming Chen

  23. Structure Abstraction Framework Session Descriptors Connection -level traffic trace Session Extraction Structure Abstraction (4 steps) 1 2 3 4 • Semi-automatic • Lack of the ground truth • Categorize sessions • based on the server port • of the 1st connection • Construct exact DFA Efrom the union • of each observed session types (ST) Speaker: Li-Ming Chen

  24. Step 3: Coverage Phase • Given exact DFA E • Aim to extract a set of DFAs that capture subsets of the observed session behavior • Best trade off simplicity-of-expression (fewest states/edges) for coverage (capturing most types of behavior) • A greedy algorithm: DFA E -> DFA F1, F2, …, Fn • Feed every session instance in ST to E • Compute hit count h(e) for every edge • Next, compute augmented hit count h’(e) = Σh(e’) • e’ reachable form e • Order edges by decreasing h’(e), denote by e1, e2,… • Construct DFAs Fi by taking the union of all edges e1, …, ei Speaker: Li-Ming Chen

  25. Step 4: Generalization Phase • Generalize F1, F2, … to a set of transformation of generalized DFAs G1, G2, … • 3 workable generalization rules: • Prefix Rule: STi in trace -> all prefixes of STi • Counting Rule: (aBc) & (aBnc) in trace -> (aB+c) • Invert Direction Rule: STi in trace -> invert(STi) ftp_in ftp_out data_in data_out data_out data_in data_in data_out data_in Speaker: Li-Ming Chen Refer to author’s slides

  26. Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments • Parameters: • Taggreg = 100 sec • Ttrigger = 500 sec • Trate = 1 hr • Threshold α = 0.1 • Tservice≥ 5 • Counting rule |B| = 2 • Only feed session types • of length ≤ 10 Speaker: Li-Ming Chen

  27. FTP session structures (content transfer) • The fraction of session types in ST accepted by Gi, • weighted by the frequency with which the type occurs. • Gi may have more or fewer than i edges DFA: 4 edges 4 2: singleton Speaker: Li-Ming Chen

  28. FTP session structure (cont’d) • DFA: 8 edges • Single data transfer • in the opposite dir DFA: 8 edges But fewer actual edges DFA: 10 edges HTTP connections can occur during FTP sessions DFA: 18 edges Coverage: 99% Speaker: Li-Ming Chen

  29. Timbuktu session structures (remote desktop) • 2: Singleton > 90% • Others < 10% DFA: 4 edges 4 Speaker: Li-Ming Chen

  30. Timbuktu session structures (cont’d) DFA: 10 edges Speaker: Li-Ming Chen

  31. HTTP session structure(content transfer) • DFA: 30 edges • (for saving space…, • only choose sessions begun with an • outgoing HTTP connections…) • More complex, ~99% are singleton • or aggregated sessions that reflect • successive retrieval of multiple • pages from the same server ! Speaker: Li-Ming Chen

  32. Finding Attacks Using Anomaly Detection • One goal is to detect network attacks by finding sessions that deviate from established session structures. • Such deviations would reflect either unintended mis-configurations, scanning, or “phone home” connections associated with compromises. Speaker: Li-Ming Chen

  33. Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Speaker: Li-Ming Chen

  34. Conclusion • Session extraction • A statistical technique to extract application sessions from a connection-level trace of network activity • Structure abstraction • A method to deduce descriptors that can be used by an analyst to capture the qualitative structure of such sessions. • The results show that the proposed method works well over many of the applications in the trace • The future work: • Evaluate/validate the proposed method over more applications • Extend the method to support single-to-multiple host sessions • Try to collate descriptors for closely-related protocols Speaker: Li-Ming Chen

  35. Comments • This method statistically correlate connections by observing connection-level traffic traces • Might not suitable for a complex environment.. • What if the packet-level traces can be acquired ? • Surprisingly, a particular application can manifest various session structures • Session structures in this paper will help to find out the host-based anomaly • Single-to-multiple host sessions might be more helpful to the observation/identification of the worm-like activities Speaker: Li-Ming Chen

More Related