350 likes | 518 Views
Semi-Automated Discovery of Application Session Structure. Jayanthkumar Kannan (Berkeley) , Jaeyeon Jung (Mazu Networks) , Vern Paxson (Berkeley) , Can Emre Koksal (EPFL) ACM Internet Measurement Conference 2006. Outline. Introduction Background Session Extraction Structure Abstraction
E N D
Semi-Automated Discovery of Application Session Structure Jayanthkumar Kannan (Berkeley), Jaeyeon Jung (Mazu Networks), Vern Paxson (Berkeley), Can Emre Koksal (EPFL) ACM Internet Measurement Conference 2006
Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Speaker: Li-Ming Chen
Network traffic analysis • Previous works have extensively examined network behavior at the level of packets and connections. • Dynamics, self-similarity • Packet delays and loses • Connection characteristics at different sites • Transport behavior, structural analysis • Applications: traffic engineering, capacity planning, anomaly detection • What about session level analysis? Speaker: Li-Ming Chen
Understanding traffic at a higher level - Session level analysis • Comparatively, the structure of user-initialed sessions remains much less explored • Sessions – application sessions • Denote as a group of connections associated with a single network task (response to a user event !) • What could be considered as application sessions? • Applications have pre-specified forms (e.g., FTP sessions) • More types of sessions: • User behavior (e.g., Web surfing, sending e-mail) • Anomalies, mis-configuration • Malicious activities (e.g., Botnet) Speaker: Li-Ming Chen
Results/Examples • FTP session • or imagine: • logging into a website and listening online music.. • Botnet zombie receiving instructions from its master and proceeding.. Speaker: Li-Ming Chen
Benefits of session level analysis • For the researchers • Aid with traffic characterization and monitoring • Provide a foundation for forming source models • Descriptions of network activity in terms of what a source is attempting to achieve using the network • Aid with anomaly detection • For the administrators • Track application use in their network at a higher level • Provide richer information for framing network policies • Anomaly detection Speaker: Li-Ming Chen
Problem & Goals • Mine a connection-level trace • Derive session descriptors(abstract descriptions of the session-structure) for the different applications present in the trace • Without any a prior knowledge about the application • Deduce descriptors to provide qualitative structure for the analysts • Express these descriptors as • Regular expressions • Deterministic finite automata (DFAs) • The expression focus on the order, type, directionality of the connections, but not their inter-arrival timing ! Speaker: Li-Ming Chen
Approach (the concept) Session Descriptors Connection -level traffic trace Session Extraction Structure Abstraction • Reduces a stream of connections • down to a stream of sessions • (Observation) connections • belonging to the same session • tend to occur “close” to one another • Model the temporal characteristics • of session arrivals • Attempts to infer succinct session • descriptors from each application • Simplify the raw descriptions to a • generalized form • Provide complexity-coverage • curves to represent the trade off • between economy-of-expression • and more detailed fidelity Speaker: Li-Ming Chen
Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Speaker: Li-Ming Chen
Dataset • Connection-level traces collected at the border of the LBNL • 1 month trace, about 2700K connections per day • 1st half – used to develop and calibrate the model • 2nd half – apply the model to infer descriptors for about 40 different applications, including: • Content-transfer (SMTP, FTP, HTTP) • Remote access (SSH, Telnet) • Database (OracleSQL, MySQL) • P2P (BitTorrent) • Mapping, authentication, remote desktop…, etc • How to evaluate? Based on the Spec. or human inspect Speaker: Li-Ming Chen
Terminology • Connection C: • Denote by (proto, dir, remote-host, local-host, start-time, duration) • proto: destination port X • dir: incoming or outgoing connection • Type of a connection T(C): • Define as (proto, dir) • Session S = (C1, C2,…, Cn) • a sequence of connections involve only a single local-host and single remote-host • Application A(S): • Associated with a session S as T(C1) • A session S belongs to the session type ST(S) = (T1, T2,…, Tn) • For all i ≦n, Ti = T(Ci) Speaker: Li-Ming Chen
Types of Sessions • Singleton • A lone connection by itself • Homogeneous sessions • Sessions consisting of consecutive invocations of the same application protocol and all with the same directionality • -> same connection type ! • Mixed sessions • Sessions involving different connection types • Sessions involving multiple remote hosts… future work Speaker: Li-Ming Chen
Applications vs. Types of Sessions • Different applications vary widely in the prevalence they exhibit for each of these types of session structure • E.g., • LDAP (mapping): 11% singleton, 88% homo • SSH (remote access): 80% singleton, 18% homo • GridFTP (content-transfer): 58% singleton, 42% mixed • About half of the 40 applications involve more complex structure.. Speaker: Li-Ming Chen
Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Session Descriptors Connection -level traffic trace Session Extraction Structure Abstraction Speaker: Li-Ming Chen
Session extraction • Problem: • Given a stream of connections, • Parse and reduce it into sessions (a stream of application-level sessions) • When observing a new connection Ci, the algorithm must decide: • (a) Ci is part of a current session !? • (b) Ci represents the beginning of a new session !? • Observation/Assumption: • The connections in a session are causally related • Such connections tend to occur “close” to each other Speaker: Li-Ming Chen
time 1. Extracting homogeneous sessions(the aggregation rule) • Considering connections less than a time Taggreg apart as part of the same session [24] • For Ci and already existed active session Sj • Sj = (C1j, …, Cnj) and A(Sj) ≡ T(C1j) = T(Ci) • If Cnj arrived less than Taggreg in the past from Ci’s arrival, then we consider Ci part of Sj • What about the connections involving different proto, or some what further apart ?? Sj = C1j C2j Cnj Ci Taggreg … [24] C. Nuzman, I. Saniee, W. Sweldens, and A. Weiss, “A compound model for TCP connection arrivals for LAN and WAN applications,” Computer Network, 2002. Speaker: Li-Ming Chen
2. Extracting mixed sessions not exactly the same • Attempt to access possible causality • For Ci and already existed active session Sk • Sk = (C1k, …, Cmk) and A(Sk) ≡ T(C1k) ≠ T(Ci) • Try to find if Ci is a “triggered” connection of C1k ? • Bases on the observation, if Ci is causally related to Sk, then its arrival is likely to be “closer” to Sk, in comparison to the case where Ci is a normal connection. • (Approach) devised a statistical test: • Identifies pairs of causally linked connections • Builds a base model of what is “normal”, and flags deviations • Using null hypothesis test Speaker: Li-Ming Chen
2. Extracting mixed sessions(causality detection algorithm) • On the arrival of a connection C of type T involving a local-host L • Let the sessions observed at L in the previous Ttrigger (500) seconds be S1, S2, …, Sn • Check & simply aggregate C to the most recent homo-sessions Si • Estimate the rate of connection arrivals at L for each session type within the past Trate (3600) seconds • For 1 ≤ i ≤ n, compute P[Ti, T, xi], for xi the interval between the arrival of Si and C • If P[Ti, T, xi] < α and C and Si involve the same remote-host, then add C to Si • else C is considered to be the 1st connection of a new session Si+1 Speaker: Li-Ming Chen
time 2. Extracting mixed sessions(causality detection algorithm) (cont’d) • (Empirically known fact) arrival model is often roughly stationary Poisson over hourly periods • Identify connections whose arrivals deviate from this model as triggered connections • Arrival process of unrelated (normal) connections = union of independent Poisson processes • Quite close coincidental arrivals are very rare • Therefore: arrivals that are close are likely related, i.e., part of same session • P[T1, T2, x] is the probability that two • sessions have an arrival within time x. • If P[..] < α, declare C1, C2 in same session FTP, rateλ1 C1 inter-arrival x HTTP rateλ2 might longer than Taggreg C2 Speaker: Li-Ming Chen
Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Session Descriptors Connection -level traffic trace Session Extraction Structure Abstraction Speaker: Li-Ming Chen
Structure abstraction • Derive succinct descriptions for application session based on the set of session types (ST) reported by Session Extraction • Use regular expressions & DFA to represent an application session • Good balance between expressiveness and ease of generation • Further refine this representation by labeling state transitions with probabilities • Avoid false positive Speaker: Li-Ming Chen
Exact DFA vs. “Nature” DFA • (Naïve approach) Simply build • a DFA that matches the list of • all the observed sessions • More complex due to the fact • that it has to completely • capture several FTP sessions Exact FTP DFA Nature FTP DFA • A more traceable DFA for FTP • Benefits: • Simplicity, • Generalization, • Highlighting Common Behavior, • Minimizing False Positives Speaker: Li-Ming Chen
Structure Abstraction Framework Session Descriptors Connection -level traffic trace Session Extraction Structure Abstraction (4 steps) 1 2 3 4 • Semi-automatic • Lack of the ground truth • Categorize sessions • based on the server port • of the 1st connection • Construct exact DFA Efrom the union • of each observed session types (ST) Speaker: Li-Ming Chen
Step 3: Coverage Phase • Given exact DFA E • Aim to extract a set of DFAs that capture subsets of the observed session behavior • Best trade off simplicity-of-expression (fewest states/edges) for coverage (capturing most types of behavior) • A greedy algorithm: DFA E -> DFA F1, F2, …, Fn • Feed every session instance in ST to E • Compute hit count h(e) for every edge • Next, compute augmented hit count h’(e) = Σh(e’) • e’ reachable form e • Order edges by decreasing h’(e), denote by e1, e2,… • Construct DFAs Fi by taking the union of all edges e1, …, ei Speaker: Li-Ming Chen
Step 4: Generalization Phase • Generalize F1, F2, … to a set of transformation of generalized DFAs G1, G2, … • 3 workable generalization rules: • Prefix Rule: STi in trace -> all prefixes of STi • Counting Rule: (aBc) & (aBnc) in trace -> (aB+c) • Invert Direction Rule: STi in trace -> invert(STi) ftp_in ftp_out data_in data_out data_out data_in data_in data_out data_in Speaker: Li-Ming Chen Refer to author’s slides
Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments • Parameters: • Taggreg = 100 sec • Ttrigger = 500 sec • Trate = 1 hr • Threshold α = 0.1 • Tservice≥ 5 • Counting rule |B| = 2 • Only feed session types • of length ≤ 10 Speaker: Li-Ming Chen
FTP session structures (content transfer) • The fraction of session types in ST accepted by Gi, • weighted by the frequency with which the type occurs. • Gi may have more or fewer than i edges DFA: 4 edges 4 2: singleton Speaker: Li-Ming Chen
FTP session structure (cont’d) • DFA: 8 edges • Single data transfer • in the opposite dir DFA: 8 edges But fewer actual edges DFA: 10 edges HTTP connections can occur during FTP sessions DFA: 18 edges Coverage: 99% Speaker: Li-Ming Chen
Timbuktu session structures (remote desktop) • 2: Singleton > 90% • Others < 10% DFA: 4 edges 4 Speaker: Li-Ming Chen
Timbuktu session structures (cont’d) DFA: 10 edges Speaker: Li-Ming Chen
HTTP session structure(content transfer) • DFA: 30 edges • (for saving space…, • only choose sessions begun with an • outgoing HTTP connections…) • More complex, ~99% are singleton • or aggregated sessions that reflect • successive retrieval of multiple • pages from the same server ! Speaker: Li-Ming Chen
Finding Attacks Using Anomaly Detection • One goal is to detect network attacks by finding sessions that deviate from established session structures. • Such deviations would reflect either unintended mis-configurations, scanning, or “phone home” connections associated with compromises. Speaker: Li-Ming Chen
Outline • Introduction • Background • Session Extraction • Structure Abstraction • Results • Conclusion & Comments Speaker: Li-Ming Chen
Conclusion • Session extraction • A statistical technique to extract application sessions from a connection-level trace of network activity • Structure abstraction • A method to deduce descriptors that can be used by an analyst to capture the qualitative structure of such sessions. • The results show that the proposed method works well over many of the applications in the trace • The future work: • Evaluate/validate the proposed method over more applications • Extend the method to support single-to-multiple host sessions • Try to collate descriptors for closely-related protocols Speaker: Li-Ming Chen
Comments • This method statistically correlate connections by observing connection-level traffic traces • Might not suitable for a complex environment.. • What if the packet-level traces can be acquired ? • Surprisingly, a particular application can manifest various session structures • Session structures in this paper will help to find out the host-based anomaly • Single-to-multiple host sessions might be more helpful to the observation/identification of the worm-like activities Speaker: Li-Ming Chen