Silvio Cesare and Yang Xiang School of Management and Information Systems Centre for Intelligent and Networked Systems Central Queensland University
Motivation Malware, short for malicious software, is hostile, intrusive, or annoying software and program code. Malware is a significant problem in distributed computer systems and in endhost security. To prevent malware causing damage, untrusted programs can be analysed to identify malicious intent before they are allowed to execute. Many malware have variants and detection of unknown malware variants provides benefit.
Introduction Automated malware analysis can be dynamic or static. Traditional Antivirus Static. Must be efficient and respond simultaneously to users’ productivity demands. String signatures based on byte level content are the dominant approach. Efficient, but not always effective with malware variants. Polymorphism Describes malware variants sharing a common history of code. May come automatically from code mutation, or manually created by malware authors for code reuse. Byte level content may vary significantly.
Introduction (cont) Static analysis can provide non traditional features to characterize malware. Control flow describes the possible execution paths through a malware. Control flow is considered more invariant in polymorphic malware than traditional features. Malware often hinders control flow analysis and static analysis through code packing. Code packing hides, encrypts, compresses or obfuscates malware. Automated unpacking reverses the obfuscation, and is required for effective malware classification.
Our Contribution We propose an algorithm to identify malware variants by determining program similarity through estimating isomorphic control flow graphs. We implement and evaluate our idea in a novel prototype system. We demonstrate the system is fast enough for desktop adoption on the endhost.
Related Work API call sequences. N-Grams, n-perms of byte level content. Basic block matching using edit distances, inverted indexes, bloom filters. Approximate matching of call graphs. Approximate matching of control flow graphs Our approach is more effective than byte level approaches , and more efficient than existing flowgraph based systems.
The Software Similarity Problem The software similarity problem is to determine the similarity between programs. A real number between 0 and 1 ; 0 is not at all similar, 1 is identical. Calculated by looking at invariant characteristics between programs. Given a query program, is it malicious? A high similarity between the query program to existing malware, identifies it as malicious. Implemented by performing a range or similarity search of a query program to identify similar neighbours from a malware database. Our system looks at static software similarity. A similarity >= 0.6 indicates a variant. 0.6 chosen using manual and empirical evaluation.
System Design and Implementation Identify if query program is packed. Unpack. Generate Control Flow Graphs. Generate Flowgraph Signatures. Classify Find high similarity between signatures and existing malware. Update malware database with variant information.
System Design and Implementation Block diagram of the malware classification system.
Flowgraph Signatures A flowgraph signature is defined as the string representing the graph after labelling the nodes using a depth first order traversal of the graph. This signature or graph invariant is used in estimating graph isomorphism by testing signatures for equality. The signature string can be hashed to allow for more efficient searching – we use crc64. Normalized weight of a procedure or flowgraph: Similarity ratio between two flowgraphs x and y:
Flowgraph Signatures A depth first ordered flowgraph and its signature.
Malware Classification Dice coefficient is a measure of similarity between two sets: We represent a program as a set of control flow graph signatures and use the weighted Dice coefficient to show similarity between programs. The weights have been normalized so the equation simplifies to the sum weights of the flowgraphs common to both sets. We define the asymmetric similarity as: Two sets of weights are possible representing either the query or the database weight. Program Similarity :
Improving Performance in Malware Classification To improve performance, we do not perform the program similarity function linearly or exhaustively for each malware in the database. We propose a novel algorithm to search the entire database for similar sets to the query. Iterate through the query program’s procedures. Find the procedure’s matching flowgraphs  and malware from the database. Building the asymmetric similarities incrementally. Processing unique or none matching flowgraphs first. Pruning low similarity objects, then processing the remaining flowgraphs.
Analysis Expected time to classify a query is O(NlogM) N is the number of procedures/control flow graphs in the query. M is the flowgraph database size. Worst time is O(NlogM + AN 2 ) A is the number of highly similar malware to the query. In previous literature of approximate call graph matching. Pairwise similarity complexity is O(N 3 ). Searching the database used metric trees with logarithmic search time, but with growth also exponential to the dimensionality of the objects. Binary trees in our system have more predictable and efficient performance.
Evaluation - Effectiveness Similarity matrices for malware families klez netsky roron klez netksy roron a b c d g h a 0.76 0.82 0.69 0.52 0.51 b 0.76 0.83 0.80 0.52 0.51 c 0.82 0.83 0.69 0.51 0.51 d 0.69 0.80 0.69 0.51 0.50 g 0.52 0.52 0.51 0.51 0.85 h 0.51 0.51 0.51 0.50 0.85 aa ac f j p t x y aa 0.74 0.59 0.67 0.49 0.72 0.50 0.83 ac 0.74 0.69 0.78 0.40 0.55 0.37 0.63 f 0.59 0.69 0.88 0.44 0.61 0.41 0.70 j 0.67 0.78 0.88 0.49 0.69 0.46 0.79 p 0.49 0.40 0.44 0.49 0.68 0.85 0.58 t 0.72 0.55 0.61 0.69 0.68 0.63 0.86 x 0.50 0.37 0.41 0.46 0.85 0.63 0.54 y 0.83 0.63 0.70 0.79 0.58 0.86 0.54 ao b d e g k m q a ao 0.44 0.28 0.27 0.28 0.55 0.44 0.44 0.47 b 0.44 0.27 0.27 0.27 0.51 1.00 1.00 0.58 d 0.28 0.27 0.48 0.56 0.27 0.27 0.27 0.27 e 0.27 0.27 0.48 0.59 0.27 0.27 0.27 0.27 g 0.28 0.27 0.56 0.59 0.27 0.27 0.27 0.27 k 0.55 0.51 0.27 0.27 0.27 0.51 0.51 0.75 m 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 q 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 a 0.47 0.58 0.27 0.27 0.27 0.75 0.58 0.58
Evaluation - Efficiency Malware processing time. Benign processing time. Time(s) Num. of Samples 0-1 299 1-2 401 2-3 46 3-4 30 4-5 32 5+  1 Time(s) Num. of Samples 0.0 0 0.1 139 0.2 80 0.3 42 0.4 28 0.5 10 0.6 10 0.7 3 0.8 6 0.9 5 1-2 17 2+ 6
Evaluation - Scalability Scalability. Database Size 1000 2000 4000 8000 16000 32000 64000 Time(ms) < 1 < 1 < 1 < 1 < 1 < 1 < 1
Evaluation - Accuracy False positive evaluation. Similarity matrix for non similar programs. Similarity Matches (approx.) Matches (exact) 0.0 105497 97791 0.1 2268 1598 0.2 637 532 0.3 342 324 0.4 199 175 0.5 121 122 0.6 44 34 0.7 72 24 0.8 24 22 0.9 20 12 1.0 6 0 cmd.exe calc.exe netsky.aa klez.a roron.ao cmd.exe 0.00 0.00 0.00 calc.exe 0.00 0.00 0.00 0.00 netsky.aa 0.00 0.00 0.15 0.09 klez.a 0.00 0.15 0.13 roron.ao 0.00 0.00 0.09 0.13
Limitations Disassembly and control flow reconstruction of an obfuscated program is an undecidable problem. In practice, analysis is possible because malware is obfuscated using packing. However, automated unpacking using application level emulation is detectable. Packing using instruction virtualization is also resistant to automated unpacking.
Conclusion Malware variants can be detected based on similarity in their control flow. We proposed estimating isomorphic control flow graphs using graph invariants. We implemented this approach in a prototype system. Our system was able to detect real malware variants. It was resilient to false positives, and had logarithmic performance in the expected case. It was shown to have suitable performance for use on the endhost.

A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

  • 1.
    Silvio Cesare andYang Xiang School of Management and Information Systems Centre for Intelligent and Networked Systems Central Queensland University
  • 2.
    Motivation Malware, shortfor malicious software, is hostile, intrusive, or annoying software and program code. Malware is a significant problem in distributed computer systems and in endhost security. To prevent malware causing damage, untrusted programs can be analysed to identify malicious intent before they are allowed to execute. Many malware have variants and detection of unknown malware variants provides benefit.
  • 3.
    Introduction Automated malwareanalysis can be dynamic or static. Traditional Antivirus Static. Must be efficient and respond simultaneously to users’ productivity demands. String signatures based on byte level content are the dominant approach. Efficient, but not always effective with malware variants. Polymorphism Describes malware variants sharing a common history of code. May come automatically from code mutation, or manually created by malware authors for code reuse. Byte level content may vary significantly.
  • 4.
    Introduction (cont) Staticanalysis can provide non traditional features to characterize malware. Control flow describes the possible execution paths through a malware. Control flow is considered more invariant in polymorphic malware than traditional features. Malware often hinders control flow analysis and static analysis through code packing. Code packing hides, encrypts, compresses or obfuscates malware. Automated unpacking reverses the obfuscation, and is required for effective malware classification.
  • 5.
    Our Contribution Wepropose an algorithm to identify malware variants by determining program similarity through estimating isomorphic control flow graphs. We implement and evaluate our idea in a novel prototype system. We demonstrate the system is fast enough for desktop adoption on the endhost.
  • 6.
    Related Work APIcall sequences. N-Grams, n-perms of byte level content. Basic block matching using edit distances, inverted indexes, bloom filters. Approximate matching of call graphs. Approximate matching of control flow graphs Our approach is more effective than byte level approaches , and more efficient than existing flowgraph based systems.
  • 7.
    The Software SimilarityProblem The software similarity problem is to determine the similarity between programs. A real number between 0 and 1 ; 0 is not at all similar, 1 is identical. Calculated by looking at invariant characteristics between programs. Given a query program, is it malicious? A high similarity between the query program to existing malware, identifies it as malicious. Implemented by performing a range or similarity search of a query program to identify similar neighbours from a malware database. Our system looks at static software similarity. A similarity >= 0.6 indicates a variant. 0.6 chosen using manual and empirical evaluation.
  • 8.
    System Design andImplementation Identify if query program is packed. Unpack. Generate Control Flow Graphs. Generate Flowgraph Signatures. Classify Find high similarity between signatures and existing malware. Update malware database with variant information.
  • 9.
    System Design andImplementation Block diagram of the malware classification system.
  • 10.
    Flowgraph Signatures Aflowgraph signature is defined as the string representing the graph after labelling the nodes using a depth first order traversal of the graph. This signature or graph invariant is used in estimating graph isomorphism by testing signatures for equality. The signature string can be hashed to allow for more efficient searching – we use crc64. Normalized weight of a procedure or flowgraph: Similarity ratio between two flowgraphs x and y:
  • 11.
    Flowgraph Signatures Adepth first ordered flowgraph and its signature.
  • 12.
    Malware Classification Dicecoefficient is a measure of similarity between two sets: We represent a program as a set of control flow graph signatures and use the weighted Dice coefficient to show similarity between programs. The weights have been normalized so the equation simplifies to the sum weights of the flowgraphs common to both sets. We define the asymmetric similarity as: Two sets of weights are possible representing either the query or the database weight. Program Similarity :
  • 13.
    Improving Performance inMalware Classification To improve performance, we do not perform the program similarity function linearly or exhaustively for each malware in the database. We propose a novel algorithm to search the entire database for similar sets to the query. Iterate through the query program’s procedures. Find the procedure’s matching flowgraphs and malware from the database. Building the asymmetric similarities incrementally. Processing unique or none matching flowgraphs first. Pruning low similarity objects, then processing the remaining flowgraphs.
  • 14.
    Analysis Expected timeto classify a query is O(NlogM) N is the number of procedures/control flow graphs in the query. M is the flowgraph database size. Worst time is O(NlogM + AN 2 ) A is the number of highly similar malware to the query. In previous literature of approximate call graph matching. Pairwise similarity complexity is O(N 3 ). Searching the database used metric trees with logarithmic search time, but with growth also exponential to the dimensionality of the objects. Binary trees in our system have more predictable and efficient performance.
  • 15.
    Evaluation - EffectivenessSimilarity matrices for malware families klez netsky roron klez netksy roron a b c d g h a 0.76 0.82 0.69 0.52 0.51 b 0.76 0.83 0.80 0.52 0.51 c 0.82 0.83 0.69 0.51 0.51 d 0.69 0.80 0.69 0.51 0.50 g 0.52 0.52 0.51 0.51 0.85 h 0.51 0.51 0.51 0.50 0.85 aa ac f j p t x y aa 0.74 0.59 0.67 0.49 0.72 0.50 0.83 ac 0.74 0.69 0.78 0.40 0.55 0.37 0.63 f 0.59 0.69 0.88 0.44 0.61 0.41 0.70 j 0.67 0.78 0.88 0.49 0.69 0.46 0.79 p 0.49 0.40 0.44 0.49 0.68 0.85 0.58 t 0.72 0.55 0.61 0.69 0.68 0.63 0.86 x 0.50 0.37 0.41 0.46 0.85 0.63 0.54 y 0.83 0.63 0.70 0.79 0.58 0.86 0.54 ao b d e g k m q a ao 0.44 0.28 0.27 0.28 0.55 0.44 0.44 0.47 b 0.44 0.27 0.27 0.27 0.51 1.00 1.00 0.58 d 0.28 0.27 0.48 0.56 0.27 0.27 0.27 0.27 e 0.27 0.27 0.48 0.59 0.27 0.27 0.27 0.27 g 0.28 0.27 0.56 0.59 0.27 0.27 0.27 0.27 k 0.55 0.51 0.27 0.27 0.27 0.51 0.51 0.75 m 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 q 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 a 0.47 0.58 0.27 0.27 0.27 0.75 0.58 0.58
  • 16.
    Evaluation - EfficiencyMalware processing time. Benign processing time. Time(s) Num. of Samples 0-1 299 1-2 401 2-3 46 3-4 30 4-5 32 5+ 1 Time(s) Num. of Samples 0.0 0 0.1 139 0.2 80 0.3 42 0.4 28 0.5 10 0.6 10 0.7 3 0.8 6 0.9 5 1-2 17 2+ 6
  • 17.
    Evaluation - ScalabilityScalability. Database Size 1000 2000 4000 8000 16000 32000 64000 Time(ms) < 1 < 1 < 1 < 1 < 1 < 1 < 1
  • 18.
    Evaluation - AccuracyFalse positive evaluation. Similarity matrix for non similar programs. Similarity Matches (approx.) Matches (exact) 0.0 105497 97791 0.1 2268 1598 0.2 637 532 0.3 342 324 0.4 199 175 0.5 121 122 0.6 44 34 0.7 72 24 0.8 24 22 0.9 20 12 1.0 6 0 cmd.exe calc.exe netsky.aa klez.a roron.ao cmd.exe 0.00 0.00 0.00 calc.exe 0.00 0.00 0.00 0.00 netsky.aa 0.00 0.00 0.15 0.09 klez.a 0.00 0.15 0.13 roron.ao 0.00 0.00 0.09 0.13
  • 19.
    Limitations Disassembly andcontrol flow reconstruction of an obfuscated program is an undecidable problem. In practice, analysis is possible because malware is obfuscated using packing. However, automated unpacking using application level emulation is detectable. Packing using instruction virtualization is also resistant to automated unpacking.
  • 20.
    Conclusion Malware variantscan be detected based on similarity in their control flow. We proposed estimating isomorphic control flow graphs using graph invariants. We implemented this approach in a prototype system. Our system was able to detect real malware variants. It was resilient to false positives, and had logarithmic performance in the expected case. It was shown to have suitable performance for use on the endhost.