MALWARE DETECTION
USING MACHINE
LEARNING
ABHIJIT MOHANTA
ABOUT PRESENTER
• Worked as security researcher for
Symantec,Mcafee,Cyphort
• Experience in reverse engineering
,malware analysis and detection
• Worked on antivirus engines,and sandbox
engines
DISCALIMER
I have used some contents from the
following sites
Reference:
• analyticsvidhya.com
• datadrivensecurity.info
• home.agh.edu.pl
• neuralnetworksanddeeplearning.com
• http://www.astroml.org
• Youtube
• Google images
Malware Detection in Antivirus:
How Antiviruses detect malware?
• Traditional AV's pattern matching on static files
• Partially decrypt using techniques like emulation
How Malwares evade antivirus?
• use polymorphic packers which evades static pattern
matching
Why Machine Learning?
• Too many types of malware bots,virus
• Based on target stealers,POS malwares,banking
• Too much data for human to process
MACHINE LEARNING INTRO
• Some prerequisites:
statistics,calculus,vectors,algebra
• Problems solved: classification /regression
• Types: supervised,semi-
supervised,unsupervised
• What is our problem? Classification
Supervised Learning:
• What is it?
• Steps:
– Feature Selection
– Training(provide Labelled Data)
– Prediction
FEATURE SELECTION
• How features are selected in Classification?
• Some property with which you can distinguish two
classes is A Feature
• Feature can be represented as Vector,Boolean etc
• Apple Vs Orange Class:
– Feature: colour,weight,shape
– Label: apple,guava
MODEL SELECTION
Models for supervised Learning:
•K-Nearest Neighbours(KNN)-classification
•K-Means clustering
•SVM
•Decision Tree
•Random Forest
•Naive Bayes Algorithm
K-Nearest Neighbours(KNN)
• Supervised learning
• Classification Algorithm
• Similarity to neighbours-(Eucledian,Manhattan,Minkowski)
• Euclidean distance
• A circle around the point to be classified that contains k points
K-Means
• Unsupervised learning
• Clustering algorithm
• Given some data we cluster the data to K
groups
• In each iteration the mean value of the
cluster is updated
• Centre calculated using Eucledian
distance
• ref video:https://www.youtube.com/watch?
v=aiJ8II94qck
Support Vector Machines
• Classifier
• What are support vectors
• Linearly separating Hyperplane
• Margins with max separation
Support Vector Machines
• ref:http://www.saedsayad.com/support_vector_machine.htm
• videos:
• https://www.youtube.com/watch?v=1NxnPkZM9bc
• https://www.youtube.com/watch?v=5zRmhOUjjGY
Decision Tree
Ref:https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-
mllib.html
Random Forest
• Ensemble learning method
• Uses output of multiple decision trees
Ref:https://citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics/
Features for Malware
Detection
• Static:
– Size
– Signed/unsigned
– Icon-exe file without icons
– entropy
• Behaviour:
– Process executed from %appdata% and %temp%
– Dropped file has random name eg xszsde.exe
– Process creating run entries
– Code injection
Training Sets for malware
Some application for Malware
Traffic Detection
• DGA algorithm detection
• DGA: what is DGA?
• Features:
– N-Grams
– Entropy
– Dictionary
– Reference:http://datadrivensecurity.info
ADVANCED TOPICS
• NEURAL NETWORKS
• DEEP NEURAL NETWORKS
PYTHON LIBRARIES
• Scikit-Learn
• Numpy
• Pandas
Malware Detection using Machine Learning

Malware Detection using Machine Learning

  • 1.
  • 2.
    ABOUT PRESENTER • Workedas security researcher for Symantec,Mcafee,Cyphort • Experience in reverse engineering ,malware analysis and detection • Worked on antivirus engines,and sandbox engines
  • 3.
    DISCALIMER I have usedsome contents from the following sites Reference: • analyticsvidhya.com • datadrivensecurity.info • home.agh.edu.pl • neuralnetworksanddeeplearning.com • http://www.astroml.org • Youtube • Google images
  • 4.
    Malware Detection inAntivirus: How Antiviruses detect malware? • Traditional AV's pattern matching on static files • Partially decrypt using techniques like emulation How Malwares evade antivirus? • use polymorphic packers which evades static pattern matching Why Machine Learning? • Too many types of malware bots,virus • Based on target stealers,POS malwares,banking • Too much data for human to process
  • 5.
    MACHINE LEARNING INTRO •Some prerequisites: statistics,calculus,vectors,algebra • Problems solved: classification /regression • Types: supervised,semi- supervised,unsupervised • What is our problem? Classification
  • 6.
    Supervised Learning: • Whatis it? • Steps: – Feature Selection – Training(provide Labelled Data) – Prediction
  • 7.
    FEATURE SELECTION • Howfeatures are selected in Classification? • Some property with which you can distinguish two classes is A Feature • Feature can be represented as Vector,Boolean etc • Apple Vs Orange Class: – Feature: colour,weight,shape – Label: apple,guava
  • 9.
    MODEL SELECTION Models forsupervised Learning: •K-Nearest Neighbours(KNN)-classification •K-Means clustering •SVM •Decision Tree •Random Forest •Naive Bayes Algorithm
  • 10.
    K-Nearest Neighbours(KNN) • Supervisedlearning • Classification Algorithm • Similarity to neighbours-(Eucledian,Manhattan,Minkowski) • Euclidean distance • A circle around the point to be classified that contains k points
  • 11.
    K-Means • Unsupervised learning •Clustering algorithm • Given some data we cluster the data to K groups • In each iteration the mean value of the cluster is updated • Centre calculated using Eucledian distance • ref video:https://www.youtube.com/watch? v=aiJ8II94qck
  • 17.
    Support Vector Machines •Classifier • What are support vectors • Linearly separating Hyperplane • Margins with max separation
  • 18.
    Support Vector Machines •ref:http://www.saedsayad.com/support_vector_machine.htm • videos: • https://www.youtube.com/watch?v=1NxnPkZM9bc • https://www.youtube.com/watch?v=5zRmhOUjjGY
  • 19.
  • 20.
    Random Forest • Ensemblelearning method • Uses output of multiple decision trees Ref:https://citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics/
  • 21.
    Features for Malware Detection •Static: – Size – Signed/unsigned – Icon-exe file without icons – entropy • Behaviour: – Process executed from %appdata% and %temp% – Dropped file has random name eg xszsde.exe – Process creating run entries – Code injection
  • 22.
  • 23.
    Some application forMalware Traffic Detection • DGA algorithm detection • DGA: what is DGA? • Features: – N-Grams – Entropy – Dictionary – Reference:http://datadrivensecurity.info
  • 24.
    ADVANCED TOPICS • NEURALNETWORKS • DEEP NEURAL NETWORKS
  • 25.