The Road to Uncovering Botnets
From Python Scikit-Learn
to Scala Spark
whoami
• Avi Aminov
– ~2 years Security Researcher at Akamai
– Physics PhD student
• Asaf Nadler
– ~1.5 years Security Researcher at Akamai
– CS PhD student
Enterprise Threat Protection
• Detect malware presence from outbound traffic
– Behavioral pattern analysis
– Domain blacklisting
• Availability – End of June ’17
Akamai
Recursive
DNS
Branch / HQ
Enterprise
DNS
Data
• Akamai Data
– 20-30% of internet traffic
– Customer ISP/Enterprise logs – 20B DNS queries/day
• Third party data
– e.g. Authoritative DNS log lines
• Open data sources
– e.g. WHOIS information
Bot Networks – IP Fluxing
• Goal – Evasion
– Regular bots: waiting for orders
– Proxies: concealing origin server
Command
& Control
server
Bots
Proxy Bots
Bot Networks Detection
• Detect illegitimate IP fluxing
• Features
– IP dispersity (Geo, systems)
– TTL features
– Lexical
Domain Description #Systems #Countries
astro-travels.net PoS CNC Host 157 11
Decision Tree Model
Malicious with high confidence
• Spread across systems
• Unpopular
Benign with high confidence
• IPs in the same system
• Contains meaningful words
Challenge – Going to Production
Feature
Extraction
Scoring Blacklist
Feature
Extraction
Model
Training Model
Model
Evaluation
Data
Sources
What have we done so far?
• Flow
– Researcher describes an algorithm (document + Hive query)
– Dev rewrites the code in MapReduce (now Scala/Spark)
• Problems
– Not applicable to ML pipelines
– Prone to mistakes
– Longer development cycle
Can We Do Better? Option #1
• Research side – Pipeline in Scala/Spark
• Dev side – Implement the algorithms
• Pros
– Greater flexibility
– Research scale
• Cons
– Learning curve
– Lose sklearn/R benefits
Can We Do Better? Option #2
• Research side – Train locally and export model
• Dev side – Transform data using imported model
• Pros
– Quick implementation
– Unified procedure
• Cons
– No support for all models
Export scheme
• Predictive Model Markup Language
• General scheme for ML pipelines
– Data transformations
– Scoring models
• XML format – Readable
• Supported by major data science / ML
frameworks using jPMML (R, sklearn)
PMML Simple Boilerplate
Python (Research side) Scala (Dev side)
Credit: jpmml lib http://openscoring.io/ , https://github.com/jpmml/
Maintained by Villu Ruusmann
Lessons Learned
• Work process adjusted to the task
– Training locally? Export the model
– Training on larger scales? Better to use Spark
• Use jpmml for model export
• When applicable, reduce workload in production
– Example – only look at domains with many IPs
Challenge solved
Feature
Extraction
Scoring Blacklist
Data
Collection
Model
Training Model
Model
Evaluation
Data
Sources PMML
Thank you!
@AviBachsh

From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets with Avi Aminov

  • 1.
    The Road toUncovering Botnets From Python Scikit-Learn to Scala Spark
  • 2.
    whoami • Avi Aminov –~2 years Security Researcher at Akamai – Physics PhD student • Asaf Nadler – ~1.5 years Security Researcher at Akamai – CS PhD student
  • 3.
    Enterprise Threat Protection •Detect malware presence from outbound traffic – Behavioral pattern analysis – Domain blacklisting • Availability – End of June ’17 Akamai Recursive DNS Branch / HQ Enterprise DNS
  • 4.
    Data • Akamai Data –20-30% of internet traffic – Customer ISP/Enterprise logs – 20B DNS queries/day • Third party data – e.g. Authoritative DNS log lines • Open data sources – e.g. WHOIS information
  • 5.
    Bot Networks –IP Fluxing • Goal – Evasion – Regular bots: waiting for orders – Proxies: concealing origin server Command & Control server Bots Proxy Bots
  • 6.
    Bot Networks Detection •Detect illegitimate IP fluxing • Features – IP dispersity (Geo, systems) – TTL features – Lexical Domain Description #Systems #Countries astro-travels.net PoS CNC Host 157 11
  • 7.
    Decision Tree Model Maliciouswith high confidence • Spread across systems • Unpopular Benign with high confidence • IPs in the same system • Contains meaningful words
  • 8.
    Challenge – Goingto Production Feature Extraction Scoring Blacklist Feature Extraction Model Training Model Model Evaluation Data Sources
  • 9.
    What have wedone so far? • Flow – Researcher describes an algorithm (document + Hive query) – Dev rewrites the code in MapReduce (now Scala/Spark) • Problems – Not applicable to ML pipelines – Prone to mistakes – Longer development cycle
  • 10.
    Can We DoBetter? Option #1 • Research side – Pipeline in Scala/Spark • Dev side – Implement the algorithms • Pros – Greater flexibility – Research scale • Cons – Learning curve – Lose sklearn/R benefits
  • 11.
    Can We DoBetter? Option #2 • Research side – Train locally and export model • Dev side – Transform data using imported model • Pros – Quick implementation – Unified procedure • Cons – No support for all models
  • 12.
    Export scheme • PredictiveModel Markup Language • General scheme for ML pipelines – Data transformations – Scoring models • XML format – Readable • Supported by major data science / ML frameworks using jPMML (R, sklearn)
  • 13.
    PMML Simple Boilerplate Python(Research side) Scala (Dev side) Credit: jpmml lib http://openscoring.io/ , https://github.com/jpmml/ Maintained by Villu Ruusmann
  • 14.
    Lessons Learned • Workprocess adjusted to the task – Training locally? Export the model – Training on larger scales? Better to use Spark • Use jpmml for model export • When applicable, reduce workload in production – Example – only look at domains with many IPs
  • 15.
  • 16.