Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Case Study on the Spirit Dataset

Supervised Machine Learning Approaches for
Log-Based Anomaly Detection: A Case Study on the
Spirit Dataset
Bekkouche Mohammed Meski Melissa Khodja Yousra
Benslimane Sidi Mohammed
LabRI-SBA Laboratory, École Supérieure en Informatique, Sidi Bel
Abbes 22000, Algeria
TACC 2025
November 20-22, 2025
Bekkouche Mohammed (ESI-SBA) Log-based anomaly detection TACC 2025 1 / 24

Structure
1 Introduction
2 Related Work
3 Dataset: Spirit
4 Methodology
5 Results and Analysis
6 Conclusion

Introduction
Introduction
Anomaly detection is essential for building secure and reliable
computer systems.
Increasing system complexity ⇒ higher risk of bugs and vulnerabilities.
Failures may lead to user dissatisfaction or substantial financial losses.
Logs are valuable resources:
Capture system events and states during runtime.
Provide insights into operational behavior.
Enable automated anomaly detection.
Manual log inspection is impractical ⇒ automated, ML-based
approaches are required.

Introduction
Introduction
Machine learning for log anomaly detection:
Unsupervised methods: work without labels, but commonly prone to
false positives.
Supervised methods: higher accuracy when labeled data is available.
Spirit dataset:
Real system logs with ground-truth anomaly labels.
Underexplored in supervised learning studies.
Our contribution: Evaluation of four supervised models (SVM, DT,
RF, XGBoost) on five dataset versions with TF-IDF and Word2Vec.
Important findings:
Shorter windows improve detection.
Tree-based models perform best with TF-IDF; SVM with Word2Vec.

Related Work
Related Work
Machine learning methods for log anomaly detection are commonly
divided into:
Supervised methods:
Require labeled training data.
Typically use classification algorithms (Decision Trees, SVMs, Random
Forests, XGBoost).
Recent advances include CNN-based classifiers and Transformer-based
approaches.
Unsupervised methods:
Do not require labeled data.
Detect anomalies as rare or unusual patterns in feature space.
Include PCA, Isolation Forest, Autoencoders, LSTMs (e.g., DeepLog,
LogAnomaly), and Transformer-based models.

Related Work
Related Work
Semi-supervised approaches:
Use a small amount of labeled data to enhance anomaly detection.
Strategies include semi-supervised classifiers, integrating known
anomalies, or pseudo-labeling.
Datasets:
HDFS and Thunderbird are widely studied.
The Spirit dataset is significantly less explored, especially with
supervised learning.
Our focus:
Provide a detailed study of the Spirit dataset using supervised machine
learning.
Evaluate the effectiveness of traditional models in detecting abnormal
log sequences.

Dataset: Spirit
Dataset: Spirit
# Log # Log
Grouping
# Log # Avg. seq. Training Data(80%) Testing Data(20%)
Messages Events sequences length # Log sequences # Anomaly # Log sequences # Anomaly
5,000,000 2,880
1 hour 1,173 4262.57 937 761(81.22%) 236 191(80.93%)
30 minutes 2,345 2,132.20 1,875 1,437(76.64%) 470 360(76.60%)
15 minutes 4,690 1,066.10 3,751 2,747(73.23%) 939 687(73.16%)
5 minutes 14,068 355.42 11,253 7901(70.21%) 2,815 1976(70.20%)
1 minute 70,327 71.10 56261 38486(68.41%) 14066 9622(68.41%)
The Spirit supercomputing system at Sandia National Laboratories (1,028
processors, 1,024 GB memory).
Original dataset: over 172 million log messages, each labeled as normal or
anomalous.
Subset used:
1 GB of continuous log lines (first 5M entries).
764,891 anomalies ⇒ anomaly ratio ≈ 15.3%.
Event templates: 2,880.
Sequence generation: fixed time windows (1m, 5m, 15m, 30m, 1h).
Shorter windows ⇒ more sequences, shorter length.
Labeling rule: sequence is anomalous if it contains ≥ 1 anomalous message.
Dataset split: 80% training, 20% testing (uniform distribution of
normal/anomalous).

Methodology
Log-Based Anomaly Detection System
Log Parsing
Extract structured events from raw logs
(e.g., using Drain or Spell parsers)
Feature Extraction / Engineering
Transform parsed logs into numerical vectors
(e.g., TF-IDF, Word2Vec embeddings, frequency counts)
Model Training
Train supervised learning models using labeled data
(e.g., SVM, Decision Tree, Random Forest)
Anomaly Detection
Apply the trained model to detect abnormal patterns
(e.g., binary classification: normal vs anomaly)
Figure: Process of log-based anomaly detection using supervised machine learning
models

Methodology
Log parsing:
Transforms unstructured log messages into a structured format.
Produces an event template (constant part) + parameters (variable
parts).
Example:
Raw: sendmail[17795]: j0170NVv017795: from=root,
size=117, class=0, nrcpts=1,
msgid=<200501010700.j0170NVv017795@sn209>,
relay=#2#@localhost
Template: <*> <*> from=root, <*> class=0, nrcpts=1, <*>
relay=#2#@localhost
In our work: a pre-parsed version of the Spirit dataset is used, with
fields such as:
Timestamp, Event ID, Event Template, Anomaly Label

Methodology
Feature Engineering:
Goal: transform log entries into numerical representations for
machine learning.
Logs are grouped into sequences using timestamps:
Each sequence captures a snapshot of system behavior.
This step is crucial: quality of features strongly impacts anomaly
detection performance.
In the Spirit dataset:
No explicit identifiers (e.g., session IDs).
Grouping performed using timestamp-based strategies.
Strategies:
Fixed window partitioning.
Sliding window partitioning (defined by window size + step size).
Our work: adopt the fixed time window approach with
non-overlapping intervals.

Methodology
Algorithm 1: Time Window Log Grouping
Input : log dataset: list of log entries with timestamp, event ID,
event template, and anomaly label (True or False)
time window: fixed time interval (e.g., 15 minutes)
Output: log sequences: list of log sequences grouped by time window
1 Sort log dataset by timestamp in ascending order; Initialize
log sequences ← [ ];
2 Initialize current sequence ← [ ]; Set window start ← timestamp of
first log entry;
3 Set window end ← window start + time window;
4 label seq ← True; // True means the sequence is normal
5 foreach log entry in log dataset do
6 if log entry.timestamp < window end then
7 Append log entry.event ID or event template to
current sequence;
8 current sequence.label ← label seq ∧ log entry.label;
9 end
10 else
11 Append current sequence to log sequences;
12 current sequence ← [log entry.event ID] or
[log entry.event template];
13 current sequence.label ← log entry.label;
14 window start ← log entry.timestamp;
15 window end ← window start + time window;
16 end
17 end
18 if current sequence is not empty then
19 Append current sequence to log sequences;
20 end
21 return log sequences;

Methodology
Each log sequence can be represented using:
Event Templates
Event IDs (one ID per template, 2,880 in Spirit dataset)
The representation choice depends on the feature extraction
technique.
Techniques used in this study:
TF-IDF: sequences represented with event IDs.
Word2Vec: sequences represented with event templates.

Methodology
Feature Engineering: TF-IDF
Log sequences must be converted into numerical vectors to be used
with ML models.
TF-IDF measures the importance of an event ID within a sequence
compared to across all sequences.
Each sequence is transformed into a sparse vector of size 2, 880
(number of unique event IDs).
Formula:
TF-IDF(ei , sj ) = TF(ei , sj ) × log

N
n

N: total number of log sequences
n: number of sequences containing ei
TF(ei , sj ): frequency of ei in sj
Output: a 2,880-dimensional sparse vector per sequence.

Methodology
Feature Engineering: Word2Vec
Word2Vec learns dense vector embeddings for event templates based
on their context in sequences.
Trained on all log sequences, treating each event template as a
“word”.
Learns semantic similarities: events appearing in similar contexts get
similar vectors.
Representation:
Each template → 500-dimensional embedding.
Sequence → average of its templates’ embeddings.
Output: a 500-dimensional dense vector encoding semantic
patterns.

Methodology
Classifiers
Based on the labeled feature vectors, supervised models are trained to
distinguish between normal and anomalous log sequences.
Models used:
SVM – boundary-based, finds optimal hyperplane, effective on dense
data (e.g., Word2Vec).
DT – rule-based, interpretable, works well on sparse data (e.g.,
TF-IDF).
RF – ensemble of decision trees, improves generalization and
robustness.
XGBoost – gradient-boosted decision trees, sequential correction of
errors, efficient and scalable.

Methodology
Training and Evaluation
Datasets (Spirit logs) are split into:
80% training set
20% test set
Uniform split is applied that ensures consistent proportion of
anomalies across sets.
Models are trained on training data and evaluated on test data.
Evaluation metrics:
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1-Score = Harmonic mean of Precision and Recall
Performance is compared across:
Feature extraction methods (TF-IDF vs. Word2Vec).
Different dataset versions (various time window sizes).

Results and Analysis
Table: Precision, Recall, and F1-Score for SVM, DT, RF, and XB supervised
learning approaches in detecting anomalies across different versions of the Spirit
log dataset, using TF-IDF and Word2Vec for feature extraction. These versions
are generated using a fixed-size window grouping strategy with varying time
window lengths.
Model
(With TF-IDF/With Word2Vec)
1 hour 30 minutes 15 minutes
Precision Recall F1-Score Precision Recall F1-Score Precision Recall F1-Score
SVM 0.724/0.699 0.353/0.356 0.472/0.472 0.742/0.721 0.587/0.515 0.655/0.601 0.743/0.753 0.742/0.697 0.742/0.724
DT 0.841/0.716 0.227/0.390 0.334/0.505 0.894/0.713 0.470/0.509 0.605/0.594 1.000/0.708 0.622/0.612 0.767/0.656
RF 0.733/0.714 0.387/0.393 0.507/0.507 0.782/0.736 0.578/0.572 0.665/0.644 0.851/0.736 0.707/0.706 0.773/0.721
XB 0.691/0.678 0.340/0.309 0.456/0.424 0.736/0.746 0.550/0.581 0.630/0.653 1.000/0.744 0.620/0.741 0.765/0.743
Model
(With TF-IDF/With Word2Vec)
5 minutes 1 minute
Precision Recall F1-Score Precision Recall F1-Score
SVM 0.746/0.754 0.861/0.862 0.800/0.804 0.782/0.986 0.959/0.948 0.860/0.966
DT 1.000/0.751 0.820/0.849 0.901/0.797 1.000/0.948 0.947/0.935 0.973/0.941
RF 0.843/0.751 0.870/0.857 0.857/0.801 0.851/0.961 0.965/0.934 0.904/0.947
XB 1.000/0.755 0.820/0.864 0.901/0.806 1.000/0.858 0.947/0.948 0.973/0.901

Comparison of SVM, DT, RF, and XB.
Evaluation across:
Five fixed time windows: 1h, 30m, 15m, 5m, 1m
Two feature extraction methods: TF-IDF, Word2Vec
Metrics used: Precision, Recall, F1-Score.
Important Observation:
Recall and F1-Score improve with shorter time windows.
Smaller windows → more sequences with fewer events.
Anomalies become easier to isolate and distinguish.

1h 30m 15m 5m 1m
0.4
0.6
0.8
1
Time Window
F1-Score
SVM (TF-IDF)
SVM (Word2Vec)
DT (TF-IDF)
DT (Word2Vec)
RF (TF-IDF)
RF (Word2Vec)
XB (TF-IDF)
XB (Word2Vec)
Figure: F1-Score evolution of SVM, DT, RF, and XB across different time
windows using TF-IDF and Word2Vec.

SVM
Moderate overall performance.
Improves with shorter windows and Word2Vec.
Best F1-Score: 0.966 (1 min, Word2Vec).
Weak at longer windows (0.472 at 1h).
Decision Tree (DT)
Excels at short windows with TF-IDF.
Perfect precision (1.000) at 5m and 1m (TF-IDF).
Best F1-Score: 0.973 (1 min, TF-IDF).
Struggles with recall at long windows (0.227 at 1h).

Random Forest (RF)
Strong and stable across all settings.
Best F1-Score: 0.947 (1 min, Word2Vec).
Ensemble nature improves robustness.
XGBoost (XB)
Competitive, especially with TF-IDF.
Best F1-Score: 0.973 (1 min, TF-IDF).
Slightly below RF on Word2Vec at 1m (0.901 vs 0.947).
General Insights
TF-IDF favors tree-based models (DT, RF, XB).
Word2Vec benefits SVM.
Shorter windows consistently improve performance.

Conclusion
Conclusion
Comparative evaluation on the Spirit dataset:
Models: SVM, Decision Tree (DT), Random Forest (RF), XGBoost
(XB).
Feature extraction: TF-IDF and Word2Vec.
Log segmentation with time-based fixed windows.
Experimental Results:
Shorter windows ⇒ improved detection accuracy.
Best performance at 1-minute window.
DT and XB (TF-IDF): F1-Score = 0.973.
SVM (Word2Vec): F1-Score = 0.966.

Conclusion
Conclusion
Important Findings:
Tree-based supervised models + structured features are highly
effective.
Adjusting log grouping parameters greatly enhances anomaly
detection.
Future Work:
Explore ensemble techniques (voting among RF, DT, XB).
Investigate hybrid features (TF-IDF + Word2Vec).
Apply Explainable AI to interpret model decisions.

Conclusion
Thank You
Questions?

Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Case Study on the Spirit Dataset

More Related Content

Similar to Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Case Study on the Spirit Dataset

Recently uploaded

Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Case Study on the Spirit Dataset