An Introduction to Anomaly Detection

An Introduction to
Anomaly Detection
Ken Graham

What we’ll cover
•What is Anomaly Detection?
•What’s an anomaly?
•Detecting Anomalies
•Methods and Applications

What is Anomaly Detection?
credit card fraud insurance fraud
image processing intrusion detection (cybersecurity)
text analysis sensor networks
insider threats industrial damage
• Trying to ﬁnd patterns in data that are different from the
expected.
• Some applications:

Detecting Anomalies
So, how would we detect some of these? Let’s take a
naive approach.
1. Deﬁne a “normal” region.
2. Observations not in the “normal” region are
anomalies.

Will this work?
• Boundary hard to define
• Definitions change over time
• Definitions are domain-dependent
• Labeled training data is hard to find
• Training data, is often heavily imbalanced

Types of Data
• Collection of data instances
• a data instance has a set of attributes
• Attributes can be of different types
• binary
• categorical
• continuous

• The attributes help determine the detection
method.
• The relationship between data instances is
important.
• Most existing anomaly detection techniques don’t
assume any particular relationship between the
data instances. We have to identify relationships.

Types of input data
• Sequential
• time-series, sequences of symbols
• Spatial
• each data instance is related to its neighbors
• images, vehicular trafﬁc
• Graph
• data instances are nodes in a graph or network

Three Types of Anomalies
• 😃 There are only three.
• 😔 No, that doesn’t make it any easier to detect
them.
• Point anomaly
• Contextual anomaly
• Collective anomaly

Point Anomaly
• Generally a single data instance.
• Anomalous compared to the entirety of the data
• Most research focuses on point anomalies
• Can occur in any dataset

Contextual Anomaly
• Anomalous in relation to a specific context
• Context comes from how data is structured
• Context has to be specified as a part of the problem
formulation
• Each data instance can be defined using two sets of
attributes:
• contextual: determines the context (e.g. lat/long or time)
• behavioral: non-contextual characteristics of an instance

• Anomalous behavior is determined by the
behavioral attributes within a speciﬁc context
• A data instance might be a contextual anomaly in a
given context, but a data instance with identical
behavioral attributes could be considered normal in
a different context.

• Contextual anomalies are generally found in time-
series data. Example:
• Avg monthly temp. of an area over last few years.
• 35 degrees F in winter might be normal
• 35 degrees F in summer in same place is
anomalous

• Another example: Credit card fraud
• Contextual attribute: time of purchase.
• $100 average weekly shopping bill, except during
the Christmas week, when it reaches $1000.
• A new purchase of $1000 in July would be
considered a contextual anomaly, since it’s
unusual for July.
• The same amount spent during Christmas week
will be considered normal.

Collective Anomaly
• A group of data instances are anomalous
• They need not be anomalies by themselves
• Again, the relationship between the data matters
• Point | Collective problem + context = Contextual
problem

Three Types of
Anomaly Detection Methods
• Supervised
• Use labeled training data to build a predictive model
• Imbalanced data (many normal, few anomalies)
• Semi-Supervised
• Only need normal data
• Model learns how to classify normal data
• Unsupervised (no labeled data)

Credit Card Fraud
Data used
• user ID
• amount spent
• time between consecutive card usage
Credit card companies have complete, labeled data and
user proﬁles

Kinds of anomalies
• point anomalies in transaction records
◦high payments
◦items never before purchased by the user
◦high rate of purchase
• contextual anomalies
◦User defines the context
▪ Each credit card user is profiled based on card usage
history.
▪ Each new transaction compared to user profile,
flagged if it doesn’t match
◦Location defines the context
▪ Detects anomalies among transactions at a specific
geographic location.

Cellphone Fraud
Data used
• Call data records (CDRs)
• CDR = vector of features
◦continuous (e.g., CALL-DURATION)
◦discrete (e.g., CALLING-CITY).
Kinds of anomalies
• point anomalies from aggregated CDR data
◦aggregated by time, user, or area
◦high volume of calls
◦calls made to unlikely destinations

Insider Trading
Data used
• Option trading data
• Stock trading data
• News
• Data is time-series or otherwise temporally sequenced.

Medical
• Patient records
◦ Electronic Health Records (EHRs)
◦demographics, medical history, medication and allergies,
immunization status, laboratory test results, radiology images,
vital signs, personal statistics like age and weight, and
billing information
◦ Electrocardiograms (ECG) and Electroencephalograms
(EEG)
• Temporal and/or spatial data

Types of anomalies
• point anomalies
◦ e.g., abnormal patient condition, instrumentation errors,
recording errors
• contextual
◦ Disease outbreaks can be contextual anomalies

(e.g. geo-temporal pattern of viral infections)
• collective

• False negatives can cost $$$ and lives
• A colleague (David Gilmore) said:
• "Precision saves money, recall saves lives."

Classiﬁcation
• Train a model from labeled data (supervised)
• Use the model to classify other data
• Many different ways to do this
◦SVMs, PGMs, Rules
◦Neural nets have shown much promise
▪ LSTMs learn features across a sequence
▪ Autoencoders reconstruct the data, reconstruction error tells
you if data is anomalous

Recurrent Neural Nets and
LSTMs
Now we’ll look at a method or two for time-series data.
• Method needs to learn patterns present in the sequence
• Sequences can have patterns of unknown length
• Recurrent neural networks (RNNs)[1][2] let you address
sequences of data

• Detect deviations from normalcy
• Steps
◦Train the NN to predict several time steps into the future
◦Each point in the sequence has several corresponding
predicted values made at different points in the past,
resulting in multiple error values.
◦Compute error distribution
• More generally, to detect anomalies in a time series
◦Anomalous if prediction error is larger than expected
◦Can pick an error threshold, e.g. 2 std. dev. from the mean

Autoencoders for Anomaly Detection

• Train the autoencoder.
• If the data is sequential, you can incorporate RNNs
or LSTMs.
• Use the model to reconstruct the input.
• If the reconstruction error is above some threshold,
label it as an anomaly

Nearest-Neighbor Methods
Assumption
• Normal data are close together, while anomalies are far away
Two Methods
1. Anomaly score is distance to kth nearest neighbor.
2. Anomaly score is the density of the neighborhood of each
point
• Distance metric affects computational complexity
• Easy to adapt to different problem domain. Just deﬁne the
distance metric

Statistical Methods
• Assumption
• Normal data lies in high probability regions,
anomalies in low probability regions
• Parametric and non-parametric methods

Parametric
• Assumes normal data is distributed according to a parametric
distribution
• Anomaly score is inverse of the PDF
• Or, use a hypothesis test. Anomaly score can be test statistic

Examples:
• Gaussian models => maximum likelihood estimation (MLE),
Grubb’s test and variants
• Regression models => ARIMA, ARMA
• mixtures of models
◦Assume each data point has prob. p of being an anomaly
◦N = PDF of normal data
◦A = PDF of anomalies (assume to be uniform)
◦D = PDF of all the data = pA + (1-p)N
◦Start with all points in N
◦Anomaly score comes from how much the distributions
change if you move point to A.

Non-parametric
• Histogram models
◦Does test instance ﬁt into an existing bin?
◦Or, how determine score from the bin in which it lands
• Kernel methods estimate the data PDF and are similar to
parametric methods

Spectral Methods
Assumption
• "Data can be embedded into a lower dimensional subspace
in which normal instances and anomalies appear signiﬁcantly
different.” - Anomaly Detection: A Survey
Main idea:
Find a subspace where the anomalies are easy to see and
project data onto it.

Methods
• Unsupervised or semi-supervised
• PCA
◦Project data along low variance principal components.
Anomaly projections will be high
◦In graphs, PCA on a graph’s adjacency matrix at different
points in time, differences in principal components determines
anomaly status
• Errors in Compact Matrix Decomposition (CMD) of a graph’s
adjacency matrix determined an anomalous graph
• PCA can be expensive

Contextual Anomalies
Contextual attributes are key
• sequential: position in sequence is the context
◦time-series
◦event data (timestamped)
▪ inter-arrival time between events can be uneven
• spatial: location is the context
• graphs: the edges between data instance (the nodes) are the
context
• proﬁles (user deﬁnes context, like for credit card fraud)

Contextual Methods
• Convert to a point anomaly problem
• 1. identify a context for a data instance
• 2. compute anomaly score within the context with
a point anomaly method
• Use the structure of the data when breaking data
into contexts is hard (time-series and sequences)

• time-series
◦regression, RNNs
• sequences
◦Use events occurring before a particular time to predict the
event occurring at that time.
◦If the prediction doesn't match the actual event, it's labeled rare.
◦Finite State Automata (FSA) and Hidden Markov Models
(HMMs)
to compute conditional probabilities for events in the sequence
based on previous events.
◦Model event sequence as a Poisson process
• graphs

Collective Anomalies
• Hardest to detect because theirs is collective behavior.
• Relationship between data points is important
◦Sequential => find an anomalous subsequence
▪ lots of research here b/c lots of time-series and
event sequence data in the wild
◦Spatial => find an anomalous subregion
▪ image/video processing
◦Graph => find an anomalous subgraph
◦The task is to find an anomalous subset

Detecting Collective
Sequential Anomalies
Reduce to point anomaly problem:
• transform subsequences and then use a point anomaly method
• FSA, Markov Models, HMMs, CRFs for symbols
Neural Nets would be powerful here
• RNNs + LSTMs + Autoencoders: Could use a sequence to
sequence model on the subsequences and compute
reconstruction error
• For every example we’ve looked at that used FSA or HMMs,
you could use neural nets instead

Detecting Collective Spatial
Anomalies
• Most work here has been on images
• Anomaly detection in videos would likely be a combination of
techniques for spatial and sequential anomalies (collective or
otherwise).
◦Video = sequence of images + an audio stream
• Convolutional neural networks (CNNs) have been used for
anomaly detection in images
◦Fully Convolutional Neural Network for Fast Anomaly
Detection in Crowded Scenes (2016): https://arxiv.org/abs/
1609.00866

Most important thing…
• Understand your problem before picking a method.
• Just because a method is the most accurate doesn’t
automatically make it the best solution for your problem.

An Introduction to Anomaly Detection

In this document

More Related Content

What's hot

Similar to An Introduction to Anomaly Detection

Recently uploaded

An Introduction to Anomaly Detection