An Introduction to
Anomaly Detection
Ken Graham
What we’ll cover
•What is Anomaly Detection?
•What’s an anomaly?
•Detecting Anomalies
•Methods and Applications
What is Anomaly Detection?
credit card fraud insurance fraud
image processing intrusion detection (cybersecurity)
text analysis sensor networks
insider threats industrial damage
• Trying to find patterns in data that are different from the
expected. 
• Some applications: 
Detecting Anomalies
So, how would we detect some of these? Let’s take a
naive approach.
1. Define a “normal” region. 
2. Observations not in the “normal” region are
anomalies. 
Will this work? 
• Boundary hard to define
• Definitions change over time
• Definitions are domain-dependent
• Labeled training data is hard to find
• Training data, is often heavily imbalanced
Types of Data
• Collection of data instances
• a data instance has a set of attributes
• Attributes can be of different types
• binary
• categorical
• continuous
• The attributes help determine the detection
method.
• The relationship between data instances is
important.
• Most existing anomaly detection techniques don’t
assume any particular relationship between the
data instances. We have to identify relationships.
Types of input data
• Sequential
• time-series, sequences of symbols
• Spatial
• each data instance is related to its neighbors
• images, vehicular traffic
• Graph
• data instances are nodes in a graph or network
Three Types of Anomalies
• 😃 There are only three. 
• 😔 No, that doesn’t make it any easier to detect
them.
• Point anomaly
• Contextual anomaly
• Collective anomaly
Point Anomaly
• Generally a single data instance. 
• Anomalous compared to the entirety of the data
• Most research focuses on point anomalies
• Can occur in any dataset
Contextual Anomaly
• Anomalous in relation to a specific context
• Context comes from how data is structured
• Context has to be specified as a part of the problem
formulation
• Each data instance can be defined using two sets of
attributes:
• contextual: determines the context (e.g. lat/long or time)
• behavioral: non-contextual characteristics of an instance
• Anomalous behavior is determined by the
behavioral attributes within a specific context
• A data instance might be a contextual anomaly in a
given context, but a data instance with identical
behavioral attributes could be considered normal in
a different context. 
• Contextual anomalies are generally found in time-
series data. Example:
• Avg monthly temp. of an area over last few years.
• 35 degrees F in winter might be normal
• 35 degrees F in summer in same place is
anomalous
• Another example: Credit card fraud
• Contextual attribute: time of purchase. 
• $100 average weekly shopping bill, except during
the Christmas week, when it reaches $1000. 
• A new purchase of $1000 in July would be
considered a contextual anomaly, since it’s
unusual for July. 
• The same amount spent during Christmas week
will be considered normal.
Collective Anomaly
• A group of data instances are anomalous
• They need not be anomalies by themselves
• Again, the relationship between the data matters
• Point | Collective problem + context = Contextual
problem
Three Types of
Anomaly Detection Methods
• Supervised
• Use labeled training data to build a predictive model
• Imbalanced data (many normal, few anomalies)
• Semi-Supervised
• Only need normal data
• Model learns how to classify normal data
• Unsupervised (no labeled data)
Applications
Credit Card Fraud
Data used
• user ID
• amount spent
• time between consecutive card usage
Credit card companies have complete, labeled data and 
user profiles
Kinds of anomalies 
• point anomalies in transaction records
◦high payments
◦items never before purchased by the user
◦high rate of purchase
• contextual anomalies
◦User defines the context
▪ Each credit card user is profiled based on card usage
history. 
▪ Each new transaction compared to user profile,
flagged if it doesn’t match
◦Location defines the context
▪ Detects anomalies among transactions at a specific
geographic location. 
Cellphone Fraud
Data used 
• Call data records (CDRs)
• CDR = vector of features
◦continuous (e.g., CALL-DURATION)
◦discrete (e.g., CALLING-CITY). 
Kinds of anomalies
• point anomalies from aggregated CDR data
◦aggregated by time, user, or area
◦high volume of calls
◦calls made to unlikely destinations
Insider Trading
Data used
• Option trading data
• Stock trading data
• News
• Data is time-series or otherwise temporally sequenced.
Medical
	•	Patient records
	 ◦	Electronic Health Records (EHRs)
◦demographics, medical history, medication and allergies,
immunization status, laboratory test results, radiology images,
vital signs, personal statistics like age and weight, and
billing information
	 ◦	Electrocardiograms (ECG) and Electroencephalograms
(EEG)
	•	Temporal and/or spatial data 
Types of anomalies
	•	point anomalies
	 ◦	e.g., abnormal patient condition, instrumentation errors,
recording errors
	•	contextual
	 ◦	Disease outbreaks can be contextual anomalies 

(e.g. geo-temporal pattern of viral infections) 
	•	collective
• False negatives can cost $$$ and lives
• A colleague (David Gilmore) said: 
• "Precision saves money, recall saves lives."
Methods
Classification
• Train a model from labeled data (supervised)
• Use the model to classify other data
• Many different ways to do this
◦SVMs, PGMs, Rules
◦Neural nets have shown much promise
▪ LSTMs learn features across a sequence
▪ Autoencoders reconstruct the data, reconstruction error tells
you if data is anomalous
Recurrent Neural Nets and
LSTMs
Now we’ll look at a method or two for time-series data.
• Method needs to learn patterns present in the sequence
• Sequences can have patterns of unknown length
• Recurrent neural networks (RNNs)[1][2] let you address
sequences of data
• Detect deviations from normalcy
• Steps
◦Train the NN to predict several time steps into the future 
◦Each point in the sequence has several corresponding
predicted values made at different points in the past,
resulting in multiple error values. 
◦Compute error distribution
• More generally, to detect anomalies in a time series
◦Anomalous if prediction error is larger than expected
◦Can pick an error threshold, e.g. 2 std. dev. from the mean
Autoencoders for Anomaly Detection
• Train the autoencoder.
• If the data is sequential, you can incorporate RNNs
or LSTMs.
• Use the model to reconstruct the input.
• If the reconstruction error is above some threshold,
label it as an anomaly
Nearest-Neighbor Methods
Assumption 
• Normal data are close together, while anomalies are far away
Two Methods
1. Anomaly score is distance to kth nearest neighbor.
2. Anomaly score is the density of the neighborhood of each
point
• Distance metric affects computational complexity
• Easy to adapt to different problem domain. Just define the
distance metric
Statistical Methods
• Assumption
• Normal data lies in high probability regions,
anomalies in low probability regions
• Parametric and non-parametric methods
Parametric
• Assumes normal data is distributed according to a parametric
distribution
• Anomaly score is inverse of the PDF 
• Or, use a hypothesis test. Anomaly score can be test statistic
Examples: 
• Gaussian models => maximum likelihood estimation (MLE),
Grubb’s test and variants
• Regression models => ARIMA, ARMA
• mixtures of models
◦Assume each data point has prob. p of being an anomaly
◦N = PDF of normal data
◦A = PDF of anomalies (assume to be uniform)
◦D = PDF of all the data = pA + (1-p)N
◦Start with all points in N
◦Anomaly score comes from how much the distributions
change if you move point to A.
Non-parametric
• Histogram models
◦Does test instance fit into an existing bin?
◦Or, how determine score from the bin in which it lands
• Kernel methods estimate the data PDF and are similar to
parametric methods 
Spectral Methods
Assumption
• "Data can be embedded into a lower dimensional subspace
in which normal instances and anomalies appear significantly
different.” - Anomaly Detection: A Survey
Main idea: 
Find a subspace where the anomalies are easy to see and
project data onto it.
Methods 
• Unsupervised or semi-supervised
• PCA
◦Project data along low variance principal components.
Anomaly projections will be high 
◦In graphs, PCA on a graph’s adjacency matrix at different
points in time, differences in principal components determines
anomaly status
• Errors in Compact Matrix Decomposition (CMD) of a graph’s
adjacency matrix determined an anomalous graph
• PCA can be expensive
Contextual Anomalies
Contextual attributes are key
• sequential: position in sequence is the context
◦time-series
◦event data (timestamped)
▪ inter-arrival time between events can be uneven
• spatial: location is the context
• graphs: the edges between data instance (the nodes) are the
context
• profiles (user defines context, like for credit card fraud)
Contextual Methods
• Convert to a point anomaly problem
• 1. identify a context for a data instance
• 2. compute anomaly score within the context with
a point anomaly method
• Use the structure of the data when breaking data
into contexts is hard (time-series and sequences)
• time-series
◦regression, RNNs
• sequences
◦Use events occurring before a particular time to predict the
event occurring at that time. 
◦If the prediction doesn't match the actual event, it's labeled rare.
◦Finite State Automata (FSA) and Hidden Markov Models
(HMMs)
to compute conditional probabilities for events in the sequence
based on previous events. 
◦Model event sequence as a Poisson process 
• graphs
Collective Anomalies 
• Hardest to detect because theirs is collective behavior.
• Relationship between data points is important
◦Sequential => find an anomalous subsequence
▪ lots of research here b/c lots of time-series and
event sequence data in the wild
◦Spatial => find an anomalous subregion
▪ image/video processing
◦Graph => find an anomalous subgraph
◦The task is to find an anomalous subset
Detecting Collective
Sequential Anomalies
Reduce to point anomaly problem:
• transform subsequences and then use a point anomaly method
• FSA, Markov Models, HMMs, CRFs for symbols
Neural Nets would be powerful here
• RNNs + LSTMs + Autoencoders: Could use a sequence to
sequence model on the subsequences and compute
reconstruction error
• For every example we’ve looked at that used FSA or HMMs,
you could use neural nets instead
Detecting Collective Spatial
Anomalies
• Most work here has been on images
• Anomaly detection in videos would likely be a combination of
techniques for spatial and sequential anomalies (collective or
otherwise). 
◦Video = sequence of images + an audio stream
• Convolutional neural networks (CNNs) have been used for
anomaly detection in images
◦Fully Convolutional Neural Network for Fast Anomaly
Detection in Crowded Scenes (2016): https://arxiv.org/abs/
1609.00866
Most important thing…
• Understand your problem before picking a method. 
• Just because a method is the most accurate doesn’t
automatically make it the best solution for your problem.

An Introduction to Anomaly Detection

  • 1.
    An Introduction to AnomalyDetection Ken Graham
  • 2.
    What we’ll cover •Whatis Anomaly Detection? •What’s an anomaly? •Detecting Anomalies •Methods and Applications
  • 3.
    What is AnomalyDetection? credit card fraud insurance fraud image processing intrusion detection (cybersecurity) text analysis sensor networks insider threats industrial damage • Trying to find patterns in data that are different from the expected.  • Some applications: 
  • 11.
    Detecting Anomalies So, howwould we detect some of these? Let’s take a naive approach. 1. Define a “normal” region.  2. Observations not in the “normal” region are anomalies. 
  • 12.
    Will this work?  •Boundary hard to define • Definitions change over time • Definitions are domain-dependent • Labeled training data is hard to find • Training data, is often heavily imbalanced
  • 13.
    Types of Data •Collection of data instances • a data instance has a set of attributes • Attributes can be of different types • binary • categorical • continuous
  • 14.
    • The attributeshelp determine the detection method. • The relationship between data instances is important. • Most existing anomaly detection techniques don’t assume any particular relationship between the data instances. We have to identify relationships.
  • 15.
    Types of inputdata • Sequential • time-series, sequences of symbols • Spatial • each data instance is related to its neighbors • images, vehicular traffic • Graph • data instances are nodes in a graph or network
  • 16.
    Three Types ofAnomalies • 😃 There are only three.  • 😔 No, that doesn’t make it any easier to detect them. • Point anomaly • Contextual anomaly • Collective anomaly
  • 17.
    Point Anomaly • Generallya single data instance.  • Anomalous compared to the entirety of the data • Most research focuses on point anomalies • Can occur in any dataset
  • 18.
    Contextual Anomaly • Anomalousin relation to a specific context • Context comes from how data is structured • Context has to be specified as a part of the problem formulation • Each data instance can be defined using two sets of attributes: • contextual: determines the context (e.g. lat/long or time) • behavioral: non-contextual characteristics of an instance
  • 19.
    • Anomalous behavioris determined by the behavioral attributes within a specific context • A data instance might be a contextual anomaly in a given context, but a data instance with identical behavioral attributes could be considered normal in a different context. 
  • 20.
    • Contextual anomaliesare generally found in time- series data. Example: • Avg monthly temp. of an area over last few years. • 35 degrees F in winter might be normal • 35 degrees F in summer in same place is anomalous
  • 22.
    • Another example:Credit card fraud • Contextual attribute: time of purchase.  • $100 average weekly shopping bill, except during the Christmas week, when it reaches $1000.  • A new purchase of $1000 in July would be considered a contextual anomaly, since it’s unusual for July.  • The same amount spent during Christmas week will be considered normal.
  • 23.
    Collective Anomaly • Agroup of data instances are anomalous • They need not be anomalies by themselves • Again, the relationship between the data matters • Point | Collective problem + context = Contextual problem
  • 24.
    Three Types of Anomaly DetectionMethods • Supervised • Use labeled training data to build a predictive model • Imbalanced data (many normal, few anomalies) • Semi-Supervised • Only need normal data • Model learns how to classify normal data • Unsupervised (no labeled data)
  • 25.
  • 26.
    Credit Card Fraud Dataused • user ID • amount spent • time between consecutive card usage Credit card companies have complete, labeled data and  user profiles
  • 27.
    Kinds of anomalies  •point anomalies in transaction records ◦high payments ◦items never before purchased by the user ◦high rate of purchase • contextual anomalies ◦User defines the context ▪ Each credit card user is profiled based on card usage history.  ▪ Each new transaction compared to user profile, flagged if it doesn’t match ◦Location defines the context ▪ Detects anomalies among transactions at a specific geographic location. 
  • 28.
    Cellphone Fraud Data used  •Call data records (CDRs) • CDR = vector of features ◦continuous (e.g., CALL-DURATION) ◦discrete (e.g., CALLING-CITY).  Kinds of anomalies • point anomalies from aggregated CDR data ◦aggregated by time, user, or area ◦high volume of calls ◦calls made to unlikely destinations
  • 29.
    Insider Trading Data used •Option trading data • Stock trading data • News • Data is time-series or otherwise temporally sequenced.
  • 30.
    Medical • Patient records ◦ ElectronicHealth Records (EHRs) ◦demographics, medical history, medication and allergies, immunization status, laboratory test results, radiology images, vital signs, personal statistics like age and weight, and billing information ◦ Electrocardiograms (ECG) and Electroencephalograms (EEG) • Temporal and/or spatial data 
  • 31.
    Types of anomalies • pointanomalies ◦ e.g., abnormal patient condition, instrumentation errors, recording errors • contextual ◦ Disease outbreaks can be contextual anomalies  (e.g. geo-temporal pattern of viral infections)  • collective
  • 33.
    • False negativescan cost $$$ and lives • A colleague (David Gilmore) said:  • "Precision saves money, recall saves lives."
  • 34.
  • 35.
    Classification • Train amodel from labeled data (supervised) • Use the model to classify other data • Many different ways to do this ◦SVMs, PGMs, Rules ◦Neural nets have shown much promise ▪ LSTMs learn features across a sequence ▪ Autoencoders reconstruct the data, reconstruction error tells you if data is anomalous
  • 36.
    Recurrent Neural Netsand LSTMs Now we’ll look at a method or two for time-series data. • Method needs to learn patterns present in the sequence • Sequences can have patterns of unknown length • Recurrent neural networks (RNNs)[1][2] let you address sequences of data
  • 37.
    • Detect deviationsfrom normalcy • Steps ◦Train the NN to predict several time steps into the future  ◦Each point in the sequence has several corresponding predicted values made at different points in the past, resulting in multiple error values.  ◦Compute error distribution • More generally, to detect anomalies in a time series ◦Anomalous if prediction error is larger than expected ◦Can pick an error threshold, e.g. 2 std. dev. from the mean
  • 38.
  • 39.
    • Train theautoencoder. • If the data is sequential, you can incorporate RNNs or LSTMs. • Use the model to reconstruct the input. • If the reconstruction error is above some threshold, label it as an anomaly
  • 40.
    Nearest-Neighbor Methods Assumption  • Normaldata are close together, while anomalies are far away Two Methods 1. Anomaly score is distance to kth nearest neighbor. 2. Anomaly score is the density of the neighborhood of each point • Distance metric affects computational complexity • Easy to adapt to different problem domain. Just define the distance metric
  • 41.
    Statistical Methods • Assumption •Normal data lies in high probability regions, anomalies in low probability regions • Parametric and non-parametric methods
  • 42.
    Parametric • Assumes normaldata is distributed according to a parametric distribution • Anomaly score is inverse of the PDF  • Or, use a hypothesis test. Anomaly score can be test statistic
  • 43.
    Examples:  • Gaussian models=> maximum likelihood estimation (MLE), Grubb’s test and variants • Regression models => ARIMA, ARMA • mixtures of models ◦Assume each data point has prob. p of being an anomaly ◦N = PDF of normal data ◦A = PDF of anomalies (assume to be uniform) ◦D = PDF of all the data = pA + (1-p)N ◦Start with all points in N ◦Anomaly score comes from how much the distributions change if you move point to A.
  • 44.
    Non-parametric • Histogram models ◦Doestest instance fit into an existing bin? ◦Or, how determine score from the bin in which it lands • Kernel methods estimate the data PDF and are similar to parametric methods 
  • 45.
    Spectral Methods Assumption • "Datacan be embedded into a lower dimensional subspace in which normal instances and anomalies appear significantly different.” - Anomaly Detection: A Survey Main idea:  Find a subspace where the anomalies are easy to see and project data onto it.
  • 46.
    Methods  • Unsupervised orsemi-supervised • PCA ◦Project data along low variance principal components. Anomaly projections will be high  ◦In graphs, PCA on a graph’s adjacency matrix at different points in time, differences in principal components determines anomaly status • Errors in Compact Matrix Decomposition (CMD) of a graph’s adjacency matrix determined an anomalous graph • PCA can be expensive
  • 47.
    Contextual Anomalies Contextual attributesare key • sequential: position in sequence is the context ◦time-series ◦event data (timestamped) ▪ inter-arrival time between events can be uneven • spatial: location is the context • graphs: the edges between data instance (the nodes) are the context • profiles (user defines context, like for credit card fraud)
  • 48.
    Contextual Methods • Convertto a point anomaly problem • 1. identify a context for a data instance • 2. compute anomaly score within the context with a point anomaly method • Use the structure of the data when breaking data into contexts is hard (time-series and sequences)
  • 49.
    • time-series ◦regression, RNNs •sequences ◦Use events occurring before a particular time to predict the event occurring at that time.  ◦If the prediction doesn't match the actual event, it's labeled rare. ◦Finite State Automata (FSA) and Hidden Markov Models (HMMs) to compute conditional probabilities for events in the sequence based on previous events.  ◦Model event sequence as a Poisson process  • graphs
  • 50.
    Collective Anomalies  • Hardestto detect because theirs is collective behavior. • Relationship between data points is important ◦Sequential => find an anomalous subsequence ▪ lots of research here b/c lots of time-series and event sequence data in the wild ◦Spatial => find an anomalous subregion ▪ image/video processing ◦Graph => find an anomalous subgraph ◦The task is to find an anomalous subset
  • 51.
    Detecting Collective Sequential Anomalies Reduceto point anomaly problem: • transform subsequences and then use a point anomaly method • FSA, Markov Models, HMMs, CRFs for symbols Neural Nets would be powerful here • RNNs + LSTMs + Autoencoders: Could use a sequence to sequence model on the subsequences and compute reconstruction error • For every example we’ve looked at that used FSA or HMMs, you could use neural nets instead
  • 52.
    Detecting Collective Spatial Anomalies •Most work here has been on images • Anomaly detection in videos would likely be a combination of techniques for spatial and sequential anomalies (collective or otherwise).  ◦Video = sequence of images + an audio stream • Convolutional neural networks (CNNs) have been used for anomaly detection in images ◦Fully Convolutional Neural Network for Fast Anomaly Detection in Crowded Scenes (2016): https://arxiv.org/abs/ 1609.00866
  • 53.
    Most important thing… •Understand your problem before picking a method.  • Just because a method is the most accurate doesn’t automatically make it the best solution for your problem.