Bridging Batch and Real-time Systems for Anomaly Detection

Bridging Batch and Real-time
Systems for Anomaly
Detection
Costin Leau
@costinl

www.elastic.co
Interesting != Common
Datasets tend to have hot / common entities
Monopolize the data set
Create too much noise
Cannot be easily avoided
Common = frequent
Interesting = frequently different

www.elastic.co
Finding the uncommon
Background vs foreground == things that stand out
Example:
Background: “flu”
“H5N1” appears in 5 / 10M docs
H5N1
flu

www.elastic.co
Finding the uncommon
Background vs foreground == things that stand out
Example:
Background: “flu”
“H5N1” appears in 5 / 10M docs
Foreground: “bird flu”
“H5N1” appears in 4 / 100 docs
H5N1
bird flu
H5N1
flu

www.elastic.co
Great for bulk data
Does not require prepared data
Useful for building data models
Execution options
Long-running (Batch?)
Time-sensitive operations
Rich answers rely on metadata
Work great with data streams
Real-time

www.elastic.co
Finding the uncommon - Stack
Deal with big data sets
Hadoop / Spark
Perform the analysis & exploration
Elasticsearch
Mine the data
Spark
Elasticsearch

The Big Picture
HDFS
Slow, in-depth
(machine) learning
Fast, real-time
learning
ETL

www.elastic.co
Hadoop
De-facto platform for big data
HDFS - Used for storing and performing ETL at scale
YARN – Job scheduling and resource management

www.elastic.co
Apache Spark
Apache Spark™ is a fast and general engine for large-scale data
processing
Provides building blocks for fast reads and transformations
streaming (Spark Streaming)
machine learning (MLLib)
works great with Hadoop (HDFS and YARN)

www.elastic.co
Elasticsearch
Open-source real-time search and analytics engine
• Fully-featured search
Relevance-ranked text search
Scalable search
High-performance geo, temporal, range and key lookup
Highlighting
Support for complex / nested document types *
Spelling suggestions
Powerful query DSL *
“Standing” queries *
Real-time results *
Extensible via plugins *
• Powerful faceting/analysis
Summarize large sets by any combinations of
time, geo, category and more. *
“Kibana” visualization tool *
* Features we see as differentiators
• Management
Simple and robust deployments *
REST APIs for handling all aspects of administration/monitoring *
“Marvel” console for monitoring and administering clusters *
Special features to manage the life cycle of content *
• Integration
Hadoop (Map/Red,Hive,Pig,Cascading..)*
Client libraries (Python, Java, Ruby, javascript…)
Data connectors (Twitter, JMS…)
Logstash ETL framework *
• Support
Development and Production support with tiered levels
Support staff are the core developers of the product *

www.elastic.co
Unstructured search

www.elastic.co
Structured search

www.elastic.co
Elasticsearch Hadoop

www.elastic.co
Discovering the relevant

www.elastic.co
Inverted index
Inverting Shakespeare
Take all the plays and break them down word by word
For each word, store the ids of the documents that contain it
Sort all tokens (words)
token doc freq. postings (doc ids)
Anthony 2 1, 2
Brutus 1 5
Caesar 2 2, 3
Calpurnia 2 4, 5

www.elastic.co
Relevancy
How well does a document match a query?
step query d1 d2
The text brown fox The quick brown fox likes
brown nuts
The red fox
The terms (brown, fox) (brown, brown, fox, likes, nuts,
quick)
(red, fox)
A frequency vector (1, 1) (2, 1) (0, 1)
Relevancy - 2? 1?

www.elastic.co
Relevancy - Vector Space Model
How well q matches d1 and d2?
The coordinates in the vector represent
weights per term
The simple (1, 0) vector defines these
weights based on the frequency of each
term
But to generalize:
.
2
1
1
tf: brown
tf: fox
q: (brown, fox)
d1: (brown, brown, fox)
d2: (fox)

www.elastic.co
Relevancy – TF/IDF
Term Frequency / Inverse Document Frequency
TF = the more a token appears in a doc, the more important it is
IDF = the more documents containing the term, the less important it is

www.elastic.co
Called Lucene Similarity
Can be ignored (was an
attempt to make query scores
comparable across indices, it’s
there for backward
compatibility)
Core TF/IDF weight
Score of a document
for a given query
Normalized doc length,
shorter docs are more
likely to be relevant than
longer docs
Boost of query
term t
Ranking Formula

www.elastic.co
Discovering the interesting

www.elastic.co
Frequency differentiator
TF-IDF by-itself is not enough
need to compare the DF in foreground vs background
Precision vs Recall balance

www.elastic.co
Single-set analysis
A C F H I K
A B C D E … X Y Z W
Query results
Dataset

www.elastic.co
Single-set analysis example
crimes
bicycle
theft
crimes
bicycle
theft
British Police Force British Transport Police

www.elastic.co
Multi-set analysis
A B C D E … X Y Z W
A C F H I K M Q R
…
Query results
Dataset
A B C D .. J L M N O .. U
Aggregate

www.elastic.co
Aggregation (geo-aggregation)

www.elastic.co
Aggregation + Analysis

www.elastic.co
Hadoop / Spark
Off-line / “slow” learning
In-depth analysis
Break down data into hot spots
Eliminate noise
Build multiple models

www.elastic.co
Elasticsearch
Search features
Scoring, TF-IDF
Significant terms (multi-set analysis)
Aggregations
Buckets & Metrics

www.elastic.co
Reacting to data

www.elastic.co
Reacting to live data
Preventing
execute queries as the data flows in
Routing
place suspicious data into a dedicate pipeline

www.elastic.co
Reacting to streaming data

www.elastic.co
Live loop
Data keeps changing
Adapt the set of rules
Improves reaction time
Build a model for fast decision making
Keeps the prevention rate high
Categorize data on the fly Elasticsearch
Streaming

www.elastic.co
Finding interesting data – basic approach

www.elastic.co
Finding interesting data - analytics

www.elastic.co
Finding interesting data - through a ML model

www.elastic.co
Q & A
Thank you!
@costinl

Bridging Batch and Real-time Systems for Anomaly Detection

More Related Content

What's hot

Similar to Bridging Batch and Real-time Systems for Anomaly Detection

More from DataWorks Summit

Recently uploaded

Bridging Batch and Real-time Systems for Anomaly Detection

Editor's Notes