Bridging Batch and Real-time
Systems for Anomaly
Detection
Costin Leau
@costinl
www.elastic.co
Interesting != Common
Datasets tend to have hot / common entities
Monopolize the data set
Create too much noise
Cannot be easily avoided
Common = frequent
Interesting = frequently different
www.elastic.co
Finding the uncommon
Background vs foreground == things that stand out
Example:
Background: “flu”
“H5N1” appears in 5 / 10M docs
H5N1
flu
www.elastic.co
Finding the uncommon
Background vs foreground == things that stand out
Example:
Background: “flu”
“H5N1” appears in 5 / 10M docs
Foreground: “bird flu”
“H5N1” appears in 4 / 100 docs
H5N1
bird flu
H5N1
flu
www.elastic.co
Stack
www.elastic.co
Great for bulk data
Does not require prepared data
Useful for building data models
Execution options
Long-running (Batch?)
Time-sensitive operations
Rich answers rely on metadata
Work great with data streams
Real-time
www.elastic.co
Finding the uncommon - Stack
Deal with big data sets
Hadoop / Spark
Perform the analysis & exploration
Elasticsearch
Mine the data
Spark
Elasticsearch
The Big Picture
HDFS
Slow, in-depth
(machine) learning
Fast, real-time
learning
ETL
www.elastic.co
Hadoop
De-facto platform for big data
HDFS - Used for storing and performing ETL at scale
YARN – Job scheduling and resource management
www.elastic.co
Apache Spark
Apache Spark™ is a fast and general engine for large-scale data
processing
Provides building blocks for fast reads and transformations
streaming (Spark Streaming)
machine learning (MLLib)
works great with Hadoop (HDFS and YARN)
www.elastic.co
Elasticsearch
Open-source real-time search and analytics engine
• Fully-featured search
Relevance-ranked text search
Scalable search
High-performance geo, temporal, range and key lookup
Highlighting
Support for complex / nested document types *
Spelling suggestions
Powerful query DSL *
“Standing” queries *
Real-time results *
Extensible via plugins *
• Powerful faceting/analysis
Summarize large sets by any combinations of
time, geo, category and more. *
“Kibana” visualization tool *
* Features we see as differentiators
• Management
Simple and robust deployments *
REST APIs for handling all aspects of administration/monitoring *
“Marvel” console for monitoring and administering clusters *
Special features to manage the life cycle of content *
• Integration
Hadoop (Map/Red,Hive,Pig,Cascading..)*
Client libraries (Python, Java, Ruby, javascript…)
Data connectors (Twitter, JMS…)
Logstash ETL framework *
• Support
Development and Production support with tiered levels
Support staff are the core developers of the product *
www.elastic.co
Unstructured search
www.elastic.co
Sorting
www.elastic.co
Pagination
www.elastic.co
Enrichment
www.elastic.co
Suggestions
www.elastic.co
Structured search
www.elastic.co
Aggregations
www.elastic.co
www.elastic.co
Elasticsearch Hadoop
www.elastic.co
Discovering the relevant
www.elastic.co
Inverted index
Inverting Shakespeare
Take all the plays and break them down word by word
For each word, store the ids of the documents that contain it
Sort all tokens (words)
token doc freq. postings (doc ids)
Anthony 2 1, 2
Brutus 1 5
Caesar 2 2, 3
Calpurnia 2 4, 5
www.elastic.co
Relevancy
How well does a document match a query?
step query d1 d2
The text brown fox The quick brown fox likes
brown nuts
The red fox
The terms (brown, fox) (brown, brown, fox, likes, nuts,
quick)
(red, fox)
A frequency vector (1, 1) (2, 1) (0, 1)
Relevancy - 2? 1?
www.elastic.co
Relevancy - Vector Space Model
How well q matches d1 and d2?
The coordinates in the vector represent
weights per term
The simple (1, 0) vector defines these
weights based on the frequency of each
term
But to generalize:
.
2
1
1
tf: brown
tf: fox
q: (brown, fox)
d1: (brown, brown, fox)
d2: (fox)
www.elastic.co
Relevancy – TF/IDF
Term Frequency / Inverse Document Frequency
TF = the more a token appears in a doc, the more important it is
IDF = the more documents containing the term, the less important it is
www.elastic.co
Called Lucene Similarity
Can be ignored (was an
attempt to make query scores
comparable across indices, it’s
there for backward
compatibility)
Core TF/IDF weight
Score of a document
for a given query
Normalized doc length,
shorter docs are more
likely to be relevant than
longer docs
Boost of query
term t
Ranking Formula
www.elastic.co
Discovering the interesting
www.elastic.co
Frequency differentiator
TF-IDF by-itself is not enough
need to compare the DF in foreground vs background
Precision vs Recall balance
www.elastic.co
Single-set analysis
A C F H I K
A B C D E … X Y Z W
Query results
Dataset
www.elastic.co
Single-set analysis example
crimes
bicycle
theft
crimes
bicycle
theft
British Police Force British Transport Police
www.elastic.co
Multi-set analysis
A B C D E … X Y Z W
A C F H I K M Q R
…
Query results
Dataset
A B C D .. J L M N O .. U
Aggregate
www.elastic.co
Aggregation (geo-aggregation)
www.elastic.co
Aggregation + Analysis
www.elastic.co
Hadoop / Spark
Off-line / “slow” learning
In-depth analysis
Break down data into hot spots
Eliminate noise
Build multiple models
www.elastic.co
Elasticsearch
Search features
Scoring, TF-IDF
Significant terms (multi-set analysis)
Aggregations
Buckets & Metrics
www.elastic.co
Reacting to data
www.elastic.co
Reacting to live data
Preventing
execute queries as the data flows in
Routing
place suspicious data into a dedicate pipeline
www.elastic.co
Reacting to streaming data
www.elastic.co
Live loop
Data keeps changing
Adapt the set of rules
Improves reaction time
Build a model for fast decision making
Keeps the prevention rate high
Categorize data on the fly Elasticsearch
Streaming
www.elastic.co
Finding interesting data – basic approach
www.elastic.co
Finding interesting data - analytics
www.elastic.co
Finding interesting data - through a ML model
www.elastic.co
Q & A
Thank you!
@costinl

Bridging Batch and Real-time Systems for Anomaly Detection

Editor's Notes

  • #8 Conceptually they are the same, same data structure, same APIs but the usage model differs.
  • #14 Bullet 1: As user content is created, its indexed and available for others to see in their search results in realtime automatically - its even relevant to them based on boosting, scoring etc. Bullet 2: New features can go out due to the Schemaless or Schema-lite nature of Elasticsearch. No need to change the schema, just update and go as you develop the document and application.
  • #19 Including highlighting, DYM
  • #20 filtering between dates/ranges
  • #31 Precision = how many of the retrieved documents are relevant Recall = how many of the relevant documents are retrieved
  • #35 Term aggregation
  • #36 Significant terms