Elasticsearch + Cascading for Scalable Log Processing

DRIVING INNOVATION
THROUGH DATA
LARGE-SCALE LOG PROCESSING WITH CASCADING & LOGSTASH
Elasticsearch Meetup, Oct 30 2014

WHAT IS LOG FILE ANALYTICS?
• Making sense of large amounts of [semi|un]structured data
• What type of log file data?
‣ Syslog
‣ Web log files (Apache, Nginix, WebTrends, Omniture)
‣ POS transactions
‣ Advertising impressions (Doubleclick DART, OpenX, Atlas)
‣ Twitter firehose (yes, it’s a log file!)
• Anything with a timestamp and data
2

LOGSTASH ARCHITECTURE
3
h"p://www.slashroot.in/logstash1tutorial1linux1central1logging1server7
• Data
collec*on
is
flexible
• Lots
of
input/output
plugins
• Grok
filtering
is
easy
• Kibana
UI
is
a?rac*ve

WHAT CAN WE DO WITH CASCADING + LOGSTASH?
• Provide richer log-processing capabilities
• Integrate & correlate with other information
‣ Large list of integration adapters
• Analyze large volumes of log data
• Capture & retain unfiltered log data
• Operationalize your log-processing application
4

GET TO KNOW CONCURRENT
5
Leader in Application Infrastructure for Big Data
• Building enterprise software to simplify Big Data application
development and management
Products and Technology
• CASCADING
Open Source - The most widely used application infrastructure for
building Big Data apps with over 175,000 downloads each month
• DRIVEN
Enterprise data application management for Big Data apps
Proven — Simple, Reliable, Robust
• Thousands of enterprises rely on Concurrent to provide their data
application infrastructure.
Founded: 2008
HQ: San Francisco, CA
CEO: Gary Nakamura
CTO, Founder: Chris Wensel
www.concurrentinc.com

CASCADING - DE-FACTO STANDARD FOR DATA APPS
Cascading Apps
6
SQL Clojure Ruby
New Fabrics
Tez Storm
Supported Fabrics and Data Stores
Mainframe DB / DW In-Memory Data Stores Hadoop
• Standard for enterprise
data app development
• Your programming
language of choice
• Cascading applications
that run on MapReduce
will also run on Apache
Spark, Storm, and …

CASCADING 3.0
7
“Write once and deploy on your fabric of choice.”
• The Innovation — Cascading 3.0 will
allow for data apps to execute on
existing and emerging fabrics
through its new customizable query
planner.
• Cascading 3.0 will support — Local
In-Memory, Apache MapReduce and
soon thereafter (3.1) Apache Tez,
Apache Spark and Apache Storm
Enterprise Data Applications
Local In-Memory MapReduce
Apache
Tez, Storm,
Computation Fabrics

… AND INCLUDES RICH SET OF EXTENSIONS
8
http://www.cascading.org/extensions/

DEMO: WORD COUNT EXAMPLE WITH CASCADING
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
9
configuration
integration
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
processing
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
scheduling
// connect the taps, pipes, etc., into a flow definition
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// create the Flow
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work
wcFlow.complete(); // <<-- Runs jobs on Cluster

SOME COMMON PATTERNS
• Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
10
filter
filter
function
function filter function
data
Pipeline
Split Join
Merge
data
Topology

THE STANDARD FOR DATA APPLICATION DEVELOPMENT
11
www.cascading.org
Build data apps
that are
scale-free
Design principals ensure
best practices at any scale
Test-Driven
Development
Efficiently test code and
process local files before
deploying on a cluster
Staffing
Bottleneck
Use existing Java, SQL,
modeling skill sets
Application
Portability
Write once, then run on
different computation
fabrics
Operational
Complexity
Simple - Package up into
one jar and hand to
operations
Systems
Integration
Hadoop never lives alone.
Easily integrate to existing
systems
Proven application development
framework for building data apps
Application platform that addresses:

CASCADING
• Java API
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
12
Processing API Integration API
Process Planner
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java

BUSINESSES DEPEND ON US
• Cascading Java API
• Data normalization and cleansing of search and click-through logs for
use by analytics tools, Hive analysts
• Easy to operationalize heavy lifting of data in one framework
13

BROAD SUPPORT
14
Hadoop ecosystem supports Cascading

CASCADING DEPLOYMENTS
Confidential
15

OPERATIONAL EXCELLENCE
Visibility Through All Stages of App Lifecycle
From Development — Building and Testing
• Design & Development
• Debugging
• Tuning
To Production — Monitoring and Tracking
• Maintain Business SLAs
• Balance & Controls
• Application and Data Quality
• Operational Health
• Real-time Insights
16

DEEPER VISUALIZATION INTO YOUR HADOOP CODE
• Easily comprehend, debug, and tune
your data applications
• Get rich insights on your application
performance
• Monitor applications in real-time
• Compare app performance with
historical (previous) iterations
18
Debug and optimize your Hadoop applications more effectively with Driven

GET OPERATIONAL INSIGHTS WITH DRIVEN
• Quickly breakdown how often
applications execute based on their tags,
teams, or names
• Immediately identify if any application is
monopolizing cluster resources
• Understand the utilization of your cluster
with a timeline of all applications running
19
Visualize the activity of your applications to help maintain SLAs

ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY
• Easily keep track of all your
applications by segmenting them with
user-defined tags
• Segment your applications for
trending analysis, cluster analysis,
and developing chargeback models
• Quickly breakdown how often
applications execute based on their
tags, teams, or names
20
Segment your applications for greater insights across all your applications

COLLABORATE WITH TEAMS
Utilize teams to collaborate and gain visibility over your set of applications
• Invite others to view and collaborate
on a specific application
• Gain visibility to all the apps and their
owners associated with each team
• Simply manage your teams and the
users assigned to them
21

MANAGE PORTFOLIO OF BIG DATA APPLICATIONS
Fast, powerful, rich search capabilities enable you to easily find the exact set of
• Identify problematic apps with their
owners and teams
• Search for groups of applications
segmented by user-defined tags
• Compare specific applications with their
previous iterations to ensure that your
application can meet its SL
22
applications that you’re looking for

DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS
• Understand the anatomy of your Hive app
• Track execution of queries as single business process
• Identify outlier behavior by comparison with historical runs
• Analyze rich operational meta-data
• Correlate Hive app behavior with other events on cluster
23

• Logstash provides a flexible and a robust way to collect
log data; Grok lets you parse logs without coding. Kibana
UI is attractive to analyze the information
• Cascading is the de-facto framework for building Big Data
(Hadoop) applications and processing data at scale
• Cascading+Logstash let’s you develop applications to
collect and process large volumes of data
• With Driven, you can put your mission critical log-processing
applications in production and monitor SLAs
TAKE AWAY POINTS
24

CONTACT INFORMATION
Supreet Oberoi
supreet@concurrentinc.com
650-868-7675 (m)
@supreet_online

DRIVING INNOVATION
THROUGH DATA
THANK YOU
Supreet Oberoi

Elasticsearch + Cascading for Scalable Log Processing

More Related Content

What's hot

Viewers also liked

Similar to Elasticsearch + Cascading for Scalable Log Processing

More from Cascading

Recently uploaded

Elasticsearch + Cascading for Scalable Log Processing