DRIVING INNOVATION 
THROUGH DATA 
LARGE-SCALE LOG PROCESSING WITH CASCADING & LOGSTASH 
Elasticsearch Meetup, Oct 30 2014
WHAT IS LOG FILE ANALYTICS? 
• Making sense of large amounts of [semi|un]structured data 
• What type of log file data? 
‣ Syslog 
‣ Web log files (Apache, Nginix, WebTrends, Omniture) 
‣ POS transactions 
‣ Advertising impressions (Doubleclick DART, OpenX, Atlas) 
‣ Twitter firehose (yes, it’s a log file!) 
• Anything with a timestamp and data 
2
LOGSTASH ARCHITECTURE 
3 
h"p://www.slashroot.in/logstash1tutorial1linux1central1logging1server7 
• Data 
collec*on 
is 
flexible 
• Lots 
of 
input/output 
plugins 
• Grok 
filtering 
is 
easy 
• Kibana 
UI 
is 
a?rac*ve
WHAT CAN WE DO WITH CASCADING + LOGSTASH? 
• Provide richer log-processing capabilities 
• Integrate & correlate with other information 
‣ Large list of integration adapters 
• Analyze large volumes of log data 
• Capture & retain unfiltered log data 
• Operationalize your log-processing application 
4
GET TO KNOW CONCURRENT 
5 
Leader in Application Infrastructure for Big Data 
• Building enterprise software to simplify Big Data application 
development and management 
Products and Technology 
• CASCADING 
Open Source - The most widely used application infrastructure for 
building Big Data apps with over 175,000 downloads each month 
• DRIVEN 
Enterprise data application management for Big Data apps 
Proven — Simple, Reliable, Robust 
• Thousands of enterprises rely on Concurrent to provide their data 
application infrastructure. 
Founded: 2008 
HQ: San Francisco, CA 
CEO: Gary Nakamura 
CTO, Founder: Chris Wensel 
www.concurrentinc.com
CASCADING - DE-FACTO STANDARD FOR DATA APPS 
Cascading Apps 
6 
SQL Clojure Ruby 
New Fabrics 
Tez Storm 
Supported Fabrics and Data Stores 
Mainframe DB / DW In-Memory Data Stores Hadoop 
• Standard for enterprise 
data app development 
• Your programming 
language of choice 
• Cascading applications 
that run on MapReduce 
will also run on Apache 
Spark, Storm, and …
CASCADING 3.0 
7 
“Write once and deploy on your fabric of choice.” 
• The Innovation — Cascading 3.0 will 
allow for data apps to execute on 
existing and emerging fabrics 
through its new customizable query 
planner. 
• Cascading 3.0 will support — Local 
In-Memory, Apache MapReduce and 
soon thereafter (3.1) Apache Tez, 
Apache Spark and Apache Storm 
Enterprise Data Applications 
Local In-Memory MapReduce 
Apache 
Tez, Storm, 
Computation Fabrics
… AND INCLUDES RICH SET OF EXTENSIONS 
8 
http://www.cascading.org/extensions/
DEMO: WORD COUNT EXAMPLE WITH CASCADING 
String docPath = args[ 0 ]; 
String wcPath = args[ 1 ]; 
Properties properties = new Properties(); 
AppProps.setApplicationJarClass( properties, Main.class ); 
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); 
9 
configuration 
integration 
// create source and sink taps 
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); 
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); 
processing 
// specify a regex to split "document" text lines into token stream 
Fields token = new Fields( "token" ); 
Fields text = new Fields( "text" ); 
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); 
// only returns "token" 
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 
// determine the word counts 
Pipe wcPipe = new Pipe( "wc", docPipe ); 
wcPipe = new GroupBy( wcPipe, token ); 
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); 
scheduling 
// connect the taps, pipes, etc., into a flow definition 
FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) 
.addSource( docPipe, docTap ) 
.addTailSink( wcPipe, wcTap ); 
// create the Flow 
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work 
wcFlow.complete(); // <<-- Runs jobs on Cluster
SOME COMMON PATTERNS 
• Functions 
• Filters 
• Joins 
‣ Inner / Outer / Mixed 
‣ Asymmetrical / Symmetrical 
• Merge (Union) 
• Grouping 
‣ Secondary Sorting 
‣ Unique (Distinct) 
• Aggregations 
‣ Count, Average, etc 
10 
filter 
filter 
function 
function filter function 
data 
Pipeline 
Split Join 
Merge 
data 
Topology
THE STANDARD FOR DATA APPLICATION DEVELOPMENT 
11 
www.cascading.org 
Build data apps 
that are 
scale-free 
Design principals ensure 
best practices at any scale 
Test-Driven 
Development 
Efficiently test code and 
process local files before 
deploying on a cluster 
Staffing 
Bottleneck 
Use existing Java, SQL, 
modeling skill sets 
Application 
Portability 
Write once, then run on 
different computation 
fabrics 
Operational 
Complexity 
Simple - Package up into 
one jar and hand to 
operations 
Systems 
Integration 
Hadoop never lives alone. 
Easily integrate to existing 
systems 
Proven application development 
framework for building data apps 
Application platform that addresses:
CASCADING 
• Java API 
• Separates business logic from integration 
• Testable at every lifecycle stage 
• Works with any JVM language 
• Many integration adapters 
12 
Processing API Integration API 
Process Planner 
Scheduler API 
Scheduler 
Apache Hadoop 
Cascading 
Data Stores 
Scripting 
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
BUSINESSES DEPEND ON US 
• Cascading Java API 
• Data normalization and cleansing of search and click-through logs for 
use by analytics tools, Hive analysts 
• Easy to operationalize heavy lifting of data in one framework 
13
BROAD SUPPORT 
14 
Hadoop ecosystem supports Cascading
CASCADING DEPLOYMENTS 
Confidential 
15
OPERATIONAL EXCELLENCE 
Visibility Through All Stages of App Lifecycle 
From Development — Building and Testing 
• Design & Development 
• Debugging 
• Tuning 
To Production — Monitoring and Tracking 
• Maintain Business SLAs 
• Balance & Controls 
• Application and Data Quality 
• Operational Health 
• Real-time Insights 
16
DRIVEN ARCHITECTURE
DEEPER VISUALIZATION INTO YOUR HADOOP CODE 
• Easily comprehend, debug, and tune 
your data applications 
• Get rich insights on your application 
performance 
• Monitor applications in real-time 
• Compare app performance with 
historical (previous) iterations 
18 
Debug and optimize your Hadoop applications more effectively with Driven
GET OPERATIONAL INSIGHTS WITH DRIVEN 
• Quickly breakdown how often 
applications execute based on their tags, 
teams, or names 
• Immediately identify if any application is 
monopolizing cluster resources 
• Understand the utilization of your cluster 
with a timeline of all applications running 
19 
Visualize the activity of your applications to help maintain SLAs
ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY 
• Easily keep track of all your 
applications by segmenting them with 
user-defined tags 
• Segment your applications for 
trending analysis, cluster analysis, 
and developing chargeback models 
• Quickly breakdown how often 
applications execute based on their 
tags, teams, or names 
20 
Segment your applications for greater insights across all your applications
COLLABORATE WITH TEAMS 
Utilize teams to collaborate and gain visibility over your set of applications 
• Invite others to view and collaborate 
on a specific application 
• Gain visibility to all the apps and their 
owners associated with each team 
• Simply manage your teams and the 
users assigned to them 
21
MANAGE PORTFOLIO OF BIG DATA APPLICATIONS 
Fast, powerful, rich search capabilities enable you to easily find the exact set of 
• Identify problematic apps with their 
owners and teams 
• Search for groups of applications 
segmented by user-defined tags 
• Compare specific applications with their 
previous iterations to ensure that your 
application can meet its SL 
22 
applications that you’re looking for
DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS 
• Understand the anatomy of your Hive app 
• Track execution of queries as single business process 
• Identify outlier behavior by comparison with historical runs 
• Analyze rich operational meta-data 
• Correlate Hive app behavior with other events on cluster 
23
• Logstash provides a flexible and a robust way to collect 
log data; Grok lets you parse logs without coding. Kibana 
UI is attractive to analyze the information 
• Cascading is the de-facto framework for building Big Data 
(Hadoop) applications and processing data at scale 
• Cascading+Logstash let’s you develop applications to 
collect and process large volumes of data 
• With Driven, you can put your mission critical log-processing 
applications in production and monitor SLAs 
TAKE AWAY POINTS 
24
CONTACT INFORMATION 
Supreet Oberoi 
supreet@concurrentinc.com 
650-868-7675 (m) 
@supreet_online
DRIVING INNOVATION 
THROUGH DATA 
THANK YOU 
Supreet Oberoi

Elasticsearch + Cascading for Scalable Log Processing

  • 1.
    DRIVING INNOVATION THROUGHDATA LARGE-SCALE LOG PROCESSING WITH CASCADING & LOGSTASH Elasticsearch Meetup, Oct 30 2014
  • 2.
    WHAT IS LOGFILE ANALYTICS? • Making sense of large amounts of [semi|un]structured data • What type of log file data? ‣ Syslog ‣ Web log files (Apache, Nginix, WebTrends, Omniture) ‣ POS transactions ‣ Advertising impressions (Doubleclick DART, OpenX, Atlas) ‣ Twitter firehose (yes, it’s a log file!) • Anything with a timestamp and data 2
  • 3.
    LOGSTASH ARCHITECTURE 3 h"p://www.slashroot.in/logstash1tutorial1linux1central1logging1server7 • Data collec*on is flexible • Lots of input/output plugins • Grok filtering is easy • Kibana UI is a?rac*ve
  • 4.
    WHAT CAN WEDO WITH CASCADING + LOGSTASH? • Provide richer log-processing capabilities • Integrate & correlate with other information ‣ Large list of integration adapters • Analyze large volumes of log data • Capture & retain unfiltered log data • Operationalize your log-processing application 4
  • 5.
    GET TO KNOWCONCURRENT 5 Leader in Application Infrastructure for Big Data • Building enterprise software to simplify Big Data application development and management Products and Technology • CASCADING Open Source - The most widely used application infrastructure for building Big Data apps with over 175,000 downloads each month • DRIVEN Enterprise data application management for Big Data apps Proven — Simple, Reliable, Robust • Thousands of enterprises rely on Concurrent to provide their data application infrastructure. Founded: 2008 HQ: San Francisco, CA CEO: Gary Nakamura CTO, Founder: Chris Wensel www.concurrentinc.com
  • 6.
    CASCADING - DE-FACTOSTANDARD FOR DATA APPS Cascading Apps 6 SQL Clojure Ruby New Fabrics Tez Storm Supported Fabrics and Data Stores Mainframe DB / DW In-Memory Data Stores Hadoop • Standard for enterprise data app development • Your programming language of choice • Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …
  • 7.
    CASCADING 3.0 7 “Write once and deploy on your fabric of choice.” • The Innovation — Cascading 3.0 will allow for data apps to execute on existing and emerging fabrics through its new customizable query planner. • Cascading 3.0 will support — Local In-Memory, Apache MapReduce and soon thereafter (3.1) Apache Tez, Apache Spark and Apache Storm Enterprise Data Applications Local In-Memory MapReduce Apache Tez, Storm, Computation Fabrics
  • 8.
    … AND INCLUDESRICH SET OF EXTENSIONS 8 http://www.cascading.org/extensions/
  • 9.
    DEMO: WORD COUNTEXAMPLE WITH CASCADING String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); 9 configuration integration // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); processing // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); scheduling // connect the taps, pipes, etc., into a flow definition FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // create the Flow Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work wcFlow.complete(); // <<-- Runs jobs on Cluster
  • 10.
    SOME COMMON PATTERNS • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc 10 filter filter function function filter function data Pipeline Split Join Merge data Topology
  • 11.
    THE STANDARD FORDATA APPLICATION DEVELOPMENT 11 www.cascading.org Build data apps that are scale-free Design principals ensure best practices at any scale Test-Driven Development Efficiently test code and process local files before deploying on a cluster Staffing Bottleneck Use existing Java, SQL, modeling skill sets Application Portability Write once, then run on different computation fabrics Operational Complexity Simple - Package up into one jar and hand to operations Systems Integration Hadoop never lives alone. Easily integrate to existing systems Proven application development framework for building data apps Application platform that addresses:
  • 12.
    CASCADING • JavaAPI • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters 12 Processing API Integration API Process Planner Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 13.
    BUSINESSES DEPEND ONUS • Cascading Java API • Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts • Easy to operationalize heavy lifting of data in one framework 13
  • 14.
    BROAD SUPPORT 14 Hadoop ecosystem supports Cascading
  • 15.
  • 16.
    OPERATIONAL EXCELLENCE VisibilityThrough All Stages of App Lifecycle From Development — Building and Testing • Design & Development • Debugging • Tuning To Production — Monitoring and Tracking • Maintain Business SLAs • Balance & Controls • Application and Data Quality • Operational Health • Real-time Insights 16
  • 17.
  • 18.
    DEEPER VISUALIZATION INTOYOUR HADOOP CODE • Easily comprehend, debug, and tune your data applications • Get rich insights on your application performance • Monitor applications in real-time • Compare app performance with historical (previous) iterations 18 Debug and optimize your Hadoop applications more effectively with Driven
  • 19.
    GET OPERATIONAL INSIGHTSWITH DRIVEN • Quickly breakdown how often applications execute based on their tags, teams, or names • Immediately identify if any application is monopolizing cluster resources • Understand the utilization of your cluster with a timeline of all applications running 19 Visualize the activity of your applications to help maintain SLAs
  • 20.
    ORGANIZE YOUR APPLICATIONSWITH GREATER FIDELITY • Easily keep track of all your applications by segmenting them with user-defined tags • Segment your applications for trending analysis, cluster analysis, and developing chargeback models • Quickly breakdown how often applications execute based on their tags, teams, or names 20 Segment your applications for greater insights across all your applications
  • 21.
    COLLABORATE WITH TEAMS Utilize teams to collaborate and gain visibility over your set of applications • Invite others to view and collaborate on a specific application • Gain visibility to all the apps and their owners associated with each team • Simply manage your teams and the users assigned to them 21
  • 22.
    MANAGE PORTFOLIO OFBIG DATA APPLICATIONS Fast, powerful, rich search capabilities enable you to easily find the exact set of • Identify problematic apps with their owners and teams • Search for groups of applications segmented by user-defined tags • Compare specific applications with their previous iterations to ensure that your application can meet its SL 22 applications that you’re looking for
  • 23.
    DRIVEN FOR HIVE:OPERATIONAL VISIBILITY FOR YOUR HIVE APPS • Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster 23
  • 24.
    • Logstash providesa flexible and a robust way to collect log data; Grok lets you parse logs without coding. Kibana UI is attractive to analyze the information • Cascading is the de-facto framework for building Big Data (Hadoop) applications and processing data at scale • Cascading+Logstash let’s you develop applications to collect and process large volumes of data • With Driven, you can put your mission critical log-processing applications in production and monitor SLAs TAKE AWAY POINTS 24
  • 25.
    CONTACT INFORMATION SupreetOberoi supreet@concurrentinc.com 650-868-7675 (m) @supreet_online
  • 26.
    DRIVING INNOVATION THROUGHDATA THANK YOU Supreet Oberoi