Apache Flume 
Data Aggregation At Scale 
Arvind Prabhakar 
© 2014 StreamSets Inc., All rights reserved 
© 2014 StreamSets, Inc.
Who am I? 
© 2014 StreamSets, Inc. 
❏ Founder/CTO 
Apache Software Foundation 
❏ Flume - PMC Chair 
❏ Sqoop - PMC Chair 
❏ Storm - PMC, Committer 
❏ MetaModel - Mentor 
❏ Sentry - Mentor 
❏ ASF Member 
Previously... 
❏ Cloudera 
❏ Informatica
What is Flume? 
© 2014 StreamSets, Inc. 
Logs 
Files 
Click 
Streams 
Sensors 
Devices 
Database 
Logs 
Social Data 
Streams 
Feeds 
Other 
Raw Storage 
(HDFS, S3) 
EDW, NoSQL 
(Hive, Impala, 
HBase, 
Cassandra) 
Search 
(Solr, 
ElasticSearch) 
Enterprise Data 
Infrastructure 
Apache Flume is a 
continuous data 
ingestion system that 
is... 
● open-source, 
● reliable, 
● scalable, 
● manageable, 
● customizable, 
...and designed for Big 
Data ecosystem.
...for Big Data ecosystem? 
“Big data is an all-encompassing term for any collection 
of data sets so large and complex that it becomes difficult 
to process using traditional data processing applications.” 
Big Data from a Data Ingestion Perspective 
● Logical Data Sources are physically distributed 
● Data production is continuous / never ending 
● Data structure and semantics change without notice 
© 2014 StreamSets, Inc.
Physically Distributed Data Sources 
© 2014 StreamSets, Inc. 
● Many physical sources 
that produce data 
● Number of physical 
sources changes 
constantly 
● Sources may exist in 
different governance 
zones, data centers, 
continents...
Continuous Data Production 
“Every two days now we create as much information 
as we did from the dawn of civilization up until 2003” 
© 2014 StreamSets, Inc. 
- Eric Schmidt, 2010 
● Weather 
● Traffic 
● Automobiles 
● Trains 
● Airplanes 
● Geological/Seismic 
● Oceanographic 
● Smart Phones 
● Health Accessories 
● Medical Devices 
● Home Automation 
● Digital Cameras 
● Social Media 
● Geolocation 
● Shop Floor Sensors 
● Network Activity 
● Industry Appliances 
● Security/Surveillance 
● Server Workloads 
● Digital Telephony 
● Bio-simulations...
Ever Changing Structure of Data 
● One of your data centers 
upgrade to IPv6 192.168.0.4 
© 2014 StreamSets, Inc. 
fe80::21b:21ff:fe83:90fa 
M0137: User {jonsmith} granted access to {accounts} 
M0137: [jonsmith] granted access to [sys.accounts] 
{ 
“first”:”jon”, 
“last”:”smith”, 
“add1”:”123 Main St.”, 
“add2”:”Ste - 4”, 
“city”:”Little Town”, 
“state”:”AZ”, 
“zip”: “12121” 
} 
{ 
“first”:”jon”, 
“last”:”smith”, 
“add1”:”123 Main St.”, 
“add2”:”Ste - 4”, 
“city”:”Little Town”, 
“state”:”AZ”, 
“zip”: “12121”, 
“phone”: “(408) 555-1212” 
} 
● Application developer 
changes logs (again) 
● JSON data may contain 
more attributes than 
expected
So, from Data Ingestion Perspective: 
Massive collection of ever changing physical sources... 
Never ending data production... 
Data structure and semantics evolve continuously... 
© 2014 StreamSets, Inc.
© 2014 StreamSets, Inc. 
Flume to the Rescue!
Apache Flume 
● Originally designed to be a log 
aggregation system by 
Cloudera Engineers 
● Evolved to handle any type of 
streaming event data 
● Low-cost of installation, 
operation and maintenance 
● Highly customizable and 
extendable 
© 2014 StreamSets, Inc.
A Closer Look at Flume 
Input Agent Agent Agent Agent Destination 
● Distributed Pipeline Architecture 
● Optimized for commonly used data sources and destinations 
● Built in support for contextual routing 
● Fully customizable and extendable 
© 2014 StreamSets, Inc.
Anatomy of a Flume Agent 
© 2014 StreamSets, Inc. 
Flume Agent 
Source 
Sink 
Channel 
Incoming 
Data 
Outgoing 
Data 
Source 
● Accepts incoming 
Data 
● Scales as required 
● Writes data to 
Channel 
Sink 
● Removes data from 
Channel 
● Sends data to 
downstream Agent or 
Destination 
Channel 
● Stores data in the 
order received
Transactional Data Exchange 
Upstream Sink TX 
© 2014 StreamSets, Inc. 
Flume Agent 
Source 
Sink 
Channel 
Incoming 
Data 
Outgoing 
Data 
Source TX 
Sink TX 
● Source uses transactions to write to the channel 
● Sink uses transactions to remove data from the channel 
● Sink transaction commits only after successful transfer of data 
● This ensures no data loss in Flume pipeline
Routing and Replicating 
© 2014 StreamSets, Inc. 
Flume Agent 
Source Sink 1 
Channel 1 Incoming 
Data 
Outgoing Data 
Channel 2 
Sink 2 Outgoing Data 
● Source can replicate or multiplex data across many channels 
● Metadata headers can be used to do contextual selection of 
channels 
● Channels can be drained by different sinks to different 
destinations or pipelines
Why Channels? 
● Buffers data and insulates downstream from load spikes 
● Provides persistent store for data in case the process restarts 
● Provides flow ordering* and transactional guarantees 
© 2014 StreamSets, Inc.
© 2014 StreamSets, Inc. 
Use-Case: Log Aggregation
Starting Out Simple 
● You would like to move your web-server 
© 2014 StreamSets, Inc. 
logs to HDFS 
● Let’s assume there are only 3 web 
servers at the time of launch 
● Ad-hoc solution will likely suffice! 
Challenges 
● How do you manage your output paths on HDFS? 
● How do you maintain your client code in face of changing 
environment as well as requirements?
Adding a Single Flume Agent 
Advantages 
● Insulation from HDFS downtime 
● Quick offload of logs from Web 
Server machines 
● Better Network utilization 
Challenges 
● What if the Flume node goes down? 
● Can one Flume node accommodate all load from Web Servers? 
© 2014 StreamSets, Inc.
Adding Two Flume Agents 
Advantages 
● Redundancy and Availability 
● Better handling of downstream 
failures 
● Automatic load balancing and 
failover 
Challenges 
● What happens when new Web Servers are added? 
● Can two Flume Agents keep up with all the load from more Web 
Servers? 
© 2014 StreamSets, Inc.
Handling a Server Farm 
© 2014 StreamSets, Inc. 
A Converging Flow 
● Traffic is aggregated by Tier-2 and 
Tier-3 before being put into 
destination system 
● Closer a tier is to the destination, 
larger the batch size it delivers 
downstream 
● Optimized handling of destination 
systems
Data Volume Per Agent 
© 2014 StreamSets, Inc. 
Batch Size Variation per Agent 
● Event volume is least in the 
outermost tier 
● Event volume increases as the 
flow converges 
● Event volume is highest in the 
innermost tier
Data Volume Per Tier 
© 2014 StreamSets, Inc. 
Batch Size Variation per Tier 
● In steady state, all tiers carry 
same event volume 
● Transient variations in flow are 
absorbed and ironed out by 
channels 
● Load spikes are handled smoothly 
without overwhelming the 
infrastructure
Planning and Sizing Flume Topology 
for Log-Aggregation Use-Case 
© 2014 StreamSets, Inc.
Planning and Sizing Your Topology 
What we need to know: 
● Number of Web Servers 
● Log volume per Web 
Server per unit time 
● Destination System and 
layout (Routing 
Requirements) 
● Worst case downtime for 
destination system 
© 2014 StreamSets, Inc. 
What we will calculate: 
● Number of tiers 
● Exit Batch Sizes 
● Channel capacity
Calculating Number of Tiers 
Rule of Thumb 
One Aggregating Agent (A) can be used with 
anywhere from 4 to 16 client Agents 
Considerations 
● Must handle projected ingest volume 
● Resulting number of tiers should provide for 
routing, load-balancing and failover 
requirements 
Gotchas 
Load test to ensure that steady state and peak 
load are addressed with adequate failover capacity 
© 2014 StreamSets, Inc.
Calculating Exit Batch Size 
Rule of Thumb 
Exit batch size is same as total exit data volume 
divided by number of Agents in a tier 
Considerations 
● Having some extra room is good 
● Keep contextual routing in mind 
● Consider duplication impact when batch 
sizes are large 
Gotchas 
Load test fail-over scenario to ensure near 
steady-state drain 
© 2014 StreamSets, Inc.
Calculating Channel Capacity 
Gotchas 
© 2014 StreamSets, Inc. 
Source 
Sink X 
Source 
Sink X 
X 
Rule of Thumb 
Equal to worst case data ingest rate sustained 
over the worst case downstream outage 
interval 
Considerations 
● Multiple disks will yield better performance 
● Channel size impacts the back-pressure 
buildup in the pipeline 
You may need more disk space than the 
physical footprint of the data size
To Recap 
Number of Tiers 
Calculated with upstream to downstream Agent ration ranging from 4:1 to 
16:1. Factor in routing, failover, load-balancing requirements... 
Exit Batch Size 
Calculated for steady state data volume exiting the tier, divided by 
number of Agents in that tier. Factor in contextual routing and duplication 
due to transient failure impact... 
Channel Capacity 
Calculated as worst case ingest rate sustained over the worst case 
downstream downtime. Factor in number of disks used etc... 
© 2014 StreamSets, Inc. 
...and that’s all there is to it!
Some Highlights of Flume 
● Flume is suitable for large volume data collection, especially when 
data is being produced in multiple locations 
● Once planned and sized appropriately, Flume will practically run 
itself without any operational intervention 
● Flume provides weak ordering guarantee, i.e., in the absence of 
failures the data will arrive in the order it was received in the Flume 
pipeline 
● Transactional exchange ensures that Flume never loses any data in 
transit between Agents. Sinks use transactions to ensure data is not 
lost at point of ingest or terminal destinations. 
● Flume has rich out-of-the box features such as contextual routing, 
and support for popular data sources and destination systems 
© 2014 StreamSets, Inc.
Things that could be better... 
● Handling of poison events 
● Ability to tail files 
● Ability to handle preset data formats such as JSON, CSV, XML 
● Centralized configuration 
● Once-only delivery semantics 
● ...and more 
Remember: patches are welcome! 
© 2014 StreamSets, Inc.
Thank You! 
Contact: 
● Email: arvind at streamsets dot com 
● Twitter: @aprabhakar 
More on Flume: 
● http://flume.apache.org/ 
● User Mailing List: user-subscribe@flume.apache.org 
● Developer Mailing List: dev-subscribe@flume.apache.org 
● JIRA: https://issues.apache.org/jira/browse/FLUME 
© 2014 StreamSets, Inc.

Data Aggregation At Scale Using Apache Flume

  • 1.
    Apache Flume DataAggregation At Scale Arvind Prabhakar © 2014 StreamSets Inc., All rights reserved © 2014 StreamSets, Inc.
  • 2.
    Who am I? © 2014 StreamSets, Inc. ❏ Founder/CTO Apache Software Foundation ❏ Flume - PMC Chair ❏ Sqoop - PMC Chair ❏ Storm - PMC, Committer ❏ MetaModel - Mentor ❏ Sentry - Mentor ❏ ASF Member Previously... ❏ Cloudera ❏ Informatica
  • 3.
    What is Flume? © 2014 StreamSets, Inc. Logs Files Click Streams Sensors Devices Database Logs Social Data Streams Feeds Other Raw Storage (HDFS, S3) EDW, NoSQL (Hive, Impala, HBase, Cassandra) Search (Solr, ElasticSearch) Enterprise Data Infrastructure Apache Flume is a continuous data ingestion system that is... ● open-source, ● reliable, ● scalable, ● manageable, ● customizable, ...and designed for Big Data ecosystem.
  • 4.
    ...for Big Dataecosystem? “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” Big Data from a Data Ingestion Perspective ● Logical Data Sources are physically distributed ● Data production is continuous / never ending ● Data structure and semantics change without notice © 2014 StreamSets, Inc.
  • 5.
    Physically Distributed DataSources © 2014 StreamSets, Inc. ● Many physical sources that produce data ● Number of physical sources changes constantly ● Sources may exist in different governance zones, data centers, continents...
  • 6.
    Continuous Data Production “Every two days now we create as much information as we did from the dawn of civilization up until 2003” © 2014 StreamSets, Inc. - Eric Schmidt, 2010 ● Weather ● Traffic ● Automobiles ● Trains ● Airplanes ● Geological/Seismic ● Oceanographic ● Smart Phones ● Health Accessories ● Medical Devices ● Home Automation ● Digital Cameras ● Social Media ● Geolocation ● Shop Floor Sensors ● Network Activity ● Industry Appliances ● Security/Surveillance ● Server Workloads ● Digital Telephony ● Bio-simulations...
  • 7.
    Ever Changing Structureof Data ● One of your data centers upgrade to IPv6 192.168.0.4 © 2014 StreamSets, Inc. fe80::21b:21ff:fe83:90fa M0137: User {jonsmith} granted access to {accounts} M0137: [jonsmith] granted access to [sys.accounts] { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121” } { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121”, “phone”: “(408) 555-1212” } ● Application developer changes logs (again) ● JSON data may contain more attributes than expected
  • 8.
    So, from DataIngestion Perspective: Massive collection of ever changing physical sources... Never ending data production... Data structure and semantics evolve continuously... © 2014 StreamSets, Inc.
  • 9.
    © 2014 StreamSets,Inc. Flume to the Rescue!
  • 10.
    Apache Flume ●Originally designed to be a log aggregation system by Cloudera Engineers ● Evolved to handle any type of streaming event data ● Low-cost of installation, operation and maintenance ● Highly customizable and extendable © 2014 StreamSets, Inc.
  • 11.
    A Closer Lookat Flume Input Agent Agent Agent Agent Destination ● Distributed Pipeline Architecture ● Optimized for commonly used data sources and destinations ● Built in support for contextual routing ● Fully customizable and extendable © 2014 StreamSets, Inc.
  • 12.
    Anatomy of aFlume Agent © 2014 StreamSets, Inc. Flume Agent Source Sink Channel Incoming Data Outgoing Data Source ● Accepts incoming Data ● Scales as required ● Writes data to Channel Sink ● Removes data from Channel ● Sends data to downstream Agent or Destination Channel ● Stores data in the order received
  • 13.
    Transactional Data Exchange Upstream Sink TX © 2014 StreamSets, Inc. Flume Agent Source Sink Channel Incoming Data Outgoing Data Source TX Sink TX ● Source uses transactions to write to the channel ● Sink uses transactions to remove data from the channel ● Sink transaction commits only after successful transfer of data ● This ensures no data loss in Flume pipeline
  • 14.
    Routing and Replicating © 2014 StreamSets, Inc. Flume Agent Source Sink 1 Channel 1 Incoming Data Outgoing Data Channel 2 Sink 2 Outgoing Data ● Source can replicate or multiplex data across many channels ● Metadata headers can be used to do contextual selection of channels ● Channels can be drained by different sinks to different destinations or pipelines
  • 15.
    Why Channels? ●Buffers data and insulates downstream from load spikes ● Provides persistent store for data in case the process restarts ● Provides flow ordering* and transactional guarantees © 2014 StreamSets, Inc.
  • 16.
    © 2014 StreamSets,Inc. Use-Case: Log Aggregation
  • 17.
    Starting Out Simple ● You would like to move your web-server © 2014 StreamSets, Inc. logs to HDFS ● Let’s assume there are only 3 web servers at the time of launch ● Ad-hoc solution will likely suffice! Challenges ● How do you manage your output paths on HDFS? ● How do you maintain your client code in face of changing environment as well as requirements?
  • 18.
    Adding a SingleFlume Agent Advantages ● Insulation from HDFS downtime ● Quick offload of logs from Web Server machines ● Better Network utilization Challenges ● What if the Flume node goes down? ● Can one Flume node accommodate all load from Web Servers? © 2014 StreamSets, Inc.
  • 19.
    Adding Two FlumeAgents Advantages ● Redundancy and Availability ● Better handling of downstream failures ● Automatic load balancing and failover Challenges ● What happens when new Web Servers are added? ● Can two Flume Agents keep up with all the load from more Web Servers? © 2014 StreamSets, Inc.
  • 20.
    Handling a ServerFarm © 2014 StreamSets, Inc. A Converging Flow ● Traffic is aggregated by Tier-2 and Tier-3 before being put into destination system ● Closer a tier is to the destination, larger the batch size it delivers downstream ● Optimized handling of destination systems
  • 21.
    Data Volume PerAgent © 2014 StreamSets, Inc. Batch Size Variation per Agent ● Event volume is least in the outermost tier ● Event volume increases as the flow converges ● Event volume is highest in the innermost tier
  • 22.
    Data Volume PerTier © 2014 StreamSets, Inc. Batch Size Variation per Tier ● In steady state, all tiers carry same event volume ● Transient variations in flow are absorbed and ironed out by channels ● Load spikes are handled smoothly without overwhelming the infrastructure
  • 23.
    Planning and SizingFlume Topology for Log-Aggregation Use-Case © 2014 StreamSets, Inc.
  • 24.
    Planning and SizingYour Topology What we need to know: ● Number of Web Servers ● Log volume per Web Server per unit time ● Destination System and layout (Routing Requirements) ● Worst case downtime for destination system © 2014 StreamSets, Inc. What we will calculate: ● Number of tiers ● Exit Batch Sizes ● Channel capacity
  • 25.
    Calculating Number ofTiers Rule of Thumb One Aggregating Agent (A) can be used with anywhere from 4 to 16 client Agents Considerations ● Must handle projected ingest volume ● Resulting number of tiers should provide for routing, load-balancing and failover requirements Gotchas Load test to ensure that steady state and peak load are addressed with adequate failover capacity © 2014 StreamSets, Inc.
  • 26.
    Calculating Exit BatchSize Rule of Thumb Exit batch size is same as total exit data volume divided by number of Agents in a tier Considerations ● Having some extra room is good ● Keep contextual routing in mind ● Consider duplication impact when batch sizes are large Gotchas Load test fail-over scenario to ensure near steady-state drain © 2014 StreamSets, Inc.
  • 27.
    Calculating Channel Capacity Gotchas © 2014 StreamSets, Inc. Source Sink X Source Sink X X Rule of Thumb Equal to worst case data ingest rate sustained over the worst case downstream outage interval Considerations ● Multiple disks will yield better performance ● Channel size impacts the back-pressure buildup in the pipeline You may need more disk space than the physical footprint of the data size
  • 28.
    To Recap Numberof Tiers Calculated with upstream to downstream Agent ration ranging from 4:1 to 16:1. Factor in routing, failover, load-balancing requirements... Exit Batch Size Calculated for steady state data volume exiting the tier, divided by number of Agents in that tier. Factor in contextual routing and duplication due to transient failure impact... Channel Capacity Calculated as worst case ingest rate sustained over the worst case downstream downtime. Factor in number of disks used etc... © 2014 StreamSets, Inc. ...and that’s all there is to it!
  • 29.
    Some Highlights ofFlume ● Flume is suitable for large volume data collection, especially when data is being produced in multiple locations ● Once planned and sized appropriately, Flume will practically run itself without any operational intervention ● Flume provides weak ordering guarantee, i.e., in the absence of failures the data will arrive in the order it was received in the Flume pipeline ● Transactional exchange ensures that Flume never loses any data in transit between Agents. Sinks use transactions to ensure data is not lost at point of ingest or terminal destinations. ● Flume has rich out-of-the box features such as contextual routing, and support for popular data sources and destination systems © 2014 StreamSets, Inc.
  • 30.
    Things that couldbe better... ● Handling of poison events ● Ability to tail files ● Ability to handle preset data formats such as JSON, CSV, XML ● Centralized configuration ● Once-only delivery semantics ● ...and more Remember: patches are welcome! © 2014 StreamSets, Inc.
  • 31.
    Thank You! Contact: ● Email: arvind at streamsets dot com ● Twitter: @aprabhakar More on Flume: ● http://flume.apache.org/ ● User Mailing List: user-subscribe@flume.apache.org ● Developer Mailing List: dev-subscribe@flume.apache.org ● JIRA: https://issues.apache.org/jira/browse/FLUME © 2014 StreamSets, Inc.