Ben Slater, Instaclustr
Processing 50,000 events per second with Cassandra and Spark
Introduction
• Ben Slater, Chief Product Officer, Instaclustr
• Cassandra + Spark Managed Service, Support, Consulting
• 20+ years experience as a developer, architect and dev/dev-ops team lead
• DataStax MVP for Apache Cassandra
© DataStax, All Rights Reserved. 2
Processing 50,000 events per second with Cassandra
and Spark
1 Problem background and overall architecture
2 Implementation process & lessons learned
3 What’s next?
3© DataStax, All Rights Reserved.
Problem background
• How to efficiently monitor >600 servers all running Cassandra
• Need to develop a metric history over time for tuning alerting & automated response systems
• Off the shelf systems are available but:
• probably don’t give us the flexibility we want to be able to optimize for our environment
• we wanted a meaty problem to tackle ourselves to dog-food our own offering and build our
internal skills and understanding
© DataStax, All Rights Reserved. 4
Solution Overview
© DataStax, All Rights Reserved. 5
Managed
Node
(AWS) x
many
Managed
Node
(Azure) x
many
Managed
Node
(SoftLayer)
x many
Cassandra
+ Spark
(x15)
Riemann
(x3)
RabbitMQ
(x2)
Console/
API
(x2)
Admin
Tools
500 nodes * ~2,000
metrics / 20 secs =
50k metrics/sec
PagerDuty
Implementation Approach
1.Writing Data
2.Rolling Up Data
3.Presenting Data
© DataStax, All Rights Reserved. 6
~ 9(!) months
(with quite a few detours
and distractions)
Writing Data
• Worked, Filled Up, Worked, Broke, Kind of Works, Works!
• Key lessons:
• Aligning Data Model with DTCS
• Initial design did not have time value in partition key
• Settled on bucketing by 5 mins
• Enables DTCS to work
• Works really well for extracting data for roll-up
• Adds complexity for retrieving data
• When running with STCS needed unchecked_compactions=true to avoid build up of TTL’d data
• Batching of writes
• Found batching of 200 rows per insert to provide optimal throughput and client load
• See Adam’s talk from yesterday for all the detail
• Controlling data volumes from column family metrics
• Limited, rotating set of CFs per check-in
• Managing back pressure is important
© DataStax, All Rights Reserved. 7
Rolling Up Data
• Works?, Doesn’t Work, Doesn’t Work, Doesn’t Work, Doesn’t Work, Works!
• Developing functional solution was easy, getting to acceptable performance was hard (and time
consuming) but seemed easy once we’d solved it
• Keys to performance?
• Align raw data partition bucketing with roll-up timeframe (5 mins)
• Use joinWithCassandra table to extract the required data – 2-3x performance improvement over alternate approaches
val RDDJoin = sc.cassandraTable[(String, String)]("instametrics" , "service_per_host")
.filter(a => broadcastListEventAll.value.map(r => a._2.matches(r)).foldLeft(false)(_ || _))
.map(a => (a._1, dateBucket, a._2))
.repartitionByCassandraReplica("instametrics", "events_raw_5m", 100)
.joinWithCassandraTable("instametrics", "events_raw_5m").cache()
• Write limiting (eg cassandra.output.throughput_mb_per_sec) not necessary as writes << reads
© DataStax, All Rights Reserved. 8
Presenting Data
• Generally, just worked
• Main challenge was dealing with how to find latest data in buckets when not all data is reported in
each data set
© DataStax, All Rights Reserved. 9
What’s Next
• Decisions to revisit:
• Use Spark Streaming for 5 min roll-ups rather than save and extract
• Scale-out by adding nodes is working as expected
• Continue to add additional metrics to roll-ups as we add functionality
• Plan to introduce more complex analytics & feed historic values back to Reimann for use in alerting
© DataStax, All Rights Reserved. 10
Questions?
Further info:
• Scaling Riemann:
https://www.instaclustr.com/blog/2016/05/03/post-500-nodes-high-availability-scalability-with-riemann/
• Riemann Intro:
https://www.instaclustr.com/blog/2015/12/14/monitoring-cassandra-and-it-infrastructure-with-riemann/
• Instametrics Case Study:
https://www.instaclustr.com/project/instametrics/
• Multi-DC Spark Benchmarks:
https://www.instaclustr.com/blog/2016/04/21/multi-data-center-sparkcassandra-benchmark-round-2/
• Top Spark Cassandra Connector Tips:
https://www.instaclustr.com/blog/2016/03/31/cassandra-connector-for-spark-5-tips-for-success/
Thanks for attending!
© DataStax, All Rights Reserved. 11
Processing 50,000 events per second with Cassandra and Spark

Processing 50,000 events per second with Cassandra and Spark

  • 1.
    Ben Slater, Instaclustr Processing50,000 events per second with Cassandra and Spark
  • 2.
    Introduction • Ben Slater,Chief Product Officer, Instaclustr • Cassandra + Spark Managed Service, Support, Consulting • 20+ years experience as a developer, architect and dev/dev-ops team lead • DataStax MVP for Apache Cassandra © DataStax, All Rights Reserved. 2
  • 3.
    Processing 50,000 eventsper second with Cassandra and Spark 1 Problem background and overall architecture 2 Implementation process & lessons learned 3 What’s next? 3© DataStax, All Rights Reserved.
  • 4.
    Problem background • Howto efficiently monitor >600 servers all running Cassandra • Need to develop a metric history over time for tuning alerting & automated response systems • Off the shelf systems are available but: • probably don’t give us the flexibility we want to be able to optimize for our environment • we wanted a meaty problem to tackle ourselves to dog-food our own offering and build our internal skills and understanding © DataStax, All Rights Reserved. 4
  • 5.
    Solution Overview © DataStax,All Rights Reserved. 5 Managed Node (AWS) x many Managed Node (Azure) x many Managed Node (SoftLayer) x many Cassandra + Spark (x15) Riemann (x3) RabbitMQ (x2) Console/ API (x2) Admin Tools 500 nodes * ~2,000 metrics / 20 secs = 50k metrics/sec PagerDuty
  • 6.
    Implementation Approach 1.Writing Data 2.RollingUp Data 3.Presenting Data © DataStax, All Rights Reserved. 6 ~ 9(!) months (with quite a few detours and distractions)
  • 7.
    Writing Data • Worked,Filled Up, Worked, Broke, Kind of Works, Works! • Key lessons: • Aligning Data Model with DTCS • Initial design did not have time value in partition key • Settled on bucketing by 5 mins • Enables DTCS to work • Works really well for extracting data for roll-up • Adds complexity for retrieving data • When running with STCS needed unchecked_compactions=true to avoid build up of TTL’d data • Batching of writes • Found batching of 200 rows per insert to provide optimal throughput and client load • See Adam’s talk from yesterday for all the detail • Controlling data volumes from column family metrics • Limited, rotating set of CFs per check-in • Managing back pressure is important © DataStax, All Rights Reserved. 7
  • 8.
    Rolling Up Data •Works?, Doesn’t Work, Doesn’t Work, Doesn’t Work, Doesn’t Work, Works! • Developing functional solution was easy, getting to acceptable performance was hard (and time consuming) but seemed easy once we’d solved it • Keys to performance? • Align raw data partition bucketing with roll-up timeframe (5 mins) • Use joinWithCassandra table to extract the required data – 2-3x performance improvement over alternate approaches val RDDJoin = sc.cassandraTable[(String, String)]("instametrics" , "service_per_host") .filter(a => broadcastListEventAll.value.map(r => a._2.matches(r)).foldLeft(false)(_ || _)) .map(a => (a._1, dateBucket, a._2)) .repartitionByCassandraReplica("instametrics", "events_raw_5m", 100) .joinWithCassandraTable("instametrics", "events_raw_5m").cache() • Write limiting (eg cassandra.output.throughput_mb_per_sec) not necessary as writes << reads © DataStax, All Rights Reserved. 8
  • 9.
    Presenting Data • Generally,just worked • Main challenge was dealing with how to find latest data in buckets when not all data is reported in each data set © DataStax, All Rights Reserved. 9
  • 10.
    What’s Next • Decisionsto revisit: • Use Spark Streaming for 5 min roll-ups rather than save and extract • Scale-out by adding nodes is working as expected • Continue to add additional metrics to roll-ups as we add functionality • Plan to introduce more complex analytics & feed historic values back to Reimann for use in alerting © DataStax, All Rights Reserved. 10
  • 11.
    Questions? Further info: • ScalingRiemann: https://www.instaclustr.com/blog/2016/05/03/post-500-nodes-high-availability-scalability-with-riemann/ • Riemann Intro: https://www.instaclustr.com/blog/2015/12/14/monitoring-cassandra-and-it-infrastructure-with-riemann/ • Instametrics Case Study: https://www.instaclustr.com/project/instametrics/ • Multi-DC Spark Benchmarks: https://www.instaclustr.com/blog/2016/04/21/multi-data-center-sparkcassandra-benchmark-round-2/ • Top Spark Cassandra Connector Tips: https://www.instaclustr.com/blog/2016/03/31/cassandra-connector-for-spark-5-tips-for-success/ Thanks for attending! © DataStax, All Rights Reserved. 11