A Real-time Processing System
based on Spark streaming
in the filed of
Telecommunications
FOR Hadoop SUMMIT,2017
Geng Wang
Dong Wang
CONTENT
CONTENT
01
What we faced in Telecommunications
1
3
2
4
0.166 million new users / day 33G data / sec by mobile
10T Voice data / day 100T Signal data / day
More data are produced
What we faced in Telecommunications
More Real time Requirements
• Smart Tourism • Intelligent marketing
• Tourist count and analysis
• Best choice of tourist resort
• Recommendation of route
for travelling
• …
• Recommendation of product for
specific customer
• Based on multiple dimensions
(location, age, salary …)
Evolution of Requirement
2014
Real time Marketing
2015
• Operation based
on Location
2017
Hard Real time
…
CEP, Esper
2016
• More data input
2/3/4 G Signal of Location
Content of Business
CONTENT
02
Framework – High level
Data output
Tagging
Data Input
Detailed Framework
• Hadoop Layer
• Basic components
• OCSP core
• Data pre-processing
• Tagging
• Event output (select and
filter)
• Multiple engine ( Spark
Streaming and Storm)
• Muti-tenant
• Check point
• Data transformation
• Parse data to Kafka
• Nifi and Flume
• Customized processor
& sink
• Data source
• Socket
• Local files
• HDFS
Framework - Data Input
flume
agent 1
• 2,3G Signal of
location
flume
agent
2,3,4
• 4G Signal of
location
Nifi
• Content data
of Business
Kafka
Partition
Data Preprocessing
Kafka
transform
transform
transform
Schema 2
Schema 3
Select Filter
Select Expr 1
imsi
Filter Expr 1
imsi!=0
Select Expr 2 Filter Expr 2
Select Expr 3 Filter Expr 3
Select Filter
Select Filter
Uniform
Schema 1
Tagging and Label
Tagging process
Customized
operation
Get by
Key
Codis
User
info
Stay
duration
Cycle of
Marketing
User
name
imsi Phone
number
Base
station
Select, filter & Output
Codis
Kafka
Data with
labels
others
Current location update for
each user
Output 3:
User with specific location &
specific business
Output 2:
New user marketing
Output 1:
User Path in a duration
Configurable process
Data with
labels
End
Check
Interval
Filter
Select
Output
Codis
Framework - Deployment & Configuration
External system
SDTP Socket Source HDFS I/O
Codis I/O
Web
Deployment in
single host
Deployment/ Configuration
OCSP
CONTENT
03
Performance - scale out
flume
flume
Nifi
Nifi
flume
Codis
Codis
Codis
Tagging
SparkData Input
kafKa Spark
Data Output
Kafka Spark
How OCSP works in Smart Tourism?
Tagging
Filter
Select
Output
Codis
• Data Source:
• 4 G signal data
• imsi + location + timestamp
• Data transformation:
• Flume source: socket
• Sink: Kafka (keyed message)
• Streaming processing
• Filter invalid data
• Tagging, get user’s information from codis by imsi
• Tagging, compute the user path in a duration
• Output
• Write the latest location to Kafka
• Use flume to update latest location in Hbase
4G data
socket
flume
Imsi | location | timestamp
Imsi | location | timestamp
Imsi | location | timestamp
…
Kafka
Imsi|location|timestamp|name|age|longitude|latitude
…
Imsi|location|timestamp|name|age|longitude|latitude
Imsi|location|timestamp|name|age|longitude|latitude
HBase
Kafka
flume
Performance - time cost
Scenario Data per 30 s Spark Codis
Kafka
Partition
Output
number
Case 1 0.6 million 20/128G/32 core 10/128G 200 3
Case 2 10 million 28/512G/64 core 10/512G 1200 11
Tagging(Get cache)
Tagging (Operation)
Output
Case1 5 seconds
0.5 s
3 s
1s
1.5 s
Data Transformation 1s
11 s
3 s
2 s
Case2 17 seconds
CONTENT
04
Next Work
Support more Scenarios
Faster
HA
• Join of multiple streams in a time window
• More streaming framework, flink, beam etc.
• Spark upgrade, structured streaming
• Faster cache
• No single point of failure
Open Source
https://github.com/OCSP
Thanks

a Real-time Processing System based on Spark streaming int he field of Telecommunications

  • 1.
    A Real-time ProcessingSystem based on Spark streaming in the filed of Telecommunications FOR Hadoop SUMMIT,2017 Geng Wang Dong Wang
  • 2.
  • 3.
  • 4.
    What we facedin Telecommunications 1 3 2 4 0.166 million new users / day 33G data / sec by mobile 10T Voice data / day 100T Signal data / day More data are produced
  • 5.
    What we facedin Telecommunications More Real time Requirements • Smart Tourism • Intelligent marketing • Tourist count and analysis • Best choice of tourist resort • Recommendation of route for travelling • … • Recommendation of product for specific customer • Based on multiple dimensions (location, age, salary …)
  • 6.
    Evolution of Requirement 2014 Realtime Marketing 2015 • Operation based on Location 2017 Hard Real time … CEP, Esper 2016 • More data input 2/3/4 G Signal of Location Content of Business
  • 7.
  • 8.
    Framework – Highlevel Data output Tagging Data Input
  • 9.
    Detailed Framework • HadoopLayer • Basic components • OCSP core • Data pre-processing • Tagging • Event output (select and filter) • Multiple engine ( Spark Streaming and Storm) • Muti-tenant • Check point • Data transformation • Parse data to Kafka • Nifi and Flume • Customized processor & sink • Data source • Socket • Local files • HDFS
  • 10.
    Framework - DataInput flume agent 1 • 2,3G Signal of location flume agent 2,3,4 • 4G Signal of location Nifi • Content data of Business Kafka Partition
  • 11.
    Data Preprocessing Kafka transform transform transform Schema 2 Schema3 Select Filter Select Expr 1 imsi Filter Expr 1 imsi!=0 Select Expr 2 Filter Expr 2 Select Expr 3 Filter Expr 3 Select Filter Select Filter Uniform Schema 1
  • 12.
  • 13.
  • 14.
    Select, filter &Output Codis Kafka Data with labels others Current location update for each user Output 3: User with specific location & specific business Output 2: New user marketing Output 1: User Path in a duration
  • 15.
  • 16.
    Framework - Deployment& Configuration External system SDTP Socket Source HDFS I/O Codis I/O Web Deployment in single host Deployment/ Configuration OCSP
  • 17.
  • 18.
    Performance - scaleout flume flume Nifi Nifi flume Codis Codis Codis Tagging SparkData Input kafKa Spark Data Output Kafka Spark
  • 19.
    How OCSP worksin Smart Tourism? Tagging Filter Select Output Codis • Data Source: • 4 G signal data • imsi + location + timestamp • Data transformation: • Flume source: socket • Sink: Kafka (keyed message) • Streaming processing • Filter invalid data • Tagging, get user’s information from codis by imsi • Tagging, compute the user path in a duration • Output • Write the latest location to Kafka • Use flume to update latest location in Hbase 4G data socket flume Imsi | location | timestamp Imsi | location | timestamp Imsi | location | timestamp … Kafka Imsi|location|timestamp|name|age|longitude|latitude … Imsi|location|timestamp|name|age|longitude|latitude Imsi|location|timestamp|name|age|longitude|latitude HBase Kafka flume
  • 20.
    Performance - timecost Scenario Data per 30 s Spark Codis Kafka Partition Output number Case 1 0.6 million 20/128G/32 core 10/128G 200 3 Case 2 10 million 28/512G/64 core 10/512G 1200 11 Tagging(Get cache) Tagging (Operation) Output Case1 5 seconds 0.5 s 3 s 1s 1.5 s Data Transformation 1s 11 s 3 s 2 s Case2 17 seconds
  • 21.
  • 22.
    Next Work Support moreScenarios Faster HA • Join of multiple streams in a time window • More streaming framework, flink, beam etc. • Spark upgrade, structured streaming • Faster cache • No single point of failure
  • 23.
  • 24.