How Linkedin uses Automic for Big Data Processes

 Vijay Aruswamy,
 Staff Engineer, Big Data Operations,
 LinkedIn Corporation
 https://www.linkedin.com/in/vijayaruswamy
2

Outline
 LinkedIn Overview
 Why Data is important for LinkedIn
 Linkedin’s Big Data Eco-System
 How Automic tools are helping LinkedIn
3

Our Mission
 Connect the world's professionals to make
them more productive and successful.
4

LinkedIn – Worlds Largest Professional Network

Outline
 Linkedin’s Big Data Eco-System
7

“What Gets measured, gets fixed”
-David Henke, Former SVP Operations, LinkedIn
8

Few Data Driven Products
 People You May Know (PYMK)
 Companies you may be Interested
 Jobs you may be interested
 Groups you may like
 Who Viewed your profile
 Economic Graph Challenge
10

11

12

13

14

15

Outline
 Linkedin's Big Data Eco-System
16

Type of Data at “LinkedIn”
Behavioral Data
18
Identity Data Social Data
+ +

What does “Big Data” mean at LinkedIn
19
Analytical Challenges & Complexity
Data
Volume
+ ∞
+ ∞
Social Media Data
Web/Behavior
Data
CRM Data
Member Data

High Level Data Flow

Camus
 Camus is a MapReduce job to load data from Kafka into HDFS. It is capable of
incrementally copying data from Kafka into HDFS
 http://etl.svbtle.com/setting-up-camus-linkedins-kafka-to-hdfs-pipeline
21
 Unified data ingestion system for internal and external data sources. Gobblin
uses a worker framework where each records run through the four stages of
extraction, conversion, quality checking before writing.
 https://engineering.linkedin.com/data-ingestion/gobblin-big-data-ease
Gobblin

High Level Data Flow Cont..

Automic
 Data driven scheduling - A process will not execute before the data dependency is
satisfied.
 Typical time series roll-up hierarchy (hour :: day :: week :: month :: quarter :: year) are
handled by Azkaban
 Processes should execute only when the input data sets are available
23
 Grouping -Organize components and workflows into common area for maintenance,
enhancements
 Supports External dependencies
 Use of Global Variable –Keep storing commonly used password in one place.
 Throttling --Assign Jobs to Queues, Schedule when jobs are to run throughout the day,
Hold jobs under the same flow
 Load Balancing --Assign queues to run on a particular server
 Monitoring --Graphical Explorer
Azkaban

LinkedIn’s Big Data Architecture
Online DBs - Prod DCs
Espress
o
Service Metrics
Web Tracking

LinkedIn’s Application Manager

Type of jobs scheduled by Automic
 External ETL
 ODS ETL
 Hadoop ETL
 Teradata ETL
 User Input ETL
 Historical Loads
 One-time data fixes
26

Data Volume
28
 How many Kafka topics (tracking + service) do we dump on Hadoop?
– ~ 900+, Tracking : 300 (/data/tracking) + Service : 682 (/data/service)
– Data size/day of above?
 ~10 TB
 How many online DB tables do we have on Hadoop?
– ~300+ (Oracle, Espresso, MySql) tables
– Data size?
 ~8 TB
 Capacity of DWH on Teradata
– ~186 TB overall with 6 month retention, ~3 TB every day
– ~340k unique queries/day (248k from users and ~ 90K from ETL)
 Capacity of Hadoop
– Biggest cluster 5 PB with 2500+ nodes
– ETL clusters 3.1 PB with 360+ nodes

Q & A
29

How Linkedin uses Automic for Big Data Processes

More Related Content

What's hot

Viewers also liked

Similar to How Linkedin uses Automic for Big Data Processes

More from CA | Automic Software

Recently uploaded

How Linkedin uses Automic for Big Data Processes