ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
 Vijay Aruswamy,
 Staff Engineer, Big Data Operations,
 LinkedIn Corporation
 https://www.linkedin.com/in/vijayaruswamy
2
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Outline
 LinkedIn Overview
 Why Data is important for LinkedIn
 Linkedin’s Big Data Eco-System
 How Automic tools are helping LinkedIn
3
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Our Mission
 Connect the world's professionals to make
them more productive and successful.
4
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 5
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn – Worlds Largest Professional Network
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Outline
 LinkedIn Overview
 Why Data is important for LinkedIn
 Linkedin’s Big Data Eco-System
 How Automic tools are helping LinkedIn
7
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
“What Gets measured, gets fixed”
-David Henke, Former SVP Operations, LinkedIn
8
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 9
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
 People You May Know (PYMK)
 Companies you may be Interested
 Jobs you may be interested
 Groups you may like
 Who Viewed your profile
 Economic Graph Challenge
10
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
11
 People You May Know (PYMK)
 Companies you may be Interested
 Jobs you may be interested
 Groups you may like
 Who Viewed your profile
 Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
12
 People You May Know (PYMK)
 Companies you may be Interested
 Jobs you may be interested
 Groups you may like
 Who Viewed your profile
 Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
13
 People You May Know (PYMK)
 Companies you may be Interested
 Jobs you may be interested
 Groups you may like
 Who Viewed your profile
 Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
14
 People You May Know (PYMK)
 Companies you may be Interested
 Jobs you may be interested
 Groups you may like
 Who Viewed your profile
 Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
15
 People You May Know (PYMK)
 Companies you may be Interested
 Jobs you may be interested
 Groups you may like
 Who Viewed your profile
 Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Outline
 LinkedIn Overview
 Why Data is important for LinkedIn
 Linkedin's Big Data Eco-System
 How Automic tools are helping LinkedIn
16
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 17
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Type of Data at “LinkedIn”
Behavioral Data
18
Identity Data Social Data
+ +
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
What does “Big Data” mean at LinkedIn
19
Analytical Challenges & Complexity
Data
Volume
+ ∞
+ ∞
Social Media Data
Web/Behavior
Data
CRM Data
Member Data
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 20
High Level Data Flow
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Camus
 Camus is a MapReduce job to load data from Kafka into HDFS. It is capable of
incrementally copying data from Kafka into HDFS
 http://etl.svbtle.com/setting-up-camus-linkedins-kafka-to-hdfs-pipeline
21
 Unified data ingestion system for internal and external data sources. Gobblin
uses a worker framework where each records run through the four stages of
extraction, conversion, quality checking before writing.
 https://engineering.linkedin.com/data-ingestion/gobblin-big-data-ease
Gobblin
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 22
High Level Data Flow Cont..
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Automic
 Data driven scheduling - A process will not execute before the data dependency is
satisfied.
 Typical time series roll-up hierarchy (hour :: day :: week :: month :: quarter :: year) are
handled by Azkaban
 Processes should execute only when the input data sets are available
23
 Grouping -Organize components and workflows into common area for maintenance,
enhancements
 Supports External dependencies
 Use of Global Variable –Keep storing commonly used password in one place.
 Throttling --Assign Jobs to Queues, Schedule when jobs are to run throughout the day,
Hold jobs under the same flow
 Load Balancing --Assign queues to run on a particular server
 Monitoring --Graphical Explorer
Azkaban
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn’s Big Data Architecture
Online DBs - Prod DCs
Espress
o
Service Metrics
Web Tracking
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 25
LinkedIn’s Application Manager
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Type of jobs scheduled by Automic
 External ETL
 ODS ETL
 Hadoop ETL
 Teradata ETL
 User Input ETL
 Historical Loads
 One-time data fixes
26
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 27
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Data Volume
28
 How many Kafka topics (tracking + service) do we dump on Hadoop?
– ~ 900+, Tracking : 300 (/data/tracking) + Service : 682 (/data/service)
– Data size/day of above?
 ~10 TB
 How many online DB tables do we have on Hadoop?
– ~300+ (Oracle, Espresso, MySql) tables
– Data size?
 ~8 TB
 Capacity of DWH on Teradata
– ~186 TB overall with 6 month retention, ~3 TB every day
– ~340k unique queries/day (248k from users and ~ 90K from ETL)
 Capacity of Hadoop
– Biggest cluster 5 PB with 2500+ nodes
– ETL clusters 3.1 PB with 360+ nodes
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Q & A
29

How Linkedin uses Automic for Big Data Processes

  • 1.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved.
  • 2.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved.  Vijay Aruswamy,  Staff Engineer, Big Data Operations,  LinkedIn Corporation  https://www.linkedin.com/in/vijayaruswamy 2
  • 3.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Outline  LinkedIn Overview  Why Data is important for LinkedIn  Linkedin’s Big Data Eco-System  How Automic tools are helping LinkedIn 3
  • 4.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Our Mission  Connect the world's professionals to make them more productive and successful. 4
  • 5.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. 5
  • 6.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. LinkedIn – Worlds Largest Professional Network
  • 7.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Outline  LinkedIn Overview  Why Data is important for LinkedIn  Linkedin’s Big Data Eco-System  How Automic tools are helping LinkedIn 7
  • 8.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. “What Gets measured, gets fixed” -David Henke, Former SVP Operations, LinkedIn 8
  • 9.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. 9
  • 10.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Few Data Driven Products  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge 10
  • 11.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Few Data Driven Products 11  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  • 12.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Few Data Driven Products 12  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  • 13.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Few Data Driven Products 13  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  • 14.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Few Data Driven Products 14  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  • 15.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Few Data Driven Products 15  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  • 16.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Outline  LinkedIn Overview  Why Data is important for LinkedIn  Linkedin's Big Data Eco-System  How Automic tools are helping LinkedIn 16
  • 17.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. 17
  • 18.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Type of Data at “LinkedIn” Behavioral Data 18 Identity Data Social Data + +
  • 19.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. What does “Big Data” mean at LinkedIn 19 Analytical Challenges & Complexity Data Volume + ∞ + ∞ Social Media Data Web/Behavior Data CRM Data Member Data
  • 20.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. 20 High Level Data Flow
  • 21.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Camus  Camus is a MapReduce job to load data from Kafka into HDFS. It is capable of incrementally copying data from Kafka into HDFS  http://etl.svbtle.com/setting-up-camus-linkedins-kafka-to-hdfs-pipeline 21  Unified data ingestion system for internal and external data sources. Gobblin uses a worker framework where each records run through the four stages of extraction, conversion, quality checking before writing.  https://engineering.linkedin.com/data-ingestion/gobblin-big-data-ease Gobblin
  • 22.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. 22 High Level Data Flow Cont..
  • 23.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Automic  Data driven scheduling - A process will not execute before the data dependency is satisfied.  Typical time series roll-up hierarchy (hour :: day :: week :: month :: quarter :: year) are handled by Azkaban  Processes should execute only when the input data sets are available 23  Grouping -Organize components and workflows into common area for maintenance, enhancements  Supports External dependencies  Use of Global Variable –Keep storing commonly used password in one place.  Throttling --Assign Jobs to Queues, Schedule when jobs are to run throughout the day, Hold jobs under the same flow  Load Balancing --Assign queues to run on a particular server  Monitoring --Graphical Explorer Azkaban
  • 24.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. LinkedIn’s Big Data Architecture Online DBs - Prod DCs Espress o Service Metrics Web Tracking
  • 25.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. 25 LinkedIn’s Application Manager
  • 26.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Type of jobs scheduled by Automic  External ETL  ODS ETL  Hadoop ETL  Teradata ETL  User Input ETL  Historical Loads  One-time data fixes 26
  • 27.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. 27
  • 28.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Data Volume 28  How many Kafka topics (tracking + service) do we dump on Hadoop? – ~ 900+, Tracking : 300 (/data/tracking) + Service : 682 (/data/service) – Data size/day of above?  ~10 TB  How many online DB tables do we have on Hadoop? – ~300+ (Oracle, Espresso, MySql) tables – Data size?  ~8 TB  Capacity of DWH on Teradata – ~186 TB overall with 6 month retention, ~3 TB every day – ~340k unique queries/day (248k from users and ~ 90K from ETL)  Capacity of Hadoop – Biggest cluster 5 PB with 2500+ nodes – ETL clusters 3.1 PB with 360+ nodes
  • 29.
    ORGANIZATION NAME©2013 LinkedInCorporation. All Rights Reserved. Q & A 29