LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Data Infrastructure at Linkedin
Jun Rao and Sam Shah

LinkedIn Confidential ©2013 All Rights Reserved

Outline
1.
2.
3.
4.

LinkedIn introduction
Online/nearline infrastructure
Offline infrastructure
Conclusion


2

The World’s Largest Professional Network
Connecting Talent  Opportunity. At scale…

200M+ 2 new
Members Worldwide

Members Per Second


100M+
Monthly Unique Visitors

2M+
Company Pages

3

Two Product Families
For Members

Professionals

For Partners

 People You May Know
 Who’s Viewed My Profile
 Jobs You May Be
Interested In
 News/Sharing
 Today
 Search
 Subscriptions

Hire
Companies

Market
Sell

Science and Analytics
Data Infrastructure
Actions

Profiles
Connections

Data

Content
4

The Big-Data Feedback Loop
Refinement 

Engagement
Value 

Member

Product

Insights 

Virality

Data

Signals

Science
Analytics 

Scale 
Infrastructure

5

LinkedIn Data Infrastructure: Three-Phase Abstraction
Near-Line
Infra

Offline
Data Infra

Application

Users

Infrastructure

Online

Near-Line

Offline

Online Data
Infra

Latency & Freshness Requirements
Activity that should be reflected immediately

•
•
•

Products
• Messages
Member Profiles
• Endorsements
Company Profiles
• Skills
Connections

Activity that should be reflected soon

•
•
•

•
Activity Streams
Profile Standardization •
•
News

Recommendations
Search
Messages

Activity that can be reflected later

•
•
•

People You May Know •
Connection Strength •
News

Recommendations
Next best idea…


6

LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase
ecosystem are
diverse, complex and specific

Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms
7

LinkedIn Data Infrastructure Solutions

Voldemort: Highly-Available
Distributed KV Store
• Key/value access at scale

8

Voldemort: Architecture

• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views”

•
•
•
•
•

10 clusters, 100+ nodes
Largest cluster – 10K+ qps
Avg latency: 3ms
Hundreds of Stores
Largest store – 2.8TB+


Espresso: Indexed Timeline-Consistent
Distributed Data Store
• Fill in the gap btw Oracle and KV store

10

Espresso: System Components
• Hierarchical data model
• Timeline consistency
• Rich functionality
• Transactions
• Secondary index
• Text search
• Partitioning/replication
• Change propagation

11

Generic Cluster Manager: Helix
• Generic Distributed State Model
•
•
•
•

ConfigManagement
Automatic Load Balancing
Fault tolerance
Cluster expansion and rebalancing

• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix

12


Databus : Timeline-Consistent
Change Data Capture
• Deliver data store changes to apps

Databus at LinkedIn
DB

Capture
Changes

Relay
Event Win

On-line
Changes

On-line
Changes

Databus
Client Lib

Client

Snapshot at U

Databus
Client Lib

Consistent

 Transport independent of data
source: Oracle, MySQL, …
 Transactional semantics
 In order, at least once delivery

Consumer n

Client

Bootstrap

DB

Consumer 1

Consumer 1

Consumer n

 Tens of relays
 Hundreds of sources
 Low latency - milliseconds

14


Kafka: High-Volume Low-Latency
Messaging System
• Log aggregation and queuing

15

Kafka Architecture
Producer

Producer

Broker 1

Broker 2

Broker 3

Broker 4

topic1-part1

topic1-part2

topic2-part1

topic2-part2

topic2-part2

topic1-part1

topic1-part2

topic2-part1

topic2-part1

topic2-part2

topic1-part1

topic1-part2

Key features
• Scale-out architecture
• Automatic load balancing
• High throughput/low latency
• Rewindability
• Intra-cluster replication

Zookeeper

Consumer

Consumer

Per day stats
• writes: 10+ billion messages
• reads: 50+ billion messages

LinkedIn Data Infrastructure: A few take-aways
1.
2.
3.

Building infrastructure in a hyper-growth
environment is challenging.
Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*)
Balance open-source products with homegrown platforms (**)


17

LinkedIn Infrastructure (analytics@webscale, at fb 2013)

More Related Content

What's hot

Similar to LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Recently uploaded

LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Editor's Notes