Data Infrastructure at Linkedin
Jun Rao and Sam Shah

LinkedIn Confidential ©2013 All Rights Reserved
Outline
1.
2.
3.
4.

LinkedIn introduction
Online/nearline infrastructure
Offline infrastructure
Conclusion

LinkedIn Confidential ©2013 All Rights Reserved

2
The World’s Largest Professional Network
Connecting Talent  Opportunity. At scale…

200M+ 2 new
Members Worldwide

Members Per Second

LinkedIn Confidential ©2013 All Rights Reserved

100M+
Monthly Unique Visitors

2M+
Company Pages

3
Two Product Families
For Members

Professionals

For Partners

 People You May Know
 Who’s Viewed My Profile
 Jobs You May Be
Interested In
 News/Sharing
 Today
 Search
 Subscriptions

Hire
Companies

Market
Sell

Science and Analytics
Data Infrastructure
Actions

Profiles
Connections
LinkedIn Confidential ©2013 All Rights Reserved

Data

Content
4
The Big-Data Feedback Loop
Refinement 

Engagement
Value 

Member

Product

Insights 

Virality

Data

Signals

Science
Analytics 

Scale 
Infrastructure
LinkedIn Confidential ©2013 All Rights Reserved

5
LinkedIn Data Infrastructure: Three-Phase Abstraction
Near-Line
Infra

Offline
Data Infra

Application

Users

Infrastructure

Online

Near-Line

Offline

Online Data
Infra

Latency & Freshness Requirements
Activity that should be reflected immediately

•
•
•

Products
• Messages
Member Profiles
• Endorsements
Company Profiles
• Skills
Connections

Activity that should be reflected soon

•
•
•

•
Activity Streams
Profile Standardization •
•
News

Recommendations
Search
Messages

Activity that can be reflected later

•
•
•

People You May Know •
Connection Strength •
News

Recommendations
Next best idea…

LinkedIn Confidential ©2013 All Rights Reserved

6
LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase
ecosystem are
diverse, complex and specific

Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms
7
LinkedIn Data Infrastructure Solutions

Voldemort: Highly-Available
Distributed KV Store
• Key/value access at scale

8
Voldemort: Architecture

• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views”

•
•
•
•
•

10 clusters, 100+ nodes
Largest cluster – 10K+ qps
Avg latency: 3ms
Hundreds of Stores
Largest store – 2.8TB+
LinkedIn Data Infrastructure Solutions

Espresso: Indexed Timeline-Consistent
Distributed Data Store
• Fill in the gap btw Oracle and KV store

10
Espresso: System Components
• Hierarchical data model
• Timeline consistency
• Rich functionality
• Transactions
• Secondary index
• Text search
• Partitioning/replication
• Change propagation

11
Generic Cluster Manager: Helix
• Generic Distributed State Model
•
•
•
•

ConfigManagement
Automatic Load Balancing
Fault tolerance
Cluster expansion and rebalancing

• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix

12
LinkedIn Data Infrastructure Solutions

Databus : Timeline-Consistent
Change Data Capture
• Deliver data store changes to apps
Databus at LinkedIn
DB

Capture
Changes

Relay
Event Win

On-line
Changes

On-line
Changes

Databus
Client Lib

Client

Snapshot at U

Databus
Client Lib

Consistent

 Transport independent of data
source: Oracle, MySQL, …
 Transactional semantics
 In order, at least once delivery

Consumer n

Client

Bootstrap

DB

Consumer 1

Consumer 1

Consumer n

 Tens of relays
 Hundreds of sources
 Low latency - milliseconds

14
LinkedIn Data Infrastructure Solutions

Kafka: High-Volume Low-Latency
Messaging System
• Log aggregation and queuing

15
Kafka Architecture
Producer

Producer

Broker 1

Broker 2

Broker 3

Broker 4

topic1-part1

topic1-part2

topic2-part1

topic2-part2

topic2-part2

topic1-part1

topic1-part2

topic2-part1

topic2-part1

topic2-part2

topic1-part1

topic1-part2

Key features
• Scale-out architecture
• Automatic load balancing
• High throughput/low latency
• Rewindability
• Intra-cluster replication

Zookeeper

Consumer

Consumer

Per day stats
• writes: 10+ billion messages
• reads: 50+ billion messages
LinkedIn Data Infrastructure: A few take-aways
1.
2.
3.

Building infrastructure in a hyper-growth
environment is challenging.
Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*)
Balance open-source products with homegrown platforms (**)

LinkedIn Confidential ©2013 All Rights Reserved

17

LinkedIn Infrastructure (analytics@webscale, at fb 2013)

  • 1.
    Data Infrastructure atLinkedin Jun Rao and Sam Shah LinkedIn Confidential ©2013 All Rights Reserved
  • 2.
    Outline 1. 2. 3. 4. LinkedIn introduction Online/nearline infrastructure Offlineinfrastructure Conclusion LinkedIn Confidential ©2013 All Rights Reserved 2
  • 3.
    The World’s LargestProfessional Network Connecting Talent  Opportunity. At scale… 200M+ 2 new Members Worldwide Members Per Second LinkedIn Confidential ©2013 All Rights Reserved 100M+ Monthly Unique Visitors 2M+ Company Pages 3
  • 4.
    Two Product Families ForMembers Professionals For Partners  People You May Know  Who’s Viewed My Profile  Jobs You May Be Interested In  News/Sharing  Today  Search  Subscriptions Hire Companies Market Sell Science and Analytics Data Infrastructure Actions Profiles Connections LinkedIn Confidential ©2013 All Rights Reserved Data Content 4
  • 5.
    The Big-Data FeedbackLoop Refinement  Engagement Value  Member Product Insights  Virality Data Signals Science Analytics  Scale  Infrastructure LinkedIn Confidential ©2013 All Rights Reserved 5
  • 6.
    LinkedIn Data Infrastructure:Three-Phase Abstraction Near-Line Infra Offline Data Infra Application Users Infrastructure Online Near-Line Offline Online Data Infra Latency & Freshness Requirements Activity that should be reflected immediately • • • Products • Messages Member Profiles • Endorsements Company Profiles • Skills Connections Activity that should be reflected soon • • • • Activity Streams Profile Standardization • • News Recommendations Search Messages Activity that can be reflected later • • • People You May Know • Connection Strength • News Recommendations Next best idea… LinkedIn Confidential ©2013 All Rights Reserved 6
  • 7.
    LinkedIn Data Infrastructure:Sample Stack Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms 7
  • 8.
    LinkedIn Data InfrastructureSolutions Voldemort: Highly-Available Distributed KV Store • Key/value access at scale 8
  • 9.
    Voldemort: Architecture • Pluggablecomponents • Tunable consistency / availability • Key/value model, server side “views” • • • • • 10 clusters, 100+ nodes Largest cluster – 10K+ qps Avg latency: 3ms Hundreds of Stores Largest store – 2.8TB+
  • 10.
    LinkedIn Data InfrastructureSolutions Espresso: Indexed Timeline-Consistent Distributed Data Store • Fill in the gap btw Oracle and KV store 10
  • 11.
    Espresso: System Components •Hierarchical data model • Timeline consistency • Rich functionality • Transactions • Secondary index • Text search • Partitioning/replication • Change propagation 11
  • 12.
    Generic Cluster Manager:Helix • Generic Distributed State Model • • • • ConfigManagement Automatic Load Balancing Fault tolerance Cluster expansion and rebalancing • Espresso, Databus and Search • Open Source Apr 2012 • https://github.com/linkedin/helix 12
  • 13.
    LinkedIn Data InfrastructureSolutions Databus : Timeline-Consistent Change Data Capture • Deliver data store changes to apps
  • 14.
    Databus at LinkedIn DB Capture Changes Relay EventWin On-line Changes On-line Changes Databus Client Lib Client Snapshot at U Databus Client Lib Consistent  Transport independent of data source: Oracle, MySQL, …  Transactional semantics  In order, at least once delivery Consumer n Client Bootstrap DB Consumer 1 Consumer 1 Consumer n  Tens of relays  Hundreds of sources  Low latency - milliseconds 14
  • 15.
    LinkedIn Data InfrastructureSolutions Kafka: High-Volume Low-Latency Messaging System • Log aggregation and queuing 15
  • 16.
    Kafka Architecture Producer Producer Broker 1 Broker2 Broker 3 Broker 4 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part1 topic2-part2 topic1-part1 topic1-part2 Key features • Scale-out architecture • Automatic load balancing • High throughput/low latency • Rewindability • Intra-cluster replication Zookeeper Consumer Consumer Per day stats • writes: 10+ billion messages • reads: 50+ billion messages
  • 17.
    LinkedIn Data Infrastructure:A few take-aways 1. 2. 3. Building infrastructure in a hyper-growth environment is challenging. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) Balance open-source products with homegrown platforms (**) LinkedIn Confidential ©2013 All Rights Reserved 17

Editor's Notes

  • #5 Enterprise Facing is all about Segmentation and Connections Our base data lead to revenue-generating productsEnterprise Application-building problems with deterministic life-cycles Science is key for targeting and matching (e.g. CAP, Marketing Solutions) Key back-office play for Hiring, Sales and Marketing for 85% of Fortune-500
  • #7 Transition needs to be goodProducts => data infrastructure requirements in previous slideAll products don’t make the same latency and freshness requirements from our data infrastructureThe way we bucketize this is….News and recommendations show up in both nearline and offline
  • #18 Data Integration is hard. Having sane and same metadata across systems. Have a schema which works across the 3 phases. Want a rich evolving schemas and make the conforming push as much of data cleaning to source and upstream as much as possible so near-line and off-line helpsSessionization logic is in WH which makes it hard for near-line systems to useExtensible system where changing schema in one phase does not break downstream systemsDon’t build over-specialized systems: e.g. a monitoring system for PYMK – build Azkaban