Network Programming: Data Plane Development Kit (DPDK)
The document outlines the architecture and implementation of high-performance networking using Data Plane Development Kit (DPDK), emphasizing efficient packet processing and management. It discusses networking protocols, memory management, and the benefits of user space libraries for reducing latency in packet handling. Additionally, it explains the operational mechanics of data planes, including mechanisms for packet reception and transmission in a multi-core environment.
Network Programming: Data Plane Development Kit (DPDK)
1.
Andriy Berestovskyy
2014
( ц) А н д р
і й Б е р е с
т о в с ь к и
й
networking hourTCP
UDP
NAT
IPsec
IPv4
IPv6
internet
protocolsAH
ESP
authentication
authorization
accounting
encapsulation
security
BGP
OSPF
ICMP
ACLSNAT
tunnelPPPoE
GRE
ARP
discovery
NDP
OSI
broadcast
multicast
IGMP
PIM
MAC
DHCP
DNS
fragmentation
semihalf
berestovskyy
Network Programming
Data Plane Development Kit
Let’s Make a10Gbit/s Switch!
3
1. Receive frame, check Ethernet FCS
2. Add/update source MAC in MAC table
3. If multicast bit is set:
a. forward all ports, but the source
4. If destination is in MAC table:
a. forward to the specific port
5. Else, forward to all ports
4.
Let’s Make aSwitch Simple Hub!
4
1. Receive frame, check Ethernet FCS
2. Add/update source MAC in MAC table
3. If multicast bit is set:
a. forward all ports, but the source
4. If destination is in MAC table:
a. forward to the specific port
5. Else, forward to another port
But still, 10 Gbit/s!
Performance Challenges
Minimum EthernetFrame Size:
min frame size = preamble + start + min frame + interframe gap
min frame size = 7 + 1 + 64 + 12 = 84 octets (84 * 8 = 672 bits)
Maximum number of frames on 1 Gbps link:
packets per second = 1 Gbps / 672 bits = 1 488 095 pps
Maximum number of frames on 10 Gbps link:
packets per second = 10 Gbps / 672 bits =14,88 Mpps
6
7.
Ethernet vs CPU
7
SkylakeIntel® Xeon® Processor E3-1280 v5 — 3,7 GHz
CPU budget per packet = CPU Freq / Packet Per Second
CPU budget per packet = 3,7 GHz / 14,88 Mpps = 249 cycles
249 CPU cycles per packet
Is it a lot?
Data Plane DevelopmentKit — set of user space
dataplane libraries and NIC drivers
for fast packet processing
— Wikipedia
12
Dataplane?
13.
Dataplane — partof architecture that decides
what to do with packets arriving on an inbound interface
— Wikipedia
13
14.
What is DPDK
1.Set of user space libraries
2. Set of user space drivers with direct NIC access
3. Support for Network Functions Virtualization
4. Open source, BSD Licensed
14
How?
15.
What DPDK isNot
1. Not a TCP/IP stack
2. Not a Berkeley socket API
3. Not a ready to use solution,
i.e. neither a router nor a switch
15
16.
Fitting Into 249Cycles: Polling
Poll for new packets, do not wait for an interrupt:
1. Interrupts are out of the budget
2. Avoid context switches
void lcore_loop()
{
while (1) {
...
}
}
16
Pros/cons?
17.
Fitting Into 249Cycles: Bursts
Process few packets at a time (a burst), not one-by-one:
1. Amortize slow memory reads per burst
2. Increase cache hit ratio: do the same for few packets at once
void lcore_loop()
{
struct rte_mbuf *burst[32];
while (1) {
nb_rx = rte_eth_rx_burst(0, 0, burst, 32);
...
rte_eth_tx_burst(1, 0, burst, nb_rx);
}
}
17
Pros/cons?
18.
Fitting Into 249Cycles: Hugepages
Virtual-Physical address translation — Translation Lookaside Buffer:
1. Haswell First Level DTLB Cache for 4K pages: 64 entries x 4
(~4 cycles)
2. Haswell Second Level DTLB Cache: 1024 entries x 8
(~10 cycles)
Use 2MB or 1GB hugepages to reduce TLB cache misses:
GRUB_CMDLINE_LINUX_DEFAULT=”... default_hugepagesz=1G hugepagesz=1G hugepages=4”
# mount -t hugetlbfs nodev /mnt/huge
18More on system requirements: http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html
19.
Fitting Into 249Cycles: Multicore
Use few CPU cores to process packets:
1. Use Receive Side Scaling
2. Use Flow affinity
void main()
{
...
rte_eal_mp_remote_launch(lcore_loop, ...);
...
}
19
Pros/cons?
CPU 1: lcore_loop()
CPU 2: lcore_loop()
CPU 3: lcore_loop()
port port
20.
Why DPDK?
1. Highperformance
○ efficient memory-management
(zero-copy, hugepages, user space)
○ efficient packet handling
(DIR-24-8 implementation, cuckoo hash)
○ efficient CPU management
(lockless poll-mode drivers, run-to-completion, NUMA awarness)
2. Simple
○ user space application
○ many examples
3. De-facto standard for dataplanes in Linux
20
21.
Pipeline Model
lcore 3:
TX
lcore2:
process
Run-to-Completion Model
lcore 1: RX, process, TX
lcore 2: RX, process, TX
lcore 3: RX, process, TX
Dataplane Application Models
21
port port
lcore 1:
RX
port port
Cons/pros?
lcore?
22.
Lcore — logicalexecution unit of the processor,
sometimes called a hardware thread
(usually, pthread bound to a CPU core)
— Wikipedia
22
23.
Pipeline Model
lcore 3:
TX
lcore2:
process
Run-to-Completion Model
lcore 1: RX, process, TX
lcore 2: RX, process, TX
lcore 3: RX, process, TX
Synchronization Issues
23
port port
lcore 1:
RX
port port
How?
How?
24.
Run-to-Completion Model
port q2
lcore1: RX, process, TX
lcore 2: RX, process, TX
lcore 3: RX, process, TX
Run-to-Completion Synchronization
24
port
q1
q3
q2
q1
q3
Hardware
queue
Cons/pros?
25.
RSS* and FlowAffinity
25
q2
lcore 1: RX, process, TX
lcore 2: RX, process, TX
lcore 3: RX, process, TX
q1
q3
q2
q1
q3
Why?
port port
* Receive Side Scaling
DPDK rte_hash Library
45
1.Array of buckets
2. Fixed number of entries per bucket
3. No data, key management only
4. 4-byte key signatures
Why?
Why?
keyskeysrte_hash free sigfree free sig sig sig free sigsigfree
int. key array free keyfree free key key key free keykeyfree
index
data
Data?
47
Cuckoo hashing —scheme for resolving hash collisions
with worst-case constant lookup time
— Wikipedia
Collision?
48.
Cuckoo Hashing Algorithm
48
AAA
B
B
1 2 3a
A
B
A
C
3b
C
B
A
C
no collision
primary hash
collision - use
alt. location
both hashes
collision -
push hash to
alt. location
add hash to a
vacant space
Double
addressing
Cuckoo
Recap: IP FlexibleSubnetting
53
3.2.5100.
Subnet Host
Service Provider (AS100)
Subnet 100.0.0.0
Company 3
Subnet 100.3.0.0
Office
100.3.1.0
Lab
100.3.2.0
2.5100.3.
5100.3.2.
100.3.2.5
How?
Route to: AS100
Route to: AS100, Company 3
Route to: AS100, Company 3, Lab
Deliver to: AS100, Company 3, Lab, Host 5
AS200
54.
Recap: Router Logic
54
ServiceProvider
Subnet 100.0.0.0
Company 3
Subnet 100.3.0.0
Lab
100.3.2.0
1. Receive IP packet:
○ Check Ethernet FCS
○ Remove Ethernet header
2. Decrease TTL
3. Find the best route in routing table:
○ Most specific route is the best
4. If found, send to next-hop router:
○ Destination MAC = next-hop gateway IP
5. Else, drop the packet
Lab Router Routing Table:
100.3.2.* —> dev eth1 (Lab),
directly
*.*.*.* —> dev eth0 (Company 3),
via 100.3.0.1 (Company 3 Router)
Longest Prefix Match(LPM)
Example routing table:
0.0.0.0/0 -> R1
10.0.0.0/8 -> R2
10.10.0.0/16 -> R3
Destination address 10.10.0.1 matches all three routes.
Which route to use?
57
58.
DPDK rte_lpm Library
IPv4:
1.32-bit keys
2. Fixed maximum number of rules
3. LPM rule: 32-bit key + prefix len + user data (next hop)
4. DIR-24-8 algorithm (1-2 memory reads per match)
IPv6:
1. 128-bit keys
2. Similar algorithm:
24 bit + 13 x 8 bit tables = 1-14 memory reads (typically 5)
58
How?
59.
DPDK rte_lpm Library:DIR-24-8
59
Table of
Next Hops
R1
Table of 2^24 words
for prefix len 0-24
0.0.0
Table of records x 256
for prefix lens 25-32
...
R2
9.255.255
10.0.0
10.0.1
...
255.255.255
R3
...
(10.0.0.) 0
(10.0.0.) 1
...
(10.0.0.) 255
...
2^24*2 = 32MB table
N*2*256 bytes table
where N is a max number of
routes with prefix len > 24
tbl24 tbl8
24 bits index 8 bits index
more
specific
routes flag
LPM Rules
(routes)
1st read
optional
2nd read
60.
DPDK Poll ModeDrivers
1. Lock-free —> thread unsafe
2. Based on user space IO (uio)
3. Limited number of NICs
4. Any interface via PCAP (slow)
60
61.
DPDK Kernel NICInterface
61
User space
Kernel
Hardware NIC
igb_uio
DPDK
lcore 2
Linux/FreeBSD
ping
Poll Mode Drivers
lcore 1
vEth0KNI Driver
rte_kni
1. Allows user space applications to access Linux control plane
2. Allows management of DPDK ports using Linux tools
3. Interface with the kernel network stack
62.
DPDK Thread Safety
1.Thread unsafe: all performance sensitive functions
hash, LPM, PMD...
2. Multi-threaded: performance insensitive
malloc, memzone...
3. Fast and thread safe: rings (lockless queues) and mempools
62
63.
DPDK Performance Tips
1.Never use libc nor Linux API
malloc -> rte_malloc
printf -> RTE_LOG
2. Avoid cache misses / false sharing by using per lcore variables
Example: port statistics
3. Use NUMA sockets to allocate local memory
4. Use rings to inter-core communication
5. Use burst mode in PMDs
6. Help branch predictor: use likely()/unlikely()
7. Prefetch data into cache with rte_prefetchX()
63
Why?
False
sharing?