Xen Summit 2010
  Extending Xen into Embedded
and Communications Workloads
Agenda


       •       Embedded Usage Models
       •       Virtual Machine Monitor Requirements
       •       Benchmarking
       •       Cisco Product Range
       •       Embedded Development Requirements
       •       High Availability




2   09.14.05
Embedded Usage Models

                           Robotics
                           Using Core Micro
                           Architecture for GUI
IP Media Phones            interface with real time
Atom based platforms       industrial control.
delivering Internet
connectivity and media
content to continuous
connected devices.

                                                         Routing
                                                      Xeon Micro
                                                      Architecture based
                                                      platforms implement
                                                      control and data-
                                                      plane services on
                                                      high end routers.



                 Unique VMM requirements across all segments

 3
Virtual Machine Monitor Implementation

                                                              Scalability, Flexibility, RAS
                                Industrial Control requires   and Fail Over are a few of
                                determinism. Performance      the vmm requirements in
     Critical partition                                       Comm’s appliance
     required to host Cell      is measured in interrupt
                                latency (10 usec or lower)    environment
     phone application,
     hypervisor requires
     Quality of Service                                                            RTOS
                                                              (Service)   Linux
                                     Microsoft                  Linux
       Critical     App              GUI
      Partition     partition                      RTOS
                                          Shared
                                          Memory                          vmm
                                            vmm
             Thin vmm

                                       Industrial               Comm’s Appliance
        Media Phone




 4
Embedded Virtualization - Advantages


Consolidation and Preservation    Dataplane                               Control
Legacy - Proprietary Single
                                 Legacy         Legacy
Threaded Operating Systems       RTOS           RTOS
                                                                         Linux




    Rapid Deployment of new                                             vmm
    services                               VT-d / SRIOV


                                      Core 0           Core 1         Multi-Core
    Integrate Development                                             Architecture
                                 rx               rx
    Environment separate from         tx               tx

    Critical Services                      PF
                                                            10 Gb/s




5
Embedded Deployment Requirements
                                                          Single Core scheduling
    Scheduling control for Guest Quality of
    Service
                                                                        Phone          App
                                                          Dom0          Application    Development
    Traffic prioritization to avoid packet loss
    requires (soft) Real Time scheduling
                                                                                      Xen
    Credit based scheduler research in progress           Atom              I/O             I/O




                                                  Consolidated              Grant Tables
    Consolidate Fast Path with Security
                                                  fast path
    Intrusion Detection application
                                                            Linux             io rings            Fast Path
    Requires efficient mechanism to share                   Intrusion
                                                                                          ip
    packet data with Linux application            Dom0      Detection
                                                                                          packet
                                                                                                   Forwarding


    Grant tables (io rings) maybe an efficient
    mechanism to meet performance                                                     Xen
    requirements (needs to be Lock Free)
                                                   Xeon                                                I/O



6
Embedded Xen Deployment
                                                                        120
                                                                        100
Power Profile of some edge based appliances is
                                                                        80
cyclical, potential power savings can be substantial
                                                                        60
(Example Base Station Controller)                                       40
                                                                                    Data

                                                                        20
ACPI support generally not supported in Real Time /                      0
Proprietary Operating Systems                                                 6am           6pm


                                                                        120
Hypervisor Power Management could be very useful
                                                                        100
to control overall power budget
                                                                         80
                                                                                    Voice
                                                                         60
“Shelf Manager” Power management research in
                                                                         40
progress
                                                                         20
                                     Fast
                                  Fast                                    0
                                Fast Path
               Dom0               Path
                                Path                                          6am           6pm
                      Shf mgr
                                   Xen            Fast
                                               Fast
                                             Fast Path
                          Dom0                 Path
                                             Path
                 Multi Core
                                   Shf mgr
                                                                        Intelligent Power
                                                                 Fast
                                               Xen            Fast
                                                            Fast Path   Management, balances I/O
                                            Dom0              Path
                                                            Path
                            Multi Core                                  latency & throughout
                                                  Shf mgr
                                                               Xen

                                              Multi Core

 7
Embedded Xen – Direct Cache Access




                                                                                                      memory
 DCA - Direct Cache Access delivers data in cache to                     CPU




                                                                                         ctrl
 reduce average memory latency and attempts to
                                                                        Cache
 reduce memory bandwidth

 DCA Driver uses get_cpu() to gather APIC_ID, uses
 this to configure the DCA enabled NIC device                              IOH
                                                                                    DCA


 static void igb_update_dca(struct igb_q_vector *q_vector)
 {                                                                                  I/O

      struct igb_adapter *adapter = q_vector->adapter;
      struct e1000_hw *hw = &adapter->hw;
      int cpu = get_cpu();                /* Get the current CPU Id*/

      if (q_vector->cpu == cpu)                                         Dom0   Guest             Guest

           goto out_no_update;
                                                                                                Xen

 get_cpu() requires to return the valid APIC ID of the                            CPU              CPU
                                                                                 Cache            Cache
 core where the guest is executing.




  8
Benchmarking, 10 GbE perspective
A 64B packet can arrive every 67.2ns
In terms of processor cycles : @ 2.53 GHz, a 64B packet arrives every ~201 cycles
Can generate up to 14.88 million Rx and 14.88 million Tx transactions every second
(packets)
Each packet has a 16B descriptor associated with it, that must be written for every
packet that needs to be processed
Mpp/s
16,000,000                                         The Linux forwarding code
14,000,000                                         takes ~3000 cycles to process
12,000,000                                         a packet.
10,000,000

 8,000,000
                                                   With enhancement we can
 6,000,000
                                                   reduce the number of cycles
                                                   per (64 Byte) packet to ~1350
 4,000,000
                                                   cycles.
 2,000,000

        0
               64
              118
              172
              226
              280
              334
              388
              442
              496
              550
              604
              658
              712
              766
              820
              874
              928
              982
             1036
             1090
             1144
             1198
             1252
             1306
             1360
             1414
             1468


                     Packet Size


9
Guest Forwarding Performance
                                 Native              Layer 3 Forwarding
                                 Virtualized         2-Port (1 Core, 1 Thread)
Packets per Second (PPS)




                                                                                         Linux             Linux
                                                                                       forwarding     forwarding




                                                                                                    VT-d            vmm


                                                                                          Core 0           Core 1   Multi-Core
                                                                                                                    Architecture
                                                                                           I/O              I/O
                                 64


                                       128


                                               256


                                                      512


                                                            768


                                                                  1024


                                                                         1280


                                                                                1518
                                               Packet Size (bytes)


                 Single threaded virtualized environments show promising performance:
                                - Near native performance for small packet sizes
                                - Native performance for large packet sizes ( >256B ).

                 Limited performance penalty for consolidation, additional scaling tests
                 in progress

                           10
Cisco Embedded Product Space                                                  Service Provider


   Wide range of products in a number
   of market segments:
                                                                                ASR 9000        CRS
                                                                 Data Center

                   Voice & Video


                                                               UCS        Nexus 7000

            TelePresence      Unified              Enterprise
                           Communications


                                                                                   Security
                                             MDS 9222i (SAN)   ASR 1000
                             Branch
        Home                                                                   Ironport    ASA 5500


                      3900 ISR    2800 ISR

Flip Video Valet


   11
Embedded Product Environment
Hardware Environment
      General Purpose CPUs, SoCs, ASICs, FPGAs, custom processors, ixp, DSPs, …
      From large multi-core, multi-blade, multi-chassis systems to small single/dual core devices
      Terabit to Gigabit I/O



Software Environment
      Multi-OS: IOS, IOS-XE, IOS-XR, NX-OS
           Proprietary (legacy), Linux, other …
      Single threaded, multi-threaded, pipelined, flow-based, …
      Multiple vm models
           integrated services platform, distributed/load balancing, HA, control & data
           separation, …
      Control plane, data plane, management plane, appliance and service engines, …
           e.g., routing, data, voice, video, deep packet inspection, firewall, security, etc.



Memory, processor, and I/O bandwidth requirements vary by application
and network device location

 12
Embedded Development Requirements
We believe that xen is the right choice for an embedded hypervisor
     Early support for prototype hardware required: In hypervisor and dom0
     Open source xen and linux critical to this effort
     It’s the right architecture and feature set for embedded development



RAS
     High Availability (HA) for guests
          non-disruptive stateful failover, non-disruptive in service software upgrade (ISSU)
     Devices
          hot pluggable/removable (non-disruptive): shared & dedicated (including sr-iov)
     dom0
          Separate device driver domains good, but not enough
          All domains need to be restartable


Deterministic Performance
     QoS control through configuration and scheduling
     I/O linearly scalable across cores and vms
     Low latency interrupts



13
Embedded Development Requirements
Core allocation/Scheduling: vcpu              pcpu mapping
     (pinned, non-shared):                    deterministic performance
     (pinned, shared), (non-pinned, shared): scheduled
For pv IOS, I/O workload, 64-byte packets, 2 ports, bidirectional, 64-bit xen, NUMA on

(pinned, non-shared), HT off                 100%line rate (1Gb) per core
                                             <0.1% time spent in hypervisor
(non-pinned, shared), HT off                 ~10% decreased throughput
(pinned, non-shared), NUMA- remote, HT off   ~8% decreased throughput

(pinned, non-shared), HT on, one on each     1.5x/1.7x (I/O/cpu) increase in
thread on the core                           throughput (aggregate)
                                             .75x/.85x (I/O/cpu) throughput per
                                             transaction single thread
(pinned, non-shared), HT on, only one        Same as (pinned, non-shared), HT off
thread on the core in use


 Guest Support
      Both pv and hvm (hybrid!)
      32-bit & 64-bit
      Virtual memory paged and non-paged (single, flat address space)


14
Embedded Development Requirements
Debug and Performance Monitoring
     multi-guest, simultaneous
     32-bit & 64-bit guests (minimum is gdbsx for both pv & hvm)
     Performance monitoring tools (access to PMU data - xenoprofile & others)
     Required in the field as well as during development

Trusted Systems: Secure Products
     Trusted boot, TPM, Intel TXT/AMD-V
     Trusted guests, sandboxed 3rd party guests, anti-counterfeiting, …
     Manageable

Power Management
     Especially at the edge, branch, and consumer devices
     Policy based, managed by hypervisor
          Cases where guest should not be automatically power managed

“carrier class” xen Development Environment
     Support for rapid prototyping
     Support for production product environment




15
HA Requirements
Rationale
      HA & ISSU features available on many platforms across our product space today
           Cannot go to market without support in certain product spaces
      Software fails much more often than hardware
           Software-only HA/ISSU at much lower cost very attractive
           Natural fit on multi-core devices

High Availability (HA)
      Active-Standby: stateful, “hot” Standby
      Failure of Active causes non-disruptive failover to Standby
      Reconciliation required on switchover
           Standby progresses through state machine to Active state
      I/O devices always belong to Active and switch to [new] Active without loss of state
           Packet loss ok on switchover – higher level protocols recover
      Downstream end of device connection must not see a “failure”
      Switchover must take place in < 1 sec.

In Service Software Upgrade (ISSU)
      Built on HA infrastructure
      Automated software upgrade (or downgrade)
      Non disruptive: Fallback if required or requested



 16
HA Requirements
What is needed:
      Reliable fast failure detection mechanism
           Current: hardware uses interrupt pin; backup is heart-beat mechanism (slow)
           Need to emulate/implement fast, reliable failure detection mechanism in xen
      Failover device transparently from Active to Standby
           no loss of [device] state
           Packet traffic dropped until Standby transitions to Active
      Interrupts
           redirected to new Active (old Standby) on failover
           interrupts dropped until Standby transitions to Active
           [new] Active must be able to address outstanding interrupts without complete reset
      Need to be able to run in redundant hardware configuration or on multi-core device
          drivers responsible for appropriate reconciliation protocols
      Minimize the changes to xen kernel and dom0 code
           recovery decisions need to be in the domain of the guest driver
      Support for direct assign devices (including sr-iov) and shared devices
      Non shared memory solution for DMA target memory preferred
          requires ability to either pre-program and switch or reprogram and switch on failover


 17
“carrier class” xen Development Environment
Needs to support 2 different Environments:

      Rapid prototyping and development of new services
         Work often requires unstable branch, pre-release/prototype hardware
         Straight forward, and accessible to the non xen expert
                    Interest is in getting the prototype/product up and running quickly rather
         than
                    xen infrastructure
                    Developer threads, blogs, etc. not a substitute for up-to-date
         documentation
         Product decisions (go/no go) based on prototype results
                    Failure/missed deadlines will eliminate a prototype as a possible solution
         Corporate networks/labs behind firewalls, use proxies
                    Doesn’t work well with current git-based source control
                    Requires exceptions to corporate IT policy

      Production product
         Uses stable release
         Controlled access to performance & debug tools in customer environment
         Documentation required in field as well
         Auditing requires ability to reproduce image bit-for-bit from local build

 18
Summary


     •   Embedded market provides for a great growth
         opportunity
     •   Deployment requires some unique features
     •   Xen is well positioned but requires support for RAS
         features, debug and “Carrier Class” Release




19

Xen summit 2010 extending xen into embedded

  • 1.
    Xen Summit 2010 Extending Xen into Embedded and Communications Workloads
  • 2.
    Agenda • Embedded Usage Models • Virtual Machine Monitor Requirements • Benchmarking • Cisco Product Range • Embedded Development Requirements • High Availability 2 09.14.05
  • 3.
    Embedded Usage Models Robotics Using Core Micro Architecture for GUI IP Media Phones interface with real time Atom based platforms industrial control. delivering Internet connectivity and media content to continuous connected devices. Routing Xeon Micro Architecture based platforms implement control and data- plane services on high end routers. Unique VMM requirements across all segments 3
  • 4.
    Virtual Machine MonitorImplementation Scalability, Flexibility, RAS Industrial Control requires and Fail Over are a few of determinism. Performance the vmm requirements in Critical partition Comm’s appliance required to host Cell is measured in interrupt latency (10 usec or lower) environment phone application, hypervisor requires Quality of Service RTOS (Service) Linux Microsoft Linux Critical App GUI Partition partition RTOS Shared Memory vmm vmm Thin vmm Industrial Comm’s Appliance Media Phone 4
  • 5.
    Embedded Virtualization -Advantages Consolidation and Preservation Dataplane Control Legacy - Proprietary Single Legacy Legacy Threaded Operating Systems RTOS RTOS Linux Rapid Deployment of new vmm services VT-d / SRIOV Core 0 Core 1 Multi-Core Integrate Development Architecture rx rx Environment separate from tx tx Critical Services PF 10 Gb/s 5
  • 6.
    Embedded Deployment Requirements Single Core scheduling Scheduling control for Guest Quality of Service Phone App Dom0 Application Development Traffic prioritization to avoid packet loss requires (soft) Real Time scheduling Xen Credit based scheduler research in progress Atom I/O I/O Consolidated Grant Tables Consolidate Fast Path with Security fast path Intrusion Detection application Linux io rings Fast Path Requires efficient mechanism to share Intrusion ip packet data with Linux application Dom0 Detection packet Forwarding Grant tables (io rings) maybe an efficient mechanism to meet performance Xen requirements (needs to be Lock Free) Xeon I/O 6
  • 7.
    Embedded Xen Deployment 120 100 Power Profile of some edge based appliances is 80 cyclical, potential power savings can be substantial 60 (Example Base Station Controller) 40 Data 20 ACPI support generally not supported in Real Time / 0 Proprietary Operating Systems 6am 6pm 120 Hypervisor Power Management could be very useful 100 to control overall power budget 80 Voice 60 “Shelf Manager” Power management research in 40 progress 20 Fast Fast 0 Fast Path Dom0 Path Path 6am 6pm Shf mgr Xen Fast Fast Fast Path Dom0 Path Path Multi Core Shf mgr Intelligent Power Fast Xen Fast Fast Path Management, balances I/O Dom0 Path Path Multi Core latency & throughout Shf mgr Xen Multi Core 7
  • 8.
    Embedded Xen –Direct Cache Access memory DCA - Direct Cache Access delivers data in cache to CPU ctrl reduce average memory latency and attempts to Cache reduce memory bandwidth DCA Driver uses get_cpu() to gather APIC_ID, uses this to configure the DCA enabled NIC device IOH DCA static void igb_update_dca(struct igb_q_vector *q_vector) { I/O struct igb_adapter *adapter = q_vector->adapter; struct e1000_hw *hw = &adapter->hw; int cpu = get_cpu(); /* Get the current CPU Id*/ if (q_vector->cpu == cpu) Dom0 Guest Guest goto out_no_update; Xen get_cpu() requires to return the valid APIC ID of the CPU CPU Cache Cache core where the guest is executing. 8
  • 9.
    Benchmarking, 10 GbEperspective A 64B packet can arrive every 67.2ns In terms of processor cycles : @ 2.53 GHz, a 64B packet arrives every ~201 cycles Can generate up to 14.88 million Rx and 14.88 million Tx transactions every second (packets) Each packet has a 16B descriptor associated with it, that must be written for every packet that needs to be processed Mpp/s 16,000,000 The Linux forwarding code 14,000,000 takes ~3000 cycles to process 12,000,000 a packet. 10,000,000 8,000,000 With enhancement we can 6,000,000 reduce the number of cycles per (64 Byte) packet to ~1350 4,000,000 cycles. 2,000,000 0 64 118 172 226 280 334 388 442 496 550 604 658 712 766 820 874 928 982 1036 1090 1144 1198 1252 1306 1360 1414 1468 Packet Size 9
  • 10.
    Guest Forwarding Performance Native Layer 3 Forwarding Virtualized 2-Port (1 Core, 1 Thread) Packets per Second (PPS) Linux Linux forwarding forwarding VT-d vmm Core 0 Core 1 Multi-Core Architecture I/O I/O 64 128 256 512 768 1024 1280 1518 Packet Size (bytes) Single threaded virtualized environments show promising performance: - Near native performance for small packet sizes - Native performance for large packet sizes ( >256B ). Limited performance penalty for consolidation, additional scaling tests in progress 10
  • 11.
    Cisco Embedded ProductSpace Service Provider Wide range of products in a number of market segments: ASR 9000 CRS Data Center Voice & Video UCS Nexus 7000 TelePresence Unified Enterprise Communications Security MDS 9222i (SAN) ASR 1000 Branch Home Ironport ASA 5500 3900 ISR 2800 ISR Flip Video Valet 11
  • 12.
    Embedded Product Environment HardwareEnvironment General Purpose CPUs, SoCs, ASICs, FPGAs, custom processors, ixp, DSPs, … From large multi-core, multi-blade, multi-chassis systems to small single/dual core devices Terabit to Gigabit I/O Software Environment Multi-OS: IOS, IOS-XE, IOS-XR, NX-OS Proprietary (legacy), Linux, other … Single threaded, multi-threaded, pipelined, flow-based, … Multiple vm models integrated services platform, distributed/load balancing, HA, control & data separation, … Control plane, data plane, management plane, appliance and service engines, … e.g., routing, data, voice, video, deep packet inspection, firewall, security, etc. Memory, processor, and I/O bandwidth requirements vary by application and network device location 12
  • 13.
    Embedded Development Requirements Webelieve that xen is the right choice for an embedded hypervisor Early support for prototype hardware required: In hypervisor and dom0 Open source xen and linux critical to this effort It’s the right architecture and feature set for embedded development RAS High Availability (HA) for guests non-disruptive stateful failover, non-disruptive in service software upgrade (ISSU) Devices hot pluggable/removable (non-disruptive): shared & dedicated (including sr-iov) dom0 Separate device driver domains good, but not enough All domains need to be restartable Deterministic Performance QoS control through configuration and scheduling I/O linearly scalable across cores and vms Low latency interrupts 13
  • 14.
    Embedded Development Requirements Coreallocation/Scheduling: vcpu pcpu mapping (pinned, non-shared): deterministic performance (pinned, shared), (non-pinned, shared): scheduled For pv IOS, I/O workload, 64-byte packets, 2 ports, bidirectional, 64-bit xen, NUMA on (pinned, non-shared), HT off 100%line rate (1Gb) per core <0.1% time spent in hypervisor (non-pinned, shared), HT off ~10% decreased throughput (pinned, non-shared), NUMA- remote, HT off ~8% decreased throughput (pinned, non-shared), HT on, one on each 1.5x/1.7x (I/O/cpu) increase in thread on the core throughput (aggregate) .75x/.85x (I/O/cpu) throughput per transaction single thread (pinned, non-shared), HT on, only one Same as (pinned, non-shared), HT off thread on the core in use Guest Support Both pv and hvm (hybrid!) 32-bit & 64-bit Virtual memory paged and non-paged (single, flat address space) 14
  • 15.
    Embedded Development Requirements Debugand Performance Monitoring multi-guest, simultaneous 32-bit & 64-bit guests (minimum is gdbsx for both pv & hvm) Performance monitoring tools (access to PMU data - xenoprofile & others) Required in the field as well as during development Trusted Systems: Secure Products Trusted boot, TPM, Intel TXT/AMD-V Trusted guests, sandboxed 3rd party guests, anti-counterfeiting, … Manageable Power Management Especially at the edge, branch, and consumer devices Policy based, managed by hypervisor Cases where guest should not be automatically power managed “carrier class” xen Development Environment Support for rapid prototyping Support for production product environment 15
  • 16.
    HA Requirements Rationale HA & ISSU features available on many platforms across our product space today Cannot go to market without support in certain product spaces Software fails much more often than hardware Software-only HA/ISSU at much lower cost very attractive Natural fit on multi-core devices High Availability (HA) Active-Standby: stateful, “hot” Standby Failure of Active causes non-disruptive failover to Standby Reconciliation required on switchover Standby progresses through state machine to Active state I/O devices always belong to Active and switch to [new] Active without loss of state Packet loss ok on switchover – higher level protocols recover Downstream end of device connection must not see a “failure” Switchover must take place in < 1 sec. In Service Software Upgrade (ISSU) Built on HA infrastructure Automated software upgrade (or downgrade) Non disruptive: Fallback if required or requested 16
  • 17.
    HA Requirements What isneeded: Reliable fast failure detection mechanism Current: hardware uses interrupt pin; backup is heart-beat mechanism (slow) Need to emulate/implement fast, reliable failure detection mechanism in xen Failover device transparently from Active to Standby no loss of [device] state Packet traffic dropped until Standby transitions to Active Interrupts redirected to new Active (old Standby) on failover interrupts dropped until Standby transitions to Active [new] Active must be able to address outstanding interrupts without complete reset Need to be able to run in redundant hardware configuration or on multi-core device drivers responsible for appropriate reconciliation protocols Minimize the changes to xen kernel and dom0 code recovery decisions need to be in the domain of the guest driver Support for direct assign devices (including sr-iov) and shared devices Non shared memory solution for DMA target memory preferred requires ability to either pre-program and switch or reprogram and switch on failover 17
  • 18.
    “carrier class” xenDevelopment Environment Needs to support 2 different Environments: Rapid prototyping and development of new services Work often requires unstable branch, pre-release/prototype hardware Straight forward, and accessible to the non xen expert Interest is in getting the prototype/product up and running quickly rather than xen infrastructure Developer threads, blogs, etc. not a substitute for up-to-date documentation Product decisions (go/no go) based on prototype results Failure/missed deadlines will eliminate a prototype as a possible solution Corporate networks/labs behind firewalls, use proxies Doesn’t work well with current git-based source control Requires exceptions to corporate IT policy Production product Uses stable release Controlled access to performance & debug tools in customer environment Documentation required in field as well Auditing requires ability to reproduce image bit-for-bit from local build 18
  • 19.
    Summary • Embedded market provides for a great growth opportunity • Deployment requires some unique features • Xen is well positioned but requires support for RAS features, debug and “Carrier Class” Release 19