# Sapphire Rapids

Arijit Biswas Intel Senior Principal Engineer





# Sapphire Rapids

Next-Gen Intel Xeon Scalable Processor

New Standard for Data Center Architecture

Designed for Microservices & AI Workloads

Pioneering Advanced Memory & IO Transitions



### **Node Performance**



### Data Center Performance



#### **Node Performance**





| intel. | Fast VM Migration                | Low Jitter<br>Architecture                                                | Next Gen Quality of<br>Service Capabilities<br>Broad WL/Usage<br>Support and<br>Optimizations |                                           |
|--------|----------------------------------|---------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|-------------------------------------------|
| Xeon   | Better Telemetry                 | Consistent Caching &<br>Mem Latency<br>Inter-Processor<br>Interrupt Virt. | Next Gen Optane<br>Support<br>CXL 1.1                                                         | Integrated WL<br>Accelerators             |
|        | Consolidation<br>& Orchestration | Performance<br>Consistency                                                | Elasticity &<br>Efficient Data<br>Center<br>Utilization                                       | Infrastructure &<br>Framework<br>Overhead |

Data Center Performance





Delivers a scalable, balanced architecture leveraging existing software paradigms for monolithic CPUs via a modular architecture



Sapphire Rapids Multiple Tiles, Single CPU

Every thread has full access to all resources on all tiles

Cache, Memory, IO...

Provides consistent low latency & high cross-section BW across the entire SoC





# Sapphire Rapids

### Key Building Blocks

| Compute I<br>Seamless Integr |        | Cores         | Acceleration<br>Engines |  |
|------------------------------|--------|---------------|-------------------------|--|
| I/O IP                       | CXL1.1 | PCle Gen<br>5 | UPI 2.0                 |  |
| Memory<br>IP                 | DDR 5  | Optane        | НВМ                     |  |





# Performance Core

Built for Data Center

Major microarchitecture and IPC improvement

Improved support for large code/data footprint

Consistent performance for multi-tenant usages

Autonomous/Fast PM for high freq @ low jitter

|     | I-TLB + I-Cache Predict                           |                         |                       |              |            |               |            |            |            |            |            |            |
|-----|---------------------------------------------------|-------------------------|-----------------------|--------------|------------|---------------|------------|------------|------------|------------|------------|------------|
| MSR | MSROM Decode                                      |                         |                       |              |            |               |            | µop Cache  |            |            |            |            |
|     | µop Queue                                         |                         |                       |              |            |               |            |            |            |            |            |            |
|     | Allocate / Rename / Move Elimination / Zero Idiom |                         |                       |              |            |               |            |            |            |            |            |            |
|     | Port<br>00                                        | Port<br>01              | Port<br>05            | Port<br>06   | Port<br>10 | Port<br>04    | Port<br>09 | Port<br>02 | Port<br>08 | Port<br>03 | Port<br>07 | Port<br>11 |
| L   | ALU<br>LEA                                        | ALU<br>LEA              | ALU<br>LEA            | ALU<br>LEA   | ALU<br>LEA | -             | vre        | AGU        | AGU        | AGU        | AGU        | AGU        |
|     | Shift<br>JMP                                      | Mul<br>iDIV             | MulHi                 | Shift<br>JMP |            | Store<br>Data |            | Load       | STA        | Load       | STA        | Load       |
|     | FMA                                               | FMA                     | FMAsp                 |              |            |               | 48KB       | Data (     | Cache      |            |            |            |
| VEC | ALU<br>Shift<br>fpDIV                             | ALU<br>Shift<br>Shuffle | ALU<br>AMX<br>Shuffle | 2MB ML Cache |            |               |            |            |            |            |            |            |
|     |                                                   | FADD                    | FADD                  |              |            |               |            |            |            |            |            |            |



| Performance                                               |                     |                                                                                                                      |
|-----------------------------------------------------------|---------------------|----------------------------------------------------------------------------------------------------------------------|
| Core                                                      | AI                  | Intel <sup>®</sup> Advanced Matrix Extensions - AMX<br>Tiled matrix operations for inference & training acceleration |
| Architactura                                              | Attached<br>Device  | Accelerator interfacing Architecture - AiA<br>Efficient dispatch, signaling & synchronization from user level        |
| Architecture<br>Improvements for DC<br>Workloads & Usages | HFNI                | Half- Precision Float New Instructions<br>Support for FP16 - higher throughput lower precision                       |
|                                                           | Cache<br>Management | <b>CLDEMOTE</b><br>Proactive placement of cache contents                                                             |



# Sapphire Rapids

#### **Acceleration Engines**

Increasing effectiveness of cores, by enabling offload of common mode tasks via seamlessly integrated acceleration engines

Native Dispatch, Signaling & Synchronization from User Space Accelerator interfacing Architecture

Coherent, Shared Memory Space Between Cores & Acceleration Engines

Concurrently shareable Processes, containers and VMs





### Intel<sup>®</sup> Data Streaming Acceleration Engine

Optimizing streaming data movement and transformation operations



Results have been estimated or simulated based on testing on pre-production hardware and software. For workloads and configurations visit <u>www.intel.com/ArchDay21claims</u>. Results may vary



#### Intel<sup>®</sup> Quick Assist Technology

### **Acceleration Engine**

Accelerating Cryptography and Data De/Compression



Results have been estimated or simulated. Sapphire Rapids estimation based on architecture models and baseline testing with Ice Lake and Intel QAT. For workloads and configurations visit <u>www.intel.com/ArchDay21claims</u>. Results may vary.



#### Intel<sup>®</sup> Dynamic Load Balancer

### **Acceleration Engine**

#### Efficient Load Balancing across CPU Cores

400M Load Balancing Decisions per Second Offloads Software Queue Management Dynamic, flow aware load balancing & reordering Priority Queuing (up to 8 levels) Dynamic, power aware sizing of applications





# Sapphire Rapids I/O Advancements

Introducing Compute eXpress Link (CXL) 1.1

Accelerator and memory expansion in datacenter

Expanded device performance via PCIe 5.0 & connectivity Improved DDIO & QoS capabilities

Improved Multi-Socket scaling via Intel Ultra Path Interconnect (UPI) 2.0

Up to 4 x24 UPI links operating @ 16 GT/s

New 8S-4UPI performance optimized topology





# Sapphire Rapids IO - Virtualization



Intel<sup>®</sup> Shared Virtual Memory (SVM) Enabling devices and IA cores to access shared data in CPU virtual address space

Consistent across host app. and offloaded tasks

Avoids memory pinning and copying overheads

Integrated & discrete, bare-metal & VM instances

#### Intel<sup>®</sup> Scalable IO Virtualization (S-IOV)

Hardware acceleration for comms between VMs/containers and PCIe devices

Scalable sharing and direct access to accelerators across 1000s of VMs/containers

Higher Perf than SW only device scaling, More scalable than SR-IOV

Supports integrated & discrete devices



# Sapphire Rapids Memory and Last Level Cache

Increased Shared Last Level Cache (LLC) Up to >100 MB LLC shared across ALL cores

Increased bandwidth, security & reliability via DDR 5 Memory

4 memory controllers supporting 8 channels

Integrated memory encryption engine

Improved RAS

Intel Optane<sup>™</sup> Persistent Memory 300 Series







Significantly Higher Memory Bandwidth vs. baseline Xeon-SP with 8 channels of DDR 5

Increased capacity and Bandwidth some usages can eliminate need for DDR entirely

2 Modes





# Sapphire Rapids - Architected for AI

AI has become ubiquitous across usages – AI performance required in all tiers of computing

2048 Goal Enable efficient usage of AI across all services deployed on elastic general-purpose tier by delivering many times more AI Ops/Cycle per core @ 100% utilization performance and lower CPU utilization 1024 int8 with int32 accumulation For Deep Learning Datatypes Bfloat16 with IEEE SP accumulation Acceleration at Full Intel Arch. programmability the ISA Level 256 Low Latency 64 AVX-512 (2xFMA) INT8 Available and integrated with AVX-512 (2xFMA) FP32 AMX (TMUL) BF16 AMX (TMUL) INT8 industry-relevant frameworks & libraries Results have been simulated. For workloads and configurations visit www.intel.com/ArchDay21claims . Results may vary



# Sapphire Rapids - Built for elastic computing models - microservices

>80% of new cloud-native and SaaS applications are expected to be built as microservices

| Goal<br>Enable higher throughput while meeting latency requirements and<br>reducing infrastructure overhead for execution, monitoring and<br>orchestration thousands of microservices |                                                                                                                       | Microservices Performance |                                                                   | +69%                                                          | Throughpu                                          |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|---------------------------|-------------------------------------------------------------------|---------------------------------------------------------------|----------------------------------------------------|
| Improved<br>Performance and<br>Quality of Service                                                                                                                                     | Runtime Languages - lower latency for Runtime Languages<br>AiA ISA's - efficient worker threads, signaling and synch. | 1.0                       | + 24%                                                             |                                                               | Throughput per Core under Latency SLA of p99 <30ms |
| Reduced<br>Infrastructure<br>Overhead                                                                                                                                                 | Kubernetes – enhanced for scaling, placement and policies<br>Advanced Telemetry - easier analysis & optimization      |                           |                                                                   |                                                               | ency SLA of p99 <30                                |
| Better Distributed<br>Communication                                                                                                                                                   | Improved latency of Remote procedure calls and service-mesh QAT, DSA etc optimized networking and data movement       |                           | Icelake Server<br>en simulated. For workl<br>ArchDay21claims. Res | Sapphire Rapids<br>oads and configurations v<br>ults may vary | ¢.                                                 |



| New Standard in Data Center Architecture    |                                           |  |                                            |  |  |  |                                                        |   |
|---------------------------------------------|-------------------------------------------|--|--------------------------------------------|--|--|--|--------------------------------------------------------|---|
| Multi Tile SoC for<br>Scalability           | Physically Tiled,<br>Logically Monolithic |  |                                            |  |  |  | General Purpose<br>& Dedicated<br>Acceleration Engines | 5 |
| Designed for Microservices and AI Workloads |                                           |  |                                            |  |  |  |                                                        |   |
| Performance Core<br>Architecture            |                                           |  |                                            |  |  |  |                                                        |   |
| Pioneering Advanced Memory & IO Transitions |                                           |  |                                            |  |  |  |                                                        |   |
| DDR 5 &<br>HBM                              | PCle 5.0                                  |  | Enhanced<br>Virtualization<br>Capabilities |  |  |  |                                                        |   |

# Sapphire Rapids

Biggest Leap in Data Center Capabilities in over a Decade





# **Inte**