Hot Chips 33 has concluded.
Thank you to the speakers, attendees, sponsors, press, and volunteers.

HC33 Proceedings (~140MB) (Synopsys Keynote unavailable. Press coverage w/ slides)

Tutorials: Sunday, August 22nd, 2021

Time (PDT)	Title	Presenters
9:00AM-1:30PM	Tutorial 1: ML Performance and Real World Applications Machine learning is a rich, varied, and rapidly evolving field. This tutorial will explore the applications, performance characteristics, and key challenges of many different unique workloads across training and inference. In particular, we will focus on hardware/software co-optimization for the industry-standard MLPerf™ benchmarks and selected applications and considerations at prominent cloud players. Chair: David Kanter
9:00AM-9:45AM	Modern Neural Networks and their Computational Characteristics This talk will survey the computational characteristics of modern DNN workloads by taking a look at the characteristics and trends of the major application domains: computer vision, language, speech, recommendation. We will review the main categories of operations that occur in these networks, such as convolution variants, attention modules, recurrent cells, embeddings, normalizations, etc. We will also examine how the nature of the input data (regular grid, sequence, graph, unstructured) influences the DNN model architectures and their choice of operations. Finally, we will outline how the workload of a given network differs between its training and inference, due to changes in input characteristics, operation fusions, and workload-reduction techniques such as quantization and sparsity.	Paulius Micikevicius, NVIDIA
9:45AM-10:15AM	MLPerf™ Training and Inference The MLPerf Training and Inference Benchmarks have become the industry standard for measuring machine learning system performance (speed). We will describe the design choices in benchmarking machine learning performance, and how the MLPerf Training and Inference Benchmarks navigate those choices. We will walk through the submission and review process for the benchmarks, with the goal of enabling smooth submissions for potential submitters. We will review industry progress as shown by 2+ years of results on the benchmark suites. Lastly, we will discuss ongoing work to improve the benchmark suites, and how new collaborators can become involved and make a field-wide impact.	Peter Mattson, Google
10:15AM-10:30AM	Break (15 min)
10:30AM-11:00AM	Software/hardware co-optimization on the IPU: An MLPerf™ case study Machine learning is a full system problem that requires careful optimization across software and hardware. In this case study, we present performance results and key optimizations from the Graphcore submission to the MLPerf v1.0 training benchmark, which is the culmination of all the architectural and system engineering innovations on real world AI models. We provide optimized implementations for a range of models in NLP (Natural Language Processing), and Image Classification. Central to our ability to fit these models on die are novel techniques including model parallelism, FP16 master weights, external streaming memory, and small batch size training.	Mario Michael Krell, Graphcore
11:00AM-11:30AM	Deep Learning Inference Optimizations on CPUs Deep learning (DL) inference applications are growing rapidly with the tremendous growth in data from connected devices. Although there are dedicated deep learning accelerators, CPUs remain the most available inference platform today. Optimization on CPUs bring direct time and cost savings to deep learning applications by either reducing latency and/or increasing throughput. Unlike vision models, most language, speech and recommendation models accept large range of input shapes and can be very large in size. To optimize these use cases, we used the following five techniques: (1) low precision inference; (2) reducing compute by introducing sparsity; (3) reducing memory accesses with ops fusion; (4) reducing primitive creation overhead; and (5) improving hardware utilization by loading balancing the input sizes and introducing more parallelism. Though we consider several specific DL models, each are examples of more general classes of DL models, where we expect these optimizations to apply. Lastly, all implementation details are open sourced in Intel’s latest MLPerf™ inference v1.0 submissions.	Guokai Ma, Intel
11:30AM-12:00PM	AI at Scale for the Modern Era The past decade has witnessed a 300,000 times increase in the amount of compute for AI. The latest natural language processing model is fueled with over trillion parameters while the memory need of neural recommendation and ranking models has grown from hundreds of gigabyte to the terabyte scale. The training of state-of-the-art industry-scale personalization and recommendation models consumes the highest number of compute cycles among all deep learning use cases at Facebook. What are the key system challenges faced by industry-scale deep learning models? This talk will highlight the scale and the implications on infrastructure optimization challenges and opportunities across the machine learning execution pipeline end to end, from ML data pre-processing to training system throughput optimization. The talk will conclude with directions for building high-performance, efficient AI systems at scale.	Carole-Jean Wu & Niket Agarwal, Facebook
12:00PM-12:15PM	Break (15 min)
12:15PM-12:45PM	The Nature of Graph Neural Network Workloads Graph neural networks (GNNs) are powerful tools for learning from graph data and are widely used in various applications such as fraud detection and recommendation. The popularity of GNN research and adoption in industry has sparked many system researches to optimize GNN training. The characteristics of the GNN computations depend on three factors: (i) the architectures of GNN models, (ii) the graph topology and the types of node/edge features associated with it, and (iii) the training algorithm. Early GNN systems research focused on optimizing simple GNN architectures, such as GraphSage and GCN, on small homogeneous graphs with full-batch training. However, GNN workloads are moving towards GNN models with complex architectures, graphs with heterogenous and multimodal information, and mini-batch training. To guide the GNN framework optimizations and hardware design towards emerging workloads, we are developing a GNN benchmark that covers both simple and complex GNN architectures, graph datasets of different types and different scales and various training methods. In this talk, we will present the first version of our GNN benchmark and the characteristics of the GNN workloads on various hardware platforms.	Da Zheng and George Karypis, Amazon
12:45PM-1:10PM	Challenges in large scale training of Giant models on large TPU machines In this talk we will present various challenges in scaling and training giant models on large TPU machines. We will first review large scale techniques for parallelizing large language models such as MeshTensorFlow, GShard etc. We will also discuss challenges in model development, performance optimization, machine availability, network topologies, debugging challenges and techniques to serve giant models. A major debugging challenge is to be able to run the same model at a smaller scale, while performance optimization techniques include optimizing communication overheads via aggressive techniques to overlap communication and computation.	Sameer Kumar, Google
1:10PM-1:35PM	ZeRO-Infinity and DeepSpeed: Breaking the device Memory Wall for Extreme Scale Deep Learning ZeRO-Infinity is a novel deep learning (DL) training technology for scaling deep learning model training, from a single GPU to massive supercomputers with thousands of GPUs. It powers unprecedented model sizes by leveraging the full memory capacity of a system, concurrently exploiting all heterogeneous memory (GPU, CPU, and Non-Volatile Memory express or NVMe for short) without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs (40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed.ai.	Yuxiong He and Samyam Rajbhandari, Microsoft
2:30PM-7:00PM	Tutorial 2: Advanced Packaging This tutorial will discuss advanced 3D packaging technologies that enable performance and density improvements. Descriptions of the technologies and how they are used in cutting edge applications will be made by industry leaders in packaging and chip design. Chair: Ralph Wittig
2:30PM-3:30PM	Technology Provider: Intel packaging technologies for chiplets and 3D Advanced packaging technologies are critical enablers of Heterogeneous Integration (HI) because of their importance as compact, power efficient platforms. This talk will establish the value of packaging as a HI platform and describe the capabilities of different packaging architectures. These architectures will be compared primarily on the basis of their physical interconnect capabilities. Key features in leading edge 2D and 3D technologies, such as EMIB, Silicon Interposer, Foveros and Co-EMIB will be described and a roadmap for their evolution will be presented. Challenges and opportunities in developing robust advanced package architectures will be discussed in the context of Intel’s use of leading edge packaging technologies in Graphics, Client and FPGA applications.	Ravi Mahajan and Sandeep Sane, Intel
3:30PM-4:30PM	Technology Provider: TSMC packaging technologies for chiplets and 3D With established TSMC 3DFabricTM technology platform, we continue to scale up the heterogeneous system package envelope, and to scale down on system interconnect. This system interconnect scaling is based on the roadmap we have proposed to drive 3D interconnect density, bandwidth/latency and energy efficient performance (EEP). Meanwhile, we leverage our wafer-level-system-integration technologies to provide innovative solutions to enhance heat dissipation when we move into liquid-cooling for the system. Furthermore, a disruptive Compact Universal Photonic Engine (COUPE) for Si photonics applications to drive EEP of HPC and networking is newly introduced to drive system EEP. Results will be shared here.	Doug Yu, TSMC
4:30PM-4:45PM	Break (15 min)
4:45PM-5:30PM	Case Study: Intel products built with 2.5D and 3D packaging	Ravi Mahajan and Sandeep Sane, Intel
5:30PM-6:15PM	Case Study: AMD products built with 3D packaging With chiplet architectures becoming mainstream, and recognized as fundamental to enabling the continued economically viable growth of power efficient computing, advanced packaging technologies and architectures are becoming more critical to enabling Moore’s Law’s next frontier through heterogeneous integration. In this tutorial, we will cover the advanced package architectures being enabled by AMD to enable PPAC (power, performance, area and cost) improvements as well as enable heterogeneous architectures. The direct Cu-Cu bonding technology used in AMD’s 3D VCache architecture will be detailed and compared to industry standard 3D architectures for PPAC benefits. Other technologies that are being enabled to advance high performance computing architectures will also be previewed.	Raja Swaminathan, AMD
6:15PM-7:00PM	Expert Opinion: An overview of the package technology landscape and industry deployment The role of heterogeneous integration, especially chiplets, is pivotal in this new era of electronics packaging. There are many choices for high-performance packages and this presentation describes the options and presents the advantages for each as they apply to different applications.	Jan Vardaman, TechSearch International Inc

Conference Day 1: Monday, August 23rd, 2021

Time (PDT)	Title	Presenters
8:45AM-9:00AM	Introductions
	Welcome to HC33	Ian Bratt and Alisa Scherer, Hot Chips 33: Organizering Chair and Program Co-Chair
9:00AM-11:00AM	CPUs Chair: Nam Sung Kim
	Intel Alder Lake CPU Architectures	Efraim Rotem, Intel
	AMD Next Generation “Zen 3” Core	Mark Evers, AMD
	The >5GHz next generation IBM Z processor chip	Christian Jacobi, IBM
	Next-Gen Intel Xeon CPU - Sapphire Rapids	Arijit Biswas and Sailesh Kottapalli, Intel
11:00AM-11:30AM	Break (30 min)
11:30AM-12:30PM	Academic Spinout Chips Chair: Krste Asanovic
	Mozart: Designing for Software Maturity and the Next Paradigm for Chip Architectures	Karu Sankaralingam, University of Wisconsin- Madison
	Morpheus II: A RISC-V Security Extension for Protecting Vulnerable Software and Hardware	Todd Austin, University of Michigan
12:30PM-1:30PM	Keynote Chair: Fred Weber
	Builders of the Imaginary: From Artificial Intelligence to Artificial Architects in the Era of SysMoore	Aart de Geus, CEO, Synopsys
1:30PM-2:30PM	Break (1 hr)
2:30PM-4:00PM	Infrastructure and Data Processors Chair:
	Arm Neoverse N2: Arm’s second-generation high performance infrastructure CPUs and system products	Andrea Pellegrini, ARM
	NVIDIA DATA Center Processing Unit (DPU) Architecture	Idan Burstein, NVIDIA
	Intel’s Hyperscale-Ready SmartNIC for Infrastructure Processing	Bradley Burres, Intel
4:00PM-5:00PM	Keynote Chair: Ralph Wittig
	Skydio Autonomy Engine: Enabling the Next Generation of Autonomous Flight	Abraham Bachrach, CTO, Skydio
5:00PM-5:30PM	Break (30 min)
5:30PM-7:00PM	Enabling Technologies Chair: Rob Aitken
	Heterogeneous computing to enable the highest level of safety in automotive systems	Ramanujan Venkatadri, Infineon
	Architecting an Open RISC-V 5G and AI SoC for Next Generation 5G Open Radio Access Network	Sriram Rajagopal, EdgeQ
	Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for machine learning accelerators	Jin Hyun Kim, Samsung Electronics

Conference Day 2: Tuesday, August 24th, 2021

Time (PDT)	Title	Presenters
8:30AM-10:00AM	ML Inference for the Cloud Chair: Ron Diamant
	Accelerating ML Recommendation with over a Thousand RISC-V/Tensor Processors on Esperanto’s ET-SoC-1 Chip	David Ditzel, Esperanto Technologies
	AI Compute Chip from Enflame	Ryan Liu and Chuang Feng, Enflame Technology
	Qualcomm Cloud AI 100: 12 TOPs/W Scalable, High Performance and Low Latency Deep Learning Inference Accelerator	Karam Chatha, Qualcomm Inc
10:00AM-11:00AM	Keynote Chair: Kunle Olukotun
	Architectural Challenges: AI Chips, Decision Support and High Performance Computing	Dimitri Kusnezov, Deputy Under Secretary for AI and Technology, Department of Energy
11:00AM-11:30AM	Break (30 min)
11:30AM-1:30PM	ML and Computation Platforms Chair: Natalia Vassilieva
	Graphcore Colossus Mk2 IPU	Simon Knowles, Graphcore
	The Multi-Million Core, Multi-Wafer AI Cluster	Sean Lie, Cerebras Systems
	SambaNova SN10 RDU: Accelerating Software 2.0 with Dataflow	Raghu Prabhakar and Sumti Jairath, SambaNova Systems, Inc
	The Anton 3 ASIC: a Fire-Breathing Monster for Molecular Dynamics Simulations	J. Adam Butts, D.E. Shaw Research
1:30PM-2:30PM	Break (1 hr)
2:30PM-4:30PM	Graphics and Video Chair: Pradeep Dubey
	Intel’s Ponte Vecchio GPU Architecture	David Blythe, Intel
	AMD RDNA(TM) 2 Graphics Architecture	Andrew Pomianowski, AMD
	Google’s Video Coding Unit (VCU) Accelerator	Aki Kuusela and Clint Smullen, Google
	Xilinx 7nm Edge Processors	Juanjo Noguera, Xilinx
4:30PM-5:00PM	Break (30 min)
5:00PM-7:00PM	New Technologies Chair: Forest Baskett
	Mojo Lens - AR Contact Lenses for Real People	Michael Wiemer and Renaldi Winoto, Mojo Vision
	World Largest Mobile Image Sensor with All Directional Phase Detection Auto Focus Function	Sukki Yoon, Samsung Electronics
	New Value Creation by Nano-Tactile Sensor Chip Exceeding our Fingertip Discrimination Ability	Hidekuni Takao, Kagawa University
	The IonQ Trapped Ion Quantum Computer Architecture	Christopher Monroe, IonQ, Inc
7:00PM-7:15PM	Closing
	Thank you for attending	Cliff Young, Hot Chips 34: General Chair

Posters

Title	Authors & Affiliation
OmniDRL: An Energy-Efficient Mobile Deep Reinforcement Learning Accelerators with Dual-mode Weight Compression and Direct Processing of Compressed Data	Juhyoung Lee; Korea Advanced Institute of Science and Technology
Exynos 1080 high-peformance, low-power CPU and GPU with AMIGO	Taehee Lee; Samsung
An Energy-efficient Floating-Point DNN Processor using Heterogeneous Computing Architecture with Exponent-Computing-in-Memory	Juhyoung Lee; Korea Advanced Institute of Science and Technology
Dynamic Neural Accelerator for Reconfigurable and Energy-efficient Neural Network Inference	Sakyasingha Dasgupta; EdgeCortix
SM6: A 16nm System-on-Chip for Accurate and Noise-Robust Attention-Based NLP Applications	Thierry Tambe; Harvard
ENIAD: A Reconfigurable Near-data Processing Architecture for Web-Scale AI-enriched Big Data Service	Jialiang Zhang; U Penn
A Plug-and-Play Universal Photonic Processor for Quantum Information Processing	Caterina Taballione; QuiX
Industry’s First 7.2 Gbps 512 GB DDR5 Memory Module with 8-Stacked DRAMs: A Promising Memory Solution for Next-Gen Servers	Sung Joo Park; Samsung
LightOn Optical Processing Unit: Scaling-up AI and HPC with a Non von Neumann co-processor	Laurent Daudet; LightOn
System-on-Chip Implementation of Trusted Execution Environment with Heterogeneous Architecture	Trong-Thuc Hoang; University of Electro-Communications
A CORDIC-based Trigonometric Hardware Accelerator with Custom Instruction in 32-bit RISC-V System-on-Chip	Khai-Duy Nguyen; University of Electro-Communications
A photonic neural network using < 1 photon per scalar multiplication	Tianyu Wang; Cornell
Edge Inference Engine for Deep & Random Sparse Neural Networks with 4-bit Cartesian-Product MAC Array and Pipelined Activation Aligner	Kota Ando; Tokyo Institute of Technology
Photonic co-processors in HPC: using LightOn OPUs for Randomized Numerical Linear Algebra	Daniel Hesslow; LightOn
Elpis: High Performance Low Power Controller for Data Center SSDs	Seungwon Lee; Samsung
SOT-MRAM – Third generation MRAM memory opens new opportunities	Jean-Pierre Nozières; Antaios
PNNPU: A Fast and Efficient 3D Point Cloud-based Neural Network Processor with Block-based Point Processing for Regular DRAM Access	Sangjin Kim; Korea Advanced Institute of Science and Technology
Samsung NPU: An AI accelerator and SDK for flagship mobile AP	Jun-Seok Park; Samsung