Program

HC33 Proceedings (~140MB) (Synopsys Keynote unavailable. Press coverage w/ slides)

Tutorials: Sunday, August 22nd, 2021

Time (PDT) Title Presenters
9:00AM-1:30PM Tutorial 1: ML Performance and Real World Applications
Machine learning is a rich, varied, and rapidly evolving field. This tutorial will explore the applications, performance characteristics, and key challenges of many different unique workloads across training and inference. In particular, we will focus on hardware/software co-optimization for the industry-standard MLPerf™ benchmarks and selected applications and considerations at prominent cloud players.
Chair: David Kanter
 
9:00AM-9:45AM Modern Neural Networks and their Computational Characteristics
This talk will survey the computational characteristics of modern DNN workloads by taking a look at the characteristics and trends of the major application domains: computer vision, language, speech, recommendation. We will review the main categories of operations that occur in these networks, such as convolution variants, attention modules, recurrent cells, embeddings, normalizations, etc. We will also examine how the nature of the input data (regular grid, sequence, graph, unstructured) influences the DNN model architectures and their choice of operations. Finally, we will outline how the workload of a given network differs between its training and inference, due to changes in input characteristics, operation fusions, and workload-reduction techniques such as quantization and sparsity.
Paulius Micikevicius, NVIDIA
9:45AM-10:15AM MLPerf™ Training and Inference
The MLPerf Training and Inference Benchmarks have become the industry standard for measuring machine learning system performance (speed). We will describe the design choices in benchmarking machine learning performance, and how the MLPerf Training and Inference Benchmarks navigate those choices. We will walk through the submission and review process for the benchmarks, with the goal of enabling smooth submissions for potential submitters. We will review industry progress as shown by 2+ years of results on the benchmark suites. Lastly, we will discuss ongoing work to improve the benchmark suites, and how new collaborators can become involved and make a field-wide impact.
Peter Mattson, Google
10:15AM-10:30AM Break (15 min)
 
10:30AM-11:00AM Software/hardware co-optimization on the IPU: An MLPerf™ case study
Machine learning is a full system problem that requires careful optimization across software and hardware. In this case study, we present performance results and key optimizations from the Graphcore submission to the MLPerf v1.0 training benchmark, which is the culmination of all the architectural and system engineering innovations on real world AI models. We provide optimized implementations for a range of models in NLP (Natural Language Processing), and Image Classification. Central to our ability to fit these models on die are novel techniques including model parallelism, FP16 master weights, external streaming memory, and small batch size training.
Mario Michael Krell, Graphcore
11:00AM-11:30AM Deep Learning Inference Optimizations on CPUs
Deep learning (DL) inference applications are growing rapidly with the tremendous growth in data from connected devices. Although there are dedicated deep learning accelerators, CPUs remain the most available inference platform today. Optimization on CPUs bring direct time and cost savings to deep learning applications by either reducing latency and/or increasing throughput. Unlike vision models, most language, speech and recommendation models accept large range of input shapes and can be very large in size. To optimize these use cases, we used the following five techniques: (1) low precision inference; (2) reducing compute by introducing sparsity; (3) reducing memory accesses with ops fusion; (4) reducing primitive creation overhead; and (5) improving hardware utilization by loading balancing the input sizes and introducing more parallelism. Though we consider several specific DL models, each are examples of more general classes of DL models, where we expect these optimizations to apply. Lastly, all implementation details are open sourced in Intel’s latest MLPerf™ inference v1.0 submissions.
Guokai Ma, Intel
11:30AM-12:00PM AI at Scale for the Modern Era
The past decade has witnessed a 300,000 times increase in the amount of compute for AI. The latest natural language processing model is fueled with over trillion parameters while the memory need of neural recommendation and ranking models has grown from hundreds of gigabyte to the terabyte scale. The training of state-of-the-art industry-scale personalization and recommendation models consumes the highest number of compute cycles among all deep learning use cases at Facebook. What are the key system challenges faced by industry-scale deep learning models? This talk will highlight the scale and the implications on infrastructure optimization challenges and opportunities across the machine learning execution pipeline end to end, from ML data pre-processing to training system throughput optimization. The talk will conclude with directions for building high-performance, efficient AI systems at scale.
Carole-Jean Wu & Niket Agarwal, Facebook
12:00PM-12:15PM Break (15 min)
 
12:15PM-12:45PM The Nature of Graph Neural Network Workloads
Graph neural networks (GNNs) are powerful tools for learning from graph data and are widely used in various applications such as fraud detection and recommendation. The popularity of GNN research and adoption in industry has sparked many system researches to optimize GNN training. The characteristics of the GNN computations depend on three factors: (i) the architectures of GNN models, (ii) the graph topology and the types of node/edge features associated with it, and (iii) the training algorithm. Early GNN systems research focused on optimizing simple GNN architectures, such as GraphSage and GCN, on small homogeneous graphs with full-batch training. However, GNN workloads are moving towards GNN models with complex architectures, graphs with heterogenous and multimodal information, and mini-batch training. To guide the GNN framework optimizations and hardware design towards emerging workloads, we are developing a GNN benchmark that covers both simple and complex GNN architectures, graph datasets of different types and different scales and various training methods. In this talk, we will present the first version of our GNN benchmark and the characteristics of the GNN workloads on various hardware platforms.
Da Zheng and George Karypis, Amazon
12:45PM-1:10PM Challenges in large scale training of Giant models on large TPU machines
In this talk we will present various challenges in scaling and training giant models on large TPU machines. We will first review large scale techniques for parallelizing large language models such as MeshTensorFlow, GShard etc. We will also discuss challenges in model development, performance optimization, machine availability, network topologies, debugging challenges and techniques to serve giant models. A major debugging challenge is to be able to run the same model at a smaller scale, while performance optimization techniques include optimizing communication overheads via aggressive techniques to overlap communication and computation.
Sameer Kumar, Google
1:10PM-1:35PM ZeRO-Infinity and DeepSpeed: Breaking the device Memory Wall for Extreme Scale Deep Learning
ZeRO-Infinity is a novel deep learning (DL) training technology for scaling deep learning model training, from a single GPU to massive supercomputers with thousands of GPUs. It powers unprecedented model sizes by leveraging the full memory capacity of a system, concurrently exploiting all heterogeneous memory (GPU, CPU, and Non-Volatile Memory express or NVMe for short) without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs (40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed.ai.
Yuxiong He and Samyam Rajbhandari, Microsoft
2:30PM-7:00PM Tutorial 2: Advanced Packaging
This tutorial will discuss advanced 3D packaging technologies that enable performance and density improvements. Descriptions of the technologies and how they are used in cutting edge applications will be made by industry leaders in packaging and chip design.
Chair: Ralph Wittig
 
2:30PM-3:30PM Technology Provider: Intel packaging technologies for chiplets and 3D
Advanced packaging technologies are critical enablers of Heterogeneous Integration (HI) because of their importance as compact, power efficient platforms. This talk will establish the value of packaging as a HI platform and describe the capabilities of different packaging architectures. These architectures will be compared primarily on the basis of their physical interconnect capabilities. Key features in leading edge 2D and 3D technologies, such as EMIB, Silicon Interposer, Foveros and Co-EMIB will be described and a roadmap for their evolution will be presented. Challenges and opportunities in developing robust advanced package architectures will be discussed in the context of Intel’s use of leading edge packaging technologies in Graphics, Client and FPGA applications.
Ravi Mahajan and Sandeep Sane, Intel
3:30PM-4:30PM Technology Provider: TSMC packaging technologies for chiplets and 3D
With established TSMC 3DFabricTM technology platform, we continue to scale up the heterogeneous system package envelope, and to scale down on system interconnect. This system interconnect scaling is based on the roadmap we have proposed to drive 3D interconnect density, bandwidth/latency and energy efficient performance (EEP). Meanwhile, we leverage our wafer-level-system-integration technologies to provide innovative solutions to enhance heat dissipation when we move into liquid-cooling for the system. Furthermore, a disruptive Compact Universal Photonic Engine (COUPE) for Si photonics applications to drive EEP of HPC and networking is newly introduced to drive system EEP. Results will be shared here.
Doug Yu, TSMC
4:30PM-4:45PM Break (15 min)
 
4:45PM-5:30PM Case Study: Intel products built with 2.5D and 3D packaging
Ravi Mahajan and Sandeep Sane, Intel
5:30PM-6:15PM Case Study: AMD products built with 3D packaging
With chiplet architectures becoming mainstream, and recognized as fundamental to enabling the continued economically viable growth of power efficient computing, advanced packaging technologies and architectures are becoming more critical to enabling Moore’s Law’s next frontier through heterogeneous integration. In this tutorial, we will cover the advanced package architectures being enabled by AMD to enable PPAC (power, performance, area and cost) improvements as well as enable heterogeneous architectures. The direct Cu-Cu bonding technology used in AMD’s 3D VCache architecture will be detailed and compared to industry standard 3D architectures for PPAC benefits. Other technologies that are being enabled to advance high performance computing architectures will also be previewed.
Raja Swaminathan, AMD
6:15PM-7:00PM Expert Opinion: An overview of the package technology landscape and industry deployment
The role of heterogeneous integration, especially chiplets, is pivotal in this new era of electronics packaging. There are many choices for high-performance packages and this presentation describes the options and presents the advantages for each as they apply to different applications.
Jan Vardaman, TechSearch International Inc

Conference Day 1: Monday, August 23rd, 2021

Time (PDT) Title Presenters
8:45AM-9:00AM Introductions
 
  Welcome to HC33
Ian Bratt and Alisa Scherer, Hot Chips 33: Organizering Chair and Program Co-Chair
9:00AM-11:00AM CPUs

Chair: Nam Sung Kim
 
  Intel Alder Lake CPU Architectures
Efraim Rotem, Intel
  AMD Next Generation “Zen 3” Core
Mark Evers, AMD
  The >5GHz next generation IBM Z processor chip
Christian Jacobi, IBM
  Next-Gen Intel Xeon CPU - Sapphire Rapids
Arijit Biswas and Sailesh Kottapalli, Intel
11:00AM-11:30AM Break (30 min)
 
11:30AM-12:30PM Academic Spinout Chips

Chair: Krste Asanovic
 
  Mozart: Designing for Software Maturity and the Next Paradigm for Chip Architectures
Karu Sankaralingam, University of Wisconsin- Madison
  Morpheus II: A RISC-V Security Extension for Protecting Vulnerable Software and Hardware
Todd Austin, University of Michigan
12:30PM-1:30PM Keynote

Chair: Fred Weber
 
  Builders of the Imaginary: From Artificial Intelligence to Artificial Architects in the Era of SysMoore
Aart de Geus, CEO, Synopsys
1:30PM-2:30PM Break (1 hr)
 
2:30PM-4:00PM Infrastructure and Data Processors

Chair:
 
  Arm Neoverse N2: Arm’s second-generation high performance infrastructure CPUs and system products
Andrea Pellegrini, ARM
  NVIDIA DATA Center Processing Unit (DPU) Architecture
Idan Burstein, NVIDIA
  Intel’s Hyperscale-Ready SmartNIC for Infrastructure Processing
Bradley Burres, Intel
4:00PM-5:00PM Keynote

Chair: Ralph Wittig
 
  Skydio Autonomy Engine: Enabling the Next Generation of Autonomous Flight
Abraham Bachrach, CTO, Skydio
5:00PM-5:30PM Break (30 min)
 
5:30PM-7:00PM Enabling Technologies

Chair: Rob Aitken
 
  Heterogeneous computing to enable the highest level of safety in automotive systems
Ramanujan Venkatadri, Infineon
  Architecting an Open RISC-V 5G and AI SoC for Next Generation 5G Open Radio Access Network
Sriram Rajagopal, EdgeQ
  Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for machine learning accelerators
Jin Hyun Kim, Samsung Electronics

Conference Day 2: Tuesday, August 24th, 2021

Time (PDT) Title Presenters
8:30AM-10:00AM ML Inference for the Cloud

Chair: Ron Diamant
 
  Accelerating ML Recommendation with over a Thousand RISC-V/Tensor Processors on Esperanto’s ET-SoC-1 Chip
David Ditzel, Esperanto Technologies
  AI Compute Chip from Enflame
Ryan Liu and Chuang Feng, Enflame Technology
  Qualcomm Cloud AI 100: 12 TOPs/W Scalable, High Performance and Low Latency Deep Learning Inference Accelerator
Karam Chatha, Qualcomm Inc
10:00AM-11:00AM Keynote

Chair: Kunle Olukotun
 
  Architectural Challenges: AI Chips, Decision Support and High Performance Computing
Dimitri Kusnezov, Deputy Under Secretary for AI and Technology, Department of Energy
11:00AM-11:30AM Break (30 min)
 
11:30AM-1:30PM ML and Computation Platforms

Chair: Natalia Vassilieva
 
  Graphcore Colossus Mk2 IPU
Simon Knowles, Graphcore
  The Multi-Million Core, Multi-Wafer AI Cluster
Sean Lie, Cerebras Systems
  SambaNova SN10 RDU: Accelerating Software 2.0 with Dataflow
Raghu Prabhakar and Sumti Jairath, SambaNova Systems, Inc
  The Anton 3 ASIC: a Fire-Breathing Monster for Molecular Dynamics Simulations
J. Adam Butts, D.E. Shaw Research
1:30PM-2:30PM Break (1 hr)
 
2:30PM-4:30PM Graphics and Video

Chair: Pradeep Dubey
 
  Intel’s Ponte Vecchio GPU Architecture
David Blythe, Intel
  AMD RDNA(TM) 2 Graphics Architecture
Andrew Pomianowski, AMD
  Google’s Video Coding Unit (VCU) Accelerator
Aki Kuusela and Clint Smullen, Google
  Xilinx 7nm Edge Processors
Juanjo Noguera, Xilinx
4:30PM-5:00PM Break (30 min)
 
5:00PM-7:00PM New Technologies

Chair: Forest Baskett
 
  Mojo Lens - AR Contact Lenses for Real People
Michael Wiemer and Renaldi Winoto, Mojo Vision
  World Largest Mobile Image Sensor with All Directional Phase Detection Auto Focus Function
Sukki Yoon, Samsung Electronics
  New Value Creation by Nano-Tactile Sensor Chip Exceeding our Fingertip Discrimination Ability
Hidekuni Takao, Kagawa University
  The IonQ Trapped Ion Quantum Computer Architecture
Christopher Monroe, IonQ, Inc
7:00PM-7:15PM Closing
 
  Thank you for attending
Cliff Young, Hot Chips 34: General Chair

Posters

Title Authors & Affiliation
OmniDRL: An Energy-Efficient Mobile Deep Reinforcement Learning Accelerators with Dual-mode Weight Compression and Direct Processing of Compressed Data Juhyoung Lee; Korea Advanced Institute of Science and Technology
Exynos 1080 high-peformance, low-power CPU and GPU with AMIGO Taehee Lee; Samsung
An Energy-efficient Floating-Point DNN Processor using Heterogeneous Computing Architecture with Exponent-Computing-in-Memory Juhyoung Lee; Korea Advanced Institute of Science and Technology
Dynamic Neural Accelerator for Reconfigurable and Energy-efficient Neural Network Inference Sakyasingha Dasgupta; EdgeCortix
SM6: A 16nm System-on-Chip for Accurate and Noise-Robust Attention-Based NLP Applications Thierry Tambe; Harvard
ENIAD: A Reconfigurable Near-data Processing Architecture for Web-Scale AI-enriched Big Data Service Jialiang Zhang; U Penn
A Plug-and-Play Universal Photonic Processor for Quantum Information Processing Caterina Taballione; QuiX
Industry’s First 7.2 Gbps 512 GB DDR5 Memory Module with 8-Stacked DRAMs: A Promising Memory Solution for Next-Gen Servers Sung Joo Park; Samsung
LightOn Optical Processing Unit: Scaling-up AI and HPC with a Non von Neumann co-processor Laurent Daudet; LightOn
System-on-Chip Implementation of Trusted Execution Environment with Heterogeneous Architecture Trong-Thuc Hoang; University of Electro-Communications
A CORDIC-based Trigonometric Hardware Accelerator with Custom Instruction in 32-bit RISC-V System-on-Chip Khai-Duy Nguyen; University of Electro-Communications
A photonic neural network using < 1 photon per scalar multiplication Tianyu Wang; Cornell
Edge Inference Engine for Deep & Random Sparse Neural Networks with 4-bit Cartesian-Product MAC Array and Pipelined Activation Aligner Kota Ando; Tokyo Institute of Technology
Photonic co-processors in HPC: using LightOn OPUs for Randomized Numerical Linear Algebra Daniel Hesslow; LightOn
Elpis: High Performance Low Power Controller for Data Center SSDs Seungwon Lee; Samsung
SOT-MRAM – Third generation MRAM memory opens new opportunities Jean-Pierre Nozières; Antaios
PNNPU: A Fast and Efficient 3D Point Cloud-based Neural Network Processor with Block-based Point Processing for Regular DRAM Access Sangjin Kim; Korea Advanced Institute of Science and Technology
Samsung NPU: An AI accelerator and SDK for flagship mobile AP Jun-Seok Park; Samsung