Tutorials

Time (PDT)	Title	Presenters
9:00AM-1:30PM	Tutorial 1: ML Performance and Real World Applications Machine learning is a rich, varied, and rapidly evolving field. This tutorial will explore the applications, performance characteristics, and key challenges of many different unique workloads across training and inference. In particular, we will focus on hardware/software co-optimization for the industry-standard MLPerf™ benchmarks and selected applications and considerations at prominent cloud players. Chair: David Kanter
9:00AM-9:45AM	Modern Neural Networks and their Computational Characteristics This talk will survey the computational characteristics of modern DNN workloads by taking a look at the characteristics and trends of the major application domains: computer vision, language, speech, recommendation. We will review the main categories of operations that occur in these networks, such as convolution variants, attention modules, recurrent cells, embeddings, normalizations, etc. We will also examine how the nature of the input data (regular grid, sequence, graph, unstructured) influences the DNN model architectures and their choice of operations. Finally, we will outline how the workload of a given network differs between its training and inference, due to changes in input characteristics, operation fusions, and workload-reduction techniques such as quantization and sparsity.	Paulius Micikevicius, NVIDIA
9:45AM-10:15AM	MLPerf™ Training and Inference The MLPerf Training and Inference Benchmarks have become the industry standard for measuring machine learning system performance (speed). We will describe the design choices in benchmarking machine learning performance, and how the MLPerf Training and Inference Benchmarks navigate those choices. We will walk through the submission and review process for the benchmarks, with the goal of enabling smooth submissions for potential submitters. We will review industry progress as shown by 2+ years of results on the benchmark suites. Lastly, we will discuss ongoing work to improve the benchmark suites, and how new collaborators can become involved and make a field-wide impact.	Peter Mattson, Google
10:15AM-10:30AM	Break (15 min)
10:30AM-11:00AM	Software/hardware co-optimization on the IPU: An MLPerf™ case study Machine learning is a full system problem that requires careful optimization across software and hardware. In this case study, we present performance results and key optimizations from the Graphcore submission to the MLPerf v1.0 training benchmark, which is the culmination of all the architectural and system engineering innovations on real world AI models. We provide optimized implementations for a range of models in NLP (Natural Language Processing), and Image Classification. Central to our ability to fit these models on die are novel techniques including model parallelism, FP16 master weights, external streaming memory, and small batch size training.	Mario Michael Krell, Graphcore
11:00AM-11:30AM	Deep Learning Inference Optimizations on CPUs Deep learning (DL) inference applications are growing rapidly with the tremendous growth in data from connected devices. Although there are dedicated deep learning accelerators, CPUs remain the most available inference platform today. Optimization on CPUs bring direct time and cost savings to deep learning applications by either reducing latency and/or increasing throughput. Unlike vision models, most language, speech and recommendation models accept large range of input shapes and can be very large in size. To optimize these use cases, we used the following five techniques: (1) low precision inference; (2) reducing compute by introducing sparsity; (3) reducing memory accesses with ops fusion; (4) reducing primitive creation overhead; and (5) improving hardware utilization by loading balancing the input sizes and introducing more parallelism. Though we consider several specific DL models, each are examples of more general classes of DL models, where we expect these optimizations to apply. Lastly, all implementation details are open sourced in Intel’s latest MLPerf™ inference v1.0 submissions.	Guokai Ma, Intel
11:30AM-12:00PM	AI at Scale for the Modern Era The past decade has witnessed a 300,000 times increase in the amount of compute for AI. The latest natural language processing model is fueled with over trillion parameters while the memory need of neural recommendation and ranking models has grown from hundreds of gigabyte to the terabyte scale. The training of state-of-the-art industry-scale personalization and recommendation models consumes the highest number of compute cycles among all deep learning use cases at Facebook. What are the key system challenges faced by industry-scale deep learning models? This talk will highlight the scale and the implications on infrastructure optimization challenges and opportunities across the machine learning execution pipeline end to end, from ML data pre-processing to training system throughput optimization. The talk will conclude with directions for building high-performance, efficient AI systems at scale.	Carole-Jean Wu & Niket Agarwal, Facebook
12:00PM-12:15PM	Break (15 min)
12:15PM-12:45PM	The Nature of Graph Neural Network Workloads Graph neural networks (GNNs) are powerful tools for learning from graph data and are widely used in various applications such as fraud detection and recommendation. The popularity of GNN research and adoption in industry has sparked many system researches to optimize GNN training. The characteristics of the GNN computations depend on three factors: (i) the architectures of GNN models, (ii) the graph topology and the types of node/edge features associated with it, and (iii) the training algorithm. Early GNN systems research focused on optimizing simple GNN architectures, such as GraphSage and GCN, on small homogeneous graphs with full-batch training. However, GNN workloads are moving towards GNN models with complex architectures, graphs with heterogenous and multimodal information, and mini-batch training. To guide the GNN framework optimizations and hardware design towards emerging workloads, we are developing a GNN benchmark that covers both simple and complex GNN architectures, graph datasets of different types and different scales and various training methods. In this talk, we will present the first version of our GNN benchmark and the characteristics of the GNN workloads on various hardware platforms.	Da Zheng and George Karypis, Amazon
12:45PM-1:10PM	Challenges in large scale training of Giant models on large TPU machines In this talk we will present various challenges in scaling and training giant models on large TPU machines. We will first review large scale techniques for parallelizing large language models such as MeshTensorFlow, GShard etc. We will also discuss challenges in model development, performance optimization, machine availability, network topologies, debugging challenges and techniques to serve giant models. A major debugging challenge is to be able to run the same model at a smaller scale, while performance optimization techniques include optimizing communication overheads via aggressive techniques to overlap communication and computation.	Sameer Kumar, Google
1:10PM-1:35PM	ZeRO-Infinity and DeepSpeed: Breaking the device Memory Wall for Extreme Scale Deep Learning ZeRO-Infinity is a novel deep learning (DL) training technology for scaling deep learning model training, from a single GPU to massive supercomputers with thousands of GPUs. It powers unprecedented model sizes by leveraging the full memory capacity of a system, concurrently exploiting all heterogeneous memory (GPU, CPU, and Non-Volatile Memory express or NVMe for short) without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs (40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed.ai.	Yuxiong He and Samyam Rajbhandari, Microsoft
2:30PM-7:00PM	Tutorial 2: Advanced Packaging This tutorial will discuss advanced 3D packaging technologies that enable performance and density improvements. Descriptions of the technologies and how they are used in cutting edge applications will be made by industry leaders in packaging and chip design. Chair: Ralph Wittig
2:30PM-3:30PM	Technology Provider: Intel packaging technologies for chiplets and 3D Advanced packaging technologies are critical enablers of Heterogeneous Integration (HI) because of their importance as compact, power efficient platforms. This talk will establish the value of packaging as a HI platform and describe the capabilities of different packaging architectures. These architectures will be compared primarily on the basis of their physical interconnect capabilities. Key features in leading edge 2D and 3D technologies, such as EMIB, Silicon Interposer, Foveros and Co-EMIB will be described and a roadmap for their evolution will be presented. Challenges and opportunities in developing robust advanced package architectures will be discussed in the context of Intel’s use of leading edge packaging technologies in Graphics, Client and FPGA applications.	Ravi Mahajan and Sandeep Sane, Intel
3:30PM-4:30PM	Technology Provider: TSMC packaging technologies for chiplets and 3D With established TSMC 3DFabricTM technology platform, we continue to scale up the heterogeneous system package envelope, and to scale down on system interconnect. This system interconnect scaling is based on the roadmap we have proposed to drive 3D interconnect density, bandwidth/latency and energy efficient performance (EEP). Meanwhile, we leverage our wafer-level-system-integration technologies to provide innovative solutions to enhance heat dissipation when we move into liquid-cooling for the system. Furthermore, a disruptive Compact Universal Photonic Engine (COUPE) for Si photonics applications to drive EEP of HPC and networking is newly introduced to drive system EEP. Results will be shared here.	Doug Yu, TSMC
4:30PM-4:45PM	Break (15 min)
4:45PM-5:30PM	Case Study: Intel products built with 2.5D and 3D packaging	Ravi Mahajan and Sandeep Sane, Intel
5:30PM-6:15PM	Case Study: AMD products built with 3D packaging With chiplet architectures becoming mainstream, and recognized as fundamental to enabling the continued economically viable growth of power efficient computing, advanced packaging technologies and architectures are becoming more critical to enabling Moore’s Law’s next frontier through heterogeneous integration. In this tutorial, we will cover the advanced package architectures being enabled by AMD to enable PPAC (power, performance, area and cost) improvements as well as enable heterogeneous architectures. The direct Cu-Cu bonding technology used in AMD’s 3D VCache architecture will be detailed and compared to industry standard 3D architectures for PPAC benefits. Other technologies that are being enabled to advance high performance computing architectures will also be previewed.	Raja Swaminathan, AMD
6:15PM-7:00PM	Expert Opinion: An overview of the package technology landscape and industry deployment The role of heterogeneous integration, especially chiplets, is pivotal in this new era of electronics packaging. There are many choices for high-performance packages and this presentation describes the options and presents the advantages for each as they apply to different applications.	Jan Vardaman, TechSearch International Inc