# **Emerging Technologies and Silicon for HPC**

### (Not all Cores are Equal)

EasyBuild User Meeting, April 2023

lan Cutress Chief Analyst, More Than Moore

### The New HPC Era

- Types of Legacy Hardware: CPU, GPU, FPGA, ASIC
- New Paradigms: Analog, Neuro, Quantum, Optical
- The Push For Low Precision and Scale: AI Hardware

#### Al Hardware

- Established Players: NVIDIA GPU, Intel CPU
- Startup Funding: A \$10B+ investment
- Case Studies
- Roadmaps
- Software stacks OneAPI, ROCm, vendor specific ones

### **Q&A**

## lan Cutress



- Chief Analyst and Founder, More Than Moore
- Online Influencer and Educator, TechTechPotato





- Senior CPU Editor, AnandTech.com (2011-2022)
- PhD, Computational Chemistry, Oxford (2011)
- MChem, Computational Chemistry, Hull (2008)

### A New HPC Era

- Types of Legacy Hardware
- New Paradigms
- Push for Low Precision
- Al Hardware
  - Established Players
  - Current \$10B Market
  - Case Studies
  - Roadmaps
  - Software



### A New HPC Era

Types of Legacy Hardware

### CPU: x86, Arm, POWER

- GPU: NVIDIA, AMD
- FPGA: Intel (Altera), AMD (Xilinx)
- ASIC: Offload

## Intel CPU





#### Super-Charged CPU

Sapphire Rapids HBM

Sampling Today Production 2H'22

Up to 2.8x Perf over 3rd Gen Xeon<sup>1</sup>



### **AMD Genoa**









# Fujitsu A64FX (Arm) in Fugaku



# **TOP 500 CPU**



### A New HPC Era

Types of Legacy Hardware

### CPU: x86, Arm, POWER

- GPU: NVIDIA, AMD, Intel(?)
- FPGA: Intel (Altera), AMD (Xilinx)
- ASIC: Offload

## NVIDIA H100









## AMD MI250X







## **Intel Ponte Vecchio**





Ponte Vecchio x4 Subsystem with Xe Links

+ 2S Sapphire Rapids



Ponte Vecchio x4 Subsystem with X<sup>e</sup> Links

Ponte Vecchio









### **TOP 500 Accelerators**



### **TOP 500 Accelerators + Aurora**



- A New HPC Era
  - Types of Legacy Hardware
- CPU: x86, Arm, POWERGPU: NVIDIA, AMD
- FPGA: Intel (Altera), AMD (Xilinx)
   ASIC: Offload

### **AMD Xilinx Virtex + Versal**







# Intel Altera Agilex



#### UP TO 2<sup>ND</sup> UP TO UP TO **PROCESS 40 TFLOPS** GENERATION 40% 40% Intel® Agilex\*\* FPGA DATA INTEL® HIGHER LOWER **DSP PERFORMANCE** HYPERFLEX \*\* ARCHITECTURE **PERFORMANCE**<sup>®</sup> **STORE DDR5** & INTEL® OPTANE THE DC PERSISTENT MEMORY SUPPORT HBM DATA MOVE 112G INTEL<sup>®</sup> XEON<sup>®</sup> PROCESSOR COHERENT Connectivity & PCIE GEN5 DATA TRANSCEIVER DATA RATES with FP16 configuration Based on current estimates, see slide 19 for details EMBARGO: APRIL 2, 2019 (10:00AM PACIFIC TIME) (intel) 8





#### The FPGA for the Data-Centric World

- A New HPC Era
  - Types of Legacy Hardware
- CPU: x86, Arm, POWER
- GPU: NVIDIA, AMD
- FPGA: Intel (Altera), AMD (Xilinx)
- ASIC: Offload

## SmartNICs

Ba

Le

# 2021 STH NIC Continuum

| Foundational                                                                                                                                                                                     | Offload NIC                                                                                                       |  |                                                                                    |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|--|------------------------------------------------------------------------------------|
| asic interface for<br>network connectivity<br>(popular in the<br>10/100/1000 and<br>some Nbase-T NICs<br>ess common at<br>100Gbps+ speeds due<br>to packet processing<br>demands on host<br>CPUs | Offload for common<br>network traffic<br>functions (e.g., TCP/IP<br>stack, limited<br>virtualization<br>features) |  | Offloa<br>addi<br>prog<br>offlo<br>fron<br>(e.g<br>decc<br>Desig<br>flexi<br>offlo |

#### SmartNIC

Offload functions with additional programmability to offload specific tasks from host systems (e.g., compression/ decompression.) Designed to be a more flexible and expanded offload device DPU

Extended compute, offload, memory, and OS capabilities Designed to be an infrastructure endpoint that exposes resources to the data center and offloads key functionalities for data center scale computing (compute, storage, networking) Higher-levels of compute, offload, memory than SmartNICs



Usually, FPGA-based solutions that have fully customizable pipelines allowing for environment specific optimization Hardware also generally more specific to given deployment scenario

Increasing Cost, Complexity, Capabilities

# **DEShaw's Anton 3**

### **Molecular dynamics (MD) simulation**



- Understand biomolecular systems through their motions
- Numerical integration of Newton's laws of motion
  - Model atoms as point masses
  - Compute forces on every atom based on current positions
  - Update atom velocities and positions in discrete time steps of a few femtoseconds
- Force computation described by a model: the force field

## **DEShaw's Anton 3**

#### **Baby pictures**



29 September 2020: chips arrive MD running (water) < 9 h later

D E Shaw Research



30 September 2020: 1<sup>st</sup> protein run Faster @ 250 MHz than Anton 2



31 October 2020: Multi-node

15

#### Node board



#### Scale up



2×64 nodes



#### The evolution of $\Lambda N T O N$

|                   | ΛΝΤΟΝ           | ANTON 2             | литои з   |
|-------------------|-----------------|---------------------|-----------|
| Tape-out          | 2007            | 2012                | 2020      |
| CPU cores         | 8+4+1           | 66                  | 528*      |
| PPIMs             | 32              | 76                  | 528*      |
| Flex SRAM         | 0.125 MiB       | 4 MiB               | 66 MiB*   |
| Atoms / node      | 460             | 8,000               | 110,000*  |
| Clock frequency   | 0.485/0.970 GHz | 1.65 GHz            | 2.8+ GHz  |
| Channel bandwidth | 0.607 Tbps      | 2.7 Tbps            | 5.6+ Tbps |
| Process node      | 90 nm           | 40 nm               | 7 nm      |
| Transistors       | 0.2 G           | 2.0 G               | 31.8 G    |
| Die size          | 299 mm²         | 410 mm <sup>2</sup> | 451 mm²   |
| Power             | 30 W            | 190 W               | 360 W     |

D E Shaw Research

17

D E Shaw Research

\* 22/24 columns

16

# **DEShaw's Anton 3**

### **MD performance**



### A New HPC Era

- Types of Legacy Hardware
- New Paradigms: Analog, Neuro, Quantum, Optical
- Push for Low Precision
- Al Hardware
  - Established Players
  - Current \$10B Market
  - Case Studies: Wafer Scale, Analog Edge
  - Roadmaps
  - Software



# **Analog Computing**



- Super Low Power
- Super Low Latency
- ☆ 'Any' Value Possible

- Conversion Accuracy
- Von-Linear Response
- Scaling

Recent Key Players: Mythic AI, IBM, Aspinity

# **Neuromorphic Computing**

Beware! Some companies calling themselves 'Neuromorphic' aren't actually doing it





Recent Key Players: Intel Loihi 2, Spinnaker

## Intel Loihi 2

#### More Resources, Better Packing, Greater Density



intel labs

# **Quantum Computing**



- Physics, Chem, Bio
- Math + Encryption
- ▲ Machine Learning

- Any other math
- High Barrier to Entry
- Need a billion qubits

Recent Key Players: Intel, IBM, Google, Microsoft, Amazon, DWave, Alibaba, IonQ

# **Quantum Computing**

Several types of Qubits available:



|                | lon trap                                                       | Superconducting | Semiconducting | NV-centers |
|----------------|----------------------------------------------------------------|-----------------|----------------|------------|
| Coherence time | >   s                                                          | <b>~ 90</b> μs  | ~ 28 ms        | ~ 250 ns   |
| # Qubits       | ~ 10                                                           | 17 (50)         | ~ 3            | ~ 3        |
| Materials      | <sup>9</sup> Be <sup>+</sup> , <sup>43</sup> Ca <sup>+</sup> , | Al, Nb, TiN,    | Si, GaAs,      | C, N       |
| Scalability    | -                                                              | +               | +++            | ++         |

# **Quantum Computing**

### Error Correction and Qubit Scaling?

| Quantum error<br>correction | -        | Enabled                             | At scale                             |
|-----------------------------|----------|-------------------------------------|--------------------------------------|
| # Physical qubits           | 10 - 100 | 100 – 1000                          | 10 <sup>4</sup> – 10 <sup>6</sup>    |
| # Logical qubits            | -        | 1                                   | 10 – 1000+                           |
| Logical error               | 10-3     | 10 <sup>-2</sup> – 10 <sup>-6</sup> | 10 <sup>-6</sup> - 10 <sup>-12</sup> |



# **Qubits Today**

#### Development Roadmap |

Executed by IBM 🥪 On target 🥹

#### IBM Quantum



# **Optical Computing**



MZIs provide observation of phase shift through interference

These become useful when you modulate the phase on  $\Phi^{}_1$  and  $\Phi^{}_2$ 



# **Optical Computing**



No Power
 Speed-of-light fast
 Manufacturing

Recent Key Players: LightMatter, Lightelligence

# **Optical Computing**

### The future of artificial intelligence.

| 12LP ASIC       |         |
|-----------------|---------|
| 90WG PIC        |         |
| 64x64 matrix    |         |
| 1GHz vector ra  | ite     |
| 8-bit signed op | perands |
| 200ps latency   |         |
| 150 mm²         |         |

Faster, lower energy and decoupled from Moore's Law.



# **Optical computing.**

### A New HPC Era

- Types of Legacy Hardware
- New Paradigms: Analog, Neuro, Quantum, Optical
- Push for Low Precision

#### Al Hardware

- Established Players
- Current \$10B Market
- Case Studies: Wafer Scale, Analog Edge
- Roadmaps
- Software



### Quantization

Using fewer bits saves power, and with the right architecture, can be sped up. But the tradeoff is range and accuracy

FP16 vs FP32



### How numbers are represented matters:

|              | Sign | Ехро | nent |   |   |   | Mant | issa |   |   |   |   |   |   |   |   |
|--------------|------|------|------|---|---|---|------|------|---|---|---|---|---|---|---|---|
| FP16 (E5M10) | S    | 5    | 4    | 3 | 2 | 1 | 10   | 9    | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |

Value = 
$$(-1)^{S} \cdot \text{mantissa} \cdot 2^{\text{exponent}}$$

IEEE754 defines number standards - what to do with infinities, sub-normal values, etc. Lots of chip companies are now defining their own number types to improve performance.





| FP32             | E8M23 | S | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
|------------------|-------|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|----|----|----|----|----|----|----|---|---|---|---|---|---|---|---|---|
| TF32             | E8M10 | S | 8 | 7 | 6 | 5 | 4 | 3 | 2 |   | 10 |    | 8  | 7  | 6  | 5  | 4  | 3  | 2  | 1  |    |    |    |    |   |   |   |   |   |   |   |   |   |
| FP16             | E5M10 |   |   |   | S | 5 | 4 | 3 | 2 | 1 | 10 | 9  | 8  | 7  | 6  | 5  | 4  | 3  | 2  | 1  |    |    |    |    |   |   |   |   |   |   |   |   |   |
| BF16             | E8M7  | S | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 7  | 6  | 5  | 4  | 3  | 2  | 1  |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
| FP12             | E5M6  |   |   |   | S | 5 | 4 | 3 | 2 | 1 | 6  | 5  | 4  | 3  | 2  | 1  |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
| FP8              | E3M4  |   |   |   |   |   | S | 3 | 2 | 1 | 4  | 3  | 2  | 1  |    |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
| CFP8 (Precision) | E4M3  |   |   |   |   | S | 4 | 3 | 2 | 1 | 3  | 2  | 1  |    |    |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
| CFP8 (Range)     | E5M2  |   |   |   | S | 5 | 4 | 3 | 2 | 1 | 2  | 1  |    |    |    |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
| CFP8 (Other)     | E2M5  |   |   |   |   |   |   | S | 2 | 1 | 5  | 4  | 3  | 2  | 1  |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
| FP24             | E7M16 |   | S | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9  | 8  | 7  | 6  | 5  | 4  | 3  | 2 | 1 |   |   |   |   |   |   |   |
| FP21             | E8M12 | S | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 12 | 11 | 10 | 9  | 8  | 7  | 6  | 5  | 4  | 3  | 2  | 1  |    |    |   |   |   |   |   |   |   |   |   |
| MSFP12           | E3M0  | S | 3 | 2 | 1 |   | S | 3 | 2 | 1 |    |    | S  | 3  | 2  | 1  |    | 8  | 7  | 6  | 5  | 4  | 3  | 2  | 1 |   |   |   |   |   |   |   |   |
|                  | +M8   | S | 3 | 2 | 1 |   | S | 3 | 2 | 1 |    |    | S  | 3  | 2  | 1  |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
|                  |       | S | 3 | 2 | 1 |   | S | 3 | 2 | 1 |    |    | S  | 3  | 2  | 1  |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
|                  |       | S | 3 | 2 | 1 |   | S | 3 | 2 | 1 |    |    | S  | 3  | 2  | 1  |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
| FP4 (IBM)        | E3M0  | S | 3 | 2 | 1 |   |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |
| FP2              | E1M0  | S | 1 |   |   |   |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |

#### **IBM Research is Leading in Reduced Precision Scaling**



### **Quantization in HPC**

### But all HPC runs are FP64 or FP32, right?



Source: NextPlatform - FP16 on Climate and Weather on Isembard (Bristol)

### **Quantization in HPC**

3:00 PM

# Perhaps the solution is mixed precision: FP64 when you need it, FP16 when you don't

| 4:0                                                                      | 0 PM                                                                                                        |                                                                                                           |                                                                      |        | 9:00 AM              |                                                                                                                                         |                                                                        |                                                                                                               |                                                                              |  |  |  |
|--------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|--------|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|--|--|--|
|                                                                          |                                                                                                             | A Mixed Precision Randomized Preconditioner<br>Solver on GPUs                                             | ixed Precision Randomized Preconditioner for the LSQR<br>/er on GPUs |        |                      |                                                                                                                                         | GPU-based Low-precision Detection Approach for Massive<br>MIMO Systems |                                                                                                               |                                                                              |  |  |  |
|                                                                          |                                                                                                             | Randomized preconditioners for large-scale regression become extremely popular over the past decade. Such |                                                                      |        |                      | Massive Multiple-Input Multiple-Output (M-MIMO) uses hundreds of<br>antennas in mobile communications basestations to increase the amou |                                                                        |                                                                                                               |                                                                              |  |  |  |
|                                                                          |                                                                                                             | Hall 4 - Ground Floor                                                                                     |                                                                      |        |                      | Hall F - 2nd Floor                                                                                                                      |                                                                        |                                                                                                               |                                                                              |  |  |  |
|                                                                          |                                                                                                             | 😵 Research Paper                                                                                          |                                                                      |        |                      | Research Paper                                                                                                                          |                                                                        |                                                                                                               |                                                                              |  |  |  |
|                                                                          |                                                                                                             | Vasileios Georgiou<br>Karlsruhe Institute of                                                              |                                                                      |        |                      | Adel Dabah<br>KAUST                                                                                                                     |                                                                        |                                                                                                               |                                                                              |  |  |  |
|                                                                          |                                                                                                             | Mixed Precision Algorithms Numerical Libraries                                                            |                                                                      |        |                      | Mixed Precision Algorithms                                                                                                              | Industrial Use Case                                                    | es of HPC, ML and QC                                                                                          |                                                                              |  |  |  |
| 1                                                                        |                                                                                                             |                                                                                                           |                                                                      |        |                      |                                                                                                                                         | 2:00 PM                                                                |                                                                                                               |                                                                              |  |  |  |
|                                                                          |                                                                                                             | Aulti-Precision Benchmark:                                                                                | 3:20 PM                                                              | Q&A: I | Vixed Precis         | sion Algorithms                                                                                                                         | 2:00 PM<br>6:00 PM                                                     |                                                                                                               |                                                                              |  |  |  |
| Design, Performance                                                      |                                                                                                             | nallenges<br>or high-performance (HP) computers that                                                      | 3:35 PM                                                              | ⊙ Hal  | l Z - 3rd Floo       | r                                                                                                                                       |                                                                        |                                                                                                               | ce to the rapidly expanding landscape<br>hods. The ongoing cross-pollination |  |  |  |
|                                                                          |                                                                                                             | ed Minimum RESidual method (GMRES)                                                                        |                                                                      | BB Foo | us Session           |                                                                                                                                         |                                                                        | Hall Y6 - 2nd Floor                                                                                           |                                                                              |  |  |  |
|                                                                          |                                                                                                             |                                                                                                           |                                                                      | A.     |                      |                                                                                                                                         |                                                                        | 88 Tutorial                                                                                                   |                                                                              |  |  |  |
| Secus Session                                                            |                                                                                                             |                                                                                                           |                                                                      | 4400   | Hatem Ltaie<br>KAUST | f                                                                                                                                       |                                                                        | Jack Dongarra<br>University of Tennessee,                                                                     | Hartwig Anzt<br>University of Tennessee,                                     |  |  |  |
|                                                                          | Piotr Luszczek         University of Tennessee,         Exascale Systems         Mixed Precision Algorithms |                                                                                                           |                                                                      |        | Piotr Luszcz         | ek                                                                                                                                      |                                                                        | Piotr Luszczek<br>University of Tennessee,                                                                    |                                                                              |  |  |  |
| Exascale Systems Mix                                                     |                                                                                                             |                                                                                                           |                                                                      |        | University of T      | ennessee,                                                                                                                               |                                                                        | Mixed Precision Algorithms Numerica                                                                           |                                                                              |  |  |  |
| Performance Modeling and Tuning Emerging HPC Processors and Accelerators |                                                                                                             |                                                                                                           |                                                                      |        |                      |                                                                                                                                         |                                                                        | Performance Modeling and Tuning Emerging HPC Processors and Accelerators Sustainability and Energy Efficiency |                                                                              |  |  |  |

### **Quantization in HPC**









### **Push for Quantization**



Source: Dylan Patel, ISSCC

### **Push for Quantization**

### Push for reduced precision comes from AI

|                           | Peak Performance                             |
|---------------------------|----------------------------------------------|
| Transistor Count          | 54 billion                                   |
| Die Size                  | 826 mm <sup>2</sup>                          |
| FP64 CUDA Cores           | 3,456                                        |
| FP32 CUDA Cores           | 6,912                                        |
| Tensor Cores              | 432                                          |
| Streaming Multiprocessors | 108                                          |
| FP64                      | 9.7 teraFLOPS                                |
| FP64 Tensor Core          | 19.5 teraFLOPS                               |
| FP32                      | 19.5 teraFLOPS                               |
| TF32 Tensor Core          | 156 teraFLOPS   312 teraFLOPS*               |
| BFLOAT16 Tensor Core      | 312 teraFLOPS   624 teraFLOPS*               |
| FP16 Tensor Core          | 312 teraFLOPS   624 teraFLOPS*               |
| INT8 Tensor Core          | 624 TOPS   1,248 TOPS*                       |
| INT4 Tensor Core          | 1,248 TOPS   2,496 TOPS*                     |
| GPU Memory                | 40 GB                                        |
| GPU Memory Bandwidth      | 1.6 TB/s                                     |
| Interconnect              | NVLink 600 GB/s<br>PCIe Gen4 64 GB/s         |
| Multi-Instance GPUs       | Various Instance sizes with up to 7MIGs @5GB |
| Form Factor               | 4/8 SXM GPUs in HGX A100                     |
| Max Power                 | 400W (SXM)                                   |

|                     | 2    | CUDA | Cores |      |      |      | Tensor | Cores |      |       |
|---------------------|------|------|-------|------|------|------|--------|-------|------|-------|
| NVIDIA Architecture | FP64 | FP32 | FP16  | INT8 | FP64 | TF32 | FP16   | INT8  | INT4 | INT1  |
| Volta               | 32   | 64   | 128   | 256  |      |      | 512    |       |      |       |
| Turing              | 2    | 64   | 128   | 256  |      |      | 512    | 1024  | 2048 | 8192  |
| Ampere (A100)       | 32   | 64   | 256   | 256  | 64   | 512  | 1024   | 2048  | 4096 | 16384 |
| Ampere, sparse      |      |      |       |      |      | 1024 | 2048   | 4096  | 8192 |       |



ghest Throughput for Scale-up Servers





| # of CUDA Cores     | 3840                                                  |
|---------------------|-------------------------------------------------------|
| ak Single Precision | 12 TeraFLOPS                                          |
| Peak INT8           | 47 TOPS                                               |
| Low Precision       | 4x 8-bit vector dot product<br>with 32-bit accumulate |
| Video Engines       | 1x decode engine, 2x encode engines                   |
| GDDR5 Memory        | 24 GB @ 346 GB/s                                      |
| Power               | 250W                                                  |
|                     |                                                       |

GoogLeNet, AlexNet, batch size = 128, CPU: Dual Socket Intel E5-2697v4

Source: NVIDIA A100

### Silicon or Survive

#### A New HPC Era

- Types of Legacy Hardware
- New Paradigms: Analog, Neuro, Quantum, Optical
- Push for Low Precision

#### Al Hardware

- Established Players
- Current \$10B Market
- Case Studies: Wafer Scale, Analog Edge
- Roadmaps
- Software



## **AI Hardware Established Players**



### Al Hardware Investment

Most of that hardware you buy, or is in the cloud. But there are 50+ startups creating Al hardware, some of which is already in HPC.

### This market is \$10B+



# Google TPU v2

### TPUv2 Chip



- 16 GB of HBM
- 600 GB/s mem BW
- Scalar/vector units: 32b float
- MXU: 32b float accumulation but reduced precision for multipliers
- 45 TFLOPS



# Google TPU (v2)



Google-designed device for neural net training and inference

# Google TPU v2





### SambaNova Cardinal



#### SambaNova Systems® Cardinal SN10 RDU

- First Reconfigurable Dataflow Unit (RDU)
- TSMC 7nm
  - Taped Out first half of 2019 o 40B transistors, 50 Km of wire
- 640 Pattern Compute Units >300 BF16 TFLOPs
  - BF16 with FP32 accumulation, stochastic rounding
     Also supports FP32, Int32, Int16, Int8 data formats
- 640 Pattern Memory Units
  - >300 MB on-chip memory
  - o 150 TB/s on-chip memory bandwidth
  - Memory transformation operations

©2021 SambaNova Systems





### CGRA



Egger et al., Auto-Tuning CNNs for Coarse-Grained Reconfigurable Array-Based Accelerators, July 2018

### CGRA



Liu et al., A Survey of Coarse-Grained Reconfigurable Architecture and Design, October 2019

### SambaNova Cardinal SN10

#### Cardinal SN10: Tile





### SambaNova Cardinal SN10

### Dataflow Architecture for Terabyte Sized Models



### SambaNova

# SambaNova system added to Fugaku supercomputer to boost AI performance



> More

World's second fastest supercomputer gets a boost

March 06, 2023 By: Sebastian Moss 🔘 Have your say



Al hardware-as-a-service company SambaNova Systems will deploy one of its supercomputers at the RIKEN Center for Computational Science (R-CCS) in Japan.

The DataScale system will be paired with Fugaku, the world's second fastest supercomputer.

"The provision of SambaNova's system resources to R-CCS provides a new option for accelerating the integration of HPC simulations and AI with Fugaku," Professor Satoshi Matsuoka, director of R-CCS, said.

"The new SambaNova system will boost research into the convergence of HPC and Al, including ultra-high-resolution computer vision for building a digital twin for the Society 5.0 era."



R-CCS researchers will use DataScale to develop surrogate models to improve the accuracy of ultra-high-resolution 3D computer vision, including for the inspection of social infrastructure such as highways, and to process ultra-high-resolution image datasets.

While Fugaku uses 152,064 Fujitsu's 48-core Arm-based A64FX processor chips, SambaNova has developed its own Reconfigurable Dataflow Unit (RDU) chip, which is only available within the wider DataScale package. The company claims that developing a single hardware and software package optimizes it for AI workloads.

It previously paired a DataScale system with Lawrence Livermore National Laboratory's Corona supercomputer, and another one of its systems is being tested by Argonne National Laboratory.



Uptime on the Line? 14 Apr 2023 Data Center Ecosystem Report 2023: The Irish Market

13 Apr 2023

Resources

Partnering to Create Data Centers of the Future 12 Apr 2023

Rethinking Data Center SSDs in context of real-world workloads and TCO





Watch now: Inside NTT's Award-

#### Source: DataCenter Dynamics





#### Cerebras WSE 2 The Largest Chip Ever Built

- •46,225 mm<sup>2</sup> silicon
- 2.6 trillion transistors
- •850,000 AI optimized cores
- •40 Gigabytes on chip memory
- •20 Petabytes memory bandwidth
- •220 Petabits fabric bandwidth
- •TSMC 7nm









#### **WPSC**

About v Resources v Research v Services v

CC

#### New Cerebras Systems technology will double capacity, allow larger deeplearning models and data

The Neocortex high performance artificial intelligence (AI) computer at PSC has been upgraded with two new Cerebras <u>CS-2</u> systems, powered by the second-generation <u>wafer-scale engine</u> (WSE-2) processor. The WSE-2 doubles the system's cores and on-chip memory and offers a new execution mode with even greater advantages for extreme-scale deep-learning tasks, enabling faster training, larger models and larger input data.

Neocortex, funded with \$11.25 million from the National Science Foundation to date, is supported under the NSF's Innovative HPC Program, meant to further the field of high performance computing (HPC) by funding new technologies with innovative approaches. The system now features a groundbreaking integration of two WSE-2's – an improved new technology that accelerates deep-learning AI with a unique chip architecture – with a powerful HPE Superdome Flex Server. By pairing the robustly provisioned HPE Superdome Flex server for massive data handling capability with the two WSE-2's, the system has unlocked new potential for rapidly training AI systems capable of learning from vast data sources.

"We are extremely excited to welcome the CS-2 servers into Neocortex," said Paola Buitrago, principal investigator of Neocortex and Director, Artificial Intelligence & Big Data at PSC. "This upgrade enhances support for new models, algorithms and research opportunities. We look forward to the breakthroughs that the now even greater capabilities of Neocortex would enable. We will continue working with the research community to help them take advantage of this technology that is orders of magnitude more powerful."

The CS-2 is based on the innovative WSE. The WSE-2 is the largest chip in existence and is the fastest AI processor. Whereas traditional processors are the size of postage stamps, the WSE-2 is the size of a dinner plate. In AI, big chips process information more quickly, producing answers in less time.

In deep learning, an AI program represents characteristics of a computational problem as layers, connected with each other by lines of inference. The AI first trains on data in which humans have labeled the "right answers," pruning or strengthening inference connections until it is predicting correctly. The researchers then test the AI against a dataset without such labels, to grade its performance. Finally, once the AI is performing adequately, it can be set to the task it has been designed to address.

The two-dimensional grid of cores on the WSE-2 allows the system to route machine-learning tasks in physical space, essentially reproducing the layers of a deep-learning algorithm on different parts of the chip. By leveraging a 7-nm fabrication process, the CS-2 improves upon the CS-1's capabilities by expanding the number of cores from 400,000 to 850,000 and on-chip memory from 18 GB to 40 GB. The CS-2 does this with the same footprint, power, and cooling requirements as the CS-1.





Simulation performed on a single Cerebras CS-2 within the Neocortex system at the Pittsburgh Supercomputing Center 736 x 896 x 300 cells (198 million) Fluid volume of 23 x 28 x 9.4 meters Video playback rate is approximately at actual solution speed



### Core Design



#### Efficient small core design

- 228µm x 170µm core area
- TSMC N7

#### Balanced logic and memory

- 50:50 logic to SRAM area ratio
- 110,000 logic standard cells
- 48kB high density SRAM memory

#### Power efficient design point

- 1.1GHz clock frequency
- 30mW peak power





### Tenstorrent

# Tenstorrent

### Tenstorrent



### Tenstorrent





### Silicon or Survive

#### A New HPC Era

- Types of Legacy Hardware
- New Paradigms: Analog, Neuro, Quantum, Optical
- Push for Low Precision
- Al Hardware
  - Established Players
  - Current \$10B Market
  - Case Studies: Wafer Scale, Analog Edge
  - Roadmaps
  - Software



## Intel DCA



## Intel DCAI



# Intel DCA

## Continuous, visible data points will provide confidence that we are **rebuilding our execution engine**



### NVIDIA



## NVIDIA

#### GRACE HOPPER SUPERCHIP

For Giant-Scale AI and HPC



- CPU+GPU Designed For Giant-Scale AI and HPC
- 600GB Memory GPU for Giant Models
- New 900 GB/s Coherent Interface
- 30X Higher System Memory B/W to GPU in a Server
- Runs NVIDIA Computing Stacks
- Available 1H 2023

#### GRACE CPU SUPERCHIP For HPC and AI Infrastructure



- High Performance CPU for HPC and AI
- 144 Cores | 740 SPECrate@2017\_int\_base est.
- First LPDDR5X Memory With ECC. 1TB/s Memory Bandwidth
- 2X Perf/Watt Over Traditional Servers
- Runs NVIDIA Computing Stacks
- Available 1H 2023

## NVIDIA





## **AMD CPU**





## **AMD GPU**



AMD Instinct<sup>™</sup> MI100

#### **Ecosystem Growth**

First purpose-built GPU architecture for the data center



AMD Instinct<sup>™</sup> MI200 AMD CDNA<sup>™</sup> 2

#### Driving HPC and AI to a New Frontier

First multi-die data center GPU expands scientific discovery and brings choice to AI training AMD Instinct<sup>™</sup> MI300

AMD

11110-201411-3

#### **Data Center APU**

Breakthrough architecture designed for leadership efficiency and performance for HPC and AI





The world's first integrated data center CPU + GPU

#### amd instinct™ MI300

Breakthrough architecture to power the exascale AI era







### Arm

### **Rapid Pace of Innovation**



### Tenstorrent



# Software: NVIDIA CUDA 12.1.0



Full support for x86, ARM, even CPUs and Windows

CUDA Math Libraries, cuDNN, cuFFT, cuBLAS, NAMD, CFD

Multi-user virtualization support for HPC + Cloud

# Software: AMD ROCm 5.4



https://docs.amd.com/

- Translates CUDA code to AMD via hipify
- OpenMP/OpenCL support, Linux (RHEL/Ubuntu), ESXi 7/8
- Extensive release notes with each version

# Software: Intel OneAPI

### oneAPI Industry Initiative

#### Specifies:

- A standards based crossarchitecture language, DPC++, based on C++ and SYCL
- Powerful APIs designed for acceleration of key domain-focused functions
- Low-level hardware interface to provide a hardware abstraction layer to vendors

Open standard to promote community and industry support

Enables code reuse across architectures and vendors

Available now



# Software: Intel oneAPI



- Attempt to unify industry around DPC++ / SYCL
- Write once, compile often? Hardware agnostic?
- A proper ground-up redesign of Intel's software support

## Software: Intel oneAPI

#### Intel® oneAPI Base & HPC Toolkit

#### **Direct Programming**

Intel<sup>®</sup> C++ Compiler Classic

Intel<sup>®</sup> Fortran Compiler (Beta)

Intel<sup>®</sup> Fortran Compiler Classic

Intel® oneAPI DPC++/C++ Compiler

Intel® DPC++ Compatibility Tool

Intel® Distribution for Python\*

Intel® FPGA Add-On for oneAPI Base Toolkit

#### API-Based Programming Intel® MPI Library

Intel® oneAPI DPC++ Library

Intel® oneAPI Math Kernel Library

Intel<sup>®</sup> oneAPI Data Analytics Library

Intel® oneAPI Threading Building Blocks

Intel® oneAPI Video Processing Library

Intel® oneAPI Collective Communications Library

Intel<sup>®</sup> oneAPI Deep Neural Network Library

Intel® Integrated Performance Primitives

#### Analysis & Debug Tools

Intel<sup>®</sup> Inspector

Intel® Trace Analyzer & Collector

Intel<sup>®</sup> Cluster Checker

Intel<sup>®</sup> VTune<sup>™</sup> Profiler

Intel<sup>®</sup> Advisor

Intel® Distribution for GDB\*

\*Other names and brands may be claimed as the property of others.

# What I didn't cover in this talk

### Interconnect

- Infiniband vs Ethernet vs GPU-to-GPU
- Integrated Optics/Photonics
- New paradigms like CXL
- Storage
  - Rise and fall of Intel/Micron's Optane
  - HBM vs DDR vs Compute-in-Memory
- China
  - Did they really have the first exaflop supercomputer?

## The AI Hardware Show



Source: Spotify, youtube.com/techtechpotato

### Silicon or Survive

### A New HPC Era

- Types of Legacy Hardware
- New Paradigms: Analog, Neuro, Quantum, Optical
- Push for Low Precision
- Al Hardware
  - Established Players
  - Current \$10B Market
  - Case Studies: Wafer Scale, Analog Edge
  - Roadmaps
  - Software
- Q&A

## **TechTechPotato Mugs**





#### http://merch.techtechpotato.com

First 50 orders 20% OFF with code **EUM23**