

## Materials Engineering for Al Compute

Tony Chiang, Ph.D. VP/CTO, New Markets and Alliances Group Applied Materials

NICE Workshop, March 28, 2019



# World's #1 Contended and display systems company



**\$17.3 billion** revenue







Applied Materials is the leader in materials engineering solutions used to produce virtually every new chip in the world

Data as of fiscal year end, October 28, 2018



## Applied's Materials Engineering Enabling the Semiconductor Roadmap



**APPLIED** MATERIALS

## A.I. Is Re-Shaping the Environment





## **Performance Improvements Are Slowing...**

#### Moore's Law: Projection Held for 40 Years...

Recent data points suggest ~2x more every 5 years



#### **Classic 2D Feature Scaling Slowing**

SOURCE: University of Wisconsin

#### PERFORMANCE IMPROVEMENTS OVER TIME (VS. VAX-11/780)



SOURCE: Computer Architecture: A Quantitative Approach, Sixth Edition, John Hennessy and David Patterson, December 2017





## **Data Explosion + Rise of A.I. = Heterogeneous Computing**



#### Heterogeneous Compute Era



Domain Specific Architectures:

System level optimization meeting performance, compute and area efficiency goals for targeted workloads



## New Playbook Needed for Connectivity & Speed...

...to address Industry Challenges of Complexity  $\uparrow$ , Integration challenges  $\uparrow\uparrow$ , Time to market  $\uparrow\uparrow\uparrow$ 

Serial mindset

VS.

**Connected** mindset



INTEGRATION EQUIPMENT DESIGN MATERIALS

**OPPORTUNITY:** Parallel development to accelerate innovation

#### **Connectivity to Accelerate Innovation**



#### **Foundation is Materials Engineering**



## **Key Challenges & Mitigation Strategies in Al**

#### **Key Challenges**

- Volumes of data and sizes of models are exploding
- Longer training times
- Larger model  $\rightarrow$  more memory ref.  $\rightarrow$  more energy
- Power use is not scalable -- energy efficiency issue

#### **Mitigation Strategies**

- Algorithm and Hardware Co-design
  - Customize/optimize per workload type
  - Minimize data moves
  - Move memory closer to computation
- New Devices
  - Integrate computation into the memory (analog compute)
  - New compute paradigms quantum, synaptic



Source: B. Dally (Chief Scientist Nvidia/Stanford), S. Han (Stanford), Efficient Methods and Hardware for Deep Learning (2017), NIPS 2016 Workshop on Efficient Methods for Deep Neural Networks (2016); V. Sze (MIT), Efficient Processing of Deep Neural Networks: A Tutorial and Survey (2017)



#### **3 Eras of AI Compute Revolution Disruptive Compute Synaptic Brain-Inspired** Stochastic Computing ReRAM. PCM **In-Memory Compute** Cryo/Quantum Superconducting Efficiency Superposition Quantum Computing Entanglement Analog Compute Memristor Analog compute with memory; Mixed signal design NOR Flash, PCM, Š **Accelerators** ReRAM, FeFET **Computational Complexity Embed Memory** MRAM, On die memory; memory does not move ReRAM, CBRAM **Near Memory** DRAM w/ HBM. System level design focused on proximity of memory to the processors HMC 3D SCM Processors CPU, GPU Heterogeneous computing, based on optimization using existing building blocks and node scaling xPU, FPGA Packaging: 2.5D/EM D2W / 3D **High Density Interconnect** Cryo Photonics ERA1 ERA3 ERA2



#### **3 Eras of AI Compute Revolution Disruptive Compute Svnaptic Brain-Inspired** Stochastic Computing ReRAM. PCM Cryo/Quantum **In-Memory Compute** Superconducting Efficiency Superposition Quantum Computing Entanglement Analog Compute Memristor Analog compute with memory; Mixed signal design NOR Flash, PCM, Š **Accelerators** ReRAM, FeFET **Computational Complexity Embed Memory** MRAM, **Implications/Needs:** ReRAM, CBRAM Connectivity across the ecosystem **Near Memory** DRAM w/ HBM. System level design fo HMC Acceleration of learning cycles 3D SCM **Processors** Co-optimization and customization of CPU, GPU Heterogeneous computing, based xPU, FPGA software & hardware, extending from materials to systems High [ 2.5D/EM D2W / 3D Packaging: ERA1 ERA2 ERA3



## **Materials to Systems**





## Accelerating the Path to Productization: Lab to Fab





## **Models are becoming Deeper and Larger**

#### Common NN's

| Neural<br>Network | Туре       | # Weights | # MACs |
|-------------------|------------|-----------|--------|
| AlexNet           | CNN        | 61M       | 724M   |
| GoogLeNet         | CNN        | 7M        | 1.43G  |
| VGG-16            | CNN        | 138M      | 15.5G  |
| ResNet50          | CNN        | 25.5M     | 3.9G   |
| RestNet152        | CNN        | 60M       | 11.3G  |
| MLP0              | MLP        | 20M       |        |
| MLP1              | MLP        | 5M        |        |
| LSTM0             | LSTM (RNN) | 52M       |        |
| LSTM1             | LSTM (RNN) | 34M       |        |
| CNNO              | CNN        | 8M        |        |
| CNN1              | CNN        | 100M      |        |

→ 95% of TPU Workload

Sources: V. Sze (MIT), Efficient Processing of Deep Neural Networks: A Tutorial and Survey (2017), Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio; An Analysis of Deep Neural Network Models for Practical Applications; (2016), N. P. Jouppi et. al, In-Datacenter Perf. Analysis of a Tensor Processing Unit, 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 2017

#### Analog Multiply-Accumulate



Vector Matrix Multiplication performed by sensing current; weights stored as cell conductances

How to store weights on die? How to make MACs more efficient?

## **Materials to Systems Approach is Needed**





## Materials Innovation Drives Performance of AI Devices



#### Performance of Different In-Memory Compute Elements Based on MNIST data set

|                                                | Device type<br>Metric                 |       | Digital    | Potential Analog Synapses |                        |            |                      |                         |                                                        |
|------------------------------------------------|---------------------------------------|-------|------------|---------------------------|------------------------|------------|----------------------|-------------------------|--------------------------------------------------------|
|                                                |                                       |       | synapse    | ReRAM                     | ReRAM                  | ReRAM      | PCRAM                | FeRAM                   | Different<br>materials                                 |
|                                                |                                       |       | 6-bit SRAM | Ag:a-Si                   | AIOx/HfO2              | TaOx/HfOx  | GST                  | HZO                     |                                                        |
|                                                | # of conductance states               | #     |            | 97                        | 40                     | 128        | 100-120              | 32                      | systems                                                |
| Properties of<br>materials<br>system           | Nonlinearity (weight up/down)         | ratio |            | 2.4 / -4.9                | 1.9 / -0.6             | 0.04 / 0.2 | 0.1 / 2.4            | 1.6 / 1.8               | Properties<br>developed<br>by materials<br>engineering |
|                                                | RON                                   | kΩ    |            | 26,000                    | 17                     | 86         | 5                    | 500                     |                                                        |
|                                                | ON/OFF ratio                          | ratio |            | 12.5                      | 4.4                    | 10         | 20                   | ~1,300                  |                                                        |
|                                                | Weight increase pulse                 | V/µs  |            | 3.2 / 300                 | 0.9 / <mark>100</mark> | 1.6 / 0.05 | 0.7 / <mark>6</mark> | 2.17 / 50               |                                                        |
|                                                | Weight decrease pulse                 | V/µs  |            | -2.8 / 300                | -1 / 100               | 1.5 / 0.05 | 3 / 0.125            | -1.62 / <mark>50</mark> |                                                        |
|                                                | Cycle-to-cycle variation ( $\sigma$ ) | %     |            | 3.5%                      | 5%                     | ~3.5%      | 1.5%                 | <1%                     |                                                        |
| Power<br>Performance<br>Area (PPA)             | Area                                  | µm^2  | 10,311     | 1,072                     | 3,657                  | 1,513      | 7,233                | 1,194                   | Resulting<br>Power,<br>Performance,<br>Area            |
|                                                | Latency (optimized)                   | sec   | 0.5217     | 64,200                    | 4,440                  | 10         | 413                  | 480                     |                                                        |
|                                                | Energy (optimized)                    | mJ    | 2.2        | 15                        | 146                    | 0.81       | 1,340                | 0.21                    |                                                        |
|                                                | Inference Latency                     | msec  | 29.2       | 0.24                      | 0.20                   | 0.20       | 0.20                 | 0.20                    |                                                        |
|                                                | Inference Energy                      | μJ    | 26.1       | 2.4                       | 5.0                    | 3.1        | 6.5                  | 2.7                     |                                                        |
| ML Algorithm                                   | Online learning accuracy              | %     | ~94%       | ~73%                      | ~41%                   | ~73%       | ~87%                 | ~90%                    | AI Model Accur                                         |
| Reference Adapted from S. Yu, ASU/Georgia Tech |                                       |       |            |                           |                        |            |                      | 1                       |                                                        |



#### Many different device types & mechanisms: Need to leverage intrinsic physics for AI compute



## **Device Physics to Cell Behavior**



#### Understand and Exploit Cell Physics Engineer Cell Stack Based on Understanding



Outputs:

Forming behavior,

potentiation/depression behaviors, variability, etc.

Inputs:

Materials/properties,

layering scheme, etc.

film thicknesses,

Ideal linea

Write Pulse #

weight

## **Connectivity Through Partnerships**





Applied Materials team selected by **DARPA** to develop advanced technology for AI

Applied is working with **Arm** and **Symetrix** to develop a new neuromorphic switch based on CeRAM memory

Announced July 24<sup>th</sup> 2018



Source: SUNY Poly

**ESD and SUNY** Announce New Research Partnership with Applied Materials

New Applied Materials R&D Center to Help Customers Overcome Moore's Law Challenges

Applied Ventures and Empire State Development Aim to Accelerate Innovation in Upstate New York

Announced Nov 15th 2018



**Spin Memory** Teams with Applied Materials to Produce a Comprehensive Embedded MRAM Solution

Announced Nov 11th 2018





Source: IBM

#### **IBM** Launches Research Collaboration Center to Drive Next-Generation AI Hardware

Partnerships with leading semiconductor equipment companies Applied Materials... are crucial to the successful introduction of disruptive materials and devices to fuel our AI hardware roadmap.

Announced Feb 7th 2019





**AI APPLICATIONS** 



