## Recent Progress on Neuromorphic Computing Using Adiabatic Josephson Devices

Olivia CHEN<sup>1</sup>, Tomoharu YAMAUCHI<sup>1</sup>, Zhengang LI<sup>2</sup>, Yanzhi WANG<sup>2</sup> and Nobuyuki YOSHIKAWA<sup>3</sup>

<sup>1</sup> Tokyo City University, JAPAN

<sup>2</sup> Northeastern University, USA

<sup>3</sup> Yokohama National University, JAPAN



## Motivation







## Back to 2018



- O. CHEN ASC 2018, Young Professional Plenary, Seattle, US
- IEEE-CSC, ESAS and CSSJ SUPERCONDUCTIVITY NEWS FORUM (global edition), Issue No. 55, January, 2024. Invited presentation given at EUCAS 2023, September 4, 2023, Bologna, Italy

- First digital circuit based machine learning acceleration attempt
- Using adiabatic quantum-fluxparametron (AQFP) devices

What has been achieved during the past 5 year?

## Outline

- Introduction
  - AQFP basics
  - Design methodology
- AQFP-based nerual network acceleration
  - Stochastic computing-based neural network design
  - Binarized neural network design
- Hardware-Algorithm Co-optimization
  - EDA-based circuit optimization
  - Hardware-oritented training optimization
  - Architecture optimizatition

# Introduction to AQFP

IEEE-CSC, ESAS and CSSJ SUPERCONDUCTIVITY NEWS FORUM (global edition), Issue No. 55, January, 2024. Invited presentation given at EUCAS 2023, September 4, 2023, Bologna, Italy

0

0

### Adiabatic Quantum Flux Parametron (AQFP) Logic

Basic idea to resolve the static power issue in SFQ: replace DC with AC, static power -> 0



### Adiabatic Quantum Flux Parametron (AQFP) Logic

Further optimization to resolve the static power issue in SFQ: parameter adjustment low  $I_c$  (50 µA), high  $J_c$  (10 kA/cm<sup>2</sup>), high- $\beta$ c (underdamped)



N. Takeuchi, et al., IEICE TRANS. ELECTRON., VOL.E105-C, NO.6 JUNE 2022

### AQFP Circuit Design Methodology







#### Cell library example

#### US IARPA SuperTools Program (2017 ~ 2022)

- Standard cell library w/ more 80+ gates design for MIT-LL SFQ5ee process
- EDA tool chain for the designed cell library
  - Majority Logic Synthesis
  - Timing allignment optimization
  - Process tailored placement and routing
  - Probablistic power analysis

#### C. J. Fourie et al., IEEE TAS 2023.

# AQFP-based nerual network acceleration

### AQFP for Stochastic Computing

#### **Stochastic Computing Conventional binary** IMAGE: ARMIN ALAGHI Stochastic Bitstream Approximation computing =2/8 or 0.25 Convert numbers to probability In conventional CMOS, random number generation is inefficient (achieved by LFSR) Area issue by digital comparator used to generate stochastic numbers (over 90% of the entire circuit). Extremely small hardware footprint **2<sup>7</sup> 2**<sup>6</sup> **2<sup>5</sup> 2<sup>4</sup> 2<sup>3</sup> 2**<sup>2</sup> 2<sup>1</sup> 2<sup>0</sup> Robust against errors 0.5% 1.0% 2.0% BFR: 0.1%



- Superconducting current comparator with symmetrical structure.
- When Iin = 0, it has random behavior due to thermal noise

#### ASC 2018 Young Professional Plenary



Stochastic computing based AQFP MAC circuit.

### Non-linear Function Approximation w/ SC + AQFP

- Berstein polynomials used for approximation
- 9 functions are employed for benchmarking
- Input and output normalized in range [0,1] for SC paradigm
- Bit-stream length from 64 to 1024 are tested
- Device parameter optimized for AIST HSTP process





### AQFP SC-based Neural Network

#### Collaboration w/ Northeastern University







- Efficiency improvement of multiplication-sum + activation operation using Bitonic sorter structure
- Automated design using toolsets developed in 'SuperTools'
- Prototype circuit implementation and low-temperature operation demonstration

#### Tested on MNIST dataset

| Network | Platform | Accuracy | Energy(µJ) | Throughput(images/ms) |
|---------|----------|----------|------------|-----------------------|
|         | Software | 99.04%   | _          | _                     |
| SNN     | CMOS     | 97.35%   | 39.46      | 231                   |
|         | AQFP     | 97.91%   | 5.606E-4   | 8305                  |
|         | Software | 99.17%   | _          | _                     |
| DNN     | CMOS     | 96.62%   | 219.37     | 229                   |
|         | AQFP     | 96.95%   | 2.482E-3   | 6667                  |

R. Cai, O. Chen, et al., *ISCA* 2019 O. Chen et al., *SOCC* 2022

# **Circuit Optimization**





#### Majority Logic Synthesis

GLSVLSI'19 LUT-based cell mapping (R. Cai et al.)

DATE'23 Bayesian optimization considering latency and fanout (R. Fu et al.)

#### **Timing Alignment**

ICCD'20 Heuristic algorithms (R. Cai et al.) ASPDAC'23 Gate count optimization using dynamic programming and approximate solutions of ILP (R. Fu et al)

#### **Placement**

IEEE TAS'16 GA-based placement (Murai et al.)

ICCAD'20 Analytical global placement and row-wise detailed placement (Y.

Chang et al.)

DAC'22 Timing-aware placement using convex optimization (P. Dong et al.)

ICCAD'23 Placement optimization for delay-line clocking scheme (R. Fu et al.)

\*Result under SuperTool Program

### Now What is Problem?



1. To achieve reasonable accuracy for CIFAR-10 dataset, 2048  $\sim$  4096 expected

2. Eventhough applying data-level parallelism, still a processor-memory seperated stucture

### Crossbar Architecture+ Binary Neural Network



 Logic-in-memory cell design to perform binary multiplication w/ prestored 1bit weight

### Analog Accumulation and AQFP Comparator-based Neuron



- Analog current accumulation for column summation (1 and 0 are represented by positive and negative current pulses in AQFP)
- Flux coupling /Current accumulated via superconducting inductance
- AQFP comparator servers as neuron to perform activation

### Implementation Example of Memory

I<sub>BCM\_in</sub>



• Serial write\_in parallel read out memory using delayed buffer chain





| DC      | -22.5% | 57.5% |
|---------|--------|-------|
| AC1     | -61.1% | 96.7% |
| AC2     | -46.7% | 64.4% |
| BCM_DC  | -26.7% | 47.5% |
| BCM_AC1 | -43.3% | 48.9% |
| BCM_AC2 | -42.2% | 51.1% |



#### Superconducting Quantum Circuit Fabrication Facility

### Implementation of an Example 4-input Neuron Circuit



**b** Superconducting Quantum Circuit Fabrication Facility

### Module Implementation and Test Summary

|                |          |                         |                         | Mem.   |
|----------------|----------|-------------------------|-------------------------|--------|
| Module Name    | JJ Count | ASIT QuFab<br>HSTPA 007 | ASIT QuFab<br>HSTPA 008 | Me     |
| Offset XNOR    | 22       | ×                       | 0                       |        |
| BCM (8-bit)    | 152      | N/A                     | $\bigcirc$              |        |
| Neuron (4-in)  | 42       | $\bigcirc$              | N/A $^{Weight}$         | input2 |
| Neuron (8-in)  | 74       | $\bigcirc$              | N/A                     | θ      |
| Neuron (16-in) | 138      | $\bigcirc$              | N/A                     |        |
| 4x4 BNN        | 690      | ×                       | $\bigcirc$              |        |
| 8x8 BNN        | 2236     | N/A                     | Under Test              |        |
|                |          |                         |                         |        |









ER



- Accumulated current attenuate due to the large inductance for magnetic coupling.
- Neurons are not able to perform accumulation function.



O. Chen, et al., IEEE AICAS2023.

# Algorithm-Hardware Co-Optimization

0

### **Review: Problems and Motivations**

- Stochastic switching of AQFP neurons. (Not step function anymore)
- Current attenuation in AQFP-based crossbar related to the crossbar size.
- Software and hardware mismatch caused by stochastic switching and current attenuation.
- How to decide the operatable crossbar size?
- How to accumulate results of multiple crossbar efficiently?
- Hardware configurations work on both energy-efficiency and model accuracy!

### Assessments on Stochastic Switching on AQFP Neuron and Crossbar Current

 Switching probability fitting using error function

$$P(I_{in}) = 0.5 + 0.5 erf\left(\sqrt{\pi} \frac{(I_{in} - I_{th})}{\Delta I_{in}}\right)$$

• Current attenuation is determined by the crossbar size C\_s. Annealing function:

 $I_1(C_s) = A \cdot C_s^{-B},$ 

• DNN value conversion

$$P_v(V_{in}) = 0.5 + 0.5 \operatorname{erf}\left(\sqrt{\pi} \frac{(V_{in} - V_{th})}{\Delta V_{in}(C_s)}\right)$$

$$\Delta V_{in}(C_s) = \Delta I_{in}/I_1(C_s).$$



The relationship between output current representing value '1' with crossbar synapse array size

### Approach: Hardware-Algorithm Co-Optimization

### **Randomness-aware Binary Neural Network Training**

 Non-deterministic activation mapping with AQFP neuron switching probability distribution

$$w_b = ext{sign}(w_r) = egin{cases} +1, & ext{if } w_r \ge 0, \ -1, & ext{otherwise }, \end{cases}$$

$$a_{b}=\mathrm{sign}\left(a_{r}
ight)=egin{cases}+1, & ext{with probability }P_{v}\left(a_{r}
ight),\-1, & ext{with probability }1-P_{v}\left(a_{r}
ight), \end{cases}$$

• Back propagation mapping

$$\frac{\partial \mathbb{E} (a_b)}{\partial a_r} = \frac{\partial \operatorname{erf} \left( \sqrt{\pi} \frac{(a_r - V_{th})}{\Delta V_{in}(C_s)} \right)}{\partial a_r} \\ = \frac{\partial \sqrt{\pi} \frac{(a_r - V_{th})}{\Delta V_{in}(C_s)}}{\partial a_r} \cdot \frac{2}{\sqrt{\pi}} e^{-\left(\sqrt{\pi} \frac{(a_r - V_{th})}{\Delta V_{in}(C_s)}\right)^2}$$

#### **Batch Normalization Matching (Hardware Mapping)**

 Batch norm mapping with analog threshold current input in AQFP

$$I_{th} = \left(-\frac{\beta\sqrt{\sigma^2 + \epsilon}}{\gamma \cdot \alpha} + \frac{\mu}{\alpha}\right) \cdot I_1(C_s).$$



### Hardware Design of AQFP-Based Randomized BNN Accelerator



- Stochastic Computing-based Accumulation Module Design
  - Crossbar may not be large enough for the whole filter's computation in DNN. We use stochastic computing (SC) to accumulate the results from multiple crossbars.
  - Using AQFP neuron directly as the SC generator.
  - APC is used in the SC addition.

### Hardware Configuration Optimization

1. Stochastic Computing Bit-stream Length Optimization:



- Relationship between SC bit-stream length with model accuracy.
- VGG-small trained on CIFAR-10 with 4 different crossbar sizes are deployed.

2. Optimization for Width of Grayzone  $\Delta I_{in}$  and Crossbar Size C\_s:



- Accuracy distribution in two demensions of Grayzone width and crossbar size.
- The stochastic bit-stream length used here is 1.

### **Experimental Results**

#### Model accuracy on Cifar-10 dataset under different energy efficiency constraints.

| Design              | Scheme                | Accuracy | Energy Efficiency w/o<br>cooling (TOPS/W) | Energy Efficiency w/<br>cooling (TOPS/W) | Power (mW)           | Throughput<br>(image/s) |
|---------------------|-----------------------|----------|-------------------------------------------|------------------------------------------|----------------------|-------------------------|
| DNN (VGG-Small) [1] | <b>Full-precision</b> | 92.5     | 0.28                                      | -                                        | -                    | -                       |
| IMB [2]             | Binary                | 87.7     | 82.6                                      | -                                        | 12.5                 | 1.3                     |
| STT-BNN [3]         | Binary                | 80.1     | 311                                       | -                                        | -                    | -                       |
| CMOS-BNN [4]        | Binary                | 92       | 617                                       | -                                        | -                    | -                       |
| Ours (VGG-Small)    | Binary                | 91.7     | 1.9×10 <sup>5</sup>                       | 4.8×10 <sup>2</sup>                      | 6.2×10 <sup>-3</sup> | 2                       |
| Ours (VGG-Small)    | Binary                | 90.6     | 3.8×10 <sup>5</sup>                       | 9.5×10 <sup>2</sup>                      | 6.3×10 <sup>-3</sup> | 3.9                     |
| Ours (VGG-Small)    | Binary                | 89.2     | 1.5×10 <sup>6</sup>                       | 3.8×10 <sup>3</sup>                      | 6.4×10 <sup>-3</sup> | 15.2                    |
| Ours (VGG-Small)    | Binary                | 87.4     | 6.8×10 <sup>6</sup>                       | 1.7×104                                  | 7.6×10-3             | 47.4                    |
| Ours (ResNet-18)    | Binary                | 92.2     | 1.9×10 <sup>5</sup>                       | 4.8×10 <sup>2</sup>                      | 6.2×10 <sup>-3</sup> | 2.2                     |

| Decign      |          | Energy Efficiency (TOPS/W) |                     |  |
|-------------|----------|----------------------------|---------------------|--|
| Design      | Accuracy | w/o cooling                | w/ cooling          |  |
| SyncBNN [5] | 98.4     | 36.6                       | 36.6                |  |
| RSFQ [5]    | 97.9     | 2.4×10 <sup>3</sup>        | 8.1                 |  |
| ERSFQ [5]   | 97.9     | 1.5×10 <sup>4</sup>        | 50                  |  |
| SC-AQFP [6] | 96.9     | 9.8×10 <sup>3</sup>        | 24.5                |  |
| Ours        | 98.1     | 1.5×10 <sup>6</sup>        | 3.8×10 <sup>3</sup> |  |

Comparison with RSFQ-JBNN, ERSFQ-JBNN, CMOSbased SyncBNN, SC-AQFP, and our implementation (MLP) on MNIST Dataset. [1] Y. Chen, MICRO 2014.

[1] F. Chen, MICRO 2014.
[2] H. Kim, ASPDAC 2019.
[3] T. N. Pham, IEEE ETCS, 2022.
[4] P. C. Knag, IEEE JSCC, 2020.
[5] R. Fu, IEEE TCAD, 2022.
[6] R. Cai, ISCA 2019.

#### Will be presented at *MICRO 2023*, October, Toronto, Canada



Thanks to Dr. Chris Ayala, Dr. Naoki Takeuchi, Dr. Tsung-Yi Ho, Mr. Wenhui Luo, Coldflux Team Members, Sponsors





