IEEE CSC & ESAS SUPERCONDUCTIVITY NEWS FORUM (global edition), February 2019. Invited presentation 1EOr1C-04 given at ASC 2018, October 28-November 02, 2018, Seattle, (USA).

**30-GHz Operation of Datapath for Bit-Parallel, Gate-Level-Pipelined Rapid Single-Flux-Quantum Microprocessors** 

#### Masamitsu Tanaka, Nagoya Univ.

Co-authors: Y. Hatanaka<sup>1</sup>, Y. Matsui<sup>1</sup>, I. Nagaoka<sup>1</sup>, K. Ishida<sup>2</sup>, K. Sano<sup>1</sup>, T. Yamashita<sup>1,3</sup>, O. Takatsugu<sup>2</sup>, K. Inoue<sup>2</sup>, A. Fujimaki<sup>1</sup>

<sup>1</sup>Nagoya Univ. <sup>2</sup>Kyushu Univ. <sup>3</sup>JST-PRESTO





#### Acknowledgment

This work was supported by JSPS KAKENHI Grant Numbers JP16H02796, JP18H05211 and JP18H01498; and by VDEC, The University of Tokyo with the collaboration with Cadence Design Systems, Inc. The circuits were fabricated in the CRAVITY of AIST, Japan.



# Outline

#### Background

- Co-design in device/circuit/architecture levels toward throughput-oriented microprocessors
- Demonstration of datapath and design of microprocessor prototype

Summary

# **Moore's Law**



3

### **RSFQ Microprocessor Projects**

- FLUX-1: SUNY Stony Brook, TRW, and JPL
- CORE: Nagoya U., Yokohama National U., Kyoto U.



After M. Dorojevets et al., IEEE Trans. Appl. Supercond. 11 (2001) 326.

#### **Program Execution with 50-GHz Clock**



# Demonstration of Stored-Program Computing with CORE e2

- Successfully executed several test programs
  - Small-scale programs written within 16 lines
    - ✓ Calculate 1 + 2 + ... + N
    - ✓ Calculate sum of an array
    - Integer division
    - ✓ Find the greatest divisor
    - ✓ Euclidean algorithm (GCD)



#### **Expected Maximum Performance in Bit-Serial Processing (8-bit)**

• CORE e2: 333 million-instructions/s (MIPS)



Pipelining execution: up to 6250 MIPS (ideal case)







# Outline

#### Background

- Co-design in device/circuit/architecture levels toward Throughput-oriented microprocessors
- Demonstration of datapath and design of microprocessor prototype

Summary

### **Revisiting Microarchitecture Design for More Powerful Computing**

- Exploring architecture space optimized for RSFQ.
  - Bit-serial, bit-slice vs. bit-parallel processing
  - Depth of pipelines
  - How to eliminate pipeline hazards?



#### **Pipeline Depth vs. Clock Frequency**



# **Eliminating Pipeline Hazards**

- Fine-grained multithreading
  - Number of threads = Number of pipeline stages



K. Ishida et al, IPSJ J. 58 (2017)

# **Results of Architectural Optimization**

- Our approaches:
  - Bit-parallel processing
  - Ultra-deep (gate-level) pipelining
  - Fine-grained multithreading
- We started development of throughput-oriented microprocessors with bit-parallel, gate-levelpipelined processing.
  - Challenges: hardware complexity and timing design

Can bit-parallel RSFQ circuits operate at very high clock frequencies?

# 8-bit ALU Design



- ✓ Target frequency: 50 GHz
- ✓ Gate-level pipelining
- ✓ Functions: ADD, SUB, AND OR, XOR, NOR, etc.
- ✓ Data length: 8 bits

#### **Based on Brent-Kung adder**

- Minimum number of logic gates (w/o D flip-flops)
- Sparse wiring tracks
- Small fanouts (Max. 3)
- Maximum logic depth

IEEE CSC & ESAS SUPERCONDUCTIVITY NEWS FORUM (global edition), February 2019. Invited presentation 1EOr1C-04 given at ASC 2018, October 28-November 02, 2018, Seattle, (USA).

#### Demonstration of Gate-Level-Pipelined ALU up to 56 GHz











# **Results of Architectural Optimization**

Our approaches:

Bit-parallel processing

- Ultra-deep (gate-level) pipelining
- Fine-grained multithreading
- We started development of throughput-oriented microprocessors with bit-parallel, gate-levelpipelined processing.
  - Challenges: hardware complexity and timing design

Can bit-parallel RSFQ circuits operate at very high clock frequencies? ...**YES** 

# Outline

#### Background

- Co-design in device/circuit/architecture levels toward Throughput-oriented microprocessors
- Demonstration of datapath and design of microprocessor prototype
- Summary

### Architectural Design of Gate-Level-Pipelined Microprocessor Prototype

 We designed 4-bit microprocessor prototype with 12-threads support.







K. Ishida et al., HotSPA2018

#### **Fabrication of Datapath Test Circuit**



IEEE CSC & ESAS SUPERCONDUCTIVITY NEWS FORUM (global edition), February 2019. Invited presentation 1EOr1C-04 given at ASC 2018, October 28-November 02, 2018, Seattle, (USA).

#### Demonstration



# **High-Frequency Test Results**

• We confirmed several successful gate-level-pipelined operations at high clock frequencies up to 31 GHz.



# Gate-Level-Pipelined Microprocessor Prototype

 $\checkmark$  24 pipeline stages ✓ 12 threads, SIMT **Data Memory** ✓ 12 x 10-bit instruction memory  $\checkmark$  4 x 4-bit register file  $\checkmark$  4 x 4-bit data memory SFR Controller ✓ 23,713 JJs ✓ Up to 31.3 GHz @6.9 mW Instruction Memory **ALU & Register File** Clock Gen.

22

4.08 mm

### Summary

- We designed and tested an RSFQ 4-bit datapath toward extremely high-throughput, bit-parallel microprocessors.
- Gate-level (ultra-deep) pipelining and fine-grained multithreading will be a promising architectural approach for RSFQ-based high-performance computing.
- We demonstrated high-speed operation up to 31 GHz with power consumption of 2.5 mW. Introduction of energy-efficient techniques, such as LV-RSFQ or ERSFQ, will provide much better efficiency.
- Fabrication and testing of the prototype microprocessors including the designed datapath is ongoing.