# Design and implementation of an SFQ-based single-chip FFT processor 

T. Ono, H. Suzuki, Y. Yamanashi, Member, IEEE, N. Yoshikawa, Member, IEEE


#### Abstract

We have been working on the development of a high-speed FFT processor using single-flux-quantum (SFQ) logic circuits. In our previous studies, we designed and demonstrated a 4-bit butterfly processor, a data-shuffling circuit, and a twiddle factor ROM for 4-bit 8-point FFT using the AIST $10 \mathrm{kA} / \mathrm{cm}^{2} \mathbf{~ N b}$ advanced process 2 (ADP2) at maximum frequencies of 51.6 $\mathrm{GHz}, 59.5 \mathrm{GHz}$, and 51.5 GHz , respectively. In this study, to complete the FFT processor design, we designed and demonstrated residual component circuits called a rounding circuit and a data buffer at a target frequency of 50 GHz . We also designed and experimentally confirmed the operation of an SFQ-based single-chip FFT processor integrating all the component circuits.


Index Terms-butterfly processing circuit, FFT, RSFQ, superconducting devices, SFQ circuit, Josephson integrated circuit

## I. Introduction

THE fast Fourier transform (FFT) is an algorithm for realtime digital signal processing which is used in a variety of fields, such as medical imaging, wireless communications, and radio astronomy. FFT processors are developed with special hardware for a large number of calculations. However, their heat and large power consumption due to the large number of calculations become a problem in CMOS-based FFT processors. Single-flux-quantum logic circuits are an attractive alternative for CMOS due to their high-speed operation and low power consumption [1]. These features are very suitable for development of an SFQ-based single-chip FFT processor.

Several studies have been conducted on butterfly processing circuits using SFQ circuits. For example, a 5-bit SFQ radix-2 butterfly processing circuit was demonstrated at a low clock frequency [2]. In our previous study, a 4-bit integer-type SFQ radix-2 butterfly processing circuit was demonstrated at 50 GHz [3] by using the AIST $10 \mathrm{kA} / \mathrm{cm}^{2}$ advanced process 2 (ADP2) [4]. Operation of a 4-bit fixed-point-type SFQ radix-2 butterfly processing circuit was demonstrated at 50 GHz [5]. In addition, component circuits called twiddle factor circuits and data-shuffling circuits were demonstrated at 50 GHz [6].

In our present study, we have been investigating residual component circuits to reach our final goal of developing an SFQ-based single-chip FFT processor. We designed and demonstrated component circuits called a rounding circuit and

[^0]a data buffer. Then, we designed and experimentally confirmed the operation of an SFQ-based single-chip FFT processor integrating all the component circuits.

## II. SFQ-BASED SINGLE-CHIP FFT PROCESSOR

Fig. 1 shows the data-flow diagram for an 8-point FFT. FFT is carried out by recursively calculating the two-input twooutput unit operation known as the butterfly operation. The butterfly operation is split into several calculation stages and every stage includes butterfly operations with different input orders. In our previous study, we investigated a circuit configuration adopting the pipelined architecture in order to implement an SFQ FFT system on a single chip [6]. Fig. 2 shows the block diagram of the SFQ-based single-chip FFT processor. The circuit contains a unit butterfly processing


Fig. 1. Data-flow diagram for 8-point FFT. The two-input two-output unit operation is shown as a cross in the diagram to represent the unit butterfly operation.


Fig. 2. Block diagram of an SFQ-based single-chip FFT system.


Fig. 3. Data-flow diagram of a butterfly calculation.
circuit, a network switch called the data-shuffling circuit [7] for data re-ordering, and a twiddle factor ROM. A postprocessing circuit called a rounding circuit is placed after the butterfly processing circuit to round down the unnecessary lower bit of output data. A data buffer is added in between for buffering the calculation data. In this study, we designed and demonstrated a data buffer and a rounding circuit for a 4-bit 8point FFT. Then, we designed and experimentally confirmed the operation of an SFQ-based single-chip FFT processor integrating all the component circuits.

## III. Butterfly Processing Circuit

The butterfly calculation is the unit operation of the FFT algorithm, which was developed by Cooley and Tukey [8]. In previous studies, we adopted the decimation-in-time (DIT) algorithm because circuit synchronization is easier than that in the decimation-in-frequency (DIF) algorithm. We also adopted fixed-point arithmetic instead of floating-point arithmetic to avoid a complicated architecture and to reduce the chip area [9], [10]. A butterfly calculation needs two input data "x(0)", " $x(1)$ " and twiddle factor $W$; complex number operations are performed on each of the data, as shown in Fig. 3. The input data length is 4 bits including a sign bit because fixed-point arithmetic was implemented. The output data length becomes 6 bits after the butterfly operation because the data to be calculated in the multiplier are 3-bit data excluding the sign bit. The circuit consists of four multipliers, three adders, three subtractors, and two two's complement converters for signednumber operation. A bit-serial architecture was used to mitigate the low integration level of SFQ circuits compared to CMOS circuits. We designed and demonstrated a 4-bit butterfly processor using ADP2 at a frequency of 51.6 GHz [5].

## IV. Component Circuits

## A. Data-shuffling Circuit

The data-shuffling circuit is a network switch used for reordering the data between each FFT stage, as shown in Fig. 1. The input and output data for the first butterfly calculation of the first stage are " $x(0)$ " and "x(4)". However, the input and output data for the first butterfly calculation of the second stage are " $x(0)$ " and " $x(2)$ ", so that the order of data " $x(2)$ " and " $x(4)$ " is needed for re-ordering. The circuit consists of two buffers and a pair of multiplexers [7]. In our previous study, we designed and demonstrated a data-shuffling circuit for 4-bit 8 -point FFT by using ADP2. The maximum operation frequency was 59.5 GHz , and the bias margin at 50 GHz was $89.5 \%$ to $98.7 \%$ [6].

## B. Twiddle Factor ROM

The twiddle factor ROM generates a twiddle factor for every butterfly operation in every FFT stage. To realize highspeed memory with small memory access time, we considered a ROM, which reads out the stored data determined in the hardware base. In our previous study, we designed and demonstrated a twiddle factor ROM for 4-bit 8-point FFT by
using ADP2. The maximum operation frequency was 51.5 GHz and the bias margin at 50 GHz was $102.4 \%$ to $112.9 \%$ [6]. This circuit was not implemented in the present SFQbased FFT as the first step of the implementation of the SFQbased single-chip FFT processor.

## C. Rounding Circuit

The rounding circuit truncates the unnecessary lower bit of output data of the butterfly processing circuit. The input data length of the butterfly processing circuit is 4 bits comprising 1 sign bit, 1 integer bit, and 2 fraction bits. The output data become 6-bit data after the butterfly operation. The circuit, which consists of a counter and a non-destructive-read-out (NDRO) gate, enables us to round down the lower 2 bits and send 4-bit data to the feedback loop on the single-chip FFT.

We designed and demonstrated a rounding circuit for 4-bit 8 -point FFT. The number of Josephson junctions is 121 , the bias current is 14.4 mA , and the circuit area is $0.48 \times 0.24$ $\mathrm{mm}^{2}$, where the circuits for on-chip high-speed tests are excluded. The correct operation was confirmed up to the maximum frequency of 74.3 GHz and the bias margin at 50 GHz was $73.2 \%$ to $93.2 \%$.

## D. Data Buffer

The data buffer is used for storing data between each FFT processing stage, because a feedback loop is formed from the output to the input of the butterfly processing circuit and the butterfly processing circuit is recursively used for the calculation in the SFQ-based single-chip FFT processor. Since the FFT processor includes a feedback loop, the clock skew between the input and output of the FFT processing stage has to be assessed properly. To separate the output clock after the previous butterfly operation and the input clock for the next butterfly operation, the data buffer consists of two shift registers. The first shift register is used for writing the data to the data buffer, whereas the second one is used for reading the data from the buffer. The output data from the butterfly


Fig. 4. Microphotograph of the SFQ-based single-chip FFT processor designed for 4-bit 8-point FFT using ADP2.
processing circuit are input to the first shift register of the data buffer, then the output data move to the second shift register after all the calculations in one FFT processing stage are finished.

We designed and demonstrated a data buffer for 4-bit 8point FFT. The number of Josephson junctions is 3500, the bias current is 397.6 mA , and the circuit area is $1.65 \times 1.17$ $\mathrm{mm}^{2}$, where the circuits for on-chip high-speed tests are excluded. We confirmed the operation with the bias margin of $77.3 \%$ to $125.8 \%$ at low frequency $(\sim 100 \mathrm{kHz})$.

## V. DESIGN of an SFQ-Based Single-Chip FFT Processor

Fig. 4 shows the component names overlaying the microphotograph of an SFQ-based 4-bit 8-point single-chip FFT processor. The circuit consists of a butterfly processing circuit, data-shuffling circuit, rounding circuit, and data buffer. The number of Josephson junctions is 14551 , the bias current is 1.73 A , and the circuit area is $4.74 \times 4.57 \mathrm{~mm}^{2}$, where the circuits for on-chip high-speed tests are excluded. We evaluated the bias margins of the designed circuit with the digital simulator Verilog-XL. In a simulation, the maximum operating frequency was estimated to be 80 GHz except for the data buffer. The normalized bias margin at 50 GHz was $80 \%$ to $125 \%$, where $100 \%$ corresponds to the bias voltage of 2.5 mV . The margins were restricted by the passive-transmission-line (PTL) drivers and receivers.

## VI. Measurement Results

We carried out an on-chip high-speed test of the SFQ-based single-chip FFT processor. In this test, the initial data input to a data buffer are used for the input data of the butterfly processing circuit. Table I shows the input data and expected output data calculated by using the butterfly operation and the expressions in Fig. 3. As the first-stage calculation in Fig. 1, 4bit 8-point data are input and four butterfly operations are

TABLE I InPuT AND OUTPUT DATA OF THE FOUR DATA PATTERNS SHOWN

| IN THE WAVEFORMS |  |  |
| :---: | :---: | :---: |
|  | Input | Output |
| Data Pattern (first) | $\begin{aligned} \hline \operatorname{Re}[x(0)] & =10.10 \\ \operatorname{Im}[x(0)] & =10.10 \\ \operatorname{Re}[x(4)] & =01.01 \\ \operatorname{Im}[x(4)] & =01.01 \\ \operatorname{Re}[W] & =01.11 \\ \operatorname{Im}[W] & =01.11 \end{aligned}$ | $\begin{aligned} & \operatorname{Re}[X(0)]=10.1000 \\ & \operatorname{Re}[X(4)]=10.1000 \\ & \operatorname{Im}[X(0)]=10.1110 \\ & \operatorname{Im}[X(4)]=10.0010 \end{aligned}$ |
| Data Pattern (second) | $\begin{aligned} & \hline \operatorname{Re}[x(1)]=01.01 \\ & \operatorname{Im}[x(1)]=01.10 \\ & \operatorname{Re}[x(5)]=01.01 \\ & \operatorname{Im}[x(5)]=00.11 \\ & \operatorname{Re}[W]=00.11 \\ & \operatorname{Im}[W]=00.10 \\ & \hline \end{aligned}$ | $\begin{aligned} & \operatorname{Re}[\mathrm{X}(1)]=01.1101 \\ & \operatorname{Re}[\mathrm{X}(5)]=00.1011 \\ & \operatorname{Im}[\mathrm{X}(1)]=10.1011 \\ & \operatorname{Im}[\mathrm{X}(5)]=00.0101 \end{aligned}$ |
| Data Pattern (third) | $\begin{aligned} \operatorname{Re}[x(2)] & =10.11 \\ \operatorname{Im}[x(2)] & =10.11 \\ \operatorname{Re}[x(6)] & =01.10 \\ \operatorname{Im}[x(6)] & =00.11 \\ \operatorname{Re}[W] & =01.10 \\ \operatorname{Im}[W] & =01.01 \end{aligned}$ | $\begin{aligned} & \operatorname{Re}[\mathrm{X}(2)]=00.0001 \\ & \operatorname{Re}[\mathrm{X}(6)]=01.0111 \\ & \operatorname{Im}[\mathrm{X}(2)]=01.1100 \\ & \operatorname{Im}[\mathrm{X}(6)]=11.1100 \end{aligned}$ |
| Data Pattern (fourth) | $\begin{gathered} \hline \operatorname{Re}[x(3)]=01.10 \\ \operatorname{Im}[x(3)]=11.00 \\ \operatorname{Re}[x(7)]=01.00 \\ \operatorname{Im}[x(7)]=00.11 \\ \operatorname{Re}[\mathrm{~W}]=01.00 \\ \operatorname{Im}[\mathrm{~W}]=01.00 \end{gathered}$ | $\begin{aligned} & \operatorname{Re}[\mathrm{X}(3)]=01.1100 \\ & \operatorname{Re}[\mathrm{X}(7)]=01.0100 \\ & \operatorname{Im}[\mathrm{X}(3)]=00.1100 \\ & \operatorname{Im}[\mathrm{X}(7)]=01.0100 \end{aligned}$ |



Fig. 5. Examples of output data $\overline{\text { patterns }} \overline{\overline{f o r}} \overline{\text { real }} \overline{\text { part at low }} \overline{\text { speed. }}$


Fig. 6. Normalized bias margins of the component circuits at low speed.


Fig. 7. Examples of output data patterns for imaginary part at high speed. The output data patterns from the butterfly processing circuit, $(\mathrm{X}(0), \mathrm{X}(4)),(\mathrm{X}(1)$, $\mathrm{X}(5)),(\mathrm{X}(2), \mathrm{X}(6)),(\mathrm{X}(3), \mathrm{X}(7))$ are reordered to $(\mathrm{X}(0), \mathrm{X}(2)),(\mathrm{X}(1), \mathrm{X}(3))$, $(X(4), X(6)),(X(5), X(7))$ at the output of the data-shuffling circuit (see dataflow diagram in Fig. 1).
executed for both the real and imaginary parts. In the bit-serial operation, the 4-bit data patterns (first) to (fourth) in Table I are applied sequentially. Examples of experimental output data patterns for the real part of the butterfly processing circuit and the rounding circuit at low clock frequency $(\sim 100 \mathrm{kHz})$ are shown in Fig. 5. The pattern numbers $p=0$ to 3 represent the first to fourth butterfly calculations shown in Table I.

The input and output data of the butterfly processing circuit are the second and third data patterns shown in Table I. The output data of the rounding circuit are rounded down to their upper 4 bits, as shown in the lower trace of Fig. 5. The gray number represents the unnecessary lower 2 bits. It should be noted that the transition in the waveforms corresponds to the output of an SFQ pulse and represents " 1 ", and the least significant bit appears first in the waveforms. The data patterns show that the operation was confirmed. Normalized bias margins of the component circuits at low local clock speed $(\sim 100 \mathrm{kHz})$ are shown in Fig. 6. The rounding circuit has the most critical margin, which is $87.8 \%$ to $96.4 \%$.

Examples of the experimental output data patterns for the imaginary part of the butterfly processing circuit and the datashuffling circuit at high speed are shown in Fig. 7. The data patterns show that the output data of $\operatorname{Im}[x(2)], \operatorname{Im}[x(3)]$ and $\operatorname{Im}[x(4)], \operatorname{Im}[x(5)]$ are re-ordered and rounded down correctly. In the high-speed measurement, the operation was confirmed up to the maximum frequency of 47.8 GHz for specific data patterns and the operation margin was a point margin. The measurement results show the full function for the first stage of the SFQ-based single-chip FFT processor.


Fig. 8. Scaling of power consumption for SFQ-based FFT processor.


Fig. 9. Scaling of the number of Josephson junctions (JJs) for SFQbased FFT processor.

## VII. DISCUSSION

We estimated the specification for the 4-bit, 8-bit, 16-bit, and 32 -bit with the 8 -point, 16 -point, 32 -point, and 64 -point SFQ-based FFTs, which are estimated from the designed 4-bit 8 -point FFT circuit. Figs. 8 and 9 show the power consumption and number of Josephson junctions, respectively. One can see that both the power consumption and number of Josephson junctions increase as the bit length and number of FFT points increase. It should be noted that the power consumption and number of Josephson junctions of the butterfly processing circuits do not increase as the number of FFT point increases because of the bit-serial architecture.

The CMOS-based FFT processor for the 16 -bit 64 -point FFT in reference [11] has a calculation time of 640 ns and the power consumption is 21.43 mW . On the other hand, the SFQbased FFT processor for the 16 -bit 64 -point FFT is expected to have a calculation time of 37.7 ns with 50 GHz local clock frequency and power consumption of 48.05 mW . Recently, several low-power techniques for SFQ have been developed, such as LR biasing [12]. With the introduction of the LRbiased SFQ circuit, the power consumption could be reduced by a factor of 20 to realize 2.40 mW . In this case, the speed power product of the SFQ-based FFT has an advantage of approximately two orders of magnitude over that of the CMOS-based FFT.

## VIII. CONCLUSION

We designed component circuits required for FFT processors using ADP2. We confirmed the operation of all the component circuits at our target frequency of 50 GHz . Then, we designed and evaluated an SFQ-based single-chip FFT processor by integrating all the component circuits. We confirmed full function for the first stage of an SFQ-based single-chip FFT processor. Our next step is improving the operating margin and demonstrating the recursively operation of the SFQ-based single-chip FFT processor.

## ACKNOWLEDGMENT

The CONNECT ADP cell library and tools were used in this study. The circuits were fabricated in the clean room for analog-digital superconductivity (CRAVITY) at AIST with advanced process 2 (ADP2). This research was partly supported by ALCA-JST.

## REFERENCES

[1] K. K. Likharev, and V. K. Semenov, "RSFQ logic/memory family: A new Josephson-junction digital technology for sub-terahertz-clock frequency digital systems," IEEE Trans. Appl. Supercond., vol. 1, pp. 328, Mar. 1991.
[2] O. A. Mukhanov and A. F. Kirichenko, "Implementation of a FFT Radix 2 Butterfly Using Serial RSFQ Multiplier-adders," IEEE Trans. Appl. Supercond., vol. 5, no. 2, pp. 2461-2464, 1995.
[3] Y. Sakashita, Y. Yamanashi, and N. Yoshikawa, " 50 GHz Demonstration of an Integer-Type Butterfly Processing Circuit for an FFT Processor Using the $10 \mathrm{kA} / \mathrm{cm}^{2} \mathrm{Nb}$ Process," IEICE Trans. on Electron., vol. 98, no. 3, pp. 232-237, 2015.
[4] S. Nagasawa, K. Hinode, T. Satoh, H. Akaike, Y. Kitagawa, M. Hidaka, "Development of advanced Nb process for SFQ circuits," Physica C, vol. 412-414, pp. 1429-1436, 2004.
[5] Y. Sakashita, Y. Yamanashi, and N. Yoshikawa, "High-speed operation of an SFQ butterfly processing circuit for FFT processors using the 10 kA/cm ${ }^{2}$ Nb process," IEEE Trans. Appl. Supercond., vol. 25, 1301205, June 2015.
[6] Y. Sakashita, T. Ono, Y. Yamanashi, and N. Yoshikawa, "Design and High-Speed Component Tests of an SFQ FFT Processor Using the 10 $\mathrm{kA} / \mathrm{cm}^{2} \mathrm{Nb}$ Advanced Process," 2015 15th International Superconductive Electronics Conference, ISEC 2015. Institute of Electrical and Electronics Engineers Inc., 2016. 7383442.
[7] T. Ahmed, "A low-power time-interleaved 128-point FFT for IEEE 802.15.3c standard," in 2013 International Conference on Informatics, Electronics and Vision (ICIEV), pp. 1-5, May. 2013.
[8] J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine Calculation of Complex Fourier Series," Math. Comput., vol. 19, no. 90, pp. 297-297, May 1965.
[9] H. Hara, K. Obata, Y. Yamanashi, K. Taketomi, N. Yoshikawa, M. Tanaka, A. Fujimaki, N. Takagi, K. Takagi, and S. Nagasawa, "Design, Implementation and On-Chip High-Speed Test of SFQ Half-Precision Floating-Point Multiplier," IEEE Trans. Appl. Supercond., vol. 19, no. 3, pp. 657-660, Jun. 2009.
[10] T. Kato, Y. Yamanashi, N. Yoshikawa, A. Fujimaki, N. Takagi, K. Takagi, and S. Nagasawa, " $60-\mathrm{GHz}$ Demonstration of an SFQ Halfprecision Bit-serial Floating-point Adder Using $10 \mathrm{kA} / \mathrm{cm}^{2} \mathrm{Nb}$ Process," 2013 IEEE 14th Int. Supercond. Electron. Conf., pp. 1-3, Jul. 2013.
[11] T. H. Tran, S. Kanagawa, D. P. Nguyen, and Y. Nakashima, "ASIC Design of MUL-RED Radix-2 Pipeline FFT Circuit," Low-Power and High-Speed Chips (COOL CHIPS XIX), 2016 IEEE Symposium in, pp. 9-11, 2016.
[12] Y. Yamanashi, T. Nishigai, and N. Yoshikawa, "Study of LR-loading technique for low-power single flux quantum circuits," IEEE Trans. Appl. Supercond., vol. 17, no. 2, pp. 150-153, Jun. 2007.


[^0]:    Automatically generated dates of receipt and acceptance will be placed here; authors do not produce these dates. The present study was supported by a Grant-in-Aid for Scientific Research (S) (No. 26220904) from the Japan Society for the Promotion of Science (JSPS).
    T. Ono, H. Suzuki, Y. Yamanashi, and N. Yoshikawa are with the Department of Electrical and Computer Engineering, Yokohama National University, Yokohama 240-8501, Japan (e-mail: nyoshi@ynu.ac.jp).

