# Low-latency machine learning FPGA accelerator for multi-qubit state discrimination Pradeep Kumar Gautam, Shantharam Kalipatnapu, Shankaranarayanan H, Ujjawal Singhal, Benjamin Lienhard, Vibhor Singh and Chetan Singh Thakur Abstract—Measuring a qubit is a fundamental vet error prone operation in quantum computing. These errors can stem from various sources such as crosstalk, spontaneous state-transitions, and excitation caused by the readout pulse. In this work, we utilize an integrated approach to deploy neural networks (NN) on to field programmable gate arrays (FPGA). We demonstrate that it is practical to design and implement a fully connected neural network accelerator for frequency-multiplexed readout balancing computational complexity with low latency requirements without significant loss in accuracy. The neural network is implemented by quantization of weights, activation functions, and inputs. The hardware accelerator performs frequency-multiplexed readout of 5 superconducting qubits in less than 50 ns on RFSoC ZCU111 FPGA which is first of its kind in the literature. These modules can be implemented and integrated in existing Quantum control and readout platforms using a RFSoC ZCU111 ready for experimental deployment. #### I. Introduction Quantum computers leverage the concepts, such as superposition and entanglement, to manipulate quantum bits (qubits). It enables them to potentially outperform classical computers for certain tasks [1]-[5]. During the execution of the algorithms, errors are inevitable due to the difficulties in controlling and reading out the quantum states of qubits as quantum computers grow in size [6]. Quantum error correction (QEC) schemes aim to combat these errors by encoding the quantum information redundantly and performing repeated readouts to detect, and correct errors during computation [7]. The successful execution of QEC algorithms is expected to rely on the crucial task of constant monitoring of a group of qubits, and subsequently to invoke swift corrective action [8]. An efficient and scalable solution for multi-qubit readout is to perform frequency division multiplexing (FDM), which allows QEC protocols while minimizing the hardware complexity [9]. To minimize the readout errors, typically a two-fold strategy is used. First, at the superconducting hardware level, tech- Pradeep Kumar Gautam is with Department of Electronic Systems Engineering, Indian Institute of Science, Bangalore 560012, India and Defence Research and Development Organisation, Bangalore 560093, India. Shantharam Kalipatnapu, Shankaranarayanan H and Chetan Singh Thakur (Corresponding Author) are with Department of Electronic Systems Engineering, Indian Institute of Science, Bangalore 560012, India. (E-mail: csthakur@iisc.ac.in) Benjamin Lienhard is with Department of Electrical Engineering, Princeton University. Ujjawal Singhal and Vibhor Singh (Corresponding Author) are with the Department of Physics, Indian Institute of Science, Bangalore 560012, India. (E-mail: vsingh@iisc.ac.in) (Corresponding authors: Vibhor Singh; Chetan Singh Thakur) niques based on Purcell-filters, large bandwidth quantum-noise limited amplifiers, and dynamic-tuning of the dispersive shift can be employed [10]–[13]. Second, for the post-processing of multi-frequency microwave signals to discriminate and identify the qubit states, methods such as matched-filter, support vector machines (SVM), and neural-networks can be utilized [14]–[17]. While implementing such methods, it is highly desirabel to minimize the latency in the post-processing for the state-assignment in order to take any *on-the-fly* corrective action for QEC schemes. On the post-processing of the readout signals, neural network (NN) based state discriminators outperform their traditional signal processing counterparts as reported on different qubit platforms [17]–[20]. Furthermore, the NN-based discriminators scale quite efficiently as they do not need any qubit specific processing like demodulation or matched filter per qubit. Scalable FPGA systems on Radio Frequency System on Chip (RFSoC) for the control and readout of individual qubits has been a recent trend for quantum computing [21]-[24]. Thus, NN-based state discriminators could be a better replacement for the traditional signal processing solutions on FPGA for FDM readout, resulting in enhanced throughput and lower latency for real-time readout applications. However, the size of required resources for a neural network on hardware platforms like FPGAs is a challenge [25]. Addressing these issues not only gives advantage for the readout signal postprocessing, but can also pave a way for implementing reinforcement learning (RL) agents in more complex systems [26]. Quantization is an effective approach to reduce the storage and computation requirement of the model by representing model parameters in low-precision fixed point formats [27], [28]. Low precision quantized models even down to binary precision with quantization aware training (OAT) are demonstrated [29], [30]. In this work, we address these challenges and show that by employing QAT and automated flows for mapping NN into implementation flow on FPGA, an ultralow latency NN-based state discriminator can be designed. We demonstrate an integrated approach for NN-based state discriminator design starting from training up to FPGA implementation for qubit readout using Brevitas [31] and FINN-R [29]. It results in ultra-low latency state-discriminators with latency from 24.03 ns to 47.12 ns. To the best of our knowledge, there has been no NN-based implementations on FPGA for FDM readout without employing any qubit-specific pre-processing. It is for the first time we integrate FPGAbased ML modules with existing quantum computing hardware platforms and control systems. Fig. 1: Block diagram of typical experimental setup of an FPGA system for the control and readout of superconducting qubit devices. The FPGA is a part of the ZCU111 RFSoC evaluation kit. The dilution refrigerator houses a five qubit chip. The control signals to the qubits are generated by the RF-DAC. A single line which has all the five readout signals multiplexed and amplified by traveling-wave parametric amplifier (TWPA) is fed to the RF-ADC. This multiplexed readout signal received by the RF-ADC is digitized to in-phase (I) and quadrature (Q) components. These components are then fed to the machine learning accelerator for state discrimination. A feedback signal generated based on the assigned state of qubits may drive the logic for the subsequent generation of the control pulses. The rest of the article is organised as follows. Section II sets up the preliminaries for the work. Section III describes design methodology and implementation of Neural network based state discriminators. Section IV discusses the results and comparison with state-of-the-art. Section V concludes the work. # II. PRELIMINARIES In this section, we describe conventional setup for control and measurement of superconducting qubits and the motivation for designing a hardware neural network discriminator on FPGA. # A. Measurement setup of superconducting qubit devices A typical schematic of the experimental setup with superconducting qubit device and control electronics is shown in Figure 1. It consists of a dilution refrigerator which is maintained at temperature of 10 mK and an RFSoC board used to generate the microwave pulses and post-process the readout signals. The readout signals of the five qubits are frequency multiplexed and are given to high speed RF Analog to Digital Converter (ADC) of RFSoC for processing. We use the SQ-CARS architecture developed on Xilinx RFSoC ZCU111 [24]. The RFSoC device XCZU28DR comes with eight high-precision and low-power DACs and ADCs with the maximum sampling rates of 6.554 and 4.096 GSPS, respectively. These data converters are configurable and integrated with programmable logic (PL) resources of the RFSoC through AXI interfaces. The superconducting qubits studied here have frequencies from 4.3 to 5.2 GHz, and energy relaxation rates from 7 to 40 $\mu$ s. A detailed characterization of the device has been discussed in Ref [17]. For the readout of each qubit, a frequency multiplexed pulse (duration 2 $\mu$ s) comprising of Intermediate Frequency (IF) signals 64.729 MHz, 25.366 MHz, 24.79 MHz, 70.269 MHz, 127.282 MHz are upconverted to the readout resonators frequencies using a local oscillator (LO) frequency of 7.127 GHz. The same LO is used for the down-conversion of the readout pulse for further digitization (500 MHz sampling rate) and post-processing. Figure 2 shows the results of integrated single-shots in the IQ-plane for all the 5-qubits. A good measure of discriminator performance in presence of cross-talk is the measure of cross-fidelity, which describes correlation between assignment fidelities of individual qubits [32]. The cross fidelity between 2 qubits is defined as $$F_{ij}^{CF} = 1 - [P(1_i|\theta_j) + P(0_i|\pi_j)], \tag{1}$$ where $\theta_i(\pi_j)$ represent preparation of qubit j in ground (excited) state and subsequent detection of qubit i in ground (excited) state. A positive (negative) off-diagonal elements represents a correlation (anti-correlation) between qubits which might be the result of readout cross-talk. For the single-shot results shown in Figure 2(a)-(e), the cross-fidelity based on matched filter based state discriminator is shown in Figure 2(f). The off diagonal colored cells show the poor performance in presence of cross-talk. # III. LOW-LATENCY NEURAL NETWORK STATE DISCRIMINATOR Quantization is an effective approach to reduce the storage and computation requirement of the model by representing model parameters in low-precision fixed point formats with little impact on accuracy [27], [28]. When Quantization Aware Training (QAT) is applied, low-precision quantized models, Fig. 2: Single shot readout results and cross fidelity matrix. (a-e) represents the single shot readout results for 5 qubits. Red markers represents the ground state and blue markers represents the excited state of the corresponding qubit. (f) shows the cross fidelity matrix computed using matched filter approach. including those with binary precision, do not experience a significant loss in accuracy [29], [30]. We use Brevitas' QAT [31] to exploit arbitrary bit-width and mixed representation in the PyTorch framework [33]. # A. NN Model optimization and Quantization The possibility of mixed precision quantization of input, weights, and activation upto 1-bit in a large design space NN calls for a careful selection of the architecture. To keep the design exploration limited, we have adopted an approach similar to [34] and start with a base model reported in [17] having an input size of 1000, consisting of 500 samples of each In-phase (I) and Quadrature (Q) component corresponding to 1 $\mu$ s of integration time. It has three hidden layers consisting of 1000, 500, and 250 number of nodes respectively and an output layer of 32 nodes. First we change the input feature size and number of nodes in each hidden layer to their closest powers of 2. Next, we do a step-wise reduction of input size, layer, and nodes in the hidden layers. The training is performed for 50 epochs with a linearly decaying learning rate starting at $10^{-3}$ . The figure of merit is kept as $F_{GM}$ , the geometric mean (GM) fidelity of five qubits. The fidelity of the $i^{th}$ qubit, $F_i$ and $F_{GM}$ are defined as follows $$F_i = 1 - [P(0_i|\pi_i) + P(1_i|\theta_i)]/2 \tag{2}$$ $$F_{GM} = \sqrt[5]{F_1 F_2 F_3 F_4 F_5},\tag{3}$$ where $P(0_i|\pi_i)$ $(P(1_i|\theta_i))$ is the conditional probability of assigning the ground (excited) state with label 0(1) to qubit i when prepared in the excited (ground) state. We keep the number of output nodes equal to the number of qubits vis-a-vis the total number of possible state combinations. This greatly reduces output layer size, as we use only 5 (N) output nodes compared to 32 $(2^N)$ [17], [25]. The different model architectures are represented as $N_I \times$ $N_{H_1} \times \cdots \times N_{H_k} \times N_O$ where $N_I$ , $N_{H_i}$ , $N_O$ denote the number of nodes in input, $i^{th}$ hidden layer ( $i \in \mathbb{Z}+$ ) and output layer, respectively. Each hidden layer consists of a linear layer followed by batch normalization, dropout layer, and Rectified Linear Unit (ReLU). The inputs to the model are obtained by box-car operation of In-Phase (I) and Quadrature (Q) samples from the RF-ADC thereby reducing the input feature size from 1024 to 512. We also explored the influence of model size on fidelity. Different architectures were trained using full floating point precision. Figure 3(a) shows the combined effects of model size variation and input feature size on GM fidelity. It shows that the reduction of input feature size has very little impact on the fidelity (maximum 0.3%). Similarly, reduction in number of hidden layers ( $\geq 2$ ) and nodes in hidden layers has minimal impact on $F_{GM}$ . From these results, it is evident that careful reduction of the model size is possible without significant loss of accuracy. We choose the model with a configuration of $512 \times 64 \times 5$ , after which reduction in model size starts to degrade $F_{GM}$ rapidly. We perform quantization of the models to derive the optimal representation for input, weight, and activation. The training is performed with the same hyper parameters employed for the QAT approach. The input, weight, and activation bit-widths are varied from 8-bit to 2-bit and their effect on $F_{GM}$ is plotted in Figure 3(b) for model $512 \times 64 \times 5$ . It can be observed that $F_{GM}$ degrades for input quantization below 4-bit. The GM fidelity is found to be $\sim 0.7$ for an input quantization Fig. 3: Effect of various neural network parameters on the readout fidelity. (a) Fidelity for various neural network architectures for input feature sizes of 1024 and 512. The horizontal axis represents the dimension of the hidden layers. (b) Impact on GM fidelity while varying the input bit quantization size. (c) Mixed quantization of weights and activations on Fidelity. (d) Impact of mixed quantization of weights and activations with binarized input. The blue curve represents variation of fidelity with activation bit width with single bit quantized weight. The red curve shows variation of fidelity with weight bit width with single bit quantized activation. (e) Effect of model parameters and depth on the readout fidelity. (f) Cross readout fidelity matrix for Quantized Neural network. For panel (b), (c), (d) and (f), the network architecture is chosen to be $512 \times 64 \times 5$ . of 2 bits. So, we conclude that the variation in input requires at least 4-bit representation, to evolve a model to faithfully discriminate the states of five qubits. We can also infer that for equal bit-widths of activation and weights, there is no significant difference in $F_{GM}$ with the change of input width. Figure 3(c) shows the variation of $F_{GM}$ for mixed representation of weights and activations ranging from 8-bits to 2-bits. It shows that there is no significant difference in fidelity despite the mixed representation of weights and activations, following the trend of Figure 3(b). We have also trained a binary quantized model which exhibits a greater drop in accuracy (4-21%). The performance of the model is shown in Figure 3(d). Figure 3(e) shows the effect of the number of parameters of the network on $F_{GM}$ for different size hidden layers. It can be observed that, when the number of parameters are greater than $10^4$ , there is no significant drop in fidelity for all the reported number of hidden layers. So, for lower latency and target fidelity one can choose a network with fewer hidden layers. Based on the aforementioned observations, we select a quantization combination of 4, 2, and 2, for input, weight, and activation, respectively. The resulted cross fidelity matrix is computed and shown in Figure 3(f). It can be observed that QNN performs better than the matched filter discriminator shown in Figure 2(f). #### B. FPGA Acceleration Low-latency readout is the primary requirement of quantum applications for which the dataflow architecture based framework of FINN-R [29] becomes the preferred choice for our work. FINN-R takes the model as input in Open Neural Network Exchange (ONNX) format with FINN specific metadata embedded, which can be exported using Brevitas library in the Pytroch environment. The model is then transformed step-wise into a streaming dataflow graph, with each node represented as Xilinx High Level Synthesis (HLS) callable function. The nodes which do not have equivalent HLS functions, would be required to run on the Processing System (PS) of the FPGA, resulting in increased latency. FINN-R implements quantization and matrix multiplication of low-precision data using Multi Vector Threshold Units (MVTU). One such unit is used for each layer of QNN. Each such unit consists of several number of Processing Elements (PE) which can have multiple input lanes like Single Instruction Multiple Data (SIMD) architecture. More details on the FINN-R flow is given in the Appendix. Trained and quantized models are adapted for varying levels of parallelization, constrained by resource availability, HLS conversion capabilities, and AXI Stream connection width. Layers exceeding a single MVTU unit are time-multiplexed, affecting computation latency. A significant hurdle in achieving full parallelism for moderate-size models is the limitation imposed by AXI-Stream interconnect's connection width [35]. To address this constraint, we have designed a novel architecture for maximally utilizing the inherent parallelism present in FPGA. The modified hardware architecture is derived from the NN having the configuration of $512 \times 64 \times 5$ from the earlier subsection. The first hidden layer of the model consisting of 64 nodes is split into 8 equal segments, each having 8 nodes. The 512 nodes of the input layer are connected to all the segments, effectively giving 64 nodes in the first hidden layer. These segmented nodes can run in parallel on the FPGA, without the need for time-multiplexing the computation, thereby reducing the total latency. Thus, the connection requirement for the first hidden layer comes down from 32768 to only 4096 per segment. The outputs of all these segments are concatenated by *Concat* layer. The model architecture is shown in Figure 4(a). The generated ONNX model is however not fully convertible to streaming dataflow complaint representation by the default FINN-R flow, due to the presence of *Concat* layer and non-uniform multiplication nodes before it. To have a fully dataflow convertible model, we have inserted uniform *QuantIdentity* layers before *Concat* layer and introduced some model graph modification steps in the default transformation steps of FINN-R flow. Figure 4(b) shows the FINN-R generated NN design after the modifications. This novel architecture fully parallelizes the FPGA implementation of the model and reduces the latency as it does not need time-multiplexing. This approach can be very effective in achieving ultra-low latency on FPGAs, when working on large NNs. As an alternative approach to attain low latency, we also have implemented a deeper NN in which a piece-wise processing of the input is performed [26]. This way, only the last layer contributes to the latency of state discrimination. This gives the flexibility of having a deeper NN with small hidden layers. The architecture for the model of size $256 \times 128 \times 128 \times 128 \times 128 \times 5$ is shown in Figure 4(c). #### IV. RESULTS AND DISCUSSION #### A. Performance and Resource utilization To demonstrate the effectiveness of our approach, we implemented NN model of Ref [17]. This network architecture consists of an input dimension of 1000 samples followed by 3 hidden layers consisting of 1000, 500, and 250 nodes and the 32 output nodes for each possible state. The total number of learnable parameters is 1,634,782 with each parameter requiring 4 bytes of storage in floating point representation. This translates into 50 Mb of storage to just save the weights and bias. Following this standard floating point approach, the implementation crosses the resource limit of most of the available FPGA devices as reported in Ref [25]. However, we adopt the methodology as described in the previous section to implement the model in Brevitas and subsequently convert to a dataflow graph using FINN-R. The input, weight and activation quantization used is 4, 2, and 2 bits, respectively. The model achieves 90.40 GM fidelity, which is only 0.9% below the one reported in Ref [17]. This reduction is within the acceptable range as reported by many other quantized implementations of standard models [29]. Latency and resource utilization comparison for various NN model architectures and quantizations is given in Table I. The resource utilization of this base model (Arch-1) on RFSoC XCZU28DR is only 39%, still leaving enough room for implementation of other logic. It can be observed that input size and hidden layer reduction has a major impact on resources and improves latency drastically with minimal drop in fidelity. Arch-4 to Arch-9 have almost the same parameters but vary in architecture, and see similar pattern in resource utilization except binarized model of Arch-6 which uses less than 5% LUTs. Arch-7, Arch-8 and Arch-9 has higher ultilization as these architecture have been optimized for higher performance. The novel approach of splitting the first hidden layer of Arch-5 ( $512 \times 64 \times 5$ ) into 8 parallel segments results in Arch-7, reducing the latency from 33 cycles to 19 cycles, an improvement of 42%. There is, however a significant jump in resources consumed, which might be the result of extra *QuantIdentity* layers introduced before *Concat* layer and increased level of parallelism. This architecture strikes the right balance between complexity and performance. Binarized models are fast but they show significant drop in the fidelity. It is difficult to draw patterns in the maximum operable frequency of the design based on model architecture, as this largely depends on the Xilinx Vivado tool with only a few handles to control the outcome. For applications like single qubit characterisation, we have also implemented an ultra low latency SVM based state discriminator on FPGA. It offers better decision boundary when Fig. 4: (a) Software model of QNN $((512 \times 8) \times 8 \times 5)$ for achieving maximum parallelism. The first hidden layer consists of 8 equal segments (Seg 1 to Seg 8), each having 8 nodes in *Linearlayer* followed by *Batch Normalization* (shown as BN) and *Rectified Linear Unit* (shown as ReLU). The output of each segment (1x8) is concatenated using *Concat* layer, giving a resultant size of 1x64. (b) Full parallel Hardware architecture of the QNN. Each segment of the model shown in (a) is implemented as MVTU, which runs in parallel on hardware. PE is shown in the inset, consisting of various computation blocks. (c) Piecewise layered QNN architecture $256 \times 128 \times 128 \times 128 \times 128 \times 5$ . Each layer processes a piece of input signal along with output from the previous layer. The input size for the first layer of the model is 1x256, and other layers are fed input of size 1x128, out of which 1x64 is from the previous layer and 1x64 from the input segment. The last sectional trace of the input signal is only fed to the last layer. Hence, only the last layer contributes to the latency of the network. compared to MF with ultralight footprint on the hardware. Further details on SVM implementation can be found in the Appendix. ## B. Comparison As an alternative approach to traditional signal processing, recent works employed NNs due to their robustness to signal perturbations and learning capability for multi-qubit state discrimination. Reuer *et al.* [26] describes a feed-forward neural network for realizing RL agent and state discriminator. The network consists of 7 hidden layers, with each layer having 20 neurons. The feed-forward network is designed for a single qubit, capable of generating decisions based on readout measurement. Maurya *et al.* [25] have implemented a Deep Neural Network (DNN)-based discriminator for multiplexed readout of qubits, employing a MF for dimensionality reduction, before feeding the data to a very light DNN. This adds an additional requirement of qubit specific demodulation and signal processing limiting the scalability of the solution. Table II summarises the latency, resources and number of learnable parameters of our networks in comparison to the state-of-the-art. The latency of the FNN of Reuer *et al.* [26] is 48 ns while Satvik *et al.* [25] has not reported latency. Our NN state discriminator based on Arch - 7, Arch - 8 and Arch-9 implemented in an FPGA yields a total latency of 20 cycles, 12 cycles and 10 cycles, respectively, including 1 cycle for box-car operation to provide input to the NN. Thus, Arch-7, Arch-8, and Arch-9 result in a latency of 50 ns, 35.16 ns, and 26.1 ns, respectively, which are either comparable or better compared to the existing works. The total number of learnable parameters of the NN reported in Ref. [26] and Ref. [25] are 1891 and 1112, respectively. Compared to these two networks, our NN implementation has 18 to 29 times more learnable parameters which makes it more robust and versatile, when Resource # of Quantization Max Freq Latency Model Arch $F_{GM}$ Parameters (IN/W/A) (MHz) (Cycles) (ns) (% util) (% util) 1000x1000x500x250x32 167054 135088 1,634,782 90.40 4/2/2 243 924 3802.26 (Arch-1) (39.28)(15.88)1024x512x256x5 58336 46654 657,413 4/2/2 90.39 275 336 1219.68 (13.72)(5.48)(Arch-2) 33358 512x128x32x5 38689 89.85 195.33 69.957 4/2/2 261 51 (Arch-3) (9.10)(3.92)512x64x32x5 38087 46064 4/2/2 90.11 350 40 114.4 35077 (Arch-4) (8.96)(5.41)43959 53252 512x64x5 33217 4/2/2 89.91 260 33 127.05 (Arch-5) (10.37)(6.26)512x64x5 18302 15197 33217 4/1/1 80.1 440 17 38.59 (Arch-6) (4.30)(1.78)107026 60696 ((512x8)x8)x533295 4/2/2 89.72 403 19 47.12 TABLE I: Resource utilization and latency comparison of various NN architectures TABLE II: Latency comparison of NN-based discriminator with the state-of-the-art 89.78 89.07 (25.16) 110351 (25.94) 104527 (24.58) (7.13) 123514 (14.52) 95945 (11.28) 341 374 11 32.23 24.03 | | Discriminator | # of Parameters | NN Latency<br>(ns) | Resources<br>(LUTs) | Readout Type | |--------------------|----------------------------------|-----------------|--------------------|---------------------|--------------| | Reuer et al. [26] | 8-point Box-Car +NN | 1891 | 48 | Not Reported | Single | | Satvik et al. [25] | Demodulation + MF +<br>NN | 1112 | Not Reported | 17917 | Multiplexed | | This Work | 2 point Box-Car + NN<br>(Arch-7) | 33295 | 50 | 107026 | Multiplexed | | | 2 Point Box-Car + NN<br>(Arch-9) | 42124 | 24.03 | 104527 | Multiplexed | deployed for more complex readout scenarios. In Ref. [25] and Ref. [17], the number of output nodes maps to the number of total possible state combinations for N-qubits which is $2^N$ . However, we have reduced the number of output nodes to N by using BCEWithLogitLoss as loss function. For example, the earlier works require 1024 output nodes for a 10-qubit frequency-multiplexed readout, whereas our NN would need only 10 nodes. This makes scaling our network for larger qubit systems practically feasible. (Arch-7) 256x128x128x128x128x5 (Arch-8) 256x128x128x128x128x5 (Arch-9) 42124 42124 4/2/4 4/1/4 Our methodology is scalable and can implement large DNNs as shown by implementing the model reported in Ref [17] with 1.6 million parameters. While employing QAT, we have shown that even upto 2-bit quantized model sees only a little drop in fidelity, while providing ultra-low latency state discrimination, critical for realizing QEC as evident from Arch-7, Arch-8, and Arch-9. Though Arch-7 strikes a balance between complexity and latency, there might be cases which need deeper networks. In such cases, Arch-8 and Arch-9 can be utilised, which provide better latency regardless of the depth of the network. # V. CONCLUSION Here, we have presented neural network accelerators for state discrimination of frequency-multiplexed qubit readout traces. We showed an integrated approach to deploy scalable neural networks on to FPGAs. This approach offers flexibility, automation, and faster design turn around time. Adopting this methodology, we have proposed ultra-low-latency architectures whose latencies are below 50 ns. For the first time, we have demonstrated a low-latency neural network architectures which do not require any qubit-specific signal processing on the FPGA. As the number of quits in multiplexed readout continues to grow, our approach with automation and flexibility offers a practical realisation of NN-based state discriminators for quantum computing. # REFERENCES - [1] S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, M. J. Bremner, J. M. Martinis, and H. Neven, "Characterizing quantum supremacy in near-term devices," *Nature Physics*, vol. 14, no. 6, pp. 595–600, 2018. - [2] P. W. Shor, "Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer," SIAM review, vol. 41, no. 2, pp. 303–332, 1999. - [3] L. K. Grover, "A fast quantum mechanical algorithm for database search," in *Proceedings of the twenty-eighth annual ACM symposium on Theory of computing*, 1996, pp. 212–219. - [4] M. Schuld, I. Sinayskiy, and F. Petruccione, "An introduction to quantum machine learning," *Contemporary Physics*, vol. 56, no. 2, pp. 172–185, 2015 - [5] P. J. O'Malley, R. Babbush, I. D. Kivlichan, J. Romero, J. R. McClean, R. Barends, J. Kelly, P. Roushan, A. Tranter, N. Ding *et al.*, "Scalable quantum simulation of molecular energies," *Physical Review X*, vol. 6, no. 3, p. 031007, 2016. - [6] R. Barends, J. Kelly, A. Megrant, A. Veitia, D. Sank, E. Jeffrey, T. C. White, J. Mutus, A. G. Fowler, B. Campbell *et al.*, "Superconducting quantum circuits at the surface code threshold for fault tolerance," *Nature*, vol. 508, no. 7497, pp. 500–503, 2014. - [7] D. A. Lidar and T. A. Brun, Quantum error correction. Cambridge university press, 2013. - [8] G. Q. AI, "Exponential suppression of bit or phase errors with cyclic error correction," *Nature*, vol. 595, no. 7867, pp. 383–387, 2021. - [9] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, "Surface codes: Towards practical large-scale quantum computation," *Physical Review A*, vol. 86, no. 3, p. 032324, 2012. - [10] E. A. Sete, J. M. Martinis, and A. N. Korotkov, "Quantum theory of a bandpass purcell filter for qubit readout," *Physical Review A*, vol. 92, no. 1, p. 012325, 2015. - [11] C. Macklin, K. O'brien, D. Hover, M. Schwartz, V. Bolkhovsky, X. Zhang, W. Oliver, and I. Siddiqi, "A near-quantum-limited josephson traveling-wave parametric amplifier," *Science*, vol. 350, no. 6258, pp. 307–310, 2015. - [12] L. Chen, H.-X. Li, Y. Lu, C. W. Warren, C. J. Križan, S. Kosen, M. Rommel, S. Ahmed, A. Osman, J. Biznárová, A. Fadavi Roudsari, B. Lienhard, M. Caputo, K. Grigoras, L. Grönberg, J. Govenius, A. F. Kockum, P. Delsing, J. Bylander, and G. Tancredi, "Transmon qubit readout fidelity at the threshold for quantum error correction without a quantum-limited amplifier," npj Quantum Information, vol. 9, no. 1, pp. 1–7, 2023. [Online]. Available: https://www.nature.com/articles/s41534-023-00689-6 - [13] F. Swiadek, R. Shillito, P. Magnard, A. Remm, C. Hellings, N. Lacroix, Q. Ficheux, D. C. Zanuz, G. J. Norris, A. Blais, S. Krinner, and A. Wallraff, "Enhancing Dispersive Readout of Superconducting Qubits Through Dynamic Control of the Dispersive Shift: Experiment and Theory," arXiv preprint arXiv:2307.07765, 2023. [Online]. Available: http://arxiv.org/abs/2307.07765 - [14] C. A. Ryan, B. R. Johnson, J. M. Gambetta, J. M. Chow, M. P. da Silva, O. E. Dial, and T. A. Ohki, "Tomography via correlation of noisy measurement records," *Physical Review* A, vol. 91, no. 2, p. 022118, 2015. [Online]. Available: https: //link.aps.org/doi/10.1103/PhysRevA.91.022118 - [15] E. Magesan, J. M. Gambetta, A. D. Córcoles, and J. M. Chow, "Machine learning for discriminating quantum measurement trajectories and improving readout," *Physical review letters*, vol. 114, no. 20, p. 200501, 2015. - [16] R. Navarathna, T. Jones, T. Moghaddam, A. Kulikov, R. Beriwal, M. Jerger, P. Pakkiam, and A. Fedorov, "Neural networks for on-the-fly single-shot state classification," *Applied Physics Letters*, vol. 119, no. 11, p. 114003, 09 2021. [Online]. Available: https://doi.org/10.1063/5.0065011 - [17] B. Lienhard, A. Vepsäläinen, L. C. Govia, C. R. Hoffer, J. Y. Qiu, D. Ristè, M. Ware, D. Kim, R. Winik, A. Melville *et al.*, "Deep-neural-network discrimination of multiplexed superconducting-qubit states," *Physical Review Applied*, vol. 17, no. 1, p. 014024, 2022. - [18] A. Seif, K. A. Landsman, N. M. Linke, C. Figgatt, C. Monroe, and M. Hafezi, "Machine learning assisted readout of trapped-ion qubits," *Journal of Physics B: Atomic, Molecular and Optical Physics*, vol. 51, no. 17, p. 174006, 2018. - [19] Z.-H. Ding, J.-M. Cui, Y.-F. Huang, C.-F. Li, T. Tu, and G.-C. Guo, "Fast high-fidelity readout of a single trapped-ion qubit via machinelearning methods," *Physical Review Applied*, vol. 12, no. 1, p. 014038, 2019. - [20] Y. Matsumoto, T. Fujita, A. Ludwig, A. D. Wieck, K. Komatani, and A. Oiwa, "Noise-robust classification of single-shot electron spin readouts using a deep neural network," npj Quantum Information, vol. 7, no. 1, p. 136, 2021. - [21] L. Stefanazzi, K. Treptow, N. Wilcer, C. Stoughton, C. Bradford, S. Uemura, S. Zorzetti, S. Montella, G. Cancelo, S. Sussman et al., "The qick (quantum instrumentation control kit): Readout and control for qubits and detectors," *Review of Scientific Instruments*, vol. 93, no. 4, 2022. - [22] K. H. Park, Y. S. Yap, Y. P. Tan, C. Hufnagel, L. H. Nguyen, K. H. Lau, P. Bore, S. Efthymiou, S. Carrazza, R. P. Budoyo et al., "Icarus-q: Integrated control and readout unit for scalable quantum processors," *Review of Scientific Instruments*, vol. 93, no. 10, 2022. - [23] M. O. Tholén, R. Borgani, G. R. Di Carlo, A. Bengtsson, C. Križan, M. Kudra, G. Tancredi, J. Bylander, P. Delsing, S. Gasparinetti et al., "Measurement and control of a superconducting quantum processor with a fully integrated radio-frequency system on a chip," Review of Scientific Instruments, vol. 93, no. 10, 2022. - [24] U. Singhal, S. Kalipatnapu, P. K. Gautam, S. Majumder, V. V. L. Pabbisetty, S. Jandhyala, V. Singh, and C. S. Thakur, "Sq-cars: A scalable quantum control and readout system," *IEEE Transactions on Instrumentation and Measurement*, 2023. - [25] S. Maurya, C. N. Mude, W. D. Oliver, B. Lienhard, and S. Tannu, "Scaling qubit readout with hardware efficient machine learning architectures," in *Proceedings of the 50th Annual International Symposium on Computer Architecture*, 2023, pp. 1–13. - [26] K. Reuer, J. Landgraf, T. Fösel, J. O'Sullivan, L. Beltrán, A. Akin, G. J. Norris, A. Remm, M. Kerschbaum, J.-C. Besse et al., "Realizing a deep reinforcement learning agent for real-time quantum feedback," *Nature Communications*, vol. 14, no. 1, p. 7138, 2023. - [27] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, "A survey of quantization methods for efficient neural network inference," in *Low-Power Computer Vision*. Chapman and Hall/CRC, 2022, pp. 291–326. - [28] M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. Van Baalen, and T. Blankevoort, "A white paper on neural network quantization," arXiv preprint arXiv:2106.08295, 2021. - [29] M. Blott, T. B. Preußer, N. J. Fraser, G. Gambardella, K. O'brien, Y. Umuroglu, M. Leeser, and K. Vissers, "Finn-r: An end-to-end deeplearning framework for fast exploration of quantized neural networks," ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, pp. 1–23, 2018. - [30] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, "Quantized neural networks: Training neural networks with low precision weights and activations," *journal of machine learning research*, vol. 18, no. 187, pp. 1–30, 2018. - vol. 18, no. 187, pp. 1–30, 2018. [31] A. Pappalardo, "Xilinx/brevitas," 2023. [Online]. Available: https://doi.org/10.5281/zenodo.3333552 - [32] J. Heinsoo, C. K. Andersen, A. Remm, S. Krinner, T. Walter, Y. Salathé, S. Gasparinetti, J.-C. Besse, A. Potočnik, A. Wallraff, and C. Eichler, "Rapid high-fidelity multiplexed readout of superconducting qubits," *Phys. Rev. Appl.*, vol. 10, p. 034040, Sep 2018. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevApplied.10.034040 - [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, "Automatic differentiation in pytorch," in NIPS-W, 2017. - [34] W. Sung, S. Shin, and K. Hwang, "Resiliency of deep neural networks under quantization," arXiv preprint arXiv:1511.06488, 2015. - [35] Xilinx, "AXI4-Stream Interconnect v1.1," https://docs.amd.com/v/u/en-US/pg035\_axis\_interconnect, 2024, [Online; accessed 20-Apr-2022]. - [36] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014. - [37] F. Hamanaka, T. Odan, K. Kise, and T. Van Chu, "An exploration of state-of-the-art automation frameworks for fpga-based dnn acceleration," *IEEE Access*, vol. 11, pp. 5701–5713, 2023. - [38] P. Krantz, A. Bengtsson, M. Simoen, S. Gustavsson, V. Shumeiko, W. Oliver, C. Wilson, P. Delsing, and J. Bylander, "Single-shot readout of a superconducting qubit using a josephson parametric oscillator," *Nature communications*, vol. 7, no. 1, p. 11417, 2016. - [39] G. Turin, "An introduction to matched filters," IRE transactions on Information theory, vol. 6, no. 3, pp. 311–329, 1960. - [40] C. A. Ryan, B. R. Johnson, J. M. Gambetta, J. M. Chow, M. P. Da Silva, O. E. Dial, and T. A. Ohki, "Tomography via correlation of noisy measurement records," *Physical Review A*, vol. 91, no. 2, p. 022118, 2015 - [41] J. Heinsoo, C. K. Andersen, A. Remm, S. Krinner, T. Walter, Y. Salathé, S. Gasparinetti, J.-C. Besse, A. Potočnik, A. Wallraff et al., "Rapid highfidelity multiplexed readout of superconducting qubits," *Physical Review Applied*, vol. 10, no. 3, p. 034040, 2018. - [42] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler et al., "Api design for machine learning software: experiences from the scikit-learn project," arXiv preprint arXiv:1309.0238, 2013. - [43] L. Xiang, Z. Zong, Z. Sun, Z. Zhan, Y. Fei, Z. Dong, C. Run, Z. Jia, P. Duan, J. Wu et al., "Simultaneous feedback and feedforward control and its application to realize a random walk on the bloch sphere in an xmon-superconducting-qubit system," *Physical Review Applied*, vol. 14, no. 1, p. 014099, 2020. - [44] Y. Salathé, P. Kurpiers, T. Karg, C. Lang, C. K. Andersen, A. Akin, S. Krinner, C. Eichler, and A. Wallraff, "Low-latency digital signal processing for feedback and feedforward in quantum computing and communication," *Physical Review Applied*, vol. 9, no. 3, p. 034011, 2018. - [45] Y. Yang, Z. Shen, X. Zhu, Z. Wang, G. Zhang, J. Zhou, X. Jiang, C. Deng, and S. Liu, "Fpga-based electronic system for the control and readout of superconducting quantum processors," *Review of Scientific Instruments*, vol. 93, no. 7, 2022. - [46] C. Guo, J. Lin, L.-C. Han, N. Li, L.-H. Sun, F.-T. Liang, D.-D. Li, Y.-H. Li, M. Gong, Y. Xu et al., "Low-latency readout electronics for dynamic superconducting quantum computing," AIP Advances, vol. 12, no. 4, 2022. # VI. APPENDIX # A. NN-Accelerator Design Methodology All the models are trained and quantized using PyTorch 1.12.1 and Brevitas 0.9.1. The activation function used is *ReLU*. We use the *Adam* [36] optimizer with weight decay of $10^{-3}$ . The learning rate is $10^{-3}$ and batch size is kept 1024. While performing QAT, the *Linear* and *ReLU* layers are replaced by their equivalent quantization layers *QuantLinear* and *QuantReLU*, respectively. 1) FINN-R Flow: The usual practice of implementing NNs on FPGA is either a hand crafted custom architecture [26] or through a High Level Synthesis (HLS) [25]. These two approaches limit the flexibility in choosing the neural network both in terms of architecture and size. Moreover, the design turnaround time for the custom architecture is large. It thus necessitates for an integrated approach which can offer automation and flexibility of implementing a QNN on an FPGA. Fig. 5: Flowchart representing End-to-End flow of FINN-R. The double bordered rectangles show the modifications relative to the default FINN-R flow. Vitis-AI and FINN-R are the leading automation frameworks [37] for mapping QNNs on FPGAs. They utilize quantized neural networks and fixed point arithmetic for generating hardware design for neural networks. They differ in how they implement the quantized neural network on FPGAs, either as Overlay-based approach or dataflow-based architecture. Overlay-based architectures adopted by Vitis-AI are very scalable as they utilize off-chip memory for model parameter storage. However, the performance may not be as high as that of dataflow architectures. FINN-R follows a dataflow architecture and implements the model utilising the on-chip memory, which results in low latencies compared to Overlay-based implementations. Since low-latency readout is an essential requirement of quantum technologies, this dataflow framework becomes the preferred choice for our work. The end-to-end flow of FINN-R is shown in Figure 5. FINN-R takes the model as input in ONNX format with FINN-specific metadata embedded, which can be exported using Brevitas library in the Pytroch environment. The model is then transformed step-wise into a streaming dataflow graph, and represented in an Intermediate Representation (IR). After replacing the nodes with HLS callable functions, it applies user defined parallelism or derives these configurations based on given throughput target. It implements quantization and matrix multiplication of low precision data using Multi Vector Threshold Units (MVTU). One such unit is used for each layer of quantized neural network. Each unit consists of several number of Processing Elements (PE). Each PE can have multiple input lanes like the Single Instruction Multiple Data (SIMD) architecture. The computation can be time-multiplxed to save hardware resources or the implementation can be built to have dedicated PE and/or SIMD lane per computation, resulting in fast implementation at higher consumption of hardware resources. This provides flexibility to strike a balance between hardware resource consumption and latency performance. The final step utilizes the Xilinx Vivado back-end to generate the Verilog Design for the targeted FPGA device. The design is exported as an Intellectual Property (IP), which can be modified and integrated into any design based on the requirements. The hardware architectures of the proposed works are implemented on Xilinx RFSoC ZCU111. Xilinx Vivado 2022.2 has been used for the implementation of the designs. The synthesis strategy was kept for high performance, keeping the target of low latency as the primary goal. #### B. SVM-based State Discriminator Common single-qubit state discriminators are the boxcar-filter (BF) [38], or matched-filter (MF) [39]–[41]. They have been widely used both at the software and the hardware level. MF is preferred over BF as it maximises the signal-to-noise ratio (SNR). Additionally, BF is not immune to additive stationary noise. However, compared to BF and MF, the Support Vector Machine (SVM) offers better decision boundaries both for the state discrimination of single-qubit and FDM readout with qubit-specific processing [15], [17]. The incoming stream of I and Q data from the RF-ADC is digitally demodulated at each qubit-specific Intermediate Frequency (IF) over the readout integration time. The training data comprised 1000 random samples from each of 32 possible combinations with a vector size of 512 for both IQ-data, accounting for $1\mu s$ readout trace duration. Linear Support Vector Classification (LinearSVC) package [42] was used for implementing SVM. Following the training, we obtained the floating point weights and biases that were used with the test data. The floating point versions of SVM outperformed their MF counterparts in terms of fidelity per qubit and overall geometric mean fidelity was improved by a factor of 1.53 %. TABLE III: Readout fidelities of SVM discriminators in comparison to the case of a matched filter | | $F_{GM}$ | | |-------------------|----------|--| | Matched Filter | 0.8846 | | | (Float) | 0.0040 | | | SVM (Float) | 0.8982 | | | SVM (2 Multiplier | 0.8980 | | | + 8-bit quant) | 0.0900 | | | SVM (1 Multiplier | 0.8985 | | | + 8-bit quant) | 0.0900 | | Fig. 6: Hardware architecture of the SVM-based state discriminator. It takes the incoming 8-bit input of each I and Q from the RF-ADC and performs multiply-accumulate (MAC) operations over the readout integration time. Thus, the resulting processing latency is the computation time required to process the last input sample. It has five modules of SVM, one for each qubit. The weights were quantized to 8-bits along with input and demodulation parameters. In the quantized implementation, we have two flavours, where we use separate multipliers for demodulation and weights multiplication and a common multiplier that passes on the fused values of weights and demodulation coefficients computed *a priori*, reducing the multiplier count by 1 per qubit. Table III shows readout fidelities for this work compared to matched filters and different SVM implementations. The hardware implementation for SVM is shown in Figure 6. Our SVM implementation benefits from the quantized implementation of weights and computation and outperforms other state-of-the-art traditional signal processing-based discriminators in terms of latency and resource utilization. TABLE IV: Latency comparison of SVM-based discriminator with the state-of-the-art | | Discriminator | Procesing Latency<br>Multi-cycle<br>(ns) | Procesing Latency<br>Single-cycle<br>(ns) | # LUTs | Readout type | |---------------------|--------------------|------------------------------------------|-------------------------------------------|--------------|--------------| | Xiang et al. [43] | Demodulation + BF | 32 | Not reported | Not Reported | Single | | Salathe et al. [44] | Demodulation + BF | 30 | 5.3 | 509 | Single | | Yang et al. [45] | Demodulation + BF | 20 | Not Reported | Not Reported | Single | | Guo et al. [46] | Demodulation + MF | 24 | Not Reported | Not Reported | Single | | Tholén et al. [23] | Demodulation + MF | 10 | Not Reported | Not Reported | Single | | This work | Demodulation + SVM | 5.74 | 3.67 | 1675 | Multiplexed | Comparison with other state-of-the-art work is given in Table IV. Salathe *et al.* [44] proposed a method using demodulation and thresholding, achieving core processing latency of 30 ns for the discriminator. Tholen *et al.* [23] is an integrated solution on RFSoC, which includes an MF state discriminator by employing pre-stored samples giving a readout latency of 10 ns. Guo *et al.* [46] presented demodulation and matched filtering with a readout latency of 24 ns. The quantized SVM implemented on the FPGA yields a discriminator latency of 5.74 ns, and the resource utilization is very frugal, with only 1675 LUTs being used for the 5-qubit system.