Page 106 - Kaleidoscope Academic Conference Proceedings 2024

P. 106

2024 ITU Kaleidoscope Academic Conference

userspace has two sections 5.3 Alveo Architecture Specification
1. ’My Software’ sections where we make or store the
software. ALVEO U55C
2. XRT section is the communication link between the Host Look up tables (LUTs): 1304K LUTs.
and the Alveo card. Registers: It features 2,607K registers.
The OS kernel has Xilinx Open Computing DSP Slices: The card includes 9,024 DSP slices.
Language(XOCL) for parallel programming on RAM of Processor:16GB High Bandwidth Memory (HBM2).
Alveo(FPGA). The matrix multiplication is written in Base Clock Speed - 300MHz
a .cpp file, in a high-level programming language of FLOPS = 500-600 GLOPS/SECOND (approximately)
C++. This file is then compiled with the help of Vitis −3 −3
3

2
compiler(V++) and after compilation, we get two files = ((2∗10 to 1.6666∗10 )∗2 +(2 ∗(2 −1)))∗
mmult.xo (matrix multiplication) and vadd.xo (vector (3)
addition) which are the kernels or functions defined in
the .cpp file, through which we create xclbin, which is 6. RESULTS
a lower-level machine language file, that consists of the
two definitions of mmult and vadd. This xclbin file is Experimental findings:
used as an overlay file using the PYNQ library in Python Runtime equations for 3architectures from curve fit
language where the matrices to be multiplied are defined −9

and the matrices to be multiplied are first reshaped as per = (3.1181 ∗ 10 ∗ 2 3 + (2 2 ∗ (2 − 1))) ∗ (4)
the shapes defined in .cpp file called as tiling of matrices
and then the matrices are ported on the ALVEO card using The coefficient a is independent of the matrix size and depth
the sync.to.device() function of the pynq library. Using the in the case of CPU. This is evident from Figure 8, we plot the
mmult and vadd functions the matrix multiplications are runtime vs number of qubits for depth 11 and found that the
−09
performed on the ALVEO card and then the result is ported value of a is 3.12324502 ∗ which is almost equal to the
back to the Host Processor using the sync.from.device() value obtained at depth 1.
function of the overlay method of PYNQ library(7).

5. ARCHITECTURE SPECIFICATIONS AND
THEORETICAL EQUATIONS
5.1 CPU Architecture Specification

AMD EPYC 7742
Number of Cores - 64
RAM of Processor - 256 GB
Number of Threads - 64 cores * 2 threads/core = 128 threads
Base Clock Speed - 2.25GHz
FLOPS = 4608 GFLOPS (double-precision performance)

= (2.1701 ∗ 10 −4 ∗ 2 3 + (2 2 ∗ (2 − 1))) ∗ (1)
Figure 7 – Performance of CPU
Here, is the number of qubits (i.e the matrix size will be
2 ), and is depth (number of matrices). The gives the

computational runtime for CPU at depth 1 and at any matrix
size.

5.2 GPU Architecture Specification

RTX A4000
Number of Cores - 6144
RAM of Processor - 16 GB
Number of Threads - 6,144 cores * 32 threads/core = 196,608
threads
Base Clock Speed - 1420MHz
FLOPS = 299.5 GFLOPS (double-precision performance)

= (3.3388 ∗ 10 −3 ∗ 2 3 + (2 2 ∗ (2 − 1))) ∗ (2) Figure 8 – Performance of CPU

– 62 –

101 102 103 104 105 106 107 108 109 110 111