Page 106 - Kaleidoscope Academic Conference Proceedings 2024
P. 106

2024 ITU Kaleidoscope Academic Conference




           userspace has two sections                         5.3  Alveo Architecture Specification
           1.  ’My Software’ sections where we make or store the
           software.                                          ALVEO U55C
           2. XRT section is the communication link between the Host  Look up tables (LUTs): 1304K LUTs.
           and the Alveo card.                                Registers: It features 2,607K registers.
           The   OS   kernel  has  Xilinx  Open   Computing   DSP Slices: The card includes 9,024 DSP slices.
           Language(XOCL)    for  parallel  programming  on   RAM of Processor:16GB High Bandwidth Memory (HBM2).
           Alveo(FPGA). The matrix multiplication is written in  Base Clock Speed - 300MHz
           a .cpp file, in a high-level programming language of  FLOPS = 500-600 GLOPS/SECOND (approximately)
           C++.  This file is then compiled with the help of Vitis         −3           −3
                                                                                             3  
                                                                                                        
                                                                                                  2  
           compiler(V++) and after compilation, we get two files                = ((2∗10  to 1.6666∗10 )∗2 +(2 ∗(2 −1)))∗  
           mmult.xo (matrix multiplication) and vadd.xo (vector                                             (3)
           addition) which are the kernels or functions defined in
           the .cpp file, through which we create xclbin, which is              6.  RESULTS
           a lower-level machine language file, that consists of the
           two definitions of mmult and vadd.  This xclbin file is  Experimental findings:
           used as an overlay file using the PYNQ library in Python  Runtime equations for 3architectures from curve fit
           language where the matrices to be multiplied are defined              −9
                                                                                                  
           and the matrices to be multiplied are first reshaped as per            = (3.1181 ∗ 10  ∗ 2 3    + (2 2    ∗ (2 − 1))) ∗     (4)
           the shapes defined in .cpp file called as tiling of matrices
           and then the matrices are ported on the ALVEO card using  The coefficient a is independent of the matrix size and depth
           the sync.to.device() function of the pynq library. Using the  in the case of CPU. This is evident from Figure 8, we plot the
           mmult and vadd functions the matrix multiplications are  runtime vs number of qubits for depth 11 and found that the
                                                                                     −09
           performed on the ALVEO card and then the result is ported  value of a is 3.12324502 ∗     which is almost equal to the
           back to the Host Processor using the sync.from.device()  value obtained at depth 1.
           function of the overlay method of PYNQ library(7).

              5. ARCHITECTURE SPECIFICATIONS AND
                       THEORETICAL EQUATIONS
           5.1 CPU Architecture Specification

           AMD EPYC 7742
           Number of Cores - 64
           RAM of Processor - 256 GB
           Number of Threads - 64 cores * 2 threads/core = 128 threads
           Base Clock Speed - 2.25GHz
           FLOPS = 4608 GFLOPS (double-precision performance)

                                               
                       = (2.1701 ∗ 10 −4  ∗ 2 3    + (2 2    ∗ (2 − 1))) ∗     (1)
                                                                         Figure 7 – Performance of CPU
           Here,    is the number of qubits (i.e the matrix size will be
           2 ), and    is depth (number of matrices). The           gives the
              
           computational runtime for CPU at depth 1 and at any matrix
           size.

           5.2  GPU Architecture Specification

           RTX A4000
           Number of Cores - 6144
           RAM of Processor - 16 GB
           Number of Threads - 6,144 cores * 32 threads/core = 196,608
           threads
           Base Clock Speed - 1420MHz
           FLOPS = 299.5 GFLOPS (double-precision performance)

                                               
                       = (3.3388 ∗ 10 −3  ∗ 2 3    + (2 2    ∗ (2 − 1))) ∗     (2)  Figure 8 – Performance of CPU



                                                           – 62 –
   101   102   103   104   105   106   107   108   109   110   111