Page 104 - Kaleidoscope Academic Conference Proceedings 2024
P. 104

2024 ITU Kaleidoscope Academic Conference























                                                              Figure 3 – Quantum Circuit and its Quantum Emulations are
                                                              Matrix-Vector and Matrix Matrix Multiplications

           Figure 2 – Runtime in units vs Quantum Circuit Depth
           Complexity
           of 2x1, which scales up exponentially as we increase the
                                           
           number of qubits i.e the matrix size (2 ∗ 2 ) and state vector
                                               
                   
           size (2 ∗ 1). Once the code optimization is done on the
           backend, then it is a hardware architecture of the classical
           computation that needs to be changed to get the performance
           enhancement, i.e. the GPU and the ALVEO Cards i.e. HPC
           Cards on our end (3). In other research works, CMOS circuit
           emulators for quantum computing have been proposed (4),
           but these hardware implementations are not yet commercially
           available.  In the near term, using classical computing
           hardware to emulate quantum computation remains a viable
           solution.
           The depth of the quantum circuit is equivalent to the
           number of matrices used in the multiplications in quantum
           emulations. The complexity of increasing the quantum circuit
           depth i.e matrix multiplication depth is linear for CPU,
           but how much is the slope of the runtime with increasing
           depth lesser in the case of the GPU and ALVEO, Also how
           does the complexity of the runtime vary with increasing the
                                                         3
           number of qubits on ALVEO and GPU, which is   (   )
           for CPU? Even if the complexity of the runtime remains
           the same on the accelerator cards the exact equation of the  Figure 4 – Internal Architecture of CPU
           complexity will have lower values on GPU and ALVEO Cards
                                                              Once the required matrix elements are in the cache, they
           owing to the customization of the hardware architecture as
                                                              can be loaded into CPU registers, which are small, fast
           per the application on ALVEO Cards and parallelism on
                                                              storage locations directly accessible by the CPU cores. The
           GPU Cards. This paper aims to benchmark the how and
                                                              CPU’s instruction decoder and execution units handle the
           exact mathematical equation for variable qubit size and the
                                                              loading of data from cache into registers. This process is
           quantum circuit depth, which can be further used to establish
                                                              typically controlled by assembly-level instructions generated
           a bottleneck for the qubit size and the quantum circuit on a
                                                              by the compiler or software. Once the matrix elements are
           present supercomputer for quantum emulations. Now let’s
                                                              loaded into CPU registers, the actual multiplication operation
           move on to the exact dataflow for matrix multiplications on
                                                              can begin. The EPYC 7742 CPU features multiple cores,
           CPU, GPU, and ALVEO Cards(5).
                                                              each capable of executing instructions independently. These
           A clear pictorial representation between the quantum
                                                              cores can work in parallel, allowing for efficient processing
           emulations and actual quantum circuits is shown in Fig. 3.
                                                              of matrix multiplication tasks. The CPU’s SIMD(Single
                                                              Instruction, Multiple Data) units can be leveraged for parallel
                2.  MATRIX MULTIPLICATION ON CPU
                                                              computation. SIMD instructions enable the execution of the
           In CPU, first the data (Matrix elements) is loaded to the  same operation on multiple data elements simultaneously,
           cache L1 from CPU RAM(Random Access Memory) by     which is beneficial for matrix multiplication.  The CPU
                                                              executes instructions generated by the software or compiler to
           using multiple data buses for efficient parallel computation.
                                                           – 60 –
   99   100   101   102   103   104   105   106   107   108   109