Page 105 - Kaleidoscope Academic Conference Proceedings 2024
P. 105

Innovation and Digital Transformation for a Sustainable World




           perform the matrix multiplication algorithm. This involves  GPU’s streaming multiprocessors(SMs). Once the matrices
           a combination of arithmetic operations(e.g., Addition and  are loaded into GPU memory and the matrix multiplication
           Multiplication) and data movement instructions. Once the  kernel is working, the kernel is launched from the CPU.
           matrix multiplication is complete, the result matrix is stored  This initiates parallel execution of the matrix multiplication
           back into CPU registers. Depending on the application’s  operation on the GPU. The number of threads per block
           requirements, the results may need to be written back to  and the number of blocks per grid are determined based
           RAM. This involves a similar process to loading data from  on the size of the matrix and the GPU’s architecture to
           RAM, where the CPU’s cache hierarchy is utilized to manage  achieve optimal parallelism. Each thread block is scheduled
           data movement efficiently. For these operations, we use the  onto an SM for execution.  Within each SM, multiple
           PYTHON library(numpy).                             thread blocks can be processed concurrently, with each
                                                              block utilizing the SM’s resources efficiently, The CUDA
                                                              runtime manages the scheduling and execution of thread
                3.  MATRIX MULTIPLICATION ON GPU
                                                              blocks across the GPU’s SM, maximizing parallelism and
                                                              throughput.  Once the matrix multiplication is complete,
                                                              the result matrix is transferred from GPU memory back
                                                              to CPU memory using functions similar to cudaMemcpy().
                                                              Overall, performing matrix multiplication on an RTX A4000
                                                              GPU involves leveraging its massively parallel architecture,
                                                              utilizing the CUDA programming model for kernel execution,
                                                              and efficiently managing data movement between the CPU
                                                              and GPU memories. This approach allows for significant
                                                              acceleration of matrix operations compared to traditional
                                                              CPU-based computations, especially for large matrices


                                                                  4.  MATRIX MULTIPLICATION ON ALVEO
                                                                             ACCELERATOR CARD


















                   Figure 5 – Internal Architecture of GPU

           We are using GPU RTX A4000 series in this series
           first, we need to transfer the matrices from the system’s
           RAM(main memory) to the GPU’s dedicated memory
           known as Video RAM (VRAM). This transfer typically
           involves using functions provided by GPU-accelerated
           libraries like CUDA(Compute Unified Device Architecture)
           or cuBLAS(CUDA Basic Linear Algebra Subprograms).
           The CUDA programming model, for instance, provides        Figure 6 – Internal Architecture of ALVEO
           functions like ’cudaMemcpy()’ to transfer data between the
           HOST(CPU) and device (GPU) memories efficiently. Matrix  We have used the Alveo U55C accelerator card, this
           multiplication on a GPU is typically implemented as a kernel  accelerator card contains 16GB of High Bandwidth
           function, which is a small program executed in parallel  Memory(HBM) of second generation to perform the required
           by many threads on the GPU(6). CUDA kernel function  mathematical operations.  The Alveo can directly access
           specifically designed to perform matrix multiplication is  this memory for fast-chain command executions. The shell
           being used. Each thread in the GPU executes a portion on the  sections shown in the figure is reserved by the Alveo card
           matrix multiplication operation in parallel. CUDA provides  to host functionalities like: PCIe control kernel, Xilinx
           grid and block structures to organize these threads into a grid  Run Time (XRT) driver, Status registers. The rest of the
           of thread blocks, which are then executed concurrently on the  Alveo outside shell is the user programmable region. In OS




                                                           – 61 –
   100   101   102   103   104   105   106   107   108   109   110