Page 220 - AI for Good Innovate for Impact
P. 220

AI for Good Innovate for Impact



                      optimizing the green utilization of resources throughout the AI lifecycle. It will promote green
                      energy consumption patterns in computing clusters such as data centers and reduce waste
                      and carbon emissions.


                      2�3     Future Work

                      Data Collection: The dimension of data collection will be expanded to cover more energy and
                      environmental parameters and detailed equipment operation information, so as to provide
                      more comprehensive data support.

                      Single-Node Resource Sharing: Research will focus on developing more efficient resource-
                      sharing algorithms to expand the scope of AI scheduling, including enhancements in GPU
                      virtualization capabilities. The goal is to increase the utilization of GPU resources while
                      minimizing idle waste, all while ensuring that task performance is maintained. Drawing from our
                      operational experience at the Institute of AI, research teams across domains such as Computer
                      Vision (CV), Natural Language Processing (NLP), and Speech have already virtualized a portion
                      of their GPU resources. This GPU-sharing approach has led to an increase of at least 150%
                      in the availability of GPUs for prototype development and debugging, without the need for
                      additional processor purchases.
                      Single-Cluster Resource Scheduling: Monitoring and task data will be regarded as the
                      state space, and scheduling optimization strategies will be regarded as the action space.
                      Technologies such as reinforcement learning and large language models will be used to
                      establish intelligent scheduling strategies, making them more suitable for complex and variable
                      task requirements and further improving the resource utilization efficiency within a single
                      cluster. The de facto standard for cluster coordination is Kubernetes, which is widely adopted
                      by leading AI companies such as OpenAI [12]. Building on Kubernetes, we have open-sourced
                      a scheduling algorithm known as the Fragmentation Gradient Descent (FGD) policy [13]. This
                      algorithm enables highly efficient scheduling decisions within hundreds of milliseconds on
                      clusters comprising over 1,200 servers. Simulations have demonstrated that FGD outperforms
                      traditional bin-packing algorithms in GPU-sharing scenarios, reducing unallocated GPUs by
                      up to 49% [14].

                      Multi-Cluster Collaborative Scheduling: With the help of technologies such as multi-agent
                      large models, a more intelligent cross-cluster collaboration mechanism will be constructed.
                      According to the real-time status of each cluster and task priorities, the optimal cross-cluster
                      resource allocation will be achieved, promoting the achievement of overall green and low-
                      carbon goals.

                      Cluster Scheduling Result Interpretability: Multi-modal generative large models will be
                      utilized to present the scheduling decision-making process to users in an understandable
                      form, enhancing trust in artificial intelligence scheduling results and facilitating subsequent
                      strategy adjustment and optimization.


                      3      Use Case Requirements


                      REQ-01: It is required to monitor multi-dimensional resource usage, profile typical job resource
                      requirements, and collect environmental and infrastructural data from AI clusters.







                  184
   215   216   217   218   219   220   221   222   223   224   225