Page 220 - AI for Good Innovate for Impact

P. 220

AI for Good Innovate for Impact

optimizing the green utilization of resources throughout the AI lifecycle. It will promote green
energy consumption patterns in computing clusters such as data centers and reduce waste
and carbon emissions.

2�3 Future Work

Data Collection: The dimension of data collection will be expanded to cover more energy and
environmental parameters and detailed equipment operation information, so as to provide
more comprehensive data support.

Single-Node Resource Sharing: Research will focus on developing more efficient resource-
sharing algorithms to expand the scope of AI scheduling, including enhancements in GPU
virtualization capabilities. The goal is to increase the utilization of GPU resources while
minimizing idle waste, all while ensuring that task performance is maintained. Drawing from our
operational experience at the Institute of AI, research teams across domains such as Computer
Vision (CV), Natural Language Processing (NLP), and Speech have already virtualized a portion
of their GPU resources. This GPU-sharing approach has led to an increase of at least 150%
in the availability of GPUs for prototype development and debugging, without the need for
additional processor purchases.
Single-Cluster Resource Scheduling: Monitoring and task data will be regarded as the
state space, and scheduling optimization strategies will be regarded as the action space.
Technologies such as reinforcement learning and large language models will be used to
establish intelligent scheduling strategies, making them more suitable for complex and variable
task requirements and further improving the resource utilization efficiency within a single
cluster. The de facto standard for cluster coordination is Kubernetes, which is widely adopted
by leading AI companies such as OpenAI [12]. Building on Kubernetes, we have open-sourced
a scheduling algorithm known as the Fragmentation Gradient Descent (FGD) policy [13]. This
algorithm enables highly efficient scheduling decisions within hundreds of milliseconds on
clusters comprising over 1,200 servers. Simulations have demonstrated that FGD outperforms
traditional bin-packing algorithms in GPU-sharing scenarios, reducing unallocated GPUs by
up to 49% [14].

Multi-Cluster Collaborative Scheduling: With the help of technologies such as multi-agent
large models, a more intelligent cross-cluster collaboration mechanism will be constructed.
According to the real-time status of each cluster and task priorities, the optimal cross-cluster
resource allocation will be achieved, promoting the achievement of overall green and low-
carbon goals.

Cluster Scheduling Result Interpretability: Multi-modal generative large models will be
utilized to present the scheduling decision-making process to users in an understandable
form, enhancing trust in artificial intelligence scheduling results and facilitating subsequent
strategy adjustment and optimization.

3 Use Case Requirements

REQ-01: It is required to monitor multi-dimensional resource usage, profile typical job resource
requirements, and collect environmental and infrastructural data from AI clusters.

184

215 216 217 218 219 220 221 222 223 224 225