Page 218 - AI for Good Innovate for Impact

P. 218

AI for Good Innovate for Impact

(continued)

Item Details
Model Training and Models include reinforcement learning agents and large language
Fine-Tuning models trained on extensive cluster resource and task operation data
to learn optimal scheduling policies. Fine-tuning involves adapting
models to dynamic multi-cluster environments, incorporating real-time
monitoring data, and improving interpretability through multi-modal
generative models. The Fragmentation Gradient Descent (FGD)
algorithm is used for efficient GPU-sharing job scheduling, validated
through simulation and real cluster deployments.

Testbeds or Pilot • https:// github .com/ hkust -adsl/ kubernetes -scheduler -simulator [4]
Deployments • https:// github .com/ qzweng/ clusterdata/ tree/ master [5]
• https:// www .usenix .org/ conference/ atc23/ presentation/ weng [6]
• https:// www .usenix .org/ conference/ nsdi22/ presentation/ weng [7]
• https:// sc20 .supercomputing .org/ proceedings/ tech _paper/ tech
_paper _pages/ pap211 .html [8]
The links above contain information regarding deployments around
scheduling GPU-sharing workloads, including novel measure of frag-
mentation to statistically quantify the extent of GPU fragmentation
caused by different sources, Metis which is a general-purpose sched-
uler using deep reinforcement learning (RL) techniques, etc.

2 Use Case Description

2�1 Description

The soaring popularity of generative AI models, such as ChatGPT [9], Midjourney [10], and
DeepSeek [11], has enriched our lives and created a wealth of new employment opportunities.
However, fueling AI requires substantial computational power and energy consumption, usually
within AI clusters--data centers equipped with advanced AI processors, such as GPUs.

A recent report from the International Telecommunication Union (ITU) on international
standards for AI and its environmental impact [12] highlights that energy consumption is driven
primarily by the intensive processing required for training and inference of large AI models.
These models are so energy-demanding that the computational power needed to support
AI's expansion has been doubling around every 100 days since 2010 [13]. Beyond energy
requirements, the vast quantities of fresh water needed to cool AI processors further strain local
water resources and ecosystems [14]. This underscores the growing environmental footprint
of AI and the urgent need to adopt efficient management practices for AI clusters.

Intended Use: Utilize artificial intelligence technology to achieve intelligent management and
coordinated scheduling of multiple computing clusters, promote the green development
of technology, enhance resource efficiency, and reduce energy consumption and carbon
emissions.

Problem to Solve: With the increase in the number and scale of computing clusters, how to
manage resources such as energy, computing power, and bandwidth in multiple clusters, and
efficiently schedule and run a large number of computing tasks.

182

213 214 215 216 217 218 219 220 221 222 223