Page 218 - AI for Good Innovate for Impact
P. 218

AI for Good Innovate for Impact



                      (continued)

                       Item                  Details
                       Model Training and  Models include reinforcement learning agents and large language
                       Fine-Tuning           models trained on extensive cluster resource and task operation data
                                             to learn optimal scheduling policies. Fine-tuning involves adapting
                                             models to dynamic multi-cluster environments, incorporating real-time
                                             monitoring data, and improving interpretability through multi-modal
                                             generative  models.  The  Fragmentation  Gradient  Descent  (FGD)
                                             algorithm is used for efficient GPU-sharing job scheduling, validated
                                             through simulation and real cluster deployments.

                       Testbeds    or  Pilot •  https:// github .com/ hkust -adsl/ kubernetes -scheduler -simulator [4]
                       Deployments           •  https:// github .com/ qzweng/ clusterdata/ tree/ master [5]
                                             •  https:// www .usenix .org/ conference/ atc23/ presentation/ weng [6]
                                             •  https:// www .usenix .org/ conference/ nsdi22/ presentation/ weng [7]
                                             •  https:// sc20 .supercomputing .org/ proceedings/ tech _paper/ tech
                                                _paper _pages/ pap211 .html [8]
                                             The links above contain information regarding deployments around
                                             scheduling GPU-sharing workloads, including novel measure of frag-
                                             mentation to statistically quantify the extent of GPU fragmentation
                                             caused by different sources, Metis which is a general-purpose sched-
                                             uler using deep reinforcement learning (RL) techniques, etc.



                      2      Use Case Description


                      2�1     Description

                      The soaring popularity of generative AI models, such as ChatGPT [9], Midjourney [10], and
                      DeepSeek [11], has enriched our lives and created a wealth of new employment opportunities.
                      However, fueling AI requires substantial computational power and energy consumption, usually
                      within AI clusters--data centers equipped with advanced AI processors, such as GPUs.

                      A recent report from the International Telecommunication Union (ITU) on international
                      standards for AI and its environmental impact [12] highlights that energy consumption is driven
                      primarily by the intensive processing required for training and inference of large AI models.
                      These models are so energy-demanding that the computational power needed to support
                      AI's expansion has been doubling around every 100 days since 2010 [13]. Beyond energy
                      requirements, the vast quantities of fresh water needed to cool AI processors further strain local
                      water resources and ecosystems [14]. This underscores the growing environmental footprint
                      of AI and the urgent need to adopt efficient management practices for AI clusters.

                      Intended Use: Utilize artificial intelligence technology to achieve intelligent management and
                      coordinated scheduling of multiple computing clusters, promote the green development
                      of technology, enhance resource efficiency, and reduce energy consumption and carbon
                      emissions.

                      Problem to Solve: With the increase in the number and scale of computing clusters, how to
                      manage resources such as energy, computing power, and bandwidth in multiple clusters, and
                      efficiently schedule and run a large number of computing tasks.







                  182
   213   214   215   216   217   218   219   220   221   222   223