Page 217 - AI for Good Innovate for Impact
P. 217

AI for Good Innovate for Impact



                   Use Case 8: AI for green multi-cluster: Intelligent management

               towards green and low-carbon, large-scale multi-clusters                                            Change  4.2-Climate











               Organization: Institute of Artificial Intelligence (TeleAI), China Telecom
               Country: China

               Contact Person:

                    Primary: Qizhen Weng, wengqzh@ chinatelecom .cn
                    Secondary: Yuankai Fan, fanyk1@ chinatelecom .cn


               1      Use Case Summary Table

                Item                  Details

                Category              Climate Change/Natural Disaster
                Problem Addressed     Using AI to efficiently managing computing resource and scheduling
                                      jobs across multiple large-scale AI clusters. This AI-driven approach
                                      aims to reduce energy consumption and carbon emissions, surpasses
                                      traditional rule-based multi-cluster management methods in adapt-
                                      ability and effectiveness, and is capable of explaining its decisions to
                                      humans via multi-modal generative models.
                Key Aspects of Solution •  AI-powered job scheduling that improves efficiency through real-
                                         time multi-dimensional resource monitoring.
                                      •  Employing GPU multiplexing and shared resource scheduling
                                         algorithms to enhance cluster utilization.
                                      •  Cross-cluster coordination via multiple AI agents towards overall
                                         green and low-carbon goals.
                                      •  Employing multi-modal  generative large models to provide
                                         interpretable explanations of scheduling decisions, enhancing
                                         transparency for end-users.

                Technology Keywords   Job Scheduling, Multi-Cluster Management, Multi-Modal Generative
                                      Models, GPU Multiplexing, Kubernetes-based Cluster Coordination

                Data Availability     Private; Public
                                      1)  https:// github .com/ alibaba/ clusterdata [1]
                                      2)  https:// github .com/ InternLM/ AcmeTrace [2]
                                      3)  https:// github .com/ ml -energy/ zeus [3]

                Metadata (Type of Data) Text data (job resource requirements and performance logs, energy
                                      consumption metrics, energy source information, etc.) and images
                                      (cluster metrics represented through graphs)










                                                                                                    181
   212   213   214   215   216   217   218   219   220   221   222