Page 330 - AI for Good Innovate for Impact
P. 330

AI for Good Innovate for Impact



                      2      Use Case Description


                      2�1     Description


                      Global network failures occur frequently, involving multi-paragraph troubleshooting, poor
                      planning, and high processing time for the execution of intelligent agents, which seriously
                      affects user experience. A one-day network outage incurs a loss of 43 billion US dollars.

                      The solution is composed of the following key components:

                      (1)  Cloud-Network operation Agentic AI: Pioneering a series of LLMs for the cloud-network
                           operation, covering key domains including emergency response, change management,
                           and alarm monitoring. Take the emergency response LLM as an example‐
                           •  Automated data collection algorithm: Based on historical fault handling reports from
                              emergency incidents, the algorithm extracts data such as fault symptoms (including
                              alarms, logs, phenomenon descriptions, and device configurations), fault automation
                              tools, and fault handling processes. This data is then stored as a knowledge graph to
                              support Retrieval-Augmented Generation (RAG) for agent intent orchestration. 
                           •  Adoption of KD technology: After data cleaning and preprocessing, the DeepSeek-R1
                              (DS-R1) model is employed to complete the reasoning processes underlying fault
                              localization and handling. This augmented data serves  as cold-start fine-tuning
                              material, enabling the Qwen model to generate step-by-step reasoning.
                           •  Applying RL-GPRO technology for model fine-tuning to further enhance the accuracy
                              of intent orchestration. The end result is a Cloud-Network Emergency Reasoning
                              Model (CNERM) with a smaller parameter count, capable of generating fault-handling
                              thought processes. For the agent intent orchestration scenario, four reward functions
                              were designed: a step format reward, a step count reward, a step repetition penalty,
                              and an R1 correctness score, ensuring proper model convergence.
                           •  Multi-turn dialogue strategy implementation: Upon detecting a fault, the intelligent
                              agent automatically generates an optimal fault-handling process using the CNERM.
                              Users can refine this plan through multi-turn dialogues. The DS model extracts
                              parameters required for API calls, guiding users to fill in missing information. For
                              critical steps in fault resolution, such as configuration deployment to key devices,
                              where incorrect operations could cause major outages, manual confirmation is
                              required before execution to ensure 100% accuracy.
                      (2)  Standardization of O&M system interfaces: Based on the cloud-network operation low-
                           code platform, establish a universal MCP to connect cloud-network operation data and
                           tools.
                      (3)  Digital employees reduce human uncertainty: Built on the Cloud-Network Agentic AI
                           framework, they monitor over 100,000 network nodes 24/7, replacing manual labor and
                           ensuring consistent, error-free operations.
                      (4)  Large-small model collaboration framework: Rapidly establish an emergency war room
                           to standardize emergency processes.  The system automatically creates group chats and
                           invites digital employees, which detect incidents and autonomously invoke emergency
                           response agents. Small models aggregate massive log alarms, while large models
                           perform root cause localization within minutes—achieving the 1-5-10 capability: 1 minute
                           for anomaly detection, 5 minutes for root cause localization, and 10 minutes to resolve
                           faults and restore services. Digital employees handle the entire process automatically and
                           generate intelligent fault-handling reports.
                      It is critical to realize the whole process control of cloud-network operation emergency events
                      to achieve timely fault discovery, accurate positioning and well-documented disposal. Through






                  294
   325   326   327   328   329   330   331   332   333   334   335