Page 330 - AI for Good Innovate for Impact

P. 330

AI for Good Innovate for Impact

2 Use Case Description

2�1 Description

Global network failures occur frequently, involving multi-paragraph troubleshooting, poor
planning, and high processing time for the execution of intelligent agents, which seriously
affects user experience. A one-day network outage incurs a loss of 43 billion US dollars.

The solution is composed of the following key components:

(1) Cloud-Network operation Agentic AI: Pioneering a series of LLMs for the cloud-network
operation, covering key domains including emergency response, change management,
and alarm monitoring. Take the emergency response LLM as an example‐
• Automated data collection algorithm: Based on historical fault handling reports from
emergency incidents, the algorithm extracts data such as fault symptoms (including
alarms, logs, phenomenon descriptions, and device configurations), fault automation
tools, and fault handling processes. This data is then stored as a knowledge graph to
support Retrieval-Augmented Generation (RAG) for agent intent orchestration.
• Adoption of KD technology: After data cleaning and preprocessing, the DeepSeek-R1
(DS-R1) model is employed to complete the reasoning processes underlying fault
localization and handling. This augmented data serves as cold-start fine-tuning
material, enabling the Qwen model to generate step-by-step reasoning.
• Applying RL-GPRO technology for model fine-tuning to further enhance the accuracy
of intent orchestration. The end result is a Cloud-Network Emergency Reasoning
Model (CNERM) with a smaller parameter count, capable of generating fault-handling
thought processes. For the agent intent orchestration scenario, four reward functions
were designed: a step format reward, a step count reward, a step repetition penalty,
and an R1 correctness score, ensuring proper model convergence.
• Multi-turn dialogue strategy implementation: Upon detecting a fault, the intelligent
agent automatically generates an optimal fault-handling process using the CNERM.
Users can refine this plan through multi-turn dialogues. The DS model extracts
parameters required for API calls, guiding users to fill in missing information. For
critical steps in fault resolution, such as configuration deployment to key devices,
where incorrect operations could cause major outages, manual confirmation is
required before execution to ensure 100% accuracy.
(2) Standardization of O&M system interfaces: Based on the cloud-network operation low-
code platform, establish a universal MCP to connect cloud-network operation data and
tools.
(3) Digital employees reduce human uncertainty: Built on the Cloud-Network Agentic AI
framework, they monitor over 100,000 network nodes 24/7, replacing manual labor and
ensuring consistent, error-free operations.
(4) Large-small model collaboration framework: Rapidly establish an emergency war room
to standardize emergency processes. The system automatically creates group chats and
invites digital employees, which detect incidents and autonomously invoke emergency
response agents. Small models aggregate massive log alarms, while large models
perform root cause localization within minutes—achieving the 1-5-10 capability: 1 minute
for anomaly detection, 5 minutes for root cause localization, and 10 minutes to resolve
faults and restore services. Digital employees handle the entire process automatically and
generate intelligent fault-handling reports.
It is critical to realize the whole process control of cloud-network operation emergency events
to achieve timely fault discovery, accurate positioning and well-documented disposal. Through

294

325 326 327 328 329 330 331 332 333 334 335