|
Work item:
|
Q.NMP-LLMT
|
|
Subject/title:
|
Network monitoring parameters for large language model training
|
|
Status:
|
Under study
|
|
Approval process:
|
-
|
|
Type of work item:
|
Recommendation
|
|
Version:
|
New
|
|
Equivalent number:
|
-
|
|
Timing:
|
Q1-2027 (Medium priority)
|
|
Liaison:
|
ITU-T SG13, ITU-T SG21
|
|
Supporting members:
|
State Grid Corporation of China, CAICT; China Telecom; China Unicom
|
|
Summary:
|
In the large language model (LLM) training scenario, as the parameter scale is gradually increased to the trillion level, the computing power is confronted with highly demands. Distributed training techniques are commonly employed to partition models and data. It requires GPU clusters on the scale of tens of thousands of units to achieve high-speed interconnection. The high-performance network that connects these clusters directly determines communication efficiency among intelligent computing nodes, thereby impacting overall cluster throughput and performance. This proposal introduces a series of parameters to monitor network performance during the LLM training to help improve the training efficiency of LLM.
|
|
Comment:
|
-
|
|
Reference(s):
|
|
|
Historic references:
|
|
Contact(s):
|
|
| ITU-T A.5 justification(s): |
|
|
|
|
First registration in the WP:
2025-12-19 12:43:47
|
|
Last update:
2025-12-19 12:54:57
|