Page 696 - AI for Good Innovate for Impact
P. 696
AI for Good Innovate for Impact
need to be outputted in structured data format (JSON schema with keywords/subject labels
and confidence level).
REQ-02: High Fidelity Motion Video Generation[1]
The system is required to generate 20 seconds of video (720P@25fps, H.264 encoding)
based on the Diffusion+Diffusion Transformers(DiT) fusion architecture to realize an end-to-
end generation pipeline[2]. Technical requirements include:Layered diffusion strategy (Latent
Diffusion) to reduce computational complexity.Optical Flow Long Short-Term Memory(LSTM) to
ensure inter-frame coherence (Structural Similarity Index Measure(SSIM) ≥ 0.85) .Integration of
audio-video synchronization controllers (based on Mel-Frequency Cepstral Coefficients(MFCC)
feature alignment) .Generation process to meet the constraints of real-time (end-to-end delay
of ≤ 10 seconds / piece), the output content should be Pass automated validation (content
relevance score ≥ 90%, evaluated by Tencent Video Multimethod Assessment Fusion(VMAF)
tool).[3]
REQ-03: Operator-level Deployment Compatibility
The solution must be compatible with the technical specifications of the video ringtone
platforms of China's three major operators (China Telecom/Mobile/Unicom), and the core
capabilities include: Dynamic transcoding engine: support for the conversion of the input video
to the target format (e.g., China Mobile's H.265 2Mbps bit rate). Terminal adaptation layer:
realize Android 9+ (ExoPlayer Software Development Kit(SDK)) and iOS 13+ (Audio-Video(AV)
Foundation) native playback support. Quality of Service(QoS) guarantee mechanism: first frame
loading time ≤ 1.5 seconds (under Round-Trip Time(RTT) 100ms network environment), frame
loss rate ≤ 0.1%.
4 Sequence Diagram
660

