Page 383 - AI for Good Innovate for Impact
P. 383
AI for Good Innovate for Impact
(continued)
Item Details
Metadata (Type of Visual
Data) 4.3 - 5G
In the initialization phase a 3D body model is initialized through Parametric
Human Model Creation utilizing the SMPL-X [2] parametric human model.
To create the SMPL-X model from image sequences, we use the method
in [11]. The initial body model is shared with the receiving end before
transmission. For human motion capture we extract temporally consis-
Model Training and tent 3D human pose and shape from monocular video through enhanced
Fine-Tuning
spatio-temporal context by extracting body-aware deep features [3] from
individual frames and simultaneously predicting initial per-frame estimates
of body pose, shape, and camera pose using a standard method [5]. The
final motion capture output is presented in a JSON format and transmitted
for rendering at the other end.
Testbeds or Pilot Initial PoC and the related experience is published as referred in [1] in the
Deployments references section.
Code repositories N/A
2 Use Case Description
2�1 Description
Lack of good teachers/ expert trainers is one of the hindrances against quality education
and training for many developing and underdeveloped nations [5]. The present use case
addresses this gap using AI native semantic communications. Although remote teaching using
telepresence robots has been considered, it suffers from infrastructural cost and challenges [6],
lacks humane touch and does not enable the exchange of non-verbal body cues like realistic
communication. So, there is a shift in demand for 3D telepresence [7] with an expectation to
bring more realism into teacher-student interaction. 3D holoportation, as described in [8], is not
scalable, requires huge bandwidth, and prohibitively costly for any democratized usage due to
dedicated infrastructure requirements. So, inspired by the future looking education roadmap
in countries like India embracing augmented reality (AR) and virtual reality (VR) glasses [9], we
propose the following solution as described by remote teaching scenario.
As shown in Fig. 1, the teacher stands in front of a simple RGB camera attached to a computer
which streams live visuals and audio of the teacher to an edge computer natively integrated with
the 5G/6G network service. The edge computer is equipped with an AI algorithm which extracts
the teacher’s 3D body posture in real-time at a specified framerate. The extracted posture is
encoded as semantic information of the body pose and transmitted over the network to the
remote school’s computing device, which is connected to the VR glasses of the students. The
computing device at the school has a local AR engine which is preloaded with a parametric
3D avatar of the teacher. The semantic information of the live body posture received by the
computing device in the school is decoded and transferred to the 3D avatar. As an effect, each
student sees the remote teacher’s avatar in situ as if the teacher is present in the classroom.
We call this ‘Semantic Live Streaming’. Since the teacher does not need to see the students in
3D, a camera attached to the computing device in the school can transmit back the view of the
class to the teacher through conventional real-time streaming. The teacher does not need a
347

