Page 25 - Kaleidoscope Academic Conference Proceedings 2022

P. 25

has been developed, to simultaneously measure the eﬀect of edge, where the image segmentation algorithm runs. The
the system on socio-emotional parameters such as presence ﬁnal result is sent back to the HMD and is projected onto
or empathy, and on technical aspects such as visual quality the virtual scene. The system works with photon-to-photon
[17]. latency less than 100 milliseconds, when using transmission
Although 360 video technology has obvious limitations, over WiFi without collision with other users [20].
since it only provides three Degrees of Freedom (DoF) in This architecture allows us to deploy diﬀerent algorithms
movement, it is enough to elicit a high degree of spatial and for the processing of the scene, without increasing the
social presence [17]. This sense of presence is increased if the computational load in the HMD. Thus, for the segmentation
user feels part of the scene, that is, if the remote participants of the person’s silhouette, both simple color-based algorithms
address him or her. In a study with elderly people with and algorithms based on semantic segmentation by deep
neuronal degradation, we have also seen how this high sense learning [21] have been used. It is also possible to include
of presence is maintained even in the face of processes of in the scene some objects from the local environment, with
cognitive degradation, severe dependency, or symptoms of which the user can interact, such as keyboards [22], pens to
depression [18]. take notes [17], or tools to perform physical tasks [23] .
The use of video-based avatars provides very high levels of
2.2 Move: embodied interaction spatial presence and self-presence, signiﬁcantly improving
on solutions based on VR controllers [22], and even those
In commercial VR solutions, the representation of the based on hand tracking [20]. It should also be borne in mind
person within the immersive environment (avatar) is usually that the implementation of video-based avatars is technically
implemented through virtual hands. Each virtual hand more complex, and there is still room for improvement both in
mimics the movement of the real hand, either when it execution time and in segmentation accuracy. Additionally,
interacts with a VR game controller, or by using hand the segmentation of elements of the local scene allows
detectionandtrackingalgorithmsfromthecamerasintegrated interaction with physical objects. This interaction can be
in the Head-Mounted Display (HMD). As an alternative, used, for example, in training programs that require some
we propose the use of video-based avatars2, where a manual work [23].
camera integrated in the HMD is used to capture the video
egocentrically (i.e. from the point of view of the user), and 2.3 Face: visual communication
the silhouette of the hand or the body itself is detected and
segmented to integrate it into the immersive scene [19], as The combination of the two elements shown so far, Visit and
shown in Figure 3. Move, allows a user to feel present in a remote environment,
to interact with the people who are there, as well as with the
objects in their environment. The next step is to add other
potential remote users to the scene, who will be represented
by their avatars [16]. In the distributed reality context,
these avatars should be real-time representations of other
users, typically captured from an array of 3D depth-sensing
cameras3 and rendered as a photo-realistic avatar.

Capture Transport Rendering
Complexity Bit Rate Complexity
Multi-camera RGB+D Free-Viewpoint
View+Depth Video Video Rendering

PCC
Figure 3 – Video-based avatars. Left: A person wearing an Point Cloud Video Point Cloud
Capture
Rendering
HMD with an attached egocentric camera, so that her body is
captured from the camera, segmented, and integrated within Digital Person Animation Digital Person
a virtual environment. Right: The virtual environment as Modeling data Animation
perceived by the HMD user, with she can see her own body.
To validate our proposal, we have developed a prototype in Figure 4 – Alternatives for the implementation of real-time
which we integrate an HMD with a stereoscopic camera, capture and representation of avatars.
properly calibrated, so that the image captured by the camera
4 are
is displayed in the correct position within the immersive There © 2022 Nokia diﬀerent approaches to capturing and transmitting
environment. This image, in turn, is sent to a server at the these types of avatars. Each of them has diﬀerent
requirements in terms of the computational complexity of
2 We use avatar to describe the representation of the user within the the capture and rendering process, as well as the bit rate of
immersive environment, regardless of whether this representation
is generated as Computer-Graphic Imagery (CGI) animation, or it 3 3D depth-sensing cameras, or view+depth cameras, capture, for
is a video capture of the user inserted in this scene. We use the each color pixel, an estimate of the distance of that pixel to the
term video-based avatar for the latter case. camera (i.e. its depth).

– xxi –

20 21 22 23 24 25 26 27 28 29 30