Page 25 - Kaleidoscope Academic Conference Proceedings 2022
P. 25

has been developed, to simultaneously measure the effect of  edge, where the image segmentation algorithm runs. The
           the system on socio-emotional parameters such as presence  final result is sent back to the HMD and is projected onto
           or empathy, and on technical aspects such as visual quality  the virtual scene. The system works with photon-to-photon
           [17].                                              latency less than 100 milliseconds, when using transmission
           Although 360 video technology has obvious limitations,  over WiFi without collision with other users [20].
           since it only provides three Degrees of Freedom (DoF) in  This architecture allows us to deploy different algorithms
           movement, it is enough to elicit a high degree of spatial and  for the processing of the scene, without increasing the
           social presence [17]. This sense of presence is increased if the  computational load in the HMD. Thus, for the segmentation
           user feels part of the scene, that is, if the remote participants  of the person’s silhouette, both simple color-based algorithms
           address him or her. In a study with elderly people with  and algorithms based on semantic segmentation by deep
           neuronal degradation, we have also seen how this high sense  learning [21] have been used. It is also possible to include
           of presence is maintained even in the face of processes of  in the scene some objects from the local environment, with
           cognitive degradation, severe dependency, or symptoms of  which the user can interact, such as keyboards [22], pens to
           depression [18].                                   take notes [17], or tools to perform physical tasks [23] .
                                                              The use of video-based avatars provides very high levels of
           2.2  Move: embodied interaction                    spatial presence and self-presence, significantly improving
                                                              on solutions based on VR controllers [22], and even those
           In commercial VR solutions, the representation of the  based on hand tracking [20]. It should also be borne in mind
           person within the immersive environment (avatar) is usually  that the implementation of video-based avatars is technically
           implemented through virtual hands.  Each virtual hand  more complex, and there is still room for improvement both in
           mimics the movement of the real hand, either when it  execution time and in segmentation accuracy. Additionally,
           interacts with a VR game controller, or by using hand  the segmentation of elements of the local scene allows
           detectionandtrackingalgorithmsfromthecamerasintegrated  interaction with physical objects. This interaction can be
           in the Head-Mounted Display (HMD). As an alternative,  used, for example, in training programs that require some
           we propose the use of video-based avatars2, where a  manual work [23].
           camera integrated in the HMD is used to capture the video
           egocentrically (i.e. from the point of view of the user), and  2.3 Face: visual communication
           the silhouette of the hand or the body itself is detected and
           segmented to integrate it into the immersive scene [19], as  The combination of the two elements shown so far, Visit and
           shown in Figure 3.                                 Move, allows a user to feel present in a remote environment,
                                                              to interact with the people who are there, as well as with the
                                                              objects in their environment. The next step is to add other
                                                              potential remote users to the scene, who will be represented
                                                              by their avatars [16].  In the distributed reality context,
                                                              these avatars should be real-time representations of other
                                                              users, typically captured from an array of 3D depth-sensing
                                                              cameras3 and rendered as a photo-realistic avatar.

                                                                       Capture       Transport         Rendering
                                                                       Complexity    Bit Rate          Complexity
                                                                          Multi-camera   RGB+D   Free-Viewpoint
                                                                           View+Depth    Video  Video Rendering

                                                                                         PCC
           Figure 3 – Video-based avatars. Left: A person wearing an       Point Cloud   Video   Point Cloud
                                                                            Capture
                                                                                                 Rendering
           HMD with an attached egocentric camera, so that her body is
           captured from the camera, segmented, and integrated within     Digital Person   Animation   Digital Person
           a virtual environment. Right: The virtual environment as         Modeling     data     Animation
           perceived by the HMD user, with she can see her own body.
           To validate our proposal, we have developed a prototype in  Figure 4 – Alternatives for the implementation of real-time
           which we integrate an HMD with a stereoscopic camera,  capture and representation of avatars.
           properly calibrated, so that the image captured by the camera
                                                                 4 are
           is displayed in the correct position within the immersive  There  © 2022 Nokia different approaches to capturing and transmitting
           environment. This image, in turn, is sent to a server at the  these types of avatars.  Each of them has different
                                                              requirements in terms of the computational complexity of
           2 We use avatar to describe the representation of the user within the  the capture and rendering process, as well as the bit rate of
            immersive environment, regardless of whether this representation
            is generated as Computer-Graphic Imagery (CGI) animation, or it  3 3D depth-sensing cameras, or view+depth cameras, capture, for
            is a video capture of the user inserted in this scene. We use the  each color pixel, an estimate of the distance of that pixel to the
            term video-based avatar for the latter case.       camera (i.e. its depth).




                                                          – xxi –
   20   21   22   23   24   25   26   27   28   29   30