Page 205 - Kaleidoscope Academic Conference Proceedings 2024
P. 205

ADVANCING IMAGE TRANSFER THROUGH SEMANTIC-AIDED APPROACHES: A
                                           MULTI-MODAL EXPLORATION

                                                                           2
                                                               1
                                                  1
                                 Dawood Aziz Zargar ; Hashim Aijaz ; Nargis Fayaz ; Brejesh Lall 2
                                         1
                                          National Institute of Technology, Srinagar, India
                                           2
                                            Indian Institute of Technology, Delhi, India


                              ABSTRACT                        we facilitate targeted information transfer and reduce
                                                              bandwidth usage through concise captions, complemented
           Semantic Communication (SC), in contrast to conventional  by modes like line Art for enriched semantic richness.
           communication, prioritizes meaning over raw data, thereby  This approach sets the stage for efficient data transmission
           minimizing errors. In this paper, we advance image transfer  and advancements in communication technologies. In [4],
           using semantic-aided approaches. Leveraging deep learning  the authors introduce a novel text-to-image generation
           models such as Bootstrapping Language-Image Pre-Training  framework utilizing a Siamese structure for distilling
           and Text-to-Image Diffusion Models, we initially utilize  semantic commonalities from linguistic descriptions and
           captions as a single mode to reduce data transfer size and  enhance visual-semantic embedding methods to preserve
           convey image content. However, recognizing the structural  unique semantic diversities, highlighting its effectiveness and
           importance of images, we introduce a second mode, favoring  significance through extensive experimentation. Moreover,
           line art for its efficacy in depicting image structure. Our  in [5], the authors have implemented a SC based end-to-end
           findings highlight the potential of multi-modal approaches to  image transmission system, discussing potential design
           improving SC systems for various applications in the 6G era.  considerations for developing SC systems alongside physical
                                                              channel characteristics. A Pre-trained Generative Adversarial
             Keywords - Bootstrapping language-image pre-training,
                                                              Network is employed at the receiver for the transmission
              canny edge, caption generation, depth map, line art,
                                                              task, reconstructing realistic images based on the semantic
             multi-modality, semantic communication, text-to-image
                                                              segmented images received as input. In these works, only a
                            diffusion models
                                                              single modality is considered, i.e., either text description or
                                                              segmented image. The question that comes to one’s mind
                         1.  INTRODUCTION
                                                              is: how can we do better, and will the integration of multiple
                                                              modalities help in image generation at the receiver?
           Wireless communication, evolving from 1G to 5G, faces
                                                              In recent years, vision-language pre-training has advanced,
           the need for the next phase of evolution to meet
                                                              yet struggles to balance understanding and generation
           emerging technology demands, including higher data rates
                                                              tasks. Bootstrapping Language-Image Pre-Training (BLIP)
           to support multimedia content and immersive experiences,
                                                              [6] introduces a Multi-modal Mixture of Encoder-Decoder
           ultra-reliable low-latency communication for real-time
                                                              (MED) architecture, seamlessly transitioning between tasks,
           applications like autonomous vehicles and remote surgery,
                                                              and expanding vision-language technology applications.
           massive machine-type communication for the Internet of
                                                              BLIP’s Captioning and Filtering (CapFilt) method leverages
           Things, network slicing for service customization, energy
                                                              noisy web data, enhancing pre-training quality, and achieving
           efficiency and sustainability, enhanced security and privacy
                                                              state-of-the-art performance. These advancements are crucial
           measures, and integration of emerging technologies like
                                                              for future communication technologies, especially in 6G,
           AI and edge computing. By 2030, 5G may become
                                                              where efficient data transmission is vital. BLIP’s exceptional
           obsolete, requiring a shift toward the next generation of
                                                              performance in vision-language tasks is pivotal for our work
           wireless communication to ensure continued technological
                                                              advancing image transfer through semantic-aided methods.
           advancement. Semantic Communication (SC) prioritizes
                                                              By using BLIP, we extract image captions and convert them
           meaning over raw data, addressing many 5G’s limitations
                                                              into bits for swift data transmission critical for 6G. Through
           [1] [2]. Despite many challenges [3], innovative frameworks
                                                              BLIP integration, we enhance communication channels,
           like semantic fidelity and symbol representation enhance
                                                              speeding up vision-language tasks.
           efficiency in natural language processing and neural network
                                                              The authors in [7] explore the potential of Text-to-Image (T2I)
           training.
                                                              adapters for semantic-aided image transfer in multi-modal
           Our work aims to contribute to the paradigm shift towards
                                                              communication. Initially focused on text-based image
           SC in wireless communication beyond 5G. Leveraging
                                                              generation control, they suggested a new application:
           AI and multi-modality,  we optimize semantic-aided
                                                              using T2I adapters to convey both image semantics and
           image transfer, bridging the semantic gap for enhanced
                                                              textual descriptions. Semantic-aided image transfer reduces
           understanding and efficiency. By integrating AI algorithms
                                                              bandwidth demands, prioritizing semantic content over raw
           like Convolutional Neural Networks and Transformers,
            978-92-61-39091-4/CFP2268P @ITU 2024          – 161 –                                    Kaleidoscope
   200   201   202   203   204   205   206   207   208   209   210