Page 205 - Kaleidoscope Academic Conference Proceedings 2024
P. 205
ADVANCING IMAGE TRANSFER THROUGH SEMANTIC-AIDED APPROACHES: A
MULTI-MODAL EXPLORATION
2
1
1
Dawood Aziz Zargar ; Hashim Aijaz ; Nargis Fayaz ; Brejesh Lall 2
1
National Institute of Technology, Srinagar, India
2
Indian Institute of Technology, Delhi, India
ABSTRACT we facilitate targeted information transfer and reduce
bandwidth usage through concise captions, complemented
Semantic Communication (SC), in contrast to conventional by modes like line Art for enriched semantic richness.
communication, prioritizes meaning over raw data, thereby This approach sets the stage for efficient data transmission
minimizing errors. In this paper, we advance image transfer and advancements in communication technologies. In [4],
using semantic-aided approaches. Leveraging deep learning the authors introduce a novel text-to-image generation
models such as Bootstrapping Language-Image Pre-Training framework utilizing a Siamese structure for distilling
and Text-to-Image Diffusion Models, we initially utilize semantic commonalities from linguistic descriptions and
captions as a single mode to reduce data transfer size and enhance visual-semantic embedding methods to preserve
convey image content. However, recognizing the structural unique semantic diversities, highlighting its effectiveness and
importance of images, we introduce a second mode, favoring significance through extensive experimentation. Moreover,
line art for its efficacy in depicting image structure. Our in [5], the authors have implemented a SC based end-to-end
findings highlight the potential of multi-modal approaches to image transmission system, discussing potential design
improving SC systems for various applications in the 6G era. considerations for developing SC systems alongside physical
channel characteristics. A Pre-trained Generative Adversarial
Keywords - Bootstrapping language-image pre-training,
Network is employed at the receiver for the transmission
canny edge, caption generation, depth map, line art,
task, reconstructing realistic images based on the semantic
multi-modality, semantic communication, text-to-image
segmented images received as input. In these works, only a
diffusion models
single modality is considered, i.e., either text description or
segmented image. The question that comes to one’s mind
1. INTRODUCTION
is: how can we do better, and will the integration of multiple
modalities help in image generation at the receiver?
Wireless communication, evolving from 1G to 5G, faces
In recent years, vision-language pre-training has advanced,
the need for the next phase of evolution to meet
yet struggles to balance understanding and generation
emerging technology demands, including higher data rates
tasks. Bootstrapping Language-Image Pre-Training (BLIP)
to support multimedia content and immersive experiences,
[6] introduces a Multi-modal Mixture of Encoder-Decoder
ultra-reliable low-latency communication for real-time
(MED) architecture, seamlessly transitioning between tasks,
applications like autonomous vehicles and remote surgery,
and expanding vision-language technology applications.
massive machine-type communication for the Internet of
BLIP’s Captioning and Filtering (CapFilt) method leverages
Things, network slicing for service customization, energy
noisy web data, enhancing pre-training quality, and achieving
efficiency and sustainability, enhanced security and privacy
state-of-the-art performance. These advancements are crucial
measures, and integration of emerging technologies like
for future communication technologies, especially in 6G,
AI and edge computing. By 2030, 5G may become
where efficient data transmission is vital. BLIP’s exceptional
obsolete, requiring a shift toward the next generation of
performance in vision-language tasks is pivotal for our work
wireless communication to ensure continued technological
advancing image transfer through semantic-aided methods.
advancement. Semantic Communication (SC) prioritizes
By using BLIP, we extract image captions and convert them
meaning over raw data, addressing many 5G’s limitations
into bits for swift data transmission critical for 6G. Through
[1] [2]. Despite many challenges [3], innovative frameworks
BLIP integration, we enhance communication channels,
like semantic fidelity and symbol representation enhance
speeding up vision-language tasks.
efficiency in natural language processing and neural network
The authors in [7] explore the potential of Text-to-Image (T2I)
training.
adapters for semantic-aided image transfer in multi-modal
Our work aims to contribute to the paradigm shift towards
communication. Initially focused on text-based image
SC in wireless communication beyond 5G. Leveraging
generation control, they suggested a new application:
AI and multi-modality, we optimize semantic-aided
using T2I adapters to convey both image semantics and
image transfer, bridging the semantic gap for enhanced
textual descriptions. Semantic-aided image transfer reduces
understanding and efficiency. By integrating AI algorithms
bandwidth demands, prioritizing semantic content over raw
like Convolutional Neural Networks and Transformers,
978-92-61-39091-4/CFP2268P @ITU 2024 – 161 – Kaleidoscope