Page 206 - Kaleidoscope Academic Conference Proceedings 2024
P. 206
2024 ITU Kaleidoscope Academic Conference
pixel data. T2I-Adapters align text-to-image models with 3. Language Modeling Loss (LM): Generating textual
control signals, enhancing efficiency. Extracting semantics descriptions from images, optimizing cross-entropy loss
from source images, including object outlines and shapes, for maximum likelihood of text.
and combining them with textual descriptions enhances
efficiency. At the receiving end, a decoder reconstructs BLIP’s image-to-text architecture transmits captions instead
the image using these representations, offering a promising of images, reducing data volume, conserving bandwidth,
alternative for faster, more efficient communication in and expediting transmission, aided by CapFilt modules for
bandwidth-restricted environments. enhanced text quality. CapFilt consists of a captioner to
The major contributions of this paper are: generate synthetic captions for web images and a filter to
remove noisy image-text pairs. Both the captioner and filter
• This paper proposes a novel multi-modal approach are initialized from the same pre-trained MED model and
for semantic-aided image transfer in the context of
fine-tuned on a small-scale human-annotated dataset. The
6G communication. Our work leverages deep learning
filtered and captioned image-text pairs form a new dataset,
models, including BLIP for image captioning and T2I
which is used for pre-training the BLIP model, resulting in
adapters for controllable image generation, aiming to
improved image-to-text understanding and generation.
achieve efficient and semantically rich image transfer.
• We investigate the efficacy of various second modes such
as line art, canny edge, and depth map to complement
textual descriptions (captions). Our findings indicate
that line art strikes the optimal balance between data
reduction and image fidelity.
2. SYSTEM MODEL
2.1 Architecture of Image Captioning Figure 1 – Examples showcasing the results of the image
captioning model
The image-to-text architecture of the BLIP model
The outcomes at this stage are:
incorporates a Vision Transformer (ViT) as the image
(i) Result of Image Captioning Model: The architecture
encoder. This ViT divides images into patches and appends a
of image captioning involves generating textual descriptions
special classification [CLS] token to capture global features.
For a unified model with both understanding and generation of images using deep learning models. The output is a
capabilities, BLIP introduces the Multimodal MED, offering descriptive caption that accurately represents the content of
three functionalities: the image as shown in Figure 1.
(ii) Testing Using BiLingual Evaluation Understudy
1. Unimodal Encoder: This component separately encoded (BLEU) Score: We assessed the image captioning model
image and text, utilizing a Bidirectional Encoder using the Flickr 8k validation set, achieving a BLEU score of
Representations from Transformers based text encoder 0.69, indicating substantial similarity with human captions.
with a [CLS] token for sentence summarization.
2.2 T2I-Adapter
2. Image-Grounded Text Encoder: Injecting visual
information, it inserts cross-attention (CA) layers The T2I-Adapter enhances text-to-image diffusion models
between self-attention (SA) layers, appending an by adding structural and color guidance during image
[Encode] token for image-text representation. synthesis. Integrated with the Stable Diffusion (SD) model,
which includes an autoencoder and a UNet denoiser, the
3. Image-Grounded Text Decoder: Replacing bidirectional T2I-Adapter improves image generation quality.
self-attention layers with causal self-attention layers, it
uses [Decode] tokens for sequence signaling.
2.2.1 Model Architecture
During pre-training, the model optimizes three distinct
• Feature Extraction Blocks: Four feature extraction
objectives:
blocks composed of convolutional layers and residual
blocks are used to extract condition features from various
1. Image-Text Contrastive Loss (ITC): Aligning feature
input maps, such as sketches, depth maps, semantic
spaces of visual and text transformers, encouraging
segmentation maps, and key poses.
similar representations for positive image-text pairs.
2. Image-Text Matching Loss (ITM): Learning • Downsample Blocks: Three downsample blocks reduce
multimodal representation capturing fine-grained the input resolution from 512×512 to 64×64 using a pixel
alignment between vision and language, predicting unshuffle operation, allowing efficient feature extraction
positive/negative pairs. at multiple scales.
– 162 –