Page 206 - Kaleidoscope Academic Conference Proceedings 2024
P. 206

2024 ITU Kaleidoscope Academic Conference




           pixel data. T2I-Adapters align text-to-image models with  3. Language Modeling Loss (LM): Generating textual
           control signals, enhancing efficiency. Extracting semantics  descriptions from images, optimizing cross-entropy loss
           from source images, including object outlines and shapes,  for maximum likelihood of text.
           and combining them with textual descriptions enhances
           efficiency. At the receiving end, a decoder reconstructs  BLIP’s image-to-text architecture transmits captions instead
           the image using these representations, offering a promising  of images, reducing data volume, conserving bandwidth,
           alternative for faster, more efficient communication in  and expediting transmission, aided by CapFilt modules for
           bandwidth-restricted environments.                 enhanced text quality. CapFilt consists of a captioner to
           The major contributions of this paper are:         generate synthetic captions for web images and a filter to
                                                              remove noisy image-text pairs. Both the captioner and filter
             • This paper proposes a novel multi-modal approach  are initialized from the same pre-trained MED model and
               for semantic-aided image transfer in the context of
                                                              fine-tuned on a small-scale human-annotated dataset. The
               6G communication. Our work leverages deep learning
                                                              filtered and captioned image-text pairs form a new dataset,
               models, including BLIP for image captioning and T2I
                                                              which is used for pre-training the BLIP model, resulting in
               adapters for controllable image generation, aiming to
                                                              improved image-to-text understanding and generation.
               achieve efficient and semantically rich image transfer.
             • We investigate the efficacy of various second modes such
               as line art, canny edge, and depth map to complement
               textual descriptions (captions). Our findings indicate
               that line art strikes the optimal balance between data
               reduction and image fidelity.

                         2. SYSTEM MODEL

           2.1 Architecture of Image Captioning               Figure 1 – Examples showcasing the results of the image
                                                              captioning model
           The image-to-text architecture of the BLIP model
                                                              The outcomes at this stage are:
           incorporates a Vision Transformer (ViT) as the image
                                                              (i) Result of Image Captioning Model: The architecture
           encoder. This ViT divides images into patches and appends a
                                                              of image captioning involves generating textual descriptions
           special classification [CLS] token to capture global features.
           For a unified model with both understanding and generation  of images using deep learning models.  The output is a
           capabilities, BLIP introduces the Multimodal MED, offering  descriptive caption that accurately represents the content of
           three functionalities:                             the image as shown in Figure 1.
                                                              (ii) Testing Using BiLingual Evaluation Understudy
            1. Unimodal Encoder: This component separately encoded  (BLEU) Score: We assessed the image captioning model
               image and text, utilizing a Bidirectional Encoder  using the Flickr 8k validation set, achieving a BLEU score of
               Representations from Transformers based text encoder  0.69, indicating substantial similarity with human captions.
               with a [CLS] token for sentence summarization.
                                                              2.2 T2I-Adapter
            2. Image-Grounded  Text  Encoder:  Injecting  visual
               information, it inserts cross-attention (CA) layers  The T2I-Adapter enhances text-to-image diffusion models
               between self-attention (SA) layers, appending an  by adding structural and color guidance during image
               [Encode] token for image-text representation.  synthesis. Integrated with the Stable Diffusion (SD) model,
                                                              which includes an autoencoder and a UNet denoiser, the
            3. Image-Grounded Text Decoder: Replacing bidirectional  T2I-Adapter improves image generation quality.
               self-attention layers with causal self-attention layers, it
               uses [Decode] tokens for sequence signaling.
                                                              2.2.1  Model Architecture
           During pre-training, the model optimizes three distinct
                                                                • Feature Extraction Blocks:  Four feature extraction
           objectives:
                                                                  blocks composed of convolutional layers and residual
                                                                  blocks are used to extract condition features from various
            1. Image-Text Contrastive Loss (ITC): Aligning feature
                                                                  input maps, such as sketches, depth maps, semantic
               spaces of visual and text transformers, encouraging
                                                                  segmentation maps, and key poses.
               similar representations for positive image-text pairs.
            2. Image-Text  Matching  Loss   (ITM):  Learning    • Downsample Blocks: Three downsample blocks reduce
               multimodal  representation  capturing  fine-grained  the input resolution from 512×512 to 64×64 using a pixel
               alignment between vision and language, predicting  unshuffle operation, allowing efficient feature extraction
               positive/negative pairs.                           at multiple scales.




                                                          – 162 –
   201   202   203   204   205   206   207   208   209   210   211