Towards Semantic Equivalence of Tokenization in Multimodal LLM

1National University of Singapore   2Skywork AI, Singapore   3Nanyang Technological University

Abstraction

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results.

Teaser

Figure 1: Comparisons between existing MLLMs in tokenizing input image via (a) Patchifying image, (b) Codebook, and (c) Cluster Merger. In (d), we show four language-driven vision tasks enhanced with semantic-equivalent vision tokens where regions with the same color denote a vision token.

Method

(a) A Vision Cluster is automatically grouping visual features from input images into dynamic number of visual clusters. (b) The Vision Merger is proposed to aggregate visual embeddings beyond merely using cluster centers as definitive vision tokens. (b) The core backbone is a LLM. (c) The Vision Decoder to decode realistic images by taking the tokenized visual tokens as inputs. (d) The Mask Decoder is taking the vision tokens as input to decode the object mask.

Teaser

Figure 2: The overview of SeTokim. The visual embedding extracted from a vision encoder is tokenized into vision tokens by SeTok, combined with text tokens to be input LLM for vision-language understanding. The output vision embeddings from LLM are fed into a vision decoder and mask decoder to generate realistic images and semantic segmentation masks, respectively.

Training

We train the Setokim to endow it with robust vision understanding and task execution capabilities, with three distinct phases.

  • Step-I: Tokenizer Pretraining. This stage aims to truly endow the tokenizer with the capability of tokenizing the input vision into a semantic complement and independent tokens that can capture the low-frequency semantic features and high-frequency pixel features.
  • Step-II: Multimodal Pretraining. In this phase, we enhance the LLM to possess interleaved understanding and generation capabilities for vision-language tasks. Thus, on the one hand, we adopt next-token prediction-based cross-entropy loss for the textual content generation. Meanwhile, we employ embedding regression loss to train LLM to generate visual tokens, which are trained to reconstruct the features on the pre-trained vision tokenizer with a Mean Squared Error (MSE) loss.
  • Step-III: Instruction Tuning. We perform multimodal instruction tuning through fine-tuning LLM using a LoRA module with both public datasets covering fine-grained visual QA, image generation and editing, and text-rich grounded datasets.


Qualitative Analysis


• Visual Understanding and Generation

Teaser

Figure 3: Qualitative results on image understanding and generation. The words marked in green are key elements in questions and answers.


• Visual segmentation

Teaser

Figure 4: The visualizations for segmentation results compared with GLaMM and Ospery.


• Visual Editing

Teaser

Figure 5: Qualitative comparison between MLLMs for the image editing. Setokim excels in adhering to instructions and preserving low-level image details.


• Visual Tokens

Teaser

Figure 6: The visualizations for visual tokens.


• Reconstructions

Teaser

Figure 7: The image reconstruction results from visual tokens by the denoising U-Net.


BibTeX

@article{wu2024towards,
  title={Towards Semantic Equivalence of Tokenization in Multimodal LLM},
  author={Wu, Shengqiong and Fei, Hao and Li, Xiangtai and Ji, Jiayi and Zhang, Hanwang and Chua, Tat-Seng and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2406.05127},
  year={2024}
}