Towards Semantic Equivalence of Tokenization in Multimodal LLM

1National University of Singapore   2Skywork AI, Singapore   3Nanyang Technological University

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results.

Teaser

Figure 1: Comparison between existing MLLMs in tokenized visual inputs: (a) patch-level continuous token, (b) patch-level discrete token, (c) learnable query token, and (d) semantic-equivalent continuous token (ours). In (e), we show four language-driven vision tasks enhanced with semantic-equivalent vision tokens, with token masks showing regions of the same color representing a single vision token.

Method

SeTok tokenizes visual features extracted from an image by a vision encoder into semantically equivalent vision tokens, which then are fed into a detokenizer to reconstruct the image and meanwhile employed to perform the concept-level image-text alignment.

Teaser

Figure 2: Overview of SeTok.

Upon acquiring SeTok, we propose to integrate it with the pre-trained LLM to construct an MLLM, i.e., Setokim. The overall framework is depicted in Figure 3. The input image will be tokenized into a sequence of semantic-equivalent visual tokens by SeTok, which are then concatenated with text tokens to form a unified multimodal sequence. To effectively distinguish between modalities and facilitate visual content generation, two special tokens, '[Img]' and '[/Img]' are introduced to signify the beginning and the end of the visual sequence, respectively. The backbone LLM subsequently processes this multimodal sequence to perform multimodal understanding and generation. The output vision tokens are then fed into the visual detokenizer to restore the images. To exploit this spatial and semantic encoding, we incorporate a lightweight mask decoder utilizing the generated vision tokens as input to yield the referring mask.

Teaser

Figure 3: Overview of SeTokim.

Training

We train the Setokim to endow it with robust vision understanding and task execution capabilities, with three distinct phases.

  • Step-I: Tokenizer Pretraining. This stage aims to truly endow the tokenizer with the capability of tokenizing the input vision into a semantic complement and independent tokens that can capture the low-frequency semantic features and high-frequency pixel features.
  • Step-II: Multimodal Pretraining. In this phase, we enhance the LLM to possess interleaved understanding and generation capabilities for vision-language tasks.
  • Step-III: Instruction Tuning. We perform multimodal instruction tuning through fine-tuning LLM using a LoRA module with both public datasets covering fine-grained visual QA, image generation and editing, and text-rich grounded datasets.


Qualitative Analysis


• Visual Understanding and Generation

Teaser

Figure 3: Qualitative results on image understanding and generation. The words marked in green are key elements in questions and answers.


• Visual segmentation

Teaser

Figure 4: The visualizations for segmentation results compared with GLaMM and Ospery.


• Visual Editing

Teaser

Figure 5: Qualitative comparison between MLLMs for the image editing. Setokim excels in adhering to instructions and preserving low-level image details.


• Visual Tokens

Teaser

Figure 6: The visualizations for visual tokens.


• Reconstructions

Teaser

Figure 7: The image reconstruction results from visual tokens by the denoising U-Net.


BibTeX

@article{wu2024towards,
  title={Towards Semantic Equivalence of Tokenization in Multimodal LLM},
  author={Wu, Shengqiong and Fei, Hao and Li, Xiangtai and Ji, Jiayi and Zhang, Hanwang and Chua, Tat-Seng and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2406.05127},
  year={2024}
}