Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results.
Figure 1: Comparison between existing MLLMs in tokenized visual inputs: (a) patch-level continuous token, (b) patch-level discrete token, (c) learnable query token, and (d) semantic-equivalent continuous token (ours). In (e), we show four language-driven vision tasks enhanced with semantic-equivalent vision tokens, with token masks showing regions of the same color representing a single vision token.
SeTok tokenizes visual features extracted from an image by a vision encoder into semantically equivalent vision tokens, which then are fed into a detokenizer to reconstruct the image and meanwhile employed to perform the concept-level image-text alignment.
Figure 2: Overview of SeTok.
Upon acquiring SeTok, we propose to integrate it with the pre-trained LLM to construct an MLLM, i.e., Setokim. The overall framework is depicted in Figure 3. The input image will be tokenized into a sequence of semantic-equivalent visual tokens by SeTok, which are then concatenated with text tokens to form a unified multimodal sequence. To effectively distinguish between modalities and facilitate visual content generation, two special tokens, '[Img]' and '[/Img]' are introduced to signify the beginning and the end of the visual sequence, respectively. The backbone LLM subsequently processes this multimodal sequence to perform multimodal understanding and generation. The output vision tokens are then fed into the visual detokenizer to restore the images. To exploit this spatial and semantic encoding, we incorporate a lightweight mask decoder utilizing the generated vision tokens as input to yield the referring mask.
Figure 3: Overview of SeTokim.
We train the Setokim to endow it with robust vision understanding and task execution capabilities, with three distinct phases.
Figure 3: Qualitative results on image understanding and generation. The words marked in green are key elements in questions and answers.
Figure 4: The visualizations for segmentation results compared with GLaMM and Ospery.
Figure 5: Qualitative comparison between MLLMs for the image editing. Setokim excels in adhering to instructions and preserving low-level image details.
Figure 6: The visualizations for visual tokens.
Figure 7: The image reconstruction results from visual tokens by the denoising U-Net.
@article{wu2024towards,
title={Towards Semantic Equivalence of Tokenization in Multimodal LLM},
author={Wu, Shengqiong and Fei, Hao and Li, Xiangtai and Ji, Jiayi and Zhang, Hanwang and Chua, Tat-Seng and Yan, Shuicheng},
journal={arXiv preprint arXiv:2406.05127},
year={2024}
}