Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results.
Figure 1: Comparisons between existing MLLMs in tokenizing input image via (a) Patchifying image, (b) Codebook, and (c) Cluster Merger. In (d), we show four language-driven vision tasks enhanced with semantic-equivalent vision tokens where regions with the same color denote a vision token.
(a) A Vision Cluster is automatically grouping visual features from input images into dynamic number of visual clusters. (b) The Vision Merger is proposed to aggregate visual embeddings beyond merely using cluster centers as definitive vision tokens. (b) The core backbone is a LLM. (c) The Vision Decoder to decode realistic images by taking the tokenized visual tokens as inputs. (d) The Mask Decoder is taking the vision tokens as input to decode the object mask.
Figure 2: The overview of SeTokim. The visual embedding extracted from a vision encoder is tokenized into vision tokens by SeTok, combined with text tokens to be input LLM for vision-language understanding. The output vision embeddings from LLM are fed into a vision decoder and mask decoder to generate realistic images and semantic segmentation masks, respectively.
We train the Setokim to endow it with robust vision understanding and task execution capabilities, with three distinct phases.
Figure 3: Qualitative results on image understanding and generation. The words marked in green are key elements in questions and answers.
Figure 4: The visualizations for segmentation results compared with GLaMM and Ospery.
Figure 5: Qualitative comparison between MLLMs for the image editing. Setokim excels in adhering to instructions and preserving low-level image details.
Figure 6: The visualizations for visual tokens.
Figure 7: The image reconstruction results from visual tokens by the denoising U-Net.
@article{wu2024towards,
title={Towards Semantic Equivalence of Tokenization in Multimodal LLM},
author={Wu, Shengqiong and Fei, Hao and Li, Xiangtai and Ji, Jiayi and Zhang, Hanwang and Chua, Tat-Seng and Yan, Shuicheng},
journal={arXiv preprint arXiv:2406.05127},
year={2024}
}