Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling

Abstract

Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation. To combat that, we propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting. First, we represent the fine-grained semantic structures of the input image and text with the visual and textual scene graphs, which are further fused into a unified cross-modal graph (CMG). Based on CMG, we perform structure refinement with the guidance of the graph information bottleneck principle, actively denoising the less-informative features. Next, we perform topic modeling over the input image and text, incorporating latent multimodal topic features to enrich the contexts. On the benchmark MRE dataset, our system outperforms the current best model significantly. With further in-depth analyses, we reveal the great potential of our method for the MRE task.

Presentation

Motivation

Current methods fail to sufficiently harness the feature sources from two information perspectives, which hinder further MRE development:

Internal-information over-utilization. Prior research shows that only parts of the texts are useful to the relation inference, and not all and always the visual sources play positive roles for MRE. A fine-grained feature screening over both the internal image and text features is needed.

External-information under-exploitation. Although compensating the texts with visual sources, there can be still information deficiency in MRE, in particular when the visual features serve less (or even negative) utility. More external semantic supplementary information should be exploited for MRE.

Method

We thus propose a novel framework for improving MRE, which consists of five parts:

Scene Graph Generation. The model takes as input an image `I` and text `T`, as well as the subject `v_s` and object entity `v_o`. We represent `I` and `T` with the corresponding visual scene graph (VSG) and textual scene graph (TSG).

Cross-modal Graph Construction. The VSG and TSG are assembled as a cross-modal graph (CMG), which is further modeled via a graph encoder.

GIB-guided Feature Refinement. We perform GIB-guided feature refinement (GENE) over the CMG for internal-information screening, i.e., node filtering and edge adjusting, which results in a structurally compact backbone graph.

Multimodal Topic Integration. The multimodal topic features induced from the latent multimodal topic model (LAMO) are integrated into the previously obtained compressed feature representation for external-information exploitation via an attention operation.

Inference. The decoder predicts the relation label `Y` based on the enriched features.

Experiment

Experimental results. Here are the main results, ablation study and some analyses:

Multimodal topic-keywords. Here are some textual and visual topic-keywords induced by the latent multimodal topic model (LAMO).

Poster

BibTeX

@inproceedings{WUAcl23MMRE,
  author    = {Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, Tat-Seng Chua},
  title     = {Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling},
  journal   = {Proceedings of the Annual Meeting of the Association for Computational Linguistics},
  year      = {2023},
}