MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
Findings of EMNLP 2020Amir Mazaheri, Mubarak Shah
2019 – 2020
A multimodal fusion transformer with BERT encodings that achieves SOTA on the TVQA dataset.
Introduces MMFT-BERT (Multi-Modal Fusion Transformer with BERT) for Visual Question Answering over video. We utilize separate and combined multimodal (video and text) processing with BERT encodings and a novel fusion method. Achieved state-of-the-art results on the TVQA dataset and introduced TVQA-Visual, a diagnostic subset to analyze the model’s handling of both modalities.
Amir Mazaheri, Mubarak Shah