Video Fill In the Blank using LR/RL LSTMs with Spatial-Temporal Attentions
ICCV 2017Amir Mazaheri, Dong Zhang, Mubarak Shah
2016 – 2017
Bidirectional LSTMs with spatial-temporal attention to predict missing words in video descriptions.
Tackles the Video-Fill-In-the-Blank (VFIB) challenge by proposing a framework that uses dual LSTMs (left-to-right and right-to-left) for textual encoding of sentence fragments, integrated with external memory, and spatial and temporal attention models for visual encoding. The approach effectively selects discriminative visual features to accurately predict missing words in video descriptions.
Amir Mazaheri, Dong Zhang, Mubarak Shah