Research Repository

Multimodal Deep Features Fusion For Video Memorability Prediction

Leyva, Roberto and Doctor, Faiyaz and Garcia Seco De Herrera, Alba and Sahab, Sohail (2019) Multimodal Deep Features Fusion For Video Memorability Prediction. In: MediaEval, 2019-10-27 - 2019-10-29, Sophia Antipolis, France. (In Press)

mediaEval2019.pdf - Submitted Version

Download (769kB) | Preview


This paper describes a multimodal feature fusion approach for predicting the short and long term video memorability where the goal to design a system that automatically predicts scores reflecting the probability of a video being remembered. The approach performs early fusion of text, image, and video features. Text features are extracted using a Convolutional Neural Network (CNN), an FBResNet152 pre-trained on ImageNet is used to extract image features and and video features are extracted using 3DResNet152 pre-trained on Kinetics 400.We use Fisher Vectors to obtain a single vector associated with each video that overcomes the need for using a non-fixed global vector representation for handling temporal information. The fusion approach demonstrates good predictive performance and regression superiority in terms of correlation over standard features.

Item Type: Conference or Workshop Item (Paper)
Additional Information: Published proceedings: CEUR Workshop Proceedings
Divisions: Faculty of Science and Health
Faculty of Science and Health > Computer Science and Electronic Engineering, School of
SWORD Depositor: Elements
Depositing User: Elements
Date Deposited: 27 Jan 2020 10:52
Last Modified: 23 Sep 2022 19:37

Actions (login required)

View Item View Item