Research Repository

Multimodal deep features fusion for video memorability prediction

Leyva, R and Doctor, F and Seco de Herrera, AG and Sahab, S (2019) Multimodal deep features fusion for video memorability prediction. In: UNSPECIFIED, ? - ?.

[img]
Preview
Text
mediaEval2019.pdf - Submitted Version

Download (769kB) | Preview

Abstract

This paper describes a multimodal feature fusion approach for predicting the short and long term video memorability where the goal to design a system that automatically predicts scores reflecting the probability of a video being remembered. The approach performs early fusion of text, image, and video features. Text features are extracted using a Convolutional Neural Network (CNN), an FBResNet152 pre-trained on ImageNet is used to extract image features and video features are extracted using 3DResNet152 pre-trained on Kinetics 400. We use Fisher Vectors to obtain a single vector associated with each video that overcomes the need for using a non-fixed global vector representation for handling temporal information. The fusion approach demonstrates good predictive performance and regression superiority in terms of correlation over standard features.

Item Type: Conference or Workshop Item (Paper)
Additional Information: Published proceedings: CEUR Workshop Proceedings
Divisions: Faculty of Science and Health > Computer Science and Electronic Engineering, School of
Depositing User: Elements
Date Deposited: 27 Jan 2020 10:52
Last Modified: 05 Apr 2021 18:15
URI: http://repository.essex.ac.uk/id/eprint/26580

Actions (login required)

View Item View Item