Research Ideas for the Facebook Hateful Memes Challenge

Dec 11, 2020·
Abhishek Das
Abhishek Das
,
Japsimar Singh Wahi
· 1 min read
Hateful Memes Challenge @ NeurIPS 2020
Abstract
We propose two research ideas which when integrated into a Multimodal model, aims to learn the context behind the combination of text and captions used. Our first idea is to use Image Captioning as a medium for introducing outside world knowledge to our model. The highly confident error cases of the multimodal baselines show that the models tend to focus more on the text modality for predictions. Our focus in using this approach is to find a deeper relationship between the text and the image modalities by bringing the visual modality and finding its “actual caption” and parallelly sending the image representation along with the pre-extracted caption representation for the concatenation step. Moreover, comparing the “actual caption” with the “pre-extracted caption” of the meme will help in understanding whether both are aligned or not because in many cases a hateful image is turned benign just by declaring what is happening in the image. Our second approach is to use sentiment analysis on both Image and Text modalities. Instead of only using multimodal representations obtained from pre-trained neural networks, we also include the unimodal sentiment to enrich the features. The intuition for this idea is that current pre-trained representations, like VisualBERT and ViLBERT, have the objective of predicting the semantic correlation between image and text, but semantic information is difficult to capture and may not be enough for solving our task. We try to include high-level features like text and image sentiments because sentiment analysis is a related but relatively simple task.
Date
Dec 11, 2020 6:00 PM
Event
Location

Virtual

Talk Overview

This talk was presented at the Hateful Memes Challenge session @ NeurIPS 2020 as a contributed talk.

Key Research Ideas

  1. Object Detection based Image Captioning: Using image captioning to introduce outside world knowledge and find deeper relationships between text and image modalities.

  2. Sentiment Analysis on Both Modalities: Including high-level features like text and image sentiments to enrich the multimodal representations.

This talk is based on our project Detecting Hate Speech in Multi-modal Memes.

Abhishek Das
Authors
Senior Applied Scientist

I’m a Senior Applied Scientist at Microsoft, building multimodal query recommendation systems for Bing Image Search. I work at the intersection of computer vision and language, turning modern ML into reliable, high-impact product experiences.

I’m interested in VLMs, multimodal retrieval, and recommender systems. If you’re building in multimodal AI, I’m happy to chat - especially around 0→1 prototypes and productionizing models.