HEMM: Holistic Evaluation of Multimodal Models

Abstract

Multimodal foundation models that can holistically process text alongside images, video, audio, and sensory modalities are increasingly becoming leveraged in a range of real-world domains. It is challenging to characterize and study progress in multimodal foundation modeling, given the range of possible modeling decisions, tasks, and downstream domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) as a framework to systematically evaluate the capabilities of multimodal foundation models across a set of 3 comprehensive dimensions: basic skills, information flow, and real-world use cases. Basic skills are internal multimodal abilities required to solve problems, such as learning interactions, fine-grained alignment, multi-step reasoning, and abilities to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Usecases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and interactive agent applications. Overall, HEMM includes 29 datasets and enables a comprehensive evaluation of multimodal models. Using HEMM, our results (1) summarize the performance of individual models across multimodal tasks, and (2) distill broader performance trends regarding different design decisions in multimodal models (e.g., scale, frozen unimodal encoders, adapters, instruction tuning). Our experiments suggest the following promising research directions for the community TODO. HEMM is publicly-available at \url{anon} and encourages community involvement in its expansion.

HEMM Overview

HEMM is an evaluation framework that characterizes multimodal models by several dimensions (size, architecture, pretraining objective, fine-tuning objective, training data) and emphasizes holistic benchmarking of these models at three disentangled levels:

Basic skills
Benchmarking models' abilities to address multimodal problems, such as multimodal interactions, multimodal alignment, reasoning across compositional features, and integration of external knowledge.
Information flow
Benchmarking models' abilities to transform multimodal information during tasks such as querying, translation, editing, and fusion.
Use cases
Benchmarking models' abilities to perform in real-world problems from multimedia, affective computing, natural sciences, healthcare, and human-computer interaction.

Figure1. HEMM benchmark overview

Key Challenges

Based on the holistic evaluation of multimodal models in HEMM, we identify several key challenges that multimodal models face in real-world applications. These challenges are as follows:

Challenging Dataset
Health, HCI and Science datasets are relatiely difficult use cases for multimodal foundation models.
Multimodal Interactions
Models perform better on redundant interactions but struggle when visual information is not directly referenced by text.
Reasoning, fine-grained, and knowledge
We need better datasets that test for complex reasoning and fine-grained alignment - current ones do not pose enough challenges to today's models, with no significant performance differences with or without reasoning and fine-grained alignment.
Model and data size
Training on diverse data sources also improves over models that only pretrain on images and captions. The tasks that show the most improvement are iNaturalist and MemeCap which are knowledge intensive and require complex reasoning.
Model architecture and training
Aligning frozen pre-trained language and vision models outperforms end-to-end multimodal learning, and instruction-tuned models performed better than those with only supervised fine-tuning

Multimodal Models in HEMM

In HEMM, 11 multimodal models with different model architecture and training methods are included. These models are considered as state-of-the-art in multimodal research and are evaluated across the three levels of HEMM. The full list of models is as follows (ranking from small-size ones to large-size ones):

KOSMOS-2
Open-Flamingo
Instruct-BLIP
Llama-Adapter
mPLUG-OWL
Fuyu-8B
BLIP-2
mini-GPT-4
EMU
Gemini
GPT-4-vision

TODO: add a radial plot.

Multimodal Dataset in HEMM

HEMM includes 30 multimodal datasets that span a wide range of domains and tasks. These datasets are used to evaluate the performance of multimodal models across the three levels of HEMM. The full list of datasets grouped by their use case is as follows:

Multimedia
Winoground, IRFL, NLVR2, GQA, NLVR, Nocaps, VCR, Flickr30k, OK-VQA, VisualGenome, MM-IMDb, VQA v1
Affect
Hateful Memes, Face Emotion, Memotion, MemeCap, New York Cartoon
Science
RESISC45, UC Merced, iNaturalist, Decimer, ScienceQA
Health
OpenPath, VQA-RAD, SLAKE, PathVQA
HCI
Screen2Words, Ethico

Figure3. Model performance on 30 datasets in HEMM grouped by their use case.

HEMM: Holistic Evaluation of Multimodal Foundation Models

Abstract

HEMM Overview

Key Challenges

Multimodal Models in HEMM

Multimodal Dataset in HEMM

BibTeX