HEMM: Holistic Evaluation of Multimodal Foundation Models

1Carnegie Mellon University 2KJSCE

Abstract

Multimodal foundation models that can holistically process text alongside images, video, audio, and sensory modalities are increasingly becoming leveraged in a range of real-world domains. It is challenging to characterize and study progress in multimodal foundation modeling, given the range of possible modeling decisions, tasks, and downstream domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) as a framework to systematically evaluate the capabilities of multimodal foundation models across a set of 3 comprehensive dimensions: basic skills, information flow, and real-world use cases. Basic skills are internal multimodal abilities required to solve problems, such as learning interactions, fine-grained alignment, multi-step reasoning, and abilities to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Usecases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and interactive agent applications. Overall, HEMM includes 29 datasets and enables a comprehensive evaluation of multimodal models. Using HEMM, our results (1) summarize the performance of individual models across multimodal tasks, and (2) distill broader performance trends regarding different design decisions in multimodal models (e.g., scale, frozen unimodal encoders, adapters, instruction tuning). Our experiments suggest the following promising research directions for the community TODO. HEMM is publicly-available at \url{anon} and encourages community involvement in its expansion.

HEMM Overview

HEMM is an evaluation framework that characterizes multimodal models by several dimensions (size, architecture, pretraining objective, fine-tuning objective, training data) and emphasizes holistic benchmarking of these models at three disentangled levels:

  • Basic skills
    Benchmarking models' abilities to address multimodal problems, such as multimodal interactions, multimodal alignment, reasoning across compositional features, and integration of external knowledge.
  • Information flow
    Benchmarking models' abilities to transform multimodal information during tasks such as querying, translation, editing, and fusion.
  • Use cases
    Benchmarking models' abilities to perform in real-world problems from multimedia, affective computing, natural sciences, healthcare, and human-computer interaction.

Description of the image

Figure1. HEMM benchmark overview

Key Challenges

Based on the holistic evaluation of multimodal models in HEMM, we identify several key challenges that multimodal models face in real-world applications. These challenges are as follows:

  • Challenging Dataset
    Health, HCI and Science datasets are relatiely difficult use cases for multimodal foundation models.
  • Multimodal Interactions
    Models perform better on redundant interactions but struggle when visual information is not directly referenced by text.
  • Reasoning, fine-grained, and knowledge
    We need better datasets that test for complex reasoning and fine-grained alignment - current ones do not pose enough challenges to today's models, with no significant performance differences with or without reasoning and fine-grained alignment.
  • Model and data size
    Training on diverse data sources also improves over models that only pretrain on images and captions. The tasks that show the most improvement are iNaturalist and MemeCap which are knowledge intensive and require complex reasoning.
  • Model architecture and training
    Aligning frozen pre-trained language and vision models outperforms end-to-end multimodal learning, and instruction-tuned models performed better than those with only supervised fine-tuning

Multimodal Models in HEMM

In HEMM, 11 multimodal models with different model architecture and training methods are included. These models are considered as state-of-the-art in multimodal research and are evaluated across the three levels of HEMM. The full list of models is as follows (ranking from small-size ones to large-size ones):

  • KOSMOS-2
  • Open-Flamingo
  • Instruct-BLIP
  • Llama-Adapter
  • mPLUG-OWL
  • Fuyu-8B
  • BLIP-2
  • mini-GPT-4
  • EMU
  • Gemini
  • GPT-4-vision
  • TODO: add a radial plot.

Multimodal Dataset in HEMM

HEMM includes 30 multimodal datasets that span a wide range of domains and tasks. These datasets are used to evaluate the performance of multimodal models across the three levels of HEMM. The full list of datasets grouped by their use case is as follows:

  • Multimedia
    Winoground, IRFL, NLVR2, GQA, NLVR, Nocaps, VCR, Flickr30k, OK-VQA, VisualGenome, MM-IMDb, VQA v1
  • Affect
    Hateful Memes, Face Emotion, Memotion, MemeCap, New York Cartoon
  • Science
    RESISC45, UC Merced, iNaturalist, Decimer, ScienceQA
  • Health
    OpenPath, VQA-RAD, SLAKE, PathVQA
  • HCI
    Screen2Words, Ethico

Description of the image

Figure3. Model performance on 30 datasets in HEMM grouped by their use case.

BibTeX