Multimodal foundation models that can holistically process text alongside images, video, audio, and sensory modalities are increasingly becoming leveraged in a range of real-world domains. It is challenging to characterize and study progress in multimodal foundation modeling, given the range of possible modeling decisions, tasks, and downstream domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) as a framework to systematically evaluate the capabilities of multimodal foundation models across a set of 3 comprehensive dimensions: basic skills, information flow, and real-world use cases. Basic skills are internal multimodal abilities required to solve problems, such as learning interactions, fine-grained alignment, multi-step reasoning, and abilities to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Usecases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and interactive agent applications. Overall, HEMM includes 29 datasets and enables a comprehensive evaluation of multimodal models. Using HEMM, our results (1) summarize the performance of individual models across multimodal tasks, and (2) distill broader performance trends regarding different design decisions in multimodal models (e.g., scale, frozen unimodal encoders, adapters, instruction tuning). Our experiments suggest the following promising research directions for the community TODO. HEMM is publicly-available at \url{anon} and encourages community involvement in its expansion.
HEMM is an evaluation framework that characterizes multimodal models by several dimensions (size, architecture, pretraining objective, fine-tuning objective, training data) and emphasizes holistic benchmarking of these models at three disentangled levels:
Figure1. HEMM benchmark overview
Based on the holistic evaluation of multimodal models in HEMM, we identify several key challenges that multimodal models face in real-world applications. These challenges are as follows:
In HEMM, 11 multimodal models with different model architecture and training methods are included. These models are considered as state-of-the-art in multimodal research and are evaluated across the three levels of HEMM. The full list of models is as follows (ranking from small-size ones to large-size ones):
TODO: add a radial plot.
HEMM includes 30 multimodal datasets that span a wide range of domains and tasks. These datasets are used to evaluate the performance of multimodal models across the three levels of HEMM. The full list of datasets grouped by their use case is as follows:
Figure3. Model performance on 30 datasets in HEMM grouped by their use case.