Human-MME logo

Human-MME Benchmark

A holistic evaluation suite that pushes Multimodal Large Language Models to understand, reason, and interact within human-centric scenarios.

📄 Read the paper 🤗 Download dataset 💻 GitHub repository

Why Human-MME?

Multimodal agents are increasingly deployed alongside people. Human-MME was created to systematically evaluate these systems across the full spectrum of human-centric perception, understanding, and interaction.

The benchmark covers eight complementary capability axes — from fine-grained human recognition to multimodal perception of relationships and emotions — offering a dependable lens into real-world readiness for MLLMs.

Explore documentation

Holistic Coverage

Eight evaluation dimensions stress-test MLLMs on recognition, understanding, reasoning, and safety around humans.

Human-Centric Scenarios

Curated image-question pairs reflect daily life, complex interactions, and diverse demographics.

Model-Agnostic

Plug in any MLLM by extending a simple interface and immediately obtain comparable metrics.

Open Evaluation

Share results through the leaderboard to benchmark progress transparently across the community.

Evaluation Dimensions

Human-MME assesses MLLMs with 19,450 carefully designed examples that target the following capability axes. Each dimension is evaluated through both multiple-choice and free-form tasks.

FU — Fine-grained Understanding

Identify subtle human attributes such as age, attire, pose, or accessories from complex scenes.

BU — Broad Understanding

Comprehend the overall context, activity, or setting of human-centered imagery.

HU — Human Understanding

Interpret intentions, relationships, and social roles between individuals.

MIU — Multimodal Interaction Understanding

Reason about interactions between humans and surrounding objects or environments.

MPR — Multimodal Perception & Reasoning

Blend visual cues with commonsense knowledge to answer how and why questions.

ID — Identity Association

Match consistent identities across images and textual descriptions.

CD — Cultural Diversity

Demonstrate fairness and inclusivity across cultures, regions, and demographics.

ED — Emotional Diversity

Recognize and reason about a wide range of affective states and expressions.

Get Started in Minutes

Run the benchmark locally or in the cloud with only a few commands.

  1. Clone the repository
    git clone https://github.com/Yuan-Hou/Human-MME.git
  2. Install dependencies
    python -m venv .env
    source .env/bin/activate
    pip install -r requirements.txt
  3. Download benchmark data
    Retrieve Human-MME_data.zip, extract it to the project root, and ensure the provided directory structure is preserved.
  4. Plug in your model
    Implement a new class under mllm_models/ by extending BaseModel, then register it in benchmark.py.
  5. Benchmark
    python benchmark.py --model_name YourModelName
    Resume with --continuing if needed and compute metrics via --calc_metrics.
Follow the full tutorial

Submit to the Leaderboard

After evaluation, upload results/result_YourModelName.json in a pull request. The maintainers verify submissions before publishing them to the official leaderboard.

Leaderboard metrics aggregate the eight capability axes above to give a single score that summarizes human-centric performance.

Top Open-Source Models

Model Avg.
GLM-4.5V76.0
Qwen2.5-VL-72B72.8
GLM-4.1V-9B69.1

Resources

Paper

Delve into the benchmark design, annotation process, and evaluation insights.

Read on arXiv

Dataset

Download high-quality images, questions, and labels curated for human-centric reasoning.

View on Hugging Face

Community

Join discussions, share results, and collaborate on advancing human-aware multimodal systems.

Open GitHub Issues