EgoNormia

Can VLMs make normative decisions in physical social interactions?

Input Modality Types:

  • Blind: Models receive only the questions with no visual input
  • Pipeline: Models receive text-only descriptions of the scene (generated by Gemini 1.5 Flash)
  • Video: Models receive both video input (1 fps, concatenated into a single image) and questions
Model
Modality
Both
Act
Jus
Sen
H
Human
Human
Video92.492.492.485.1
Google logo - light
Gemini 1.5 Pro
Google
Video45.351.947.861.1
Google logo - light
Gemini 2.0 Thinking
Google
Video42.751.745.357.3
Google logo - light
Gemini 1.5 Flash
Google
Video41.746.544.354.4
OpenAI logo - light
o3-mini
OpenAI
Pipeline41.545.745.265.0
Alibaba logo - light
Qwen2.5 VL
Alibaba
Video41.548.343.862.8
OpenAI logo - light
GPT-4o
OpenAI
Video39.845.144.859.6
Google logo - light
Gemini 2.0 Flash
Google
Video38.949.641.360.0
Google logo - light
Gemini 2.0 Thinking
Google
Pipeline37.546.342.158.8
Deepseek logo - light
Deepseek R1
Deepseek
Pipeline36.542.940.061.0
Anthropic logo - light
Claude 3.5 Sonnet
Anthropic
Video36.043.541.059.3
S
InternVL 2.5
Shanghai AI Lab
Pipeline32.740.938.062.5
Google logo - light
Gemini 1.5 Pro
Google
Pipeline30.737.334.864.0
Anthropic logo - light
Claude 3.5 Sonnet
Anthropic
Pipeline23.936.733.561.2
Google logo - light
Gemini 1.5 Pro
Google
Blind21.224.623.654.0
OpenAI logo - light
GPT-4o
OpenAI
Pipeline21.023.723.566.0
OpenAI logo - light
GPT-4o
OpenAI
Blind17.719.919.955.9
Deepseek logo - light
Deepseek R1
Deepseek
Blind16.119.417.127.3
S
InternVL 2.5
Shanghai AI Lab
Blind15.318.317.455.4
S
InternVL 2.5
Shanghai AI Lab
Video15.118.717.650.7
OpenAI logo - light
o3-mini
OpenAI
Blind15.016.817.151.9
Google logo - light
Gemini 1.5 Flash
Google
Pipeline14.717.716.754.2
Google logo - light
Gemini 1.5 Flash
Google
Blind12.215.014.146.6
Meta logo - light
Llama 3.2
Meta
Video2.219.910.154.7