EgoNormia
Can VLMs make normative decisions in physical social interactions?
Input Modality Types:
- Blind: Models receive only the questions with no visual input
- Pipeline: Models receive text-only descriptions of the scene (generated by Gemini 1.5 Flash)
- Video: Models receive both video input (1 fps, concatenated into a single image) and questions
Model | Modality | Both | Act | Jus | Sen |
---|---|---|---|---|---|
H Human | Video | 92.4 | 92.4 | 92.4 | 85.1 |
![]() ![]() | Video | 45.3 | 51.9 | 47.8 | 61.1 |
![]() ![]() | Video | 42.7 | 51.7 | 45.3 | 57.3 |
![]() ![]() | Video | 41.7 | 46.5 | 44.3 | 54.4 |
![]() ![]() | Pipeline | 41.5 | 45.7 | 45.2 | 65.0 |
![]() ![]() | Video | 41.5 | 48.3 | 43.8 | 62.8 |
![]() ![]() | Video | 39.8 | 45.1 | 44.8 | 59.6 |
![]() ![]() | Video | 38.9 | 49.6 | 41.3 | 60.0 |
![]() ![]() | Pipeline | 37.5 | 46.3 | 42.1 | 58.8 |
![]() ![]() | Pipeline | 36.5 | 42.9 | 40.0 | 61.0 |
![]() ![]() | Video | 36.0 | 43.5 | 41.0 | 59.3 |
S InternVL 2.5 | Pipeline | 32.7 | 40.9 | 38.0 | 62.5 |
![]() ![]() | Pipeline | 30.7 | 37.3 | 34.8 | 64.0 |
![]() ![]() | Pipeline | 23.9 | 36.7 | 33.5 | 61.2 |
![]() ![]() | Blind | 21.2 | 24.6 | 23.6 | 54.0 |
![]() ![]() | Pipeline | 21.0 | 23.7 | 23.5 | 66.0 |
![]() ![]() | Blind | 17.7 | 19.9 | 19.9 | 55.9 |
![]() ![]() | Blind | 16.1 | 19.4 | 17.1 | 27.3 |
S InternVL 2.5 | Blind | 15.3 | 18.3 | 17.4 | 55.4 |
S InternVL 2.5 | Video | 15.1 | 18.7 | 17.6 | 50.7 |
![]() ![]() | Blind | 15.0 | 16.8 | 17.1 | 51.9 |
![]() ![]() | Pipeline | 14.7 | 17.7 | 16.7 | 54.2 |
![]() ![]() | Blind | 12.2 | 15.0 | 14.1 | 46.6 |
![]() ![]() | Video | 2.2 | 19.9 | 10.1 | 54.7 |