Researchers compared tightly matched LLM and VLM pairs in a strictly text-only setting to isolate the impact of multimodal training history on human alignment during natural reading. The study evaluated model alignment using a human natural-reading dataset that included whole-cortex fMRI responses and synchronized eye-tracking saccades. The findings indicate that multimodal pretraining may not provide a uniform, global advantage in human alignment during natural reading. Instead, language-internal representations remain a key factor for modeling human text processing. However, a selective VLM advantage was observed when sentences contained stronger visual semantic content, with evidence from both fMRI and eye-movement alignments. This suggests that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.
Featured on AI Radar: VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading