A Critique of Cross-modal Vector Space Alignment for Capturing Referential Semantics

Sven Eichholtz

Università della Svizzera italiana (USI)

The linguistic outputs and internal representations of Large Language Models (LLM) are said to lack referential grounding in the entities they putatively refer to. Anders Søgaard recently proposed that progress on solving the grounding problem can be made by developing an AI system that can match vector representations of words encoded by an LLM to visual vector representations of objects encoded by a computer Vision Model (VM). There is experimental evidence that the vector space of linguistic representations generated by an LLM can be geometrically isomorphic to the vector space of visual representations generated by a VM such that the two spaces can be aligned by means of a linear mapping. He argues that an AI system that acquires such a mapping can ground expressions in non-linguistic representations of their referents, thereby obtaining referential semantics. This paper offers a critique of the proposal. I consider the details of Søgaard’s proposal and experiment and offer three objections to the claim that such an AI system could capture referential semantics. First, I argue that the system does not learn reference on the suggested methods for aligning the vector spaces because these introduce what I call semantic contamination into the system. Second, I argue that a linear mapping between vector spaces does not show the system’s vector representations are referentially grounded because geometric isomorphism between vector spaces is a structural resemblance relation which is neither necessary nor sufficient for fixing representation or reference. Third, argue that the system’s vector representations cannot have the right historical-normative relations to their putative referents which are commonly held to be necessary for fixing representation. I conclude that developing an AI system that integrates cross-modal vector space alignment is a dead-end for making progress on the grounding problem and for finding a way for machines to grasp referential semantics.

Chair: Szymon Sapalski

Time: September 6th, 15:20-15:50

Location: SR 1.006