Information-theoretic and machine learning methods for semantic categorization


Of the incredible amount of structure in the world, a small subset is privileged enough to be carved into semantic categories, i.e., partitions that are encoded and transmitted via language. What makes a useful partitioning? Why do different cultures sometimes have different partitions? Characterizing the constrained variation and universals in human semantic categorization remains a foundational question for cognitive science. One prominent approach argues that semantic categories across languages are shaped by pressure to communicate efficiently, typically captured by a tradeoff between cognitive economy and communicative accuracy. This idea has recently been formulated and tested using tools from machine learning and information theory, and in particular, the Information Bottleneck (IB) principle (Tishby et al., 1999; Zaslavsky et al., 2018). The IB framework for semantic categorization makes formal predictions about the fine-grained structure and evolution of semantic categories, and has been directly connected to the coordination of grounded semantic categories in multi-agent reinforcement learning settings. Here, we review the framework and its empirical evidence across languages and semantic domains, and contextualize it with respect to alternative formulations of efficient communication. We focus on how this theoretical framework may address important questions including: Are attested languages shaped by pressures towards optimal communication? To what extent can this principle predict the constrained semantic variation observed across languages? Which assumptions about communication/cognition provide a better explanation of typological data? We demonstrate the application of this framework with a concrete example from the color domain.

The Oxford Handbook of Approaches to Language Evolution