Authors:
(1) Raphaël Millière, Department of Philosophy, Macquarie University ([email protected]);
(2) Cameron Buckner, Department of Philosophy, University of Houston ([email protected]).
Table of Links
Abstract and 1 Introduction
2. A primer on LLMs
2.1. Historical foundations
2.2. Transformer-based LLMs
3. Interface with classic philosophical issues
3.1. Compositionality
3.2. Nativism and language acquisition
3.3. Language understanding and grounding
3.4. World models
3.5. Transmission of cultural knowledge and linguistic scaffolding
4. Conclusion, Glossary, and References
Abstract
Large language models like GPT-4 have achieved remarkable proficiency in a broad spectrum of language-based tasks, some of which are traditionally associated with hallmarks of human intelligence. This has prompted ongoing disagreements about the extent to which we can meaningfully ascribe any kind of linguistic or cognitive competence to language models. Such questions have deep philosophical roots, echoing longstanding debates about the status of artificial neural networks as cognitive models. This article–the first part of two companion papers–serves both as a primer on language models for philosophers, and as an opinionated survey of their significance in relation to classic debates in the philosophy cognitive science, artificial intelligence, and linguistics. We cover topics such as compositionality, language acquisition, semantic competence, grounding, world models, and the transmission of cultural knowledge. We argue that the success of language models challenges several long-held assumptions about artificial neural networks. However, we also highlight the need for further empirical investigation to better understand their internal mechanisms. This sets the stage for the companion paper (Part II), which turns to novel empirical methods for probing the inner workings of language models, and new philosophical questions prompted by their latest developments.
1. Introduction
Deep learning has catalyzed a significant shift in artificial intelligence over the past decade, leading up to the development of Large Language Models (LLMs). The reported achievements of LLMs, often heralded for their ability to perform a wide array of language-based tasks with unprecedented proficiency, have captured the attention of both the academic community and the public at large. State-of-the-art LLMs like GPT-4 are even claimed to exhibit “sparks of general intelligence” (Bubeck et al. 2023). They can produce essays and dialogue responses that often surpass the quality of an average undergraduate student’s work (Herbold et al. 2023); they achieve better scores than most humans on a variety of AP tests for college credit and rank in the 80-99th percentile on graduate admissions tests like the GRE or LSAT (OpenAI 2023a); their programming proficiency “favorably compares to the average software engineer’s ability” (Bubeck et al. 2023, Savelka, Agarwal, An, Bogart & Sakr 2023); they can solve many difficult mathematical problems (Zhou et al. 2023)–even phrasing their solution in the form of a Shakespearean sonnet, if prompted to do so. LLMs also form the backbone of multimodal systems that can answer advanced questions about visual inputs (OpenAI 2023b) or generate images that satisfy complex compositional relations based on linguistic descriptions (Betker et al. 2023).[1] While the released version of GPT-4 was intentionally hobbled to be unable to perfectly imitate humans–to mitigate plagiarism, deceit, and unsafe behavior–it nevertheless still managed to produce responses that were indistinguishable from those written by humans at least 30% of the time when assessed on a two-person version of the Turing test for intelligence (Jones & Bergen 2023). This rate exceeds the threshold established by Turing himself for the test: that computer programs in the 21st century should imitate humans so convincingly that an average interrogator would have less than a 70% chance of identifying them as non-human after five minutes of questioning (Turing 1950).
To philosophers who have been thinking about artificial intelligence for many years, GPT-4 can seem like a thought experiment come to life–albeit one that calls into question the link between intelligence and behavior. As early as 1981, Ned Block imagined a hypothetical system–today commonly called “Blockhead”–that exhibited behaviors indistinguishable from an adult human’s, yet was not considered intelligent.[2] Block’s challenge focused on the way in which the system produced its responses to inputs. In particular, Blockhead’s responses were imagined to have been explicitly preprogrammed by a “very large and clever team [of human researchers] working for a very long time, with a very large grant and a lot of mechanical help,” to devise optimal answers to any potential question the judge might ask (Block 1981, p. 20). In other words, Blockhead answers questions not by understanding the inputs and processing them flexibly and efficiently, but rather simply retrieving and regurgitating the answers from its gargantuan memory, like a lookup operation in a hash table. The consensus among philosophers is that such a system would not qualify as intelligent. In fact, many classes in the philosophy of artificial intelligence begin with the position Block and others called “psychologism:” intelligence does not merely depend on the observable behavioral dispositions of a system, but also on the nature and complexity of internal information processing mechanisms that drive these behavioral dispositions.
In fact, many of GPT-4’s feats may be produced by a similarly inefficient and inflexible memory retrieval operation. GPT-4’s training set likely encompasses trillions of tokens in millions of textual documents, a significant subset of the whole internet.[3] Their training sets include dialogues generated by hundreds of millions of individual humans and hundreds of thousands of academic publications covering potential question-answer pairs. Empirical studies have discovered that the many-layered architecture of DNNs grants them an astounding capacity to memorize their training data, which can allow them to retrieve the right answers to millions of randomly-labeled data points in artificially-constructed datasets where we know a priori there are no abstract principles governing the correct answers (Zhang et al. 2021). This suggests that GPT-4’s responses could be generated by approximately–and, in some cases, exactly–reproducing samples from its training data.[4] If this were all they could do, LLMs like GPT-4 would simply be Blockheads come to life. Compare this to a human student who had found a test’s answer key on the Internet and reproduced its answers without any deeper understanding; such regurgitation would not be good evidence that the student was intelligent. For these reasons, “data contamination”–when the training set contains the very question on which the LLM’s abilities are assessed–is considered a serious concern in any report of an LLM’s performance, and many think it must be ruled out by default when comparing human and LLM performance (Aiyappa et al. 2023). Moreover, GPT-4’s pre-training and fine-tuning requires an investment in computation on a scale available only to well-funded corporations and national governments–a process which begins to look quite inefficient when compared to the data and energy consumed by the squishy, 20-watt engine between our ears before it generates similarly sophisticated output.
In this opinionated review paper, we argue that LLMs are more than mere Blockheads; but this skeptical interpretation of LLMs serves as a useful foil to develop a subtler view. While LLMs can simply regurgitate large sections of their prompt or training sets, they are also capable of flexibly blending patterns from their training data to produce genuinely novel outputs. Many empiricist philosophers have defended the idea that sufficiently flexible copying of abstract patterns from previous experience could form the basis of not only intelligence, but full-blown creativity and rational decision-making (Baier 2002, Hume 1978, Buckner 2023); and more scientific research has emphasized that the kind of flexible generalization that can be achieved by interpolating vectors in the semantic spaces acquired by these models may explain why these systems often appear more efficient, resilient, and capable than systems based on rules and symbols (Smolensky 1988, Smolensky et al. 2022a). A useful framework for exploring the philosophical significance of such LLMs, then, might be to treat the worry that they are merely unintelligent, inefficient Blockheads as a null hypothesis, and survey the empirical evidence that can be mustered to refute it.[5]
We adopt that approach here, and use it to provide a brief introduction to the architecture, achievements, and philosophical questions surrounding state-of-the-art LLMs such as GPT-4. There has, in our opinion, never been a more important time for philosophers from a variety of backgrounds– but especially philosophy of mind, philosophy of language, epistemology, and philosophy of science–to engage with foundational questions about artificial intelligence. Here, we aim to provide a wide range of those philosophers (and philosophically-inclined researchers from other disciplines) with an opinionated survey that can help them to overcome the barriers imposed by the technical complexity of these systems and the ludicrous pace of recent research achievements.
1GPT-4V (OpenAI 2023b) is a single multimodal model that can take both text and images as input; by contrast, DALL-E 3 (Betker et al. 2023) is a distinct text-to-image model that can be seamlessly prompted by GPT-4–an example of model ensembling using natural language as a universal interface (Zeng et al. 2022). While officially available information about GPT-4, GPT-4V and DALL-E 3 is scarce, they are widely believed to use a Transformer architecture as backbone to encode linguistic information, like similar multimodal models and virtually all LLMs (Brown et al. 2020, Touvron et al. 2023, Ramesh et al. 2022, Alayrac et al. 2022).
[2] Key technical terms in this paper are highlighted in red and defined in the glossary. In the electronic version, these terms are interactive and link directly to their respective glossary entries.
[3] While details about the training data of GPT-4 are not publicly available, we can turn to other LLMs for clues. For example, PaLM 2 has 340 billion parameters and was trained on 3.6 trillion tokens (Anil et al. 2023), while the largest version of Llama 2 has 70 billion parameters and was trained on 2 trillion tokens (Touvron et al. 2023). GPT-4 is rumored to have well over a trillion parameters (Karhade 2023).
[4] This concern is highlighted by lawsuits against OpenAI, notably from the New York Times (Grynbaum & Mac 2023). These cases document instances where LLMs like GPT-4 have been shown to reproduce substantial portions of copyrighted text verbatim, raising questions about the originality of their outputs.
[5] Such a method of taking a deflationary explanation for data as a null hypothesis and attempting to refute it with empirical evidence has been a mainstay of comparative psychology for more than a century, in the form of Morgan’s Canon (Buckner 2017, Sober 1998). As DNN-based systems approach the complexity of an animal brain, it may be useful to take lessons from comparative psychology in arbitrating fair comparisons to human intelligence (Buckner 2021). In comparative psychology, standard deflationary explanations for data include reflexes, innate-releasing mechanisms, and simple operant conditioning. Here, we suggest that simple deflationary explanations for an AI-inspired version of Morgan’s Canon include Blockhead-style memory lookup