Mindreading Models: Do LLMs Have Theory of Mind?

By Ankesh Chandaria, Strategy Advisor and AI Researcher

July 22, 2024

(

August 1, 2024

large language models

philosophy

psychology

ethics

tl;dr

Did you know we're all mind readers? Humans develop what’s called “theory of mind” (ToM) around the age of four, enabling us to hold views and beliefs about our own mental states and the mental states of others. It’s one of those things that, well, makes us human. However, a recent paper by Strachan et al. explores this capability in the context of AI -- a finding which has some interesting implications. The authors compared the performance of a series of LLMs (GPT-3, GPT-4, and LLaMA2) against around 1,900 humans across a battery of tests designed to explore ToM. They determined that, in every task bar one (designed to test faux pas), LLMs can perform at or above human level. My own cursory testing suggests that GPT-4o may even be able to demonstrate ToM across the board. While we can’t conclude that these models actually have ToM (in a cognitive sense) they certainly seem to behave as if they do. So what does this mean in practice? These LLMs open the door to exciting potential applications across areas such as mental health and decision support making. Ultimately, this paper’s conclusions give us a moment to stop and think about the state of our technology: to be impressed by what we’ve accomplished, and acknowledge the risks that should be considered as we move forward.

Did you know we're all mind-readers?

Maybe not quite in the telepathic sense (and certainly not with 100% accuracy); however, we humans possess what some psychologists call theory of mind (ToM) – the cognitive capacity to represent (i.e., hold views, beliefs, etc., about) our own mental states and the mental states of others. It's the closest to mind-reading we can get, and it is fundamental to how we navigate social life.

Theory of Mind: In Humans and Machines

The term “theory of mind” was coined in the 1970s by Premack and Woodruff [1], and is closely related to concepts such as theory-theory, folk psychology, and intentionality. It can be demonstrated simply by an experiment typically given to young children: the Sally-Anne task [2]. See if you can follow along.

Sally walks into a room, holding a basket, which she sets down. Already in the room is Anne, who has a box. Sally places a chocolate in her basket and then (for some reason – let’s say she hears an ice-cream truck outside) leaves the room. Anne, perhaps much preferring warm chocolate to brain-freeze, takes the chocolate from the basket and hides it in her box. Sally returns to the room.

Where will Sally look for the chocolate?

The answer seems pretty obvious, but it isn’t all that simple. While most neurotypical four-year olds do tend to get the answer right (the basket!), three-year olds typically think that Sally will look in the box – which is, of course, where they know the chocolate is. The difference (and this seems to generally develop around the age of four) is an understanding that others can hold false beliefs about things. That their mental states can be different from our own.

We can even infer layers upon layers of mental states. For example, I might be able to think about what you think that I think about you, and appreciate that it doesn’t necessarily add up with what I actually think about you. Rest assured, if you’re taking the time to read this, I think highly of you indeed!

While ToM has also been explored in certain other animals, a recent paper by Strachan et al. propels the exploration of ToM beyond the biological altogether. In it, the authors explore whether large language models (LLMs) like GPT-4 and LLaMA2 [3] exhibit ToM – a conclusion which could have significant practical implications.

Does GPT Know Where the Chocolate Is?

It’s worth first taking a moment to touch on the methodology and findings of the paper. The authors compared a sample of around 1900 humans against GPT-3, GPT-4, and LLaMA2, putting them through a battery of tests designed to test ToM – specifically: (a) the hinting task, which tests indirect speech for both the intended meaning and action it is trying to provoke; (b) the false belief task, like the Sally Anne task above; (c) recognition of faux pas; (d) the strange stories task, which probes reasoning about things like manipulation, misdirection and misunderstanding; and (e) an irony comprehension test.

So how do the models stack up?

GPT-4 performed at human or above-human level on the false belief task, irony test, hinting task, and strange stories task. The only task it struggled with was faux pas recognition.
GPT-3 performed at human level on the false belief, hinting, and strange stories task, made errors recognizing ironic statements, and performed poorly with faux pas recognition.
LLaMA2, an open-source model, performed at human level on the false belief task, scored significantly lower than humans at everything else apart from faux pas recognition, in which it excelled.‍

Note: Out of curiosity, I tested GPT-4o on the faux pas task outlined by the authors and found that it easily handled the situation. Now, this is just a single data point and hardly conclusive. However, if you're curious and want to explore this for yourself, you can try the same and see if you can replicate my findings.

Notwithstanding the gap that LLaMA2 has to cover, it is clear that LLMs are meeting or exceeding standard tests for ToM. At the rate at which LLMs are developing, I expect it won’t be long before other models start to match GPT-4’s performance. It seems like GPT-4o, at least from cursory prompting, does demonstrate ToM according to the standards set in this paper. Our latest LLMs may, in fact, be mind-readers in their own right.

One might protest that we’re getting ahead of ourselves here. After all, isn’t “mind reading” what recommendation algorithms already do? Take, for example, the predictive drafting algorithm that conveniently feeds in the most likely next few words of the email you’ve just started writing. By building up a model of our behaviour it can then extrapolate and often accurately guess what comes next – surely on par with, if not better than, a human might.

The difference is that ToM within the context of an LLM (and ultimately with other generalised algorithms of the sort we will invariably see as AI agents develop) is markedly more domain-general. Unlike a narrow model built to predict the next few words of text, ToM in AI (as evinced by a variety of test-tasks) is therefore a much more significant achievement.

With that said, we should be wary of how easy it is to slip into a bias of anthropomorphism -- of giving AI human qualities where they don’t actually exist. One method for addressing this bias is to apply Morgan's Canon, a rule of parsimony from comparative psychology that encourages application of the simplest explanation. If something can be explained by an explanation simpler than human cognition or intelligence, for example, then that is the explanation to favour. Indeed, the authors of the paper themselves state that “[w]hile LLMs are designed to emulate human-like responses, this does not mean that this analogy extends to the underlying cognition giving rise to those responses.” This is worth keeping in mind; however, for the purposes of this article, it is not the cognition that is relevant but the behaviour, upon which foundation we can build novel technologies.

Implications of LLM ToM

If, as suggested by the paper, LLMs can at least emulate ToM, this would open the door to a number of interesting applications. These opportunities stem from an enhanced interaction between humans and AI, some of which may be nearer-term whilst others may depend on further development and the integration of various systems. I touch on two of these below.

Mental Health and Companionship AI

Researchers have been thinking about practical uses of AI in healthcare since at least the early 1970s, with Stanford’s development of MYCIN – a simple inference engine that was designed (but never rolled out) to identify bacteria and recommend antibiotics [4]. Today, the most obvious use-cases continue to leverage algorithmic classification capacities: one of the most widely written-about examples of anticipated disruption – if you’ve looked at the news in the past five years you’ll have read about this at least once – has been the question of whether AI will one day replace radiologists in the world of diagnostic imaging [5].

AI that exhibits ToM, however, opens up an entirely new world of applications in areas requiring greater patient engagement. One of the most evident examples of this is in the context of mental health; specifically, through the deployment of tailor-made LLMs and LLM-powered companion AIs.

I’ve previously written on companionship AIs such as Replika, which are already on the market. These fascinating implementations of LLMs can convincingly replicate the colloquial language and texting style of humans. They talk and write like everyday versions of us. Replika, which uses models that appear to demonstrate rudimentary ToM (based on the minuscule sample size of me running a couple of tests of the false belief and faux pas tasks), has already demonstrated positive impacts mitigating loneliness and suicide risk in early studies [6]. Building on this, more deliberate application of such LLMs could -- through continued interaction -- detect and model patterns of behaviour and act as custom, accessible first ports of call for any necessary intervention. To the extent that they can themselves model ToM, this would allow for a below-the-surface level interpretation of such conversation and behaviour, which would enable more accurate and useful analysis generally. They could then, for example, enable better choice architecture and be trained to give the right nudges to users both within the context of certain therapies and more generally.

Negotiation and Decision Making Support

A second universe of use cases, and of particular interest users of AI in commercial and legal spaces, is deployment in the context of decision support. While most, if not all, algorithms are inherently decision-making tools, LLMs that are able to demonstrate ToM may actually offer significant benefits when used as decision-making and negotiation support tools in a more fulsome sense. Studies have started to explore how LLMs negotiate against humans [7] and amongst themselves [8]. While we appear to be racing towards a future where AI agents might negotiate (just as we do) with us, against us, or against one another in-lieu of us, an intermediary step that could be enabled by ToM-demonstrating LLMs is negotiation strategy development.

One of the most fundamental elements of any negotiation is information. It is incredibly rare to have a situation where a negotiation takes place in a world of perfect information and, in reality, there tend to be incentives on every side of the table to withhold information to maintain a bargaining advantage. Much of a negotiation is therefore about what is left unsaid, what is deduced (calling someone’s bluff, for example) and the impacts of intangibles such as a party's underlying bias or proclivities. LLMs that demonstrate ToM should be better able to model the outcomes of actions within the context of such a negotiation, thus enabling an offloading of some strategic planning decisions from humans. There are benefits, for those without deep pockets, to having this sort of technology at hand as an alternative to paying expensive law firm for the same work. The first port of call might therefore be a negotiation advisor LLM fully apprised of the details of the world of the negotiation at hand -- fed as much information as is available online and off-line about the personalities, contracts, etc., that are responsible for and relevant to the process. These LLMs would not only be able to process large quantities of relevant data, they might also even consider the nuances of how various approaches may affect the individuals across the bargaining table. Similar benefits could also arise in less contentious decision making spaces. Let’s think back to the Sally Anne task, and the importance of holding in mind what someone else’s knowledge of the world is, independently of your own. In any decision-making process, AI that demonstrates ToM could anticipate and describe the perspectives of different stakeholders, aiding in a more balanced and better-informed process.

Closing Words

These two examples barely scratch the surface of what might be possible with LLMs that demonstrate ToM. Even so, they aren’t without their own embedded risks. More ‘human’-like AI might foster dependence and vulnerability. Without appropriate controls, users could place undue trust on models that have no real ‘code of conduct’ or clear ethic. Furthermore, AI capable of understanding and predicting human emotions and intentions could be used to manipulate individuals' decisions and behaviours subtly and effectively, posing risks to personal autonomy and freedom. The nudge could be in the wrong direction. While these concerns shouldn't have us slamming on the brakes to a screeching halt, we should nevertheless be cautious about, say, the deployment of companion AI developed by privately-owned for-profit companies.

Another persistent risk is bias. If an LLM deployed in the healthcare context has been trained on data that contains implicit biases against certain racial or socioeconomic groups, it could misinterpret the symptoms or pain levels articulated by these patients, leading to suboptimal and unfair treatment recommendations. This could add strain to the system rather than alleviate it. Bias might also rear its ugly head within the context of decision-making support. It may be baked into the underlying algorithms and propagate through these applications: skewed assumptions about a person’s views and potential behaviour may make it less reliable. COMPAS, an algorithm designed to assist judges and probation officers in their assessment of recidivism (i.e., the likelihood that someone who has been convicted of a crime with re-offend) offers a well-cited cautionary tale. Researchers found that while white defendants were predicted to be less risky than they actually were, black defendants were twice as likely as their white counterparts of being misclassified as having a higher risk of violent recidivism [9].

This is not to say that we should not explore these spaces. It remains necessary to be pragmatic. These tools will be used for these various purposes; however, they ought to be deployed with the right guardrails and training in place so as to lead to the best outcomes.

Whether in healthcare, law, education, or elsewhere, models displaying ToM clearly have the power to significantly enhance our lives and even change our societies. What’s more, the technology itself is here and now. Whether we say an LLM has ToM or not has no impact on its actual capabilities. Instead, these conclusions enable us to stop and think about the state of our technology: to be impressed by what we’ve accomplished, and acknowledge the risks that should be considered as we move forward. These developments are noteworthy because of how they can help us re-conceptualize, or imagine new ways of interacting with the technology at our disposal.

And these LLMs are likely just the starting point. If neural nets with ToM can be integrated into embodied robots (or agent systems with perceptual abilities generally), they may be able to more reliably judge the impact of direct actions they take in the physical world, in social contexts. When combined with decision-making capabilities, they could be empowered to act preemptively and might -- sooner than we may even realize -- begin to behave as humans do.

References

[1] Premack, D., & Woodruff, G. (1978). Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4), 515–526. doi:10.1017/S0140525X00076512

[2] Nebreda, A., Shpakivska-Bilan, D., Camara, C., & Susi, G. (n.d.). The Social Machine: Artificial Intelligence (AI) Approaches to Theory of Mind. In The Theory of Mind Under Scrutiny (pp. 681–722). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-46742-4_22

[3] Strachan, J.W.A., Albergo, D., Borghini, G. et al. Testing theory of mind in large language models and humans. Nat Hum Behav (2024). https://doi.org/10.1038/s41562-024-01882-z

[4] Davenport, T., & Kalakota, R. (2019). The potential for artificial intelligence in healthcare. Future Healthcare Journal, 6(2), 94–98. https://doi.org/10.7861/futurehosp.6-2-94

[5] Cacciamani, G. E., Sanford, D. I., Chu, T. N., Kaneko, M., De Castro Abreu, A. L., Duddalwar, V., & Gill, I. S. (2023). Is artificial intelligence replacing our radiology stars? Not yet! European Urology Open Science, 48, 14–16. https://doi.org/10.1016/j.euros.2022.09.024

[6] Maples, B., Cerit, M., Vishwanath, A., & Pea, R. (2024b). Loneliness and suicide mitigation for students using GPT3-enabled chatbots. Npj Mental Health Research, 3(1). https://doi.org/10.1038/s44184-023-00047-6

[7] Schneider, J., Haag, S., & Kruse, L. C. (2023a). Negotiating with LLMS: Prompt Hacks, Skill Gaps, and Reasoning Deficits. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2312.03720

[8] Bianchi, F., Chia, P. J., Yuksekgonul, M., Tagliabue, J., Jurafsky, D., & Zou, J. (2024). How well can LLMs negotiate? NegotiationArena Platform and analysis. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2402.05863

[9] hMattu, J. L. a. K. (2023, December 20). How we analyzed the COMPAS Recidivism Algorithm. ProPublica. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm