Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Matton, Katie; Ness, Robert Osazuwa; Guttag, John; Kıcıman, Emre

Computer Science > Computation and Language

arXiv:2504.14150 (cs)

[Submitted on 19 Apr 2025]

Title:Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Authors:Katie Matton, Robert Osazuwa Ness, John Guttag, Emre Kıcıman

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.

Comments:	61 pages, 14 figures, 36 tables
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2504.14150 [cs.CL]
	(or arXiv:2504.14150v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2504.14150

Submission history

From: Katie Matton [view email]
[v1] Sat, 19 Apr 2025 02:51:20 UTC (4,981 KB)

Computer Science > Computation and Language

Title:Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators