BRIDGING THE PERFORMANCE-INTERPRETABILITY GAP IN LARGE LANGUAGE MODELS: SPARSE AUTOENCODER DISENTANGLEMENT WITH PERFORMANCE ANCHORING
Abstract
The high performance of modern models of transformer-based systems in processing natural languages has been achieved at the expense of transparency. Because these systems are being used in more sensitive fields - such as analysis of legal documents, medical decision support, etc. The inability to know why these systems have come to that point is a significant practical and ethical drawback. The classical post hoc explanation methods are often incapable of delivering faithful information and the efforts to construct intrinsic explanatory ability of the task are mostly detrimental to the task performance. This article offers a system contesting the perceived trade-off. We present AnchorX, a method that uses sparse autoencoders (SAEs) on the intermediate activations of both the encoder and decoder transformer models that uses a performance-anchoring regularizer that directly incurs a badness penalty on not matching the logits of the original model when learning features. Instead of considering interpretability as a post-hoc process, the algorithm trains a dictionary of conceptually monosemantic features and maintains the behavioral consistency of the model with a composite loss that trades-off reconstruction fidelity, sparsity, and anchoring to the task. A set of 7 benchmarks in NLP, including GLUE, SuperGLUE subsets, and domain-specific biomedical and legal text, show that models with AnchorX can retain 98.7% of the baseline performance on average and increase faithfulness measures by 31-47 over popular strong baseline models like Integrated Gradients, SHAP, and attention rollout. The human subject studies also ensure increased plausibility and applicability of the derived explanations. The study of the acquired dictionaries shows constant, human-understandable concepts such as syntactic structures up to an abstract, pattern of reason, some of which can generalize to model sizes. The most notable is that some sparse features also seem to be regularizers themselves and some actually demonstrate slight performance improvements on out-of-distribution samples. The work provides a technical method that can be replicated as well as a more general point: it is possible to achieve interpretability without buying the capability with appropriate engineering. We comment on constraints at frontier scales at computational overhead and the long-standing problem of perfectly monosemantic representations. Further avenues involve adaptation of online dictionaries and incorporating causal intervention methods. This thread of work, which has been followed during a number of years in my group, indicates that we are not so much further away than we had thought before we get to models powerful as well as understandable. (Word count: 217)
Keywords: Large Language Models (LLMs), Sparse Autoencoders (SAEs), Mechanistic Interpretability, Performance Anchoring, Dictionary Learning, Monosemanticity, Natural Language Processing (NLP).
