Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Jiaqi Weng, Han Zheng, Hanyu Zhang, Qinqin He, Jialing Tao, Hui Xue, Zhixuan Chu, Xiting Wang

Published: 2025/9/11

Abstract

Increasing deployment of large language models (LLMs) in real-world applications raises significant safety concerns. Most existing safety research focuses on evaluating LLM outputs or specific safety tasks, limiting their ability to ad- dress broader, undefined risks. Sparse Autoencoders (SAEs) facilitate interpretability research to clarify model behavior by explaining single-meaning atomic features decomposed from entangled signals. jHowever, prior applications on SAEs do not interpret features with fine-grained safety-related con- cepts, thus inadequately addressing safety-critical behaviors, such as generating toxic responses and violating safety regu- lations. For rigorous safety analysis, we must extract a rich and diverse set of safety-relevant features that effectively capture these high-risk behaviors, yet face two challenges: identifying SAEs with the greatest potential for generating safety concept-specific neurons, and the prohibitively high cost of detailed feature explanation. In this paper, we pro- pose Safe-SAIL, a framework for interpreting SAE features within LLMs to advance mechanistic understanding in safety domains. Our approach systematically identifies SAE with best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the in- terpretation process. We will release a comprehensive toolkit including SAE checkpoints and human-readable neuron ex- planations, which supports empirical analysis of safety risks to promote research on LLM safety.

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework | SummarXiv | SummarXiv