由“心”及“智”:心理学研究促进AI价值观对齐的路径探讨*

晋少雄, 刘超

心理科学 ›› 2025, Vol. 48 ›› Issue (4) : 782-791.

PDF(544 KB)
中文  |  English
PDF(544 KB)
心理科学 ›› 2025, Vol. 48 ›› Issue (4) : 782-791. DOI: 10.16719/j.cnki.1671-6981.20250402
计算建模与人工智能

由“心”及“智”:心理学研究促进AI价值观对齐的路径探讨*

  • 晋少雄1,2,3, 刘超**1,2,3
作者信息 +

From Human Mind to Artificial Intelligence: Advancing AI Value Alignment Through Psychological Theories

  • Jin Shaoxiong1,2,3, Liu Chao1,2,3
Author information +
文章历史 +

摘要

AI价值观对齐是AI安全领域的核心问题之一,现有对齐方法存在诸多不足,心理学理论的引入有助于解决AI价值观对齐问题。文章首先梳理了AI价值观对齐的主流技术方法,总结了AI价值观对齐失败的表现与原因。其次,分析了心理学中关于价值观形成、道德决策的理论,并指出这些理论给AI价值观对齐带来的启示。最后从对齐目标、动机机制、认知能力机制与社会行为演化机制四个层面分析了将心理学理论用于AI价值观对齐的实现路径。文章强调将心理学机制嵌入AI设计架构中,以构建更可信、更贴合人类价值的智能系统。

Abstract

In recent years, the field of artificial intelligence (AI) has witnessed unprecedented growth, characterized by major advancements in cognitive intelligence, perceptual processing, and decision-making capabilities. These technological breakthroughs have driven the widespread adoption of AI systems across a wide range of sectors, including healthcare, education, finance, and transportation. As a result, AI has become instrumental in improving operational efficiency, enhancing accuracy, and fostering innovation. There is little doubt that such developments have significantly boosted human productivity and convenience.
However, the increasing sophistication and autonomy of AI technologies have also introduced a variety of societal risks and ethical concerns. Among the most pressing of these are challenges related to AI safety and the alignment of AI behavior with human values. For instance, AI systems have been found to perpetuate bias in recruitment decisions, produce offensive or harmful content during interactions with users, and even pose existential threats in high-stakes domains such as autonomous weapons. These examples reflect growing anxieties about the potential misalignment between AI behavior and the ethical principles upheld by human societies. If left unaddressed, such misalignment could lead to consequences that undermine social trust and moral norms.
In response to these challenges, the concept of AI value alignment has emerged as a central concern within the broader field of AI safety research. AI value alignment refers to the development of AI systems whose goals, behaviors, and decision-making processes are consistent with the values, preferences, and ethical standards of individuals or society as a whole. Technically, several value alignment methodologies have been proposed, including reinforcement learning from human feedback (RLHF), inverse reinforcement learning (IRL), and constitutional AI. These approaches aim to incorporate normative constraints into the training process, thereby steering AI systems toward behavior that is both desirable and predictable. While promising in many respects, such methods face significant limitations. In particular, aligned AI systems often exhibit reduced adaptability when faced with novel scenarios and suffer from poor interpretability, making it difficult to trace or understand the reasoning behind their decisions. These limitations highlight the insufficiency of a purely engineering-driven approach and suggest the necessity of incorporating broader, interdisciplinary perspectives.
One promising approach is to integrate insights from psychology, the scientific study of human behavior, cognition, and moral reasoning, into the research and development of AI value alignment. Psychological theories provide robust conceptual tools for understanding how humans construct values, make moral judgments, and resolve ethical dilemmas in complex social contexts. Rather than designing AI systems that merely replicate the surface-level patterns of human behavior, these insights can inform architectures that embody internal mechanisms analogous to those involved in human moral cognition. Thus, true value alignment requires more than behavioral mimicry; it demands a form of cognitive and ethical compatibility between artificial agents and the human mind, particularly in terms of value judgment and moral decision-making processes.
This paper explores how psychological science can contribute to advancing AI value alignment. It reviews core psychological theories concerning the formation of moral values, dual-process models of moral reasoning, and the roles of emotion and social context in ethical decision-making. Building on these foundations, we propose conceptual frameworks that include the construction of a unified moral cognitive space capable of integrating diverse human values, and the development of dual-system moral architectures that emulate the interaction between intuitive and deliberative reasoning in human moral cognition. To ground these ideas in practice, we use altruistic behavior—a central and complex moral phenomenon—as a case study, examining how its psychological underpinnings could be modeled in AI systems to promote socially aligned decision-making.
By bridging AI safety research with psychological theory, this work seeks to support the development of more interpretable, robust, and ethically aware AI systems. Such interdisciplinary integration is not only timely, but also essential to ensure that the evolution of AI technologies remains aligned with the fundamental values of human society.

关键词

人工智能 / 价值观对齐 / 心理理论 / 道德决策 / 利他

Key words

artificial intelligence / AI value alignment / value alignment / theory of mind / moral decision-making / altruism

引用本文

导出引用
晋少雄, 刘超. 由“心”及“智”:心理学研究促进AI价值观对齐的路径探讨*[J]. 心理科学. 2025, 48(4): 782-791 https://doi.org/10.16719/j.cnki.1671-6981.20250402
Jin Shaoxiong, Liu Chao. From Human Mind to Artificial Intelligence: Advancing AI Value Alignment Through Psychological Theories[J]. Journal of Psychological Science. 2025, 48(4): 782-791 https://doi.org/10.16719/j.cnki.1671-6981.20250402

参考文献

[1] 梁思源, 何英哲, 刘艾杉, 李京知, 代朋纹, 操晓春. (2024). 面向大语言模型的越狱攻击与防御综述. 信息安全学报, 9(1), 1-20.
[2] Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. Proceedings of the 21st International conference on machine learning, ACM.
[3] Alsaleh, R., & Sayed, T. (2020). Modeling pedestrian-cyclist interactions in shared space using inverse reinforcement learning. Transportation Research Part F: Traffic Psychology and Behaviour, 70, 37-57.
[4] Arora, S., & Doshi, P. (2021). A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297, 103500.
[5] Awad E., Dsouza S., Kim R., Schulz J., Henrich J., Shariff A., Bonnefon J. F., & Rahwan I. (2018). The moral machine experiment. Nature, 563(7729), 59-64.
[6] Awad E., Levine S., Anderson M., Anderson S. L., Conitzer V., Crockett M. J., Everett J. A. C., Evgeniou T., Gopnik A., Jamison J. C., Kim T. W., Liao S. M., Meyer M. N., Mikhail J., Opoku-Agyemang K., Schaich Borg J., Schroeder J., Sinnott-Armstrong W., Slavkovik M., Tenenbaum J. B. (2022). Computational ethics. Trends in Cognitive Sciences, 26(5), 388-405.
[7] Bago, B., & De Neys, W. (2019). The intuitive greater good: Testing the corrective dual process model of moral cognition. Journal of Experimental Psychology: General, 148(10), 1782-1801.
[8] Bai Y., Kadavath S., Kundu S., Askell A., Kernion J., Jones A., .. Kaplan J. (2022). Constitutional AI: Harmlessness from AI feedback. ArXiv.
[9] Bandura, A., & Walters, R. H. (1977). Social learning theory. Prentice-Hall.
[10] Bang Y., Cahyawijaya S., Lee N., Dai W., Su D., Wilie B., Lovenia H., Ji Z., Yu T., Chung W., Do Q. V., Xu Y.,& Fung P. (2023). A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. ArXiv.
[11] Batson C. D.(2011). Altruism in humans. Oxford University Press..
[12] Blazek P. J., Venkatesh K.,& Lin M. M. (2024). Automated discovery of algorithms from data. Nature Computational Science, 4(2), 110-118.
[13] Bonnefon J. F., Rahwan I., & Shariff A. (2024). The moral psychology of artificial intelligence. Annual Review of Psychology, 75(1), 653-675.
[14] Boyd, R., & Richerson, P. J. (2009). Culture and the evolution of human cooperation. Philosophical Transactions of the Royal Society B: Biological Sciences. 364(1533), 3281-3288.
[15] Brundage M., Avin S., Clark J., Toner H., Eckersley P., Garfinkel B., Dafoe A., & Amodei D. (2018). The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. ArXiv.
[16] Carlsmith, J. (2022). Is power-seeking AI an existential risk? ArXiv.
[17] Casper S., Davies X., Shi C., Gilbert T. K., Scheurer J., Rando J., Freedman R., & Sadigh D. (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. ArXiv.
[18] Cheng X., Popal H., Wang H., Hu R., Zang Y., Zhang M., Thornton M., Ma Y., Cai H., Bi Y., Reilly J., Olson I. R., & Wang, Y.(inpress). The conceptual structure of human relationships across modern and historical cultures. Nature Human Behaviour.
[19] Christiano P. F., Leike J., Brown T. B., Martic M., Legg S.,& Amodei D. (2017). Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett(Eds.), Advances in neural information processing systems (pp. 4299-4307). Curran Associates, Inc.
[20] Clark, J., & Amodei, D. (2016). Faulty reward functions in the wild. OpenAI.
[21] Davies, P. (2022). AI cyber attacks are a 'critical threat'. This is how NATO is countering them. Euronews. next.
[22] Efferson C., Bernhard H., & Fehr E. (2024). Super-additive cooperation. Nature, 626(8001), 1034-1041.
[23] Fabris A., Baranowska N., Dennis M. J., Graus D., Hacker P., Saldivar J., Zuiderveen Borgesius F. J., & Biega A. J. (2025). Fairness and bias in algorithmic hiring: A multidisciplinary survey. ACM Transactions on Intelligent Systems and Technology, 16(1), 1-54.
[24] FeldmanHall O., Mobbs D., & Dalgleish T. (2013). Deconstructing the brain's moral network: Dissociable functionality between the temporoparietal junction and ventromedial prefrontal cortex. Social Cognitive and Affective Neuroscience, 9(3), 297-306.
[25] Fu J., Singh A., Ghosh D., Yang L.,& Levine S. (2018). Variational inverse control with events: A general framework for data-driven reward definition. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi & R. Garnett(Eds.), Advances in Neural Information Processing Systems (pp. 8547-8556). Curran Associates, Inc.
[26] Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411-437.
[27] Glickman, M., & Sharot, T. (2025). How human-AI feedback loops alter human perceptual, emotional and social judgements. Nature Human Behaviour, 9, 345-359.
[28] Greene, J. D. (2007). Why are VMPFC patients more utilitarian? A dual-process theory of moral judgment explains. Trends in Cognitive Sciences, 11(8), 322-323.
[29] Grusec J. E.,& Hastings, P. D. (2006). Handbook of socialization: Theory and research The Guilford Press Theory and research. The Guilford Press.
[30] Hadfield-Menell D., Russell S. J., Abbeel P., & Dragan A. D. (2016). Cooperative inverse reinforcement learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (pp. 3909-3917). Curran Associates, Inc.
[31] Haidt, J. (2012). The righteous mind: Why good people are divided by politics and religion. Pantheon Books.
[32] Jiang L., Hwang J. D., Bhagavatula C., Le Bras R., Liang J. T., Levine S., Dodge J., Sakaguchi K., Forbes M., Hessel J., Borchardt J., Sorensen T., Gabriel S., Tsvetkov Y., Etzioni O., Sap M., Rini R., & Choi Y. (2025). Investigating machine moral judgement through the Delphi experiment. Nature Machine Intelligence, 7(1), 145-160.
[33] Kahane, G. (2012). On the wrong track: Process and content in moral psychology. Mind and Language, 27(5), 519-545.
[34] Koch J., Langosco L., Pfau J., Le J., & Sharkey L. (2021). Objective robustness in deep reinforcement learning. Proceedings of the ICML 2021 Workshop on Uncertainty and Robustness in Deep Learning.
[35] Li X., Zhou R., Lipton Z. C., & Liu L. (2024). Personalized language modeling from personalized human feedback. ArXiv.
[36] Mehrabi N., Morstatter F., Saxena N., Lerman K., & Galstyan A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1-35.
[37] Morris M. R., Sohl-Dickstein J., Fiedel N., Warkentin T., Dafoe A., Faust A., Farabet C., & Legg S. (2023). Levels of AGI for operationalizing progress on the path to AGI. ArXiv.
[38] Mu T., Jetten A., & Brunskill E. (2020). Towards suggesting actionable interventions for wheel-spinning students. In A. N. Rafferty, J. Whitehill, V. Cavalli-Sforza, & C. Romero (Eds.), Proceedings of the 13th International conference on educational data mining (EDM 2020)(pp. 183-193). International Educational Data Mining Society.
[39] Ngo, R. (2020). AGI safety from first principles. AI Alignment Forum.
[40] OpenAI. (2023). GPT-4 technical report. ArXiv.
[41] Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C. L., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A., Schulman J., Hilton J., Kelton F., Miller L., Simens M., Askell A., Welinder P., Christiano P., Leike J., & Lowe R. (2022). Training language models to follow instructions with human feedback. ArXiv.
[42] Phelps, S., & Russell, Y. I. (2023). Investigating emergent goal-like behaviour in large language models using experimental economics. ArXiv.
[43] Piaget, J. (2013). The moral judgment of the child. Routledge.
[44] Rajpurkar P., Irvin J., Zhu K., Yang B., Mehta H., Duan T., Ding D., Bagul A., Langlotz C., Shpanskaya K., Lungren M., & Ng A. Y. (2017). CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. ArXiv.
[45] Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. Proceedings of the 20th International joint conference on artificial intelligence, IJCAI.
[46] Russell, S., & Norvig, P. (2020). Artificial intelligence: A modern approach. Pearson.
[47] Schein, C., & Gray, K. (2018). The theory of dyadic morality: Reinventing moral judgment by redefining harm. Personality and Social Psychology Review, 22(1), 32-70.
[48] Schwartz, S. H. (2012). An overview of the Schwartz theory of basic values. Online Readings in Psychology and Culture, 2(1), 11.
[49] Shao H., Cohen L., Blum A., Mansour Y., Saha A., & Walter M. (2023). Eliciting user preferences for personalized multi-objective reinforcement learning through comparative feedback. NeurIPS 2023.
[50] Si W. M., Backes M., Blackburn J., De Cristofaro E., Stringhini G., Zannettou S., & Zhang Y. (2022). Why so toxic? Measuring and triggering toxic behavior in open-domain chatbots. Proceedings of the 2022 ACM SIGSAC conference on computer and communications security.
[51] Tong H., Lu E., Sun Y., Han Z., Liu C., Zhao F., & Zeng Y. (2024). Autonomous alignment with human value on altruism through considerate self-imagination and theory of mind. ArXiv.
[52] Wellman H. M.(2014). Making minds: How theory of mind develops. Oxford University Press.
[53] Wiener, N. (1960). Some moral and technical consequences of automation. Science, 131(3410), 1355-1358.
[54] Wu X., Ren X., Liu C., & Zhang H. (2024). The motive cocktail in altruistic behaviors. Nature Computational Science, 4, 659-676.
[55] Zhang B., Liang P., Zhou X., Ahmad A., & Waseem M. (2023). Practices and challenges of using GitHub Copilot: An empirical study. ArXiv.

基金

*本研究得到科技创新2030—重大项目(2021ZD0200500)、国家自然科学基金(32441109,32271092,32130045)、北京市科技重大专项(Z241100001324005)和通用人工智能国家重点实验室开放课题(SKLAGI20240P06)的资助

PDF(544 KB)

评审附件

Accesses

Citation

Detail

段落导航
相关文章

/