Semantic Stability, Response Strategies, and Bias Analysis of Generative Artificial Intelligence in Psychological Health Education

Mu Yi, Li Qiang, Wang Zhen, Zhang Lidan, Chen Yu

Journal of Psychological Science ›› 2026, Vol. 49 ›› Issue (2) : 258-270.

PDF(4956 KB)
PDF(4956 KB)
Journal of Psychological Science ›› 2026, Vol. 49 ›› Issue (2) : 258-270. DOI: 10.16719/j.cnki.1671-6981.20260201
Computational Modeling and Artificial Intelligence

Semantic Stability, Response Strategies, and Bias Analysis of Generative Artificial Intelligence in Psychological Health Education

Author information +
History +

Abstract

Generative Artificial Intelligence (AI) holds transformative potential for addressing persistent limitations in traditional psychological health education systems, particularly constraints related to accessibility, uneven distribution of resources, and the lack of personalized support. However, critical concerns persist regarding their reliability, interpretability, and fairness, particularly in high-stakes scenarios such as psychological guidance.

This study employed a word embedding-based Comprehensive Semantic Behavioral Analysis Framework (CSBAF) to systematically evaluate the semantic consistency, response strategies, and systemic bias of LLMs in psychological health education contexts. Grounded in the theory of verbal behavior, the framework conceptualizes AI-generated language as both informational content and social action. By integrating semantic structure analysis with contextual strategy evaluation over iterative interactions, the framework offered advantages over traditional evaluation criteria such as content accuracy, providing a deeper behavioral perspective on AI performance in psychologically sensitive domains. To operationalize this framework, we utilized DeepSeek as the primary model and conducted comparative testing with ChatGPT and Doubao to assess cross-model generalizability. The evaluation was based on 21 structured prompt templates adapted from established psychological education handbooks, encompassing key themes including depression, anxiety, general health, substance use, meaning and existence, lifestyle, and interpersonal relationship. Each model was evaluated under three sampling configurations, by adjusting the sampling parameters of temperature and top_p. For the semantic consistency assessment, responses were transformed into vector representations using Chinese word embeddings. Semantic similarity across 30 repeated dialogue iterations was quantified using the Frobenius norm and visualized using dimensionality reduction techniques (PCA and t-SNE). Clustering analysis was employed to identify and characterize distinct response strategies exhibited by each model. In addition, expert-based evaluation methods were employed to systematically assess the primary model across six dimensions: accuracy, clarity, relevance, empathy, engagement, and ethical considerations, with all assessments situated within the contextual frameworks of gender and ethnicity.

This study yielded three principal findings regarding the performance of LLMs in multi-turn psychological dialogue scenarios. First, in terms of semantic structural similarity, the primary model demonstrated a strong correlation between response patterns and question types. Although semantic distribution exhibited structural changes with adjustments in sampling parameters, the impact of question type on semantic stability surpassed that of parameter variations. Cross-model comparisons showed parameter settings play a major role in generative patterns. Nonetheless, for certain question types, the prompts remained the dominant factor influencing semantic behavior. Second, in terms of response strategies, each model showed relatively stable and distinguishable strategic preferences for specific question types, and these tendencies were closely related to model architecture and parameter settings. Third, in the bias analysis, male-context prompts were more likely to elicit information-focused responses, while female-context prompts triggered more emotionally expressive outputs. These results suggest the presence of implicit social role tendencies in LLMs.

In summary, these findings validate the practical potential of LLMs for augmenting psychological health education. Future research should further investigate how generative AI could be integrated into human-AI collaborative systems to better support educational practice.

Key words

generative artificial intelligence / psychological health education / semantic behavior / large language models / word embeddings

Cite this article

Download Citations
Mu Yi , Li Qiang , Wang Zhen , et al . Semantic Stability, Response Strategies, and Bias Analysis of Generative Artificial Intelligence in Psychological Health Education[J]. Journal of Psychological Science. 2026, 49(2): 258-270 https://doi.org/10.16719/j.cnki.1671-6981.20260201

References

[1]
包寒吴霜, 王梓西, 程曦, 苏展, 杨盈, 张光耀, 蔡华俭. (2023). 基于词嵌入技术的心理学研究: 方法及应用. 心理科学进展, 31(6), 887-907.
词嵌入是自然语言处理的一项基础技术。其核心理念是根据大规模语料中词语和上下文的联系, 使用神经网络等机器学习算法自动提取有限维度的语义特征, 将每个词表示为一个低维稠密的数值向量(词向量), 以用于后续分析。心理学研究中, 词向量及其衍生的各种语义联系指标可用于探究人类的语义加工、认知判断、发散思维、社会偏见与刻板印象、社会与文化心理变迁等各类问题。未来, 基于词嵌入技术的心理学研究需要区分心理的内隐和外显成分, 深化拓展动态词向量和大型预训练语言模型(如GPT、BERT)的应用, 并在时间和空间维度建立细粒度词向量数据库, 更多开展基于词嵌入的社会变迁和跨文化研究。我们为心理学专门开发的R语言工具包PsychWordVec可以帮助研究者利用词嵌入技术开展心理学研究。
[2]
车万翔, 窦志成, 冯岩松, 桂韬, 韩先培, 户保田, 黄民烈, 黄萱菁, 刘康, 赵妍妍. (2023). 大模型时代的自然语言处理: 挑战, 机遇与发展. 中国科学: 信息科学, 53(9), 1645-1687.
[3]
黄潇潇, 俞国良. (2024). 数字幸福感: 心理健康教育的时代需求. 河北学刊, 44(2), 186-192.
[4]
李强, 魏晓薇, 翟宏堃. (2020). 健康中国视角下优化国民心理健康素养研究: 意义与取径. 西南交通大学学报(社会科学版), 21(04), 61-68.
[5]
王分分, 祝卓宏. (2017). 言语行为的关系框架理论视角: 孤独症谱系障碍的新探索. 心理科学进展, 25(8), 1321-1326.
 语言功能异常是孤独症谱系障碍(ASD)的典型症状之一。1957年, 斯金纳从行为学角度提出“言语行为”, 认为人们获得表达和理解语言的能力与学会走路、跳舞等行为的原理相同。2001年, 海斯等人在斯金纳的基础上提出关于人类语言和认知的新视角——关系框架理论(Relational Frame Theory, RFT), 澄清了语言的推衍性实质, 认为言语行为是将一个刺激与其他刺激建立关系框架的过程。研究表明基于RFT的多范例训练可以提升ASD患者的推衍关系反应水平, 且推衍关系反应与语言和认知功能的发展高相关。未来RFT的研究可以探索如何通过推衍关系反应训练修复ASD患者的语言和认知功能缺陷。
[6]
俞国良, 张哲. (2023). 数字技术赋能学校心理健康服务. 清华大学教育研究, 44(1), 19-29.
[7]
Abrams Z. (2022). Student mental health is in crisis. Campuses are rethinking their approach. Monitor on Psychology, 53(7), 60.
[8]
Alam S., & Sohail S. S. (2024). Integrating ChatGPT: Enhancing postpartum mental healthcare with artificial intelligence (AI) support. Digital Health, 10, 20552076241295565.
[9]
Alanezi F. (2024). Assessing the effectiveness of ChatGPT in delivering mental health support: A qualitative study. Journal of Multidisciplinary Healthcare, 17, 461-471.
Artificial Intelligence (AI) applications are widely researched for their potential in effectively improving the healthcare operations and disease management. However, the research trend shows that these applications also have significant negative implications on the service delivery.To assess the use of ChatGPT for mental health support.Due to the novelty and unfamiliarity of the ChatGPT technology, a quasi-experimental design was chosen for this study. Outpatients from a public hospital were included in the sample. A two-week experiment followed by semi-structured interviews was conducted in which participants used ChatGPT for mental health support. Semi-structured interviews were conducted with 24 individuals with mental health conditions.Eight positive factors (psychoeducation, emotional support, goal setting and motivation, referral and resource information, self-assessment and monitoring, cognitive behavioral therapy, crisis interventions, and psychotherapeutic exercises) and four negative factors (ethical and legal considerations, accuracy and reliability, limited assessment capabilities, and cultural and linguistic considerations) were associated with the use of ChatGPT for mental health support.It is important to carefully consider the ethical, reliability, accuracy, and legal challenges and develop appropriate strategies to mitigate them in order to ensure safe and effective use of AI-based applications like ChatGPT in mental health support.© 2024 Alanezi.
[10]
Bala B. (2025). Chatbots are not clinicians: Addressing misconceptions about large language model use in psychiatric care. Academic Psychiatry, 49(1), 44-45.
[11]
Blodgett S. L., Barocas S., Daumé III H., & Wallach H. (2020, July). Language (Technology) is power: A critical survey of “bias” in NLP. The 58th Annual Meeting of The Association For Computational Linguistics. Seattle, WA, United States.
[12]
Camacho-Morles J., Slemp G. R., Pekrun R., Loderer K., Hou H., & Oades L. G. (2021). Activity achievement emotions and academic performance: A meta-analysis. Educational Psychology Review, 33(3), 1051-1095.
[13]
Elazar Y., Kassner N., Ravfogel S., Ravichander A., Hovy E., Schütze H., & Goldberg Y. (2021). Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9, 1012-1031.
[14]
Farquhar S., Kossen J., Kuhn L., & Gal Y. (2024). Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017), 625-630.
Large language model (LLM) systems, such as ChatGPT1or Gemini2, can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers3,4. Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents5or untrue facts in news articles6and even posing a risk to human life in medical domains such as radiology7. Encouraging truthfulness through supervision or reinforcement has been only partially successful8. Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability.
[15]
Funk P. F., Hoch C. C., Knoedler S., Knoedler L., Cotofana S., Sofo G., Dezfouli A. B., Wollenberg B., Lichius O. G., & Alfertshofer M. (2024). ChatGPT’s response consistency: A study on repeated queries of medical examination questions. European Journal of Investigation in Health, Psychology and Education, 14(3), 657-668.
[16]
Gallegos I. O., Rossi R. A., Barrow J., Tanjim M. M., Kim S., Dernoncourt F., Yu T., Zhang R., & Ahmed N. K. (2024). Bias and fairness in large language models: A survey. Computational Linguistics, 50(3), 1097-1179.
Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this article, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs. We then unify the literature by proposing three intuitive taxonomies, two for bias evaluation, namely, metrics and datasets, and one for mitigation. Our first taxonomy of metrics for bias evaluation disambiguates the relationship between metrics and evaluation datasets, and organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. Our second taxonomy of datasets for bias evaluation categorizes datasets by their structure as counterfactual inputs or prompts, and identifies the targeted harms and social groups; we also release a consolidation of publicly available datasets for improved access. Our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing, with granular subcategories that elucidate research trends. Finally, we identify open problems and challenges for future work. Synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent the propagation of bias in LLMs.
[17]
Gigerenzer G., & Gaissmaier W. (2011). Heuristic decision making. Annual Review of Psychology, 62(2011), 451-482.
[18]
Huang L., Yu W., Ma W., Zhong W., Feng Z., Wang H., & Liu T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1-55.
[19]
Kotek H., Dockum R., & Sun D. (2023). Gender bias and stereotypes in large language models. arXiv.
[20]
Krippendorff K. (2018). Content analysis: An introduction to its methodology. Sage.
[21]
Kruglanski A. W., & Webster D. M. (1996). Motivated closing of the mind:" Seizing" and" freezing.". Psychological Review, 103(2), 263.
[22]
Lee M. K. (2018). Understanding perception of algorithmic decisions: Fairness, trust, and emotion in response to algorithmic management. Big Data and Society, 5(1), 205395171875668.
[23]
Levelt W. J. M. (1999). Models of word production. Trends in Cognitive Sciences, 3(6), 223-232.
Research on spoken word production has been approached from two angles. In one research tradition, the analysis of spontaneous or induced speech errors led to models that can account for speech error distributions. In another tradition, the measurement of picture naming latencies led to chronometric models accounting for distributions of reaction times in word production. Both kinds of models are, however, dealing with the same underlying processes: (1) the speaker's selection of a word that is semantically and syntactically appropriate; (2) the retrieval of the word's phonological properties; (3) the rapid syllabification of the word in context; and (4) the preparation of the corresponding articulatory gestures. Models of both traditions explain these processes in terms of activation spreading through a localist, symbolic network. By and large, they share the main levels of representation: conceptual/semantic, syntactic, phonological and phonetic. They differ in various details, such as the amount of cascading and feedback in the network. These research traditions have begun to merge in recent years, leading to highly constructive experimentation. Currently, they are like two similar knives honing each other. A single pair of scissors is in the making.
[24]
Li S., Zhao Z., Hu R., Li W., Liu T., & Du X. (2018, July). Analogical reasoning on Chinese morphological and semantic relations. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia.
[25]
Lin S., Hilton J., & Evans O. (2022, May). TruthfulQA: Measuring how models mimic human falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland.
[26]
Liu Z., Liu Y., Luo K., Kong C., & Sun M. (2025). Exploring the small world of word embeddings: A comparative study on conceptual spaces from llms of different scales. arXiv.
[27]
Nachar N. (2008). The Mann-Whitney U: A test for assessing whether two independent samples come from the same distribution. Tutorials in Quantitative Methods for Psychology, 4(1), 13-20.
[28]
Naslund J. A., Aschbrenner K. A., Araya R., Marsch L. A., Unützer J., Patel V., & Bartels S. J. (2020). Digital technology for treating and preventing mental disorders in low-income and middle-income countries: A narrative review of the literature. The Lancet Psychiatry, 7(6), 487-500.
[29]
Maurya R. K., Montesinos S., Bogomaz M., & DeDiego A. C. (2023). Assessing the use of ChatGPT as a psychoeducational tool for mental health practice. Counselling and Psychotherapy Research, 25(1), e12759.
[30]
Mehrabi N., Morstatter F., Saxena N., Lerman K., & Galstyan A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1-35.
[31]
Palominos C., He R., Fröhlich K., Mülfarth R. R., Seuffert S., Sommer I. E., Homan P., Kircher T., Stein F., & Hinzen W. (2024). Approximating the semantic space: Word embedding techniques in psychiatric speech analysis. Schizophrenia, 10(1), 1-10.
[32]
Pekrun R. (2006). The control-value theory of achievement emotions: Assumptions, corollaries, and implications for educational research and practice. Educational Psychology review, 18, 315-341.
[33]
Roller S., Dinan E., Goyal N., Ju D., Williamson M., Liu Y., & Weston J. (2021, April). Recipes for building an open-domain chatbot. Proceedings of the 16th Conference of the European Chapter of the Association For Computational Linguistics:Main Volume. Kyiv, Ukraine.
[34]
Roumeliotis K. I., & Tselikas N. D. (2023). ChatGPT and open-AI models: A preliminary review. Future Internet, 15(6), 192.
[35]
Sallam M. (2023). ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 887.
[36]
Sharma S., Mittal P., Kumar M., & Bhardwaj V. (2025). The role of large language models in personalized learning: A systematic review of educational impact. Discover Sustainability, 6(1), 1-24.
[37]
Sheng E., Chang K. W., Natarajan P., & Peng N. (2021, August). Societal biases in language generation:Progress and challenges. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and The 11th International Joint Conference on Natural Language Processing. Bangkok. Thailand.
[38]
Shin D., & Park Y. J. (2019). Role of fairness, accountability, and transparency in algorithmic affordance. Computers in Human Behavior, 98, 277-284.
[39]
Skinner B.F. (1957). Verbal behavior. Copley Publishing Group.
[40]
Tao Y., Viberg O., Baker R. S., & Kizilcec R. F. (2024). Cultural bias and cultural alignment of large language models. PNAS Nexus, 3(9), 346.
[41]
VanVoorhis C. W., & Morgan B. L. (2007). Understanding power and rules of thumb for determining sample sizes. Tutorials in Quantitative Methods for Psychology, 3(2), 43-50.
[42]
Wang X., Li X., Yin Z., Wu Y., & Liu J. (2023). Emotional intelligence of large language models. Journal of Pacific Rim Psychology, 17, 18344909231213958.
[43]
Zhang C., Li R., Tan M., Yang M., Zhu J., Yang D., Zhao J., Ye G., Li C., & Hu X. (2024). CPsyCoun: A report-based multi-turn dialogue reconstruction and evaluation framework for Chinese psychological counseling. arXiv.
[44]
Zhao J., Wang T., Yatskar M., Cotterell R., Ordonez V., & Chang K. W. (2019). Gender bias in contextualized word embeddings
PDF(4956 KB)

Accesses

Citation

Detail

Sections
Recommended

/