Human Intelligence-Inspired Testing for the Developmental Stages of Artificial General Intelligence: From General to Applicable

Peng Yujia; He Xinyi; Xie Hongzhao; Xiao Xizhi; Wang Yuxi; Zhu Songchun; Zhang Zhenliang

doi:10.16719/j.cnki.1671-6981.20260202

PDF(2822 KB)

Journal of Psychological Science ›› 2026, Vol. 49 ›› Issue (2) : 271-281. DOI: 10.16719/j.cnki.1671-6981.20260202

Human Intelligence-Inspired Testing for the Developmental Stages of Artificial General Intelligence: From General to Applicable

Peng Yujia ¹^,²^,³^,^** ,
He Xinyi ²^,⁴ ,
Xie Hongzhao ² ,
Xiao Xizhi ⁵ ,
Wang Yuxi ¹^,² ,
Zhu Songchun ²^,³^,⁴ ,
Zhang Zhenliang ²^,^**

Author information +

History +

Abstract

The rapid advancement of artificial intelligence (AI) is profoundly reshaping society, presenting unprecedented opportunities for the development of Artificial General Intelligence (AGI). While generative pre-trained models (e.g., the GPT series) demonstrate remarkable generalization in specialized domains, they remain narrow AI systems, still facing gaps in achieving AGI. Our previous work proposed that AGI demands adaptability to dynamic, embodied environments (Dynamic Embodied Physical and Social Interactive, DEPSI), characterized by infinite-task handling, autonomous task generation, and value-driven decision-making. However, translating abstract AGI definitions into practical testing frameworks remains a critical challenge. Here, we proposed a human intelligence-inspired developmental testing framework for AGI to assess its progression from general to applied capabilities.

First, in the general stage, AGI is expected to demonstrate cross-domain foundational cognitive abilities, such as common sense reasoning and adaptive learning, analogous to early childhood intelligence development (ages 0-6). By collecting and analyzing human developmental data, this study establishes a series of general tests to measure an AI system’s "cognitive age." Specifically, eight representative tasks were selected and implemented in a UE5-based virtual environment, including organizing a suitcase, tidying a desk, and solving puzzles, which cover the cognitive and motor skills expected of 5-6-year-olds. The environment features realistic domestic settings (e.g., kitchens and bedrooms) with interactive objects (e.g., appliances and furniture) and social agents (e.g., family members and teachers) to assess both physical reasoning and social intelligence. A human-user interface, incorporating VR and motion tracking, enables direct comparisons between AI and human performance. Four multimodal large models (GPT-4o, Claude-3.5, Qwen, and Doubao) were tested after being equipped with perception and action modules to interface with the virtual environment. Each task was repeated 10-15 times, with average scores computed for evaluation.

Key findings reveal critical limitations in current AI systems. A common limitation lies in their constrained embodied performance. While models approached baseline competence (30/100) in simpler tasks, such as understanding button functions, they struggled in complex, physically interactive tasks, including puzzle-solving and room cleaning. GPT-4o emerged as the strongest performer, leading in five tasks, but still exhibited significant shortcomings in motor coordination. Similarly, the models excelled in language-heavy tasks (e.g., selecting gifts) but underperformed in spatial and sequential-action tasks. This reflects their training bias toward static text/image data rather than dynamic, embodied interaction. The study concludes that current large language models, without specialized adaptation, lack the embodied intelligence required for human-like task execution. Future advancements must prioritize real-time sensory feedback, interactive learning, and improved physical simulation to bridge this gap.

Building upon this foundation of general abilities, we introduce a three-phase AGI testing framework: General-Specialized-Applicable (GSA). The specialized phase emphasizes autonomous learning and skill refinement in specific domains (e.g., Go, mathematics), enabling AI to tackle complex problem-solving and knowledge integration, much like human adolescents mastering specialized subjects. It is noteworthy that general and specialized capabilities are not mutually exclusive but exhibit a synergistic, spiral progression in AGI development. General capabilities form the foundational "operating system" of an agent, enabling cross-domain knowledge transfer and adaptive learning. Conversely, advancements in specialized domains refine this system through novel cognitive patterns and problem-solving methods. This bidirectional reinforcement creates a "general-specialized" spiral trajectory of AGI development. Looking back, traditional AI approaches often bypass general capabilities, focusing narrowly on specialized tasks (e.g., chess). To address this, we advocate a "layered development, dynamic balance" strategy: first achieving threshold general competence, then cultivating prioritized specialized skills while establishing feedback mechanisms to generalize domain insights. This approach prevents premature specialization ("ability silos") while ensuring practical utility, enabling continuous breakthroughs in both generality and expertise.

Finally, the applicable phase evaluates AGI’s generalization ability in real-world environments and industrial applications (e.g., robotics, autonomous driving), verifying whether it can seamlessly integrate into human society and serve practical needs.

Overall, the GSA framework aims to provide a potential systematic, human development-inspired standard for AGI evaluation, guiding its development toward intelligence that can sufficiently coexist with and benefit humanity. The GSA framework not only proposed a standardized AGI assessment but may also fostered trust by ensuring alignment with human-centric values and practical applicability, which may advance AGI toward safe and meaningful social integration.

Key words

artificial intelligence / artificial general intelligence / cognitive development / artificial intelligence evaluation / embodied AI

Cite this article

EndNote

Ris (Procite)

Bibtex

Download Citations

Peng Yujia , He Xinyi , Xie Hongzhao , et al . Human Intelligence-Inspired Testing for the Developmental Stages of Artificial General Intelligence: From General to Applicable[J]. Journal of Psychological Science. 2026, 49(2): 271-281 https://doi.org/10.16719/j.cnki.1671-6981.20260202

References

List( Publishing order | Descend order by publishing year | Descend order by cited within ) Chart analysis

[1]	丁贵广, 陈辉, 王澳, 杨帆, 熊翊哲, 梁伊雯. (2024). 视觉深度学习模型压缩加速综述. 智能系统学报, 19(5), 1072-1081. Cited in this article [1]

[2]	Antol S., Agrawal A., Lu J. S., Mitchell M., Batra D., Zitnick C. L., & Parikh D. (2015). VQA: Visual question answering. Proceedings of the 2015 IEEE international conference on computer vision (pp. 2425-2433), Santiago, Chile, IEEE. Cited in this article [1]

[3]

Brown

T. B.

, Mann

, Ryder

, Subbiah

, Kaplan

, Dhariwal

, & Amodei

(2020). Language models are few-shot learners. Proceedings of the 34th international conference on neural information processing systems (pp.1877-1901), Vancouver, BC, Canada: Curran Associates Inc.

Cited in this article [1]

[4]	Comrey A. L., & Lee H. B. (1992). A first course in factor analysis. Psychology Press. Cited in this article [1]

[5]	Deng J., Dong W., Socher R., Li L. J., Li K., & Fei-Fei L. (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision And Pattern Recognition (pp. 248-255), Miami, FL, USA. Cited in this article [1]

[6]	Driess D., Xia F., Sajjadi M. S. M., Lynch C., Chowdhery A., Ichter B., & Florence P. (2023). PaLM-E: An embodied multimodal language model. arXiv. Cited in this article [1]

[7]

Fan

L. F.

, Xu

M. J.

, Cao

Z. H.

, Zhu

Y. X.

, & Zhu

S. C.

(2022). Artificial social intelligence: A comparative and holistic view. CAAI Artificial Intelligence Research, 1(2), 144-160.

https://doi.org/10.26599/AIR.2022.9150010

https://www.sciopen.com/article/10.26599/AIR.2022.9150010

Cited in this article [1]

[8]	Feldman R. S. (2006). Development across the life span. Pearson Education New Zealand. Cited in this article [1]

[9]	He Z. Q., Liu Y. S., Zheng J. S., Li X. J., Yao J. G., Qin B. W., & Yang X. (2024). FlagEvalMM: A flexible framework for comprehensive multimodal model evaluation. https://github.com/flageval-baai/FlagEvalMM https://github.com/flageval-baai/FlagEvalMM Cited in this article [1]

[10]	Kirillov A., Mintun E., Ravi N., Mao H. Z., Rolland C., Gustafson L., Xiao T., & Girshick R. (2023). Segment anything. Proceedings of the 2023 IEEE/CVF international conference on computer vision (pp.3992-4003), Paris, France. Cited in this article [1]

[11]	Lin T. Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., & Zitnick C. L. (2014). Microsoft COCO: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference (pp. 740-755), Zurich, Switzerland. Cited in this article [1]

[12]	Liu Y., Duan H. D., Zhang Y. H., Li B., Zhang S. Y., Zhao W. B., & Lin D. H. (2024). MMBench: Is your multi-modal model an all-around player? In 18th European Conference on Computer Vision (pp. 216-233), Milan, Italy. Cited in this article [1]

[13]

Magesh

, Surani

, Dahl

, Suzgun

, Manning

C. D.

, & Ho

D. E.

(2025). Hallucination-free? Assessing the reliability of leading AI legal research tools. Journal of Empirical Legal Studies, 22(2), 216-242.

https://doi.org/10.1111/jels.v22.2

https://onlinelibrary.wiley.com/toc/17401461/22/2

Cited in this article [1]

[14]

McDuff

, Schaekermann

, Tu

, Palepu

, Wang

, Garrison

, & Natarajan

(2025). Towards accurate differential diagnosis with large language models. Nature, 642(8067), 451-457.

https://doi.org/10.1038/s41586-025-08869-4

Cited in this article [1] Abstract

A comprehensive differential diagnosis is a cornerstone of medical care that is often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by large language models present new opportunities to assist and automate aspects of this process1. Here we introduce the Articulate Medical Intelligence Explorer (AMIE), a large language model that is optimized for diagnostic reasoning, and evaluate its ability to generate a differential diagnosis alone or as an aid to clinicians. Twenty clinicians evaluated 302 challenging, real-world medical cases sourced from published case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: assistance from search engines and standard medical resources; or assistance from AMIE in addition to these tools. All clinicians provided a baseline, unassisted differential diagnosis prior to using the respective assistive tools. AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% versus 33.6%, P = 0.04). Comparing the two assisted study arms, the differential diagnosis quality score was higher for clinicians assisted by AMIE (top-10 accuracy 51.7%) compared with clinicians without its assistance (36.1%; McNemar’s test: 45.7, P < 0.01) and clinicians with search (44.4%; McNemar’s test: 4.75, P = 0.03). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without assistance from AMIE. Our study suggests that AMIE has potential to improve clinicians’ diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients’ access to specialist-level expertise.

[15]

Morris

M. R.

, Sohl-Dickstein

, Fiedel

, Warkentin

, Dafoe

, Faust

, & Legg

(2024). Position: Levels of AGI for operationalizing progress on the path to AGI. In Proceedings of the 41st International Conference on Machine Learning (pp. 36308-36321),Vienna, Austria.

Cited in this article [1]

[16]	OpenAI. (2024). GPT-4 technical report. arXiv. Cited in this article [4]

[17]	Oppy G., & Dowe D. (2003). The turing test. Stanford Encyclopedia, 2021, 1-26. Cited in this article [1]

[18]

Peng

Y. J.

, Han

J. H.

, Zhang

Z. L.

, Fan

L. F.

, Liu

T. Y.

, Qi

S. Y.

, & Zhu

S. C.

(2024). The tong test: Evaluating artificial general intelligence through dynamic embodied physical and social interactions. Engineering, 34, 12-22.

https://doi.org/10.1016/j.eng.2023.07.006

https://linkinghub.elsevier.com/retrieve/pii/S209580992300293X

Cited in this article [3]

[19]	Radford A., Wu J., Child R., Luan D., Amodei D., & Sutskever I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9. Cited in this article [1]

[20]

Scharf

R. J.

, Scharf

G. J.

, & Stroustrup

(2016). Developmental milestones. Pediatrics in Review, 37(1), 25-38.

https://doi.org/10.1542/pir.2014-0103

https://www.ncbi.nlm.nih.gov/pubmed/26729779

Cited in this article [2] Abstract

• On the basis of observational studies (level C), preterm birth is a leading cause of neurodevelopmental disabilities in children, and the degree of neurodevelopmental disability is inversely correlated with gestational age at birth. When comparing performance of preterm children to developmental norms, “corrected age” or age from due date rather than birth date should be used for the first 24 to 36 months. • On the basis of observational studies (level C), clinicians should pay specific attention to sensory function in children born preterm because the incidence of visual and hearing impairments is higher in preterm than term children. Due to the elevated risk of cognitive and behavioral disabilities, clinicians caring for children born preterm should be vigilant when performing developmental assessments to improve outcomes. • On the basis of observational studies (level C), early identification of developmental delays allows for referral to therapeutic services, and children referred for early intervention are more likely to make gains in developmental milestones.

[21]	Sheldrick R. C., Schlichting L. E., Berger B., Clyne A., Ni P. S., Perrin E. C., & Vivier P. M. (2019). Establishing new norms for developmental milestones. Pediatrics, 144(6), Article e20190374. Cited in this article [2]

[22]	Shu T. M., Peng Y. J., Zhu S. C., & Lu H. J. (2021). A unified psychological space for human perception of physical and social events. Cognitive Psychology, 128, Article 101398. Cited in this article [1]

[23]	Silver D., Huang A., Maddison C. J., Guez A., Sifre L., van Den Driessche G., & Hassabis D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489. Cited in this article [1]

[24]	Spearman C. (1923). The nature of "intelligence" and the principles of cognition. Macmillan. Cited in this article [1]

[25]

Spelke

E. S.

, & Kinzler

K. D.

(2007). Core knowledge. Developmental Science, 10(1), 89-96.

https://www.ncbi.nlm.nih.gov/pubmed/17181705

Cited in this article [1] Abstract

Human cognition is founded, in part, on four systems for representing objects, actions, number, and space. It may be based, as well, on a fifth system for representing social partners. Each system has deep roots in human phylogeny and ontogeny, and it guides and shapes the mental lives of adults. Converging research on human infants, non-human primates, children and adults in diverse cultures can aid both understanding of these systems and attempts to overcome their limits.

[26]

Wang

J. Q.

, Zhang

C. H.

, Li

J. P.

, Ma

Y. X.

, Niu

L. X.

, Han

J. H.

, & Fan

L. F.

(2024). Evaluating and modeling social intelligence: A comparative study of human and AI capabilities. In Proceedings of the Annual Meeting of the Cognitive Science Society, Rotterdam, Netherlands.

Cited in this article [1]