PDF(2822 KB)
Human Intelligence-Inspired Testing for the Developmental Stages of Artificial General Intelligence: From General to Applicable
Peng Yujia, He Xinyi, Xie Hongzhao, Xiao Xizhi, Wang Yuxi, Zhu Songchun, Zhang Zhenliang
Journal of Psychological Science ›› 2026, Vol. 49 ›› Issue (2) : 271-281.
PDF(2822 KB)
PDF(2822 KB)
Human Intelligence-Inspired Testing for the Developmental Stages of Artificial General Intelligence: From General to Applicable
The rapid advancement of artificial intelligence (AI) is profoundly reshaping society, presenting unprecedented opportunities for the development of Artificial General Intelligence (AGI). While generative pre-trained models (e.g., the GPT series) demonstrate remarkable generalization in specialized domains, they remain narrow AI systems, still facing gaps in achieving AGI. Our previous work proposed that AGI demands adaptability to dynamic, embodied environments (Dynamic Embodied Physical and Social Interactive, DEPSI), characterized by infinite-task handling, autonomous task generation, and value-driven decision-making. However, translating abstract AGI definitions into practical testing frameworks remains a critical challenge. Here, we proposed a human intelligence-inspired developmental testing framework for AGI to assess its progression from general to applied capabilities.
First, in the general stage, AGI is expected to demonstrate cross-domain foundational cognitive abilities, such as common sense reasoning and adaptive learning, analogous to early childhood intelligence development (ages 0-6). By collecting and analyzing human developmental data, this study establishes a series of general tests to measure an AI system’s "cognitive age." Specifically, eight representative tasks were selected and implemented in a UE5-based virtual environment, including organizing a suitcase, tidying a desk, and solving puzzles, which cover the cognitive and motor skills expected of 5-6-year-olds. The environment features realistic domestic settings (e.g., kitchens and bedrooms) with interactive objects (e.g., appliances and furniture) and social agents (e.g., family members and teachers) to assess both physical reasoning and social intelligence. A human-user interface, incorporating VR and motion tracking, enables direct comparisons between AI and human performance. Four multimodal large models (GPT-4o, Claude-3.5, Qwen, and Doubao) were tested after being equipped with perception and action modules to interface with the virtual environment. Each task was repeated 10-15 times, with average scores computed for evaluation.
Key findings reveal critical limitations in current AI systems. A common limitation lies in their constrained embodied performance. While models approached baseline competence (30/100) in simpler tasks, such as understanding button functions, they struggled in complex, physically interactive tasks, including puzzle-solving and room cleaning. GPT-4o emerged as the strongest performer, leading in five tasks, but still exhibited significant shortcomings in motor coordination. Similarly, the models excelled in language-heavy tasks (e.g., selecting gifts) but underperformed in spatial and sequential-action tasks. This reflects their training bias toward static text/image data rather than dynamic, embodied interaction. The study concludes that current large language models, without specialized adaptation, lack the embodied intelligence required for human-like task execution. Future advancements must prioritize real-time sensory feedback, interactive learning, and improved physical simulation to bridge this gap.
Building upon this foundation of general abilities, we introduce a three-phase AGI testing framework: General-Specialized-Applicable (GSA). The specialized phase emphasizes autonomous learning and skill refinement in specific domains (e.g., Go, mathematics), enabling AI to tackle complex problem-solving and knowledge integration, much like human adolescents mastering specialized subjects. It is noteworthy that general and specialized capabilities are not mutually exclusive but exhibit a synergistic, spiral progression in AGI development. General capabilities form the foundational "operating system" of an agent, enabling cross-domain knowledge transfer and adaptive learning. Conversely, advancements in specialized domains refine this system through novel cognitive patterns and problem-solving methods. This bidirectional reinforcement creates a "general-specialized" spiral trajectory of AGI development. Looking back, traditional AI approaches often bypass general capabilities, focusing narrowly on specialized tasks (e.g., chess). To address this, we advocate a "layered development, dynamic balance" strategy: first achieving threshold general competence, then cultivating prioritized specialized skills while establishing feedback mechanisms to generalize domain insights. This approach prevents premature specialization ("ability silos") while ensuring practical utility, enabling continuous breakthroughs in both generality and expertise.
Finally, the applicable phase evaluates AGI’s generalization ability in real-world environments and industrial applications (e.g., robotics, autonomous driving), verifying whether it can seamlessly integrate into human society and serve practical needs.
Overall, the GSA framework aims to provide a potential systematic, human development-inspired standard for AGI evaluation, guiding its development toward intelligence that can sufficiently coexist with and benefit humanity. The GSA framework not only proposed a standardized AGI assessment but may also fostered trust by ensuring alignment with human-centric values and practical applicability, which may advance AGI toward safe and meaningful social integration.
artificial intelligence / artificial general intelligence / cognitive development / artificial intelligence evaluation / embodied AI
| [1] |
丁贵广, 陈辉, 王澳, 杨帆, 熊翊哲, 梁伊雯. (2024). 视觉深度学习模型压缩加速综述. 智能系统学报, 19(5), 1072-1081.
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
A comprehensive differential diagnosis is a cornerstone of medical care that is often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by large language models present new opportunities to assist and automate aspects of this process1. Here we introduce the Articulate Medical Intelligence Explorer (AMIE), a large language model that is optimized for diagnostic reasoning, and evaluate its ability to generate a differential diagnosis alone or as an aid to clinicians. Twenty clinicians evaluated 302 challenging, real-world medical cases sourced from published case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: assistance from search engines and standard medical resources; or assistance from AMIE in addition to these tools. All clinicians provided a baseline, unassisted differential diagnosis prior to using the respective assistive tools. AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% versus 33.6%, P = 0.04). Comparing the two assisted study arms, the differential diagnosis quality score was higher for clinicians assisted by AMIE (top-10 accuracy 51.7%) compared with clinicians without its assistance (36.1%; McNemar’s test: 45.7, P < 0.01) and clinicians with search (44.4%; McNemar’s test: 4.75, P = 0.03). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without assistance from AMIE. Our study suggests that AMIE has potential to improve clinicians’ diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients’ access to specialist-level expertise.
|
| [15] |
|
| [16] |
OpenAI. (2024). GPT-4 technical report. arXiv.
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
• On the basis of observational studies (level C), preterm birth is a leading cause of neurodevelopmental disabilities in children, and the degree of neurodevelopmental disability is inversely correlated with gestational age at birth. When comparing performance of preterm children to developmental norms, “corrected age” or age from due date rather than birth date should be used for the first 24 to 36 months. • On the basis of observational studies (level C), clinicians should pay specific attention to sensory function in children born preterm because the incidence of visual and hearing impairments is higher in preterm than term children. Due to the elevated risk of cognitive and behavioral disabilities, clinicians caring for children born preterm should be vigilant when performing developmental assessments to improve outcomes. • On the basis of observational studies (level C), early identification of developmental delays allows for referral to therapeutic services, and children referred for early intervention are more likely to make gains in developmental milestones.
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
Human cognition is founded, in part, on four systems for representing objects, actions, number, and space. It may be based, as well, on a fifth system for representing social partners. Each system has deep roots in human phylogeny and ontogeny, and it guides and shapes the mental lives of adults. Converging research on human infants, non-human primates, children and adults in diverse cultures can aid both understanding of these systems and attempts to overcome their limits.
|
| [26] |
|
感谢张驰、李佳琪、郑子隆、牛力兴、范丽凤在测试中的贡献,感谢赵士云、卢宇洁、刘明远在通用任务测试中的贡献,感谢朱爱菊、谢卢彬、韩佳衡在数据收集方面的贡献,傅雨秋、周尚博在数据分析中的贡献,以及陈珍在作图中的贡献。
/
| 〈 |
|
〉 |