深度卷积神经网络在面孔识别中的表现及与人类视觉系统的对比*

程羽慧, 申天宇, 路子童,袁祥勇, 蒋毅

心理科学 ›› 2025, Vol. 48 ›› Issue (4) : 814-825.

PDF(670 KB)
中文  |  English
PDF(670 KB)
心理科学 ›› 2025, Vol. 48 ›› Issue (4) : 814-825. DOI: 10.16719/j.cnki.1671-6981.20250405
计算建模与人工智能

深度卷积神经网络在面孔识别中的表现及与人类视觉系统的对比*

  • 程羽慧**1, 申天宇1, 路子童**2,袁祥勇3,4, 蒋毅3,4
作者信息 +

The Performance of Deep Convolutional Neural Networks in Face Recognition and the Comparison with the Human Visual System

  • Cheng Yuhui1, Shen Tianyu1, Lu Zitong2, Yuan Xiangyong3,4, Jiang Yi3,4
Author information +
文章历史 +

摘要

面孔识别是人类社会交往中的核心认知能力。近年来,深度卷积神经网络(deep convolutional neural network,DCNN)在模拟和理解面孔加工中展现出强大的能力,为探究人类面孔识别的行为表现和神经机制提供了新的视角。因此,围绕识别能力、行为效应与神经机制三个方面,本文系统综述了DCNN与人类在面孔识别中的异同:(1)首先,DCNN是否具备与人类相当的面孔识别能力?从面孔身份、性别、情绪等特征方面出发,评估DCNN在面孔识别任务中的表现;(2)其次,尽管DCNN在识别准确性上表现优异,其加工策略是否与人类的行为机制一致?基于经典的面孔加工效应(如倒置效应、种族效应、熟悉性效应等)分析DCNN与人类加工策略上的相似性与差异性;(3)进一步,DCNN的内部表征是否与人类面孔加工的神经机制相类似?从结构层级性和功能专门化两个方面,比较其表征方式与人类面孔识别系统的神经基础之间的对应关系。当前模型在鲁棒性与泛化性、结果解释力、生物视觉系统模拟等方面仍存在一定局限性,未来研究也可进一步探索其与多模态网络及生成对抗网络的融合潜力。

Abstract

Face recognition is a fundamental cognitive function that plays a crucial role in human social interaction, as the human brain exhibits a remarkable sensitivity to facial stimuli. For decades, psychologists, cognitive neuroscientists, and computer vision researchers have been dedicated to uncovering the behavioral and neural mechanisms underlying face processing. Existing studies have demonstrated that humans process facial information differently from other objects, supporting the existence of highly specialized mechanisms for face perception. In particular, the fusiform face area (FFA) in the human brain has been identified as a specialized region for face recognition, and numerous face-selective neurons have been observed in the temporal lobe of macaques. In recent years, Deep Convolutional Neural Networks (DCNNs) have demonstrated remarkable performance in modeling and understanding face processing, providing new computational perspectives for exploring the neural mechanisms underlying face recognition. DCNNs are a class of artificial neural networks that have achieved impressive performance in visual recognition tasks, including face recognition. These models typically begin by applying a series of convolutional and pooling operations to extract increasingly abstract features, which are then passed through one or more fully connected layers to perform classification tasks. Consequently, there has been a growing interest in investigating the applications of DCNNs in face recognition.
First, this review examines the performance of DCNNs in identifying key facial attributes. Although most DCNNs are trained only for face identity tasks, they can still infer social information such as gender and expression. In addition, this review also discusses the similarities and differences between DCNNs and humans in well-known face processing phenomena, such as the inversion, own-race, and familiarity effects. Evidence suggests that DCNNs can produce face-specific cognitive effects similar to those observed in humans. To better understand the computational validity of DCNNs, this review compares their internal representations with the neural mechanisms involved in human face recognition. On the one hand, this paper analyzes the hierarchical processing architecture that emerges in trained DCNNs and evaluates its correspondence with the hierarchical structure of the human visual system, spanning from early visual areas (e.g., V1-V4) to higher-level face-selective regions such as the FFA. On the other hand, this review further discusses evidence for brain-like functional specialization within DCNNs, examining whether units selective to different facial attributes can be mapped onto the functionally specialized cortical areas observed in neuroimaging and electrophysiological studies.
Lastly, this paper highlights several limitations of current models and outlines promising directions for future research. First, although DCNNs excel at face recognition, they remain far less robust than humans when faced with challenges such as viewpoint shifts, image distortions, adversarial perturbations, and limited training data. Second, although DCNNs exhibit behavioral effects like those observed in humans, there are multiple possible explanations for the underlying mechanisms responsible for these phenomena. The DCNN models examined in different studies often vary in terms of architecture, task objectives, and training datasets, which may affect the comparability of their results. Third, the extent to which current models can capture essential features of the biological visual system remains unclear. Specifically, many DCNNs operate as feedforward architectures and lack critical elements such as recurrent processing, top-down feedback, and dynamic attentional modulation, all of which are fundamental characteristics of the human visual system. Fourth, current neural network models primarily focus on the perceptual stage underlying face recognition. Future research should aim to incorporate semantic-level processing to more fully capture the complexity of human face perception. Fifth, generative Adversarial Networks (GANs) have recently attracted significant attention, which are powerful tools for generating diverse facial stimuli, enabling more controlled and flexible investigations of face perception. Integrating GANs with DCNNs has also enhanced our understanding of the mechanisms underlying facial representation, making it a promising direction for future research.

关键词

面孔识别 / 卷积神经网络 / 梭状回面孔区 / 层级结构 / 功能分化

Key words

face recognition / convolutional neural network / fusiform face area / hierarchical structure / functional specialization

引用本文

导出引用
程羽慧, 申天宇, 路子童,袁祥勇, 蒋毅. 深度卷积神经网络在面孔识别中的表现及与人类视觉系统的对比*[J]. 心理科学. 2025, 48(4): 814-825 https://doi.org/10.16719/j.cnki.1671-6981.20250405
Cheng Yuhui, Shen Tianyu, Lu Zitong, Yuan Xiangyong, Jiang Yi. The Performance of Deep Convolutional Neural Networks in Face Recognition and the Comparison with the Human Visual System[J]. Journal of Psychological Science. 2025, 48(4): 814-825 https://doi.org/10.16719/j.cnki.1671-6981.20250405

参考文献

[1] Baek S., Song M., Jang J., Kim G., & Paik S. B. (2021). Face detection in untrained deep neural networks. Nature Communications, 12(1), 7328.
[2] Behrmann, M., & Avidan, G. (2022). Face perception: Computational insights from phylogeny. Trends in Cognitive Sciences, 26(4), 350-363.
[3] Belhumeur P. N., Hespanha J. P., & Kriegman D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 711-720.
[4] Blauch N. M., Behrmann M., & Plaut D. C. (2021). Computational insights into human perceptual expertise for familiar and unfamiliar face recognition. Cognition, 208, 104341.
[5] Blauch N. M., Behrmann M., & Plaut D. C. (2022). A connectivity-constrained computational account of topographic organization in primate high-level visual cortex. Proceedings of the National Academy of Sciences, 119(3), e2112566119.
[6] Bothwell R. K., Brigham J. C., & Malpass R. S. (1989). Cross-racial identification. Personality and Social Psychology Bulletin, 15(1), 19-25.
[7] Cadieu C. F., Hong H., Yamins D. L. K., Pinto N., Ardila D., Solomon E. A., DiCarlo J. J. (2014). Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLOS Computational Biology, 10(12), e1003963.
[8] Calder A. J.(2011). Oxford handbook of face perception. Oxford University Press..
[9] Colón Y. I., Castillo C. D., & O’Toole A. J. (2021). Facial expression is retained in deep networks trained for face identification. Journal of Vision, 21(4), 4-4.
[10] Dahl C. D., Logothetis N. K., Bülthoff H. H., & Wallraven C. (2010). The thatcher illusion in humans and monkeys. Proceedings of the Royal Society B: Biological Sciences, 277(1696), 2973-2981.
[11] Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3(1), 1-8.
[12] Dhar P., Bansal A., Castillo C. D., Gleason J., Phillips P. J., & Chellappa R. (2020). How are attributes expressed in face DCNNs? Paper presented at the 2020 15th IEEE international conference on automatic face and gesture recognition.
[13] Dobs K., Isik L., Pantazis D., & Kanwisher N. (2019). How face perception unfolds over time. Nature Communications, 10(1), 1258.
[14] Dobs K., Martinez J., Kell A. J. E., & Kanwisher N. (2022). Brain-like functional specialization emerges spontaneously in deep neural networks. Science Advances, 8(11), eabl8913.
[15] Dobs K., Yuan J., Martinez J., & Kanwisher N. (2023). Behavioral signatures of face perception emerge in deep neural networks optimized for face recognition. Proceedings of the National Academy of Sciences, 120(32), e2220642120.
[16] Dong Y., Ruan S., Su H., Kang C., Wei X., & Zhu J. (2022). Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints. Advances in Neural Information Processing Systems, 35, 36789-36803.
[17] Eickenberg M., Gramfort A., Varoquaux G., & Thirion B. (2017). Seeing it all: Convolutional network layers map the function of the human visual system. NeuroImage, 152, 184-194.
[18] Geirhos R., Temme C. R. M., Rauber J., Schütt H. H., Bethge M., & Wichmann F. A. (2018). Generalisation in humans and deep neural networks. ArXiv.
[19] Grossman S., Gaziv G., Yeagle E. M., Harel M., Mégevand P., Groppe D. M., Khucis S., Herrero J. L., & Mehta A. D. (2019). Convergent evolution of face spaces across human face-selective neuronal groups and deep convolutional networks. Nature Communications, 10(1), 4934.
[20] Gu J., Yang X., De Mello S., & Kautz J. (2017). Dynamic facial analysis: From bayesian filtering to recurrent neural network. Proceedings of the IEEE conference on computer vision and pattern recognition.
[21] Güçlü, U., & Van Gerven, M. A. J. (2015). Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 35(27), 10005-10014.
[22] Gupta, P., & Dobs, K. (2025). Human-like face pareidolia emerges in deep neural networks optimized for face and object recognition. PLOS Computational Biology, 21(1), e1012751.
[23] Hadjikhani N., Kveraga K., Naik P., & Ahlfors S. P. (2009). Early (M170) activation of face-specific cortex by face-like objects. Neuroreport, 20(4), 403-407.
[24] Haxby J. V., Hoffman E. A., & Gobbini M. I. (2000). The distributed human neural system for face perception. Trends in Cognitive Sciences, 4(6), 223-233.
[25] He K., Zhang X., Ren S., & Sun J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.
[26] Hill M. Q., Parde C. J., Castillo C. D., Colon Y. I., Ranjan R., Chen J. C., Blanz, V. & O’Toole, A. J. (2019). Deep convolutional neural networks in the face of caricature. Nature Machine Intelligence, 1(11), 522-529.
[27] Jacob G., Pramod R. T., Katti H., & Arun S. P. (2021). Qualitative similarities and differences in visual object representations between brains and deep networks. Nature communications, 12(1), 1872.
[28] Jiahui G., Feilong M., Visconti di Oleggio Castello, M., Nastase S. A., Haxby J. V., & Gobbini M. I. (2023). Modeling naturalistic face processing in humans with deep convolutional neural networks. Proceedings of the National Academy of Sciences, 120(43), e2304085120.
[29] Kadosh, K. C., & Johnson, M. H. (2007). Developing a cortex specialized for face perception. Trends in Cognitive Sciences, 11(9), 367-369.
[30] Kanwisher, N. (2000). Domain specificity in face perception. Nature Neuroscience, 3(8), 759-763.
[31] Kanwisher N., Gupta P., & Dobs K. (2023). CNNs reveal the computational implausibility of the expertise hypothesis. Iscience, 26(2), 105976.
[32] Kanwisher N., McDermott J., & Chun M. M. (1997). The fusiform face area: A module in human extrastriate cortex specialized for face perception. Journal of Neuroscience, 17(11), 4302-4311.
[33] Khaligh-Razavi, S.-M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may explain IT cortical representation. PLOS Computational Biology, 10(11), e1003915.
[34] Kietzmann T. C., Spoerer C. J., Sörensen L. K. A., Cichy R. M., Hauk O., & Kriegeskorte N. (2019). Recurrence is required to capture the representational dynamics of the human visual system. Proceedings of the National Academy of Sciences, 116(43), 21854-21863.
[35] Konkle, T., & Alvarez, G. A. (2022). A self-supervised domain-general learning framework for human ventral stream representation. Nature Communications, 13(1), 491.
[36] Kramer R. S. S., Young A. W., & Burton A. M. (2018). Understanding face familiarity. Cognition, 172, 46-58.
[37] Kravitz D. J., Saleem K. S., Baker C. I., Ungerleider L. G., & Mishkin M. (2013). The ventral visual pathway: An expanded neural framework for the processing of object quality. Trends in Cognitive Sciences, 17(1), 26-49.
[38] Krizhevsky A., Sutskever I., & Hinton G. E. (2012). Imagenet classification with deep convolutional neural networks. Curran Assooiates Inc.
[39] Lago F., Pasquini C., Böhme R., Dumont H., Goffaux V., & Boato G. (2021). More real than real: A study on human visual perception of synthetic faces [applications corner]. IEEE Signal Processing Magazine, 39(1), 109-116.
[40] Lake B. M., Salakhutdinov R., & Tenenbaum J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266), 1332-1338.
[41] LeCun Y., Bengio Y., & Hinton G. (2015). Deep learning. Nature, 521(7553), 436-444.
[42] Li Y., Zheng W., Cui Z., & Zhang T. (2018). Face recognition based on recurrent regression neural network. Neurocomputing, 297, 50-58.
[43] Lu, Z., & Wang, Y. (2025). Category-selective neurons in deep networks: Comparing purely visual and visual-language models. ArXiv.
[44] Luo A., Henderson M., Wehbe L., & Tarr M. (2023). Brain diffusion for visual exploration: Cortical discovery using large scale generative models. Advances in neural information processing systems, 36, 75740-75781.
[45] Luo A. F., Henderson M. M., Tarr M. J., & Wehbe L. (2023). Brainscuba: Fine-grained natural language captions of visual cortex selectivity. ArXiv.
[46] Madry A., Makelov A., Schmidt L., Tsipras D., & Vladu A. (2017). Towards deep learning models resistant to adversarial attacks. ArXiv.
[47] Margalit E., Lee H., Finzi D., DiCarlo J. J., Grill-Spector K., & Yamins, D. L. K. (2024). A unifying framework for functional organization in early and higher ventral visual cortex. Neuron, 112(14), 2435-2451.
[48] McKone E., Kanwisher N., & Duchaine B. C. (2007). Can generic expertise explain special processing for faces? Trends in Cognitive Sciences, 11(1), 8-15.
[49] Nestor A., Plaut D. C., & Behrmann M. (2016). Feature-based face representations and image reconstruction from behavioral and neural data. Proceedings of the National Academy of Sciences, 113(2), 416-421.
[50] O'Toole, A. J., & Castillo, C. D. (2021). Face recognition by humans and machines: Three fundamental advances from deep learning. Annual Review of Vision Science, 7(1), 543-570.
[51] O'Toole A. J., Roark D. A., & Abdi H. (2002). Recognizing moving faces: A psychological and neural synthesis. Trends in cognitive sciences, 6(6), 261-266.
[52] O’Toole A. J., Abdi H., Deffenbacher K. A., & Valentin D. (1993). Low-dimensional representation of faces in higher dimensions of the face space. Journal of the Optical Society of America A, 10(3), 405-411.
[53] O’Toole A. J., Castillo C. D., Parde C. J., Hill M. Q., & Chellappa R. (2018). Face space representations in deep convolutional neural networks. Trends in Cognitive Sciences, 22(9), 794-809.
[54] O’Toole A. J., Deffenbacher K. A., Valentin D., McKee K., Huff D., & Abdi H. (1998). The perception of face gender: The role of stimulus structure in recognition and classification. Memory and Cognition, 26, 146-160.
[55] Phillips, P. J., & O'Toole, A. J. (2014). Comparison of human and computer performance across face recognition experiments. Image and Vision Computing, 32(1), 74-85.
[56] Phillips P. J., Yates A. N., Hu Y., Hahn C. A., Noyes E., Jackson K., & Sankaranarayanan S. (2018). Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms. Proceedings of the National Academy of Sciences, 115(24), 6171-6176.
[57] Prince J. S., Alvarez G. A., & Konkle T. (2024). Contrastive learning explains the emergence and function of visual category-selective regions. Science Advances, 10(39), eadl1776.
[58] Raman, R., & Hosoya, H. (2020). Convolutional neural networks explain tuning properties of anterior, but not middle, face-processing areas in macaque inferotemporal cortex. Communications Biology, 3(1), 221.
[59] Ratan Murty N. A., Bashivan P., Abate A., DiCarlo J. J., & Kanwisher N. (2021). Computational models of category-selective brain regions enable high-throughput tests of selectivity. Nature Communications, 12(1), 5540.
[60] Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019-1025.
[61] Rolls, E. T., & Milward, T. (2000). A model of invariant object recognition in the visual system: Learning rules, activation functions, lateral inhibition, and information-based performance measures. Neural Computation, 12(11), 2547-2572.
[62] Rossion, B. (2014). Understanding face perception by means of human electrophysiology. Trends in Cognitive Sciences, 18(6), 310-318.
[63] Shoham A., Grosbard I. D., Patashnik O., Cohen-Or D., & Yovel G. (2024). Using deep neural networks to disentangle visual and semantic information in human perception and memory. Nature Human Behaviour, 8(4), 702-717.
[64] Shoura M., Walther D. B., & Nestor A. (2025). Unraveling other-race face perception with GAN-based image reconstruction. Behavior Research Methods, 57(4), 1-14.
[65] Simonyan, K., & Zisserman, A. (2014a). Two-stream convolutional networks for action recognition in videos. ArXiv.
[66] Simonyan, K., & Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. ArXiv.
[67] Song Y., Qu Y., Xu S., & Liu J. (2021). Implementation-independent representation for deep convolutional neural networks and humans in processing faces. Frontiers in Computational Neuroscience, 14, 601314.
[68] Taigman Y., Yang M., Ranzato M. A., & Wolf L. (2014). Deepface: Closing the gap to human-level performance in face verification. Proceedings of the IEEE conference on computer vision and pattern recognition.
[69] Tanaka, J. W., & Sengco, J. A. (1997). Features and their configuration in face recognition. Memory and Cognition, 25(5), 583-592.
[70] Taubert J., Wardle S. G., Flessert M., Leopold D. A., & Ungerleider L. G. (2017). Face pareidolia in the rhesus monkey. Current Biology, 27(16), 2505-2509.
[71] Thompson, P. (1980). Margaret Thatcher: A new illusion. Perception, 9(4), 483-484.
[72] Tian F., Xie H., Song Y., Hu S., & Liu J. (2022). The face inversion effect in deep convolutional neural networks. Frontiers in Computational Neuroscience, 16, 854218.
[73] Tian J., Xie H., Hu S., & Liu J. (2021). Multidimensional face representation in a deep convolutional neural network reveals the mechanism underlying AI racism. Frontiers in Computational Neuroscience, 15, 620281.
[74] Tsao D. Y., Moeller S., & Freiwald W. A. (2008). Comparing face patch systems in macaques and humans. Proceedings of the National Academy of Sciences, 105(49), 19514-19519.
[75] Valentine T., Lewis M. B., & Hills P. J. (2016). Face-space: A unifying concept in face recognition research. Quarterly journal of Experimental Psychology, 69(10), 1996-2019.
[76] Vinken K., Prince J. S., Konkle T., & Livingstone M. S. (2023). The neural code for “face cells” is not face-specific. Science Advances, 9(35), eadg1736.
[77] Wang J., Cao R., Brandmeir N. J., Li X., & Wang S. (2022). Face identity coding in the deep neural network and primate brain. Communications Biology, 5(1), 611.
[78] Wang, M., & Deng, W. (2020). Mitigating bias in face recognition using skewness-aware reinforcement learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[79] Wardle S. G., Taubert J., Teichmann L., & Baker C. I. (2020). Rapid and dynamic processing of face pareidolia in the human brain. Nature Communications, 11(1), 4518.
[80] Wichmann, F. A., & Geirhos, R. (2023). Are deep neural networks adequate behavioral models of human visual perception? Annual Review of Vision Science, 9(1), 501-524.
[81] Yamins D. L. K., Hong H., Cadieu C. F., Solomon E. A., Seibert D., & DiCarlo J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23), 8619-8624.
[82] Yin, R. K. (1969). Looking at upside-down faces. Journal of Experimental Psychology, 81(1), 141.
[83] Yovel G., Grosbard I., & Abudarham N. (2023). Deep learning models challenge the prevailing assumption that face-like effects for objects of expertise support domain-general mechanisms. Proceedings of the Royal Society B, 290(1998), 20230093.
[84] Zhang C., Bengio S., Hardt M., Recht B., & Vinyals O. (2016). Understanding deep learning requires rethinking generalization. ArXiv.
[85] Zhou L., Yang A., Meng M., & Zhou K. (2022). Emerged human-like facial expression representation in a deep convolutional neural network. Science Advances, 8(12), eabj4383.
[86] Zhuang C., Yan S., Nayebi A., Schrimpf M., Frank M. C., DiCarlo J. J., & Yamins, D. L. K. (2021). Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences, 118(3), e2014196118.

基金

* 本研究得到国家自然科学基金青年项目(32400864)、南京师范大学引进人才科研启动项目(184080H201A45)和国家社会科学基金青年项目(23CYY048)的资助

PDF(670 KB)

评审附件

Accesses

Citation

Detail

段落导航
相关文章

/