The Role of Speaker’s Identity Information in Spoken Word Processing

Yin Shuqi; Aierken Mukaidaisi; Shen Taiyu; Li Li; Yu Keke; Wang Ruiming

doi:10.16719/j.cnki.1671-6981.20260306

PDF(1118 KB)

Journal of Psychological Science ›› 2026, Vol. 49 ›› Issue (3) : 565-575. DOI: 10.16719/j.cnki.1671-6981.20260306

General Psychology, Experimental Psychology & Ergonomics

The Role of Speaker’s Identity Information in Spoken Word Processing

Yin Shuqi ¹ ,
Aierken Mukaidaisi ¹ ,
Shen Taiyu ¹ ,
Li Li ² ,
Yu Keke ¹^,^** ,
Wang Ruiming ¹^,^**

Author information +

History +

Abstract

Considering the speaker’s identity information provides a more social and ecological explanation of the cognitive processing of spoken words. However, whether and how speaker’s identity information affects spoken word processing is controversial. The abstractionist view (including the early and developmental abstractionist views) and the episodic view hold different opinions on this issue. Moreover, previous studies have employed different experimental tasks that provide different evidence for these views. Based on our analyses of these previous studies, we propose that existing views may each be suitable for explaining different processes in spoken word processing. It is necessary to examine the role of speaker’s identity information in spoken word processing requiring different processing depths. Based on this background, the present study focused on whether and how speaker’s identity information affected lexical access and conceptual comprehension in spoken word processing. Addressing these issues can help us better understand spoken word processing.

The present study conducted two behavioral experiments and adopted the classic long-term repetition priming paradigm to minimize possible interference from explicit experimental tasks. Specifically, Experiment 1 adopted a lexical decision task to examine whether and how speaker’s identity information affected lexical access in spoken word processing. Eighty-eight participants were recruited for the experiments and randomly divided into two groups (speakers’ identities were consistent vs. inconsistent). The experiment contained learning and test phases. In the consistent group, participants would hear stimuli spoken by a male in both the learning and test phases; in the inconsistent group, participants would hear stimuli spoken by a male in the learning phase and by a female in the test phase. The experimental materials consisted of 36 real words (e.g., “/yi1fu2/”, which means clothes in English) and 36 pseudowords (i.e., pronounceable but meaningless nonwords, e.g., “/ju4hong2/”). Participants needed to judge whether the auditory word was real or pseudo. Experiment 2 adopted a category decision task to examine whether and how speaker’s identity information affected conceptual comprehension in spoken word processing. The participants and design were the same as Experiment 1, with 36 biological words (e.g., “/xiao3cao3/”, which means grass in English) and 36 non-biological words (e.g., “/qian1bi3/”, which means pencil in English) as experimental materials. Participants needed to judge whether the auditory word was biological or non-biological.

In Experiment 1, the performance of learned words was better than that of unlearned words, indicating a stable repetition effect. More importantly, in the overall analysis (including real words and pseudowords), for learned words, the accuracy of the consistent condition was significantly larger than the inconsistent condition; for unlearned words, there was no significant difference between the consistent and inconsistent conditions. Further analysis revealed that the results for pseudowords were the same as the overall analysis, but for real words, there were no significant differences in either accuracy or reaction time between the consistent and inconsistent conditions for both learned and unlearned words. In Experiment 2, the response times of learned words were significantly shorter than those of unlearned words, suggesting the repetition effect of learned words. However, in contrast to Experiment 1, the accuracy of the consistent condition was significantly larger than the inconsistent condition for unlearned words, while there was no such difference for learned words.

Speaker’s identity information influences the processing of spoken word differently depending on the processes. Specifically, speaker’s identity consistency facilitation for learned words in the lexical decision task suggested that the representation of the speaker’s identity was integrated with linguistic information and would affect lexical access integrally, supporting the episodic view. In contrast, speaker’s identity consistency facilitation for unlearned words in the category decision task suggested that the speaker’s identity and linguistic information would be represented separately and affect conceptual comprehension independently, supporting the developmental abstractionist view. Integrating the developmental abstractionist and episodic views helps us better understand spoken word processing.

Key words

spoken word processing / identity information / linguistic information / lexical access / conceptual comprehension

Cite this article

EndNote

Ris (Procite)

Bibtex

Download Citations

Yin Shuqi , Aierken Mukaidaisi , Shen Taiyu , et al . The Role of Speaker’s Identity Information in Spoken Word Processing[J]. Journal of Psychological Science. 2026, 49(3): 565-575 https://doi.org/10.16719/j.cnki.1671-6981.20260306

References

List( Publishing order | Descend order by publishing year | Descend order by cited within ) Chart analysis

[1]	汉语大字典编纂处. (2020). 现代汉语词典. 四川辞书出版社. Cited in this article [2]

[2]	胡砚冰, 蒋晓鸣. (2023). “信”以传信,“疑”以传疑?基于人声线索的可信度编码与解码. 心理科学, 5, 1057-1066. Cited in this article [1]

[3]	姜路遥, 李兵兵. (2023). 汉语听觉阈下启动效应:来自听觉掩蔽启动范式的证据. 心理学报, 4, 529-541. Cited in this article [1]

[4]	李利, 莫雷, 王瑞明, 罗雪莹. (2006). 非熟练中—英双语者跨语言长时重复启动效应. 心理学报, 5, 672-680. Cited in this article [2]

[5]	明莉莉, 胡学平. (2021). 人类嗓音加工的神经机制——来自正常视力者和盲人的脑神经证据. 心理科学进展, 12, 2147-2160. Cited in this article [1]

[6]	莫雷, 李利, 王瑞明. (2005). 熟练中—英双语者跨语言长时重复启动效应. 心理科学, 6, 10-15. Cited in this article [1]

[7]	余可可, 周亚聪, 刘秉怡, 蔡涵涵, 王瑞明. (2021). 听话者对说话者嗓音中语言学信息和副语言学信息的加工. 心理研究, 1, 29-36. Cited in this article [1]

[8]	张钦, 张必隐. (1999). 词汇决定任务中的策略因素. 心理科学, 1, 75-76. Cited in this article [2]

[9]	赵荣, 王小娟, 杨剑峰. (2016). 声调在汉语音节感知中的作用. 心理学报, 48(8), 915-923. Cited in this article [1]

[10]	Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48. Cited in this article [1]

[11]

Belin,

, Fecteau,

, & Bédard,

(2004). Thinking the voice: Neural correlates of voice perception. Trends in Cognitive Sciences, 8(3), 129-135.

https://www.ncbi.nlm.nih.gov/pubmed/15301753

Cited in this article [3] Abstract

The human voice is the carrier of speech, but also an "auditory face" that conveys important affective and identity information. Little is known about the neural bases of our abilities to perceive such paralinguistic information in voice. Results from recent neuroimaging studies suggest that the different types of vocal information could be processed in partially dissociated functional pathways, and support a neurocognitive model of voice perception largely similar to that proposed for face perception.

[12]

Blank,

, Wieland,

, & von Kriegstein,

(2014). Person recognition and the brain: Merging evidence from patients and healthy individuals. Neuroscience and Biobehavioral Reviews, 47, 717-734.

https://doi.org/10.1016/j.neubiorev.2014.10.022

https://linkinghub.elsevier.com/retrieve/pii/S0149763414002759

Cited in this article [2]

[13]	Boersma, P., & Weenink, D. (1992). Praat: Doing phonetics by computer (Version 6.2.06). [computer software]. https://www.fon.hum.uva.nl/praat/ https://www.fon.hum.uva.nl/praat/ Cited in this article [1]

[14]	Bowers, J. S. (2000). In defense of abstractionist theories of repetition priming and word identification. Psychonomic Bulletin and Review, 7(1), 83-99. https://doi.org/10.3758/BF03210726 http://link.springer.com/10.3758/BF03210726 Cited in this article [1]

[15]

Cai,

Z. G.

, Gilbert,

R. A.

, Davis,

M. H.

, Gaskell,

M. G.

, Farrar,

, Adler,

, & Rodd,

J. M.

(2017). Accent modulates access to word meaning: Evidence for a speaker-model account of spoken word recognition. Cognitive Psychology, 98, 73-101.

https://doi.org/S0010-0285(17)30076-2

https://www.ncbi.nlm.nih.gov/pubmed/28881224

Cited in this article [7] Abstract

Speech carries accent information relevant to determining the speaker's linguistic and social background. A series of web-based experiments demonstrate that accent cues can modulate access to word meaning. In Experiments 1-3, British participants were more likely to retrieve the American dominant meaning (e.g., hat meaning of "bonnet") in a word association task if they heard the words in an American than a British accent. In addition, results from a speeded semantic decision task (Experiment 4) and sentence comprehension task (Experiment 5) confirm that accent modulates on-line meaning retrieval such that comprehension of ambiguous words is easier when the relevant word meaning is dominant in the speaker's dialect. Critically, neutral-accent speech items, created by morphing British- and American-accented recordings, were interpreted in a similar way to accented words when embedded in a context of accented words (Experiment 2). This finding indicates that listeners do not use accent to guide meaning retrieval on a word-by-word basis; instead they use accent information to determine the dialectic identity of a speaker and then use their experience of that dialect to guide meaning access for all words spoken by that person. These results motivate a speaker-model account of spoken word recognition in which comprehenders determine key characteristics of their interlocutor and use this knowledge to guide word meaning access.Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.

[16]	Clapp, W., Vaughn, C., Todd, S., & Sumner, M. (2023). Talker-specificity and token-specificity in recognition memory. Cognition, 237, 105450. https://doi.org/10.1016/j.cognition.2023.105450 https://linkinghub.elsevier.com/retrieve/pii/S0010027723000847 Cited in this article [7]

[17]	Cooper, A., & Bradlow, A. R. (2017). Talker and background noise specificity in spoken word recognition memory. Laboratory Phonology, 8(1), 1-15. https://doi.org/10.5334/labphon.25 http://www.journal-labphon.org/articles/10.5334/labphon.25/ Cited in this article [2]

[18]	Cutler, A., Eisner, F., McQueen, J. M., & Norris, D. (2010). How abstract phonemic categories are necessary for coping with speaker-related variation. Laboratory Phonology, 10, 91-111. Cited in this article [2]

[19]

Davies,

, Porretta,

, Koleva,

, & Klepousniotou,

(2022). Speaker-specific cues influence semantic disambiguation. Journal of Psycholinguistic Research, 51(5), 933-955.

https://doi.org/10.1007/s10936-022-09852-0

https://www.ncbi.nlm.nih.gov/pubmed/35556197

Cited in this article [6] Abstract

Addressees use information from specific speakers' previous discourse to make predictions about incoming linguistic material and to restrict the choice of potential interpretations. In this way, speaker specificity has been shown to be an influential factor in language processing across several domains e.g., spoken word recognition, sentence processing, and pragmatics. However, its influence on semantic disambiguation has received little attention to date. Using an exposure-test design and visual world eye tracking, we examined the effect of speaker-specific literal vs. nonliteral style on the disambiguation of metaphorical polysemes such as 'fork', 'head', and 'mouse'. Eye movement data revealed that when interpreting polysemous words with a literal and a nonliteral meaning, addressees showed a late-stage preference for the literal meaning in response to a nonliteral speaker. We interpret this as reflecting an indeterminacy in the intended meaning in this condition, as well as the influence of meaning dominance cues at later stages of processing. Response data revealed that addressees then ultimately resolved to the literal target in 90% of trials. These results suggest that addressees consider a range of senses in the earlier stages of processing, and that speaker style is a contextual determinant in semantic processing.© 2022. The Author(s).

[20]

Faul,

, Erdfelder,

, Lang,

A. G.

, & Buchner,

(2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175-191.

https://doi.org/10.3758/bf03193146

https://www.ncbi.nlm.nih.gov/pubmed/17695343

Cited in this article [1] Abstract

G*Power (Erdfelder, Faul, & Buchner, 1996) was designed as a general stand-alone power analysis program for statistical tests commonly used in social and behavioral research. G*Power 3 is a major extension of, and improvement over, the previous versions. It runs on widely used computer platforms (i.e., Windows XP, Windows Vista, and Mac OS X 10.4) and covers many different statistical tests of the t, F, and chi2 test families. In addition, it includes power analyses for z tests and some exact tests. G*Power 3 provides improved effect size calculators and graphic options, supports both distribution-based and design-based input modes, and offers all types of power analyses in which users might be interested. Like its predecessors, G*Power 3 is free.

[21]	Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22(5), 1166. Cited in this article [2]

[22]	Goldinger, S. D. (2007). A complementary-systems approach to abstract and episodic speech perception. In 16th International Congress of Phonetic Sciences, Saarbrücken, Germany. Cited in this article [2]

[23]	Hickok, G., & Poeppel, D. (2015). Neural basis of speech perception. Handbook of Clinical Neurology, 129, 149-160. Cited in this article [1]

[24]

Jia,

, Tsang,

Y. K.

, Huang,

, Chen,

H. C.

(2015). Processing cantonese lexical tones: Evidence from oddball paradigms. Neuroscience, 305, 351-360.

https://doi.org/10.1016/j.neuroscience.2015.08.009

https://www.ncbi.nlm.nih.gov/pubmed/26265553

Cited in this article [1] Abstract

Two event-related potential (ERP) experiments were conducted to investigate whether Cantonese lexical tones are processed with general auditory perception mechanisms and/or a special speech module. Two tonal features (f0 direction and f0 height deviation) were manipulated to reflect acoustic processing, and the contrast between syllables and hums was used to reveal the involvement of a speech module. Experiment 1 adopted a passive oddball paradigm to study a relatively early stage of tonal processing. Mismatch negativity (MMN) and novelty P3 (P3a) were modulated by the interaction between tonal feature and stimulus type. Similar interactions were found for N2 and P3 in Experiment 2, where more in-depth tonal processing was examined with an active oddball paradigm. Moreover, detecting tonal deviants of syllables elicited N1 and P2 that were not found in hum detection. Together, these findings suggest that the processing of lexical tone relies on both acoustic and linguistic processes from the early stage. Another noteworthy finding is the absence of brain lateralization in both experiments, which challenges the use of a lateralization pattern as evidence for processing lexical tones through a special speech module. Copyright © 2015 IBRO. Published by Elsevier Ltd. All rights reserved.

[25]

Kapnoula,

E. C.

, & Samuel,

A. G.

(2019). Voices in the mental lexicon: Words carry indexical information that can affect access to their meaning. Journal of Memory and Language, 107, 111-127.

https://doi.org/10.1016/j.jml.2019.05.001

https://linkinghub.elsevier.com/retrieve/pii/S0749596X19300464

Cited in this article [8]

[26]	Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. (2017). lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1-26. Cited in this article [1]

[27]	Lavan, N. (2023). The time course of person perception from voices: A behavioral study. PsychologicalScience, 34(7), 771-783. Cited in this article [2]

[28]	Lavan, N., Rinke, P., & Scharinger, M. (2024). The time course of person perception from voices in the brain. Proceedings of the National Academy of Sciences, 121(26), e2318361121. Cited in this article [2]

[29]	Lenth, R. (2021). Emmeans: Estimated marginal means, aka least-squares means(R package version 1.8.2). [computer software]. https://CRAN.R-project.org/package=emmecans https://CRAN.R-project.org/package=emmecans Cited in this article [1]

[30]

Luthra,

(2024). Why are listeners hindered by talker variability? Psychonomic Bulletin and Review, 31(1), 104-121.

https://doi.org/10.3758/s13423-023-02355-6

Cited in this article [1] Abstract

Though listeners readily recognize speech from a variety of talkers, accommodating talker variability comes at a cost: Myriad studies have shown that listeners are slower to recognize a spoken word when there is talker variability compared with when talker is held constant. This review focuses on two possible theoretical mechanisms for the emergence of these processing penalties. One view is that multitalker processing costs arise through a resource-demanding talker accommodation process, wherein listeners compare sensory representations against hypothesized perceptual candidates and error signals are used to adjust the acoustic-to-phonetic mapping (an active control process known as contextual tuning). An alternative proposal is that these processing costs arise because talker changes involve salient stimulus-level discontinuities that disrupt auditory attention. Some recent data suggest that multitalker processing costs may be driven by both mechanisms operating over different time scales. Fully evaluating this claim requires a foundational understanding of both talker accommodation and auditory streaming; this article provides a primer on each literature and also reviews several studies that have observed multitalker processing costs. The review closes by underscoring a need for comprehensive theories of speech perception that better integrate auditory attention and by highlighting important considerations for future research in this area.

[31]

Ma,

, Yu,

, Yin,

, Li,

, & Wang,

(2023). Attention modulates the role of speakers' voice identity and linguistic information in spoken word processing: Evidence from event-related potentials. Journal of Speech, Language, and Hearing Research, 66(5), 1678-1693.

https://doi.org/10.1044/2023_JSLHR-22-00420

http://pubs.asha.org/doi/10.1044/2023_JSLHR-22-00420

Cited in this article [4]

[32]

McLennan,

C. T.

, & Luce,

P. A.

(2005). Examining the time course of indexical specificity effects in spoken word recognition. Journal of Experimental Psychology: Learning Memory and Cognition, 31(2), 306-321.

https://doi.org/10.1037/0278-7393.31.2.306

https://doi.apa.org/doi/10.1037/0278-7393.31.2.306

Cited in this article [4]

[33]

McQueen,

J. M.

, Cutler,

, & Norris,

(2006). Phonological abstraction in the mental lexicon. Cognitive Science, 30(6), 1113-1126.

https://doi.org/10.1207/s15516709cog0000_79

https://www.ncbi.nlm.nih.gov/pubmed/21702849

Cited in this article [1] Abstract

A perceptual learning experiment provides evidence that the mental lexicon cannot consist solely of detailed acoustic traces of recognition episodes. In a training lexical decision phase, listeners heard an ambiguous [f-s] fricative sound, replacing either [f] or [s] in words. In a test phase, listeners then made lexical decisions to visual targets following auditory primes. Critical materials were minimal pairs that could be a word with either [f] or [s] (cf. English knife-nice), none of which had been heard in training. Listeners interpreted the minimal pair words differently in the second phase according to the training received in the first phase. Therefore, lexically mediated retuning of phoneme perception not only influences categorical decisions about fricatives (Norris, McQueen, & Cutler, 2003), but also benefits recognition of words outside the training set. The observed generalization across words suggests that this retuning occurs prelexically. Therefore, lexical processing involves sublexical phonological abstraction, not only accumulation of acoustic episodes.2006 Lawrence Erlbaum Associates, Inc.

[34]

Orfanidou,

, Davis,

M. H.

, Ford,

M. A.

, & Marslen-Wilson,

W. D.

(2011). Perceptual and response components in repetition priming of spoken words and pseudowords. Quarterly Journal of Experimental Psychology, 64(1), 96-121.

https://doi.org/10.1080/17470211003743794

https://journals.sagepub.com/doi/10.1080/17470211003743794

Cited in this article [6] Abstract

Two experiments explored repetition priming effects for spoken words and pseudowords in order to investigate abstractionist and episodic accounts of spoken word recognition and repetition priming. In Experiment 1, lexical decisions were made on spoken words and pseudowords with half of the items presented twice (∼12 intervening items). Half of all repetitions were spoken in a “different voice” from the first presentations. Experiment 2 used the same procedure but with stimuli embedded in noise to slow responses. Results showed greater priming for words than for pseudowords and no effect of voice change in both normal and effortful processing conditions. Additional analyses showed that for slower participants, priming is more equivalent for words and pseudowords, suggesting episodic stimulus–response associations that suppress familiarity-based mechanisms that ordinarily enhance word priming. By relating behavioural priming to the time-course of pseudoword identification we showed that under normal listening conditions (Experiment 1) priming reflects facilitation of both perceptual and decision components, whereas in effortful listening conditions (Experiment 2) priming effects primarily reflect enhanced decision/response generation processes. Both stimulus–response associations and enhanced processing of sensory input seem to be voice independent, providing novel evidence concerning the degree of perceptual abstraction in the recognition of spoken words and pseudowords.

[35]

Rodd,

J. M.

, Lopez Cutrin,

, Kirsch,

, Millar,

, & Davis,

M. H.

(2013). Long-term priming of the meanings of ambiguous words. Journal of Memory and Language, 68(2), 180-198.

https://doi.org/10.1016/j.jml.2012.08.002

https://linkinghub.elsevier.com/retrieve/pii/S0749596X12000836

Cited in this article [1]

[36]	Samuel, A. G. (2011). Speech perception. Annual Review of Psychology, 62(1), 49-72. https://doi.org/10.1146/psych.2011.62.issue-1 https://www.annualreviews.org/toc/psych/62/1 Cited in this article [1]

[37]	Scott, S. K. (2019). From speech and talkers to the social world: The neural processing of human spoken language. Science, 6461, 58-62. Cited in this article [1]

[38]

Yu,

, Chen,

, Yin,

, Li,

, & Wang,

(2022). The roles of pitch type and lexicality in the hemispheric lateralization for lexical tone processing: An ERP study. International Journal of Psychophysiology, 177, 83-91.

https://doi.org/10.1016/j.ijpsycho.2022.05.001

https://www.ncbi.nlm.nih.gov/pubmed/35533781

Cited in this article [1] Abstract

Previous studies proposed different views to explain the hemispheric lateralization of lexical tone processing. But how the acoustic and phonological information modulates it remains unclear. The acoustic information refers to the physical acoustic features of lexical tones, and the phonological information means the different word meanings differentiated by lexical tones. In the present study, we adopted the active oddball paradigm to explore the effects of pitch type and lexicality on native Cantonese speakers' lexical tone processing with the event-related potential (ERP) technique. We used Cantonese level and contour tones (pitch type) to examine the role of acoustic information and real words and pseudowords (lexicality) to detect the phonological information's effect. The results showed that the pitch type and lexicality affected the N2b amplitudes between the left and right hemispheres interactively, while they did not play roles in P3b amplitudes. The results indicated that the acoustic and phonological information modulated the hemispheric lateralization of lexical tone processing interactively only in the early stage (N2b time window) but not in the later stage (P3b time window). The findings suggested a two-stage model interprets the hemispheric lateralization in lexical tone processing.Copyright © 2021. Published by Elsevier B.V.

[39]

Zeelenberg,

, & Pecher,

(2003). Evidence for long-term cross-language repetition priming in conceptual implicit memory tasks. Journal of Memory and Language, 49(1), 80-94.

https://doi.org/10.1016/S0749-596X(03)00020-2

https://linkinghub.elsevier.com/retrieve/pii/S0749596X03000202

Cited in this article [1]