处理缺失作答和随机猜测的认知诊断模型开发及其应用研究*

李潇沛; 彭思韦; 王琴; 蔡艳

doi:10.16719/j.cnki.1671-6981.20260119

PDF(2648 KB)

心理科学 ›› 2026, Vol. 49 ›› Issue (1) : 207-224. DOI: 10.16719/j.cnki.1671-6981.20260119

统计、测量与方法

处理缺失作答和随机猜测的认知诊断模型开发及其应用研究^*

李潇沛 ¹ ,
彭思韦 ²^,^** ,
王琴 ¹^,^** ,
蔡艳 ¹

作者信息 +

Cognitive Diagnostic Model for Miss and Random Guesses

Li Xiaopei ¹ ,
Peng Siwei ²^,^** ,
Wang Qin ¹^,^** ,
Cai Yan ¹

Author information +

文章历史 +

摘要

在实际测验中，被试的异常作答，尤其是缺失作答和随机猜测，往往会导致参数估计的偏差并且损害测验结果的准确性和公平性。然而，目前在认知诊断领域中，针对异常作答的建模研究仍然十分有限。针对这一现状，本研究首次尝试将项目反应树模型与认知诊断模型联合建模，开发出一种新型的认知诊断模型——IRTree-LCDM，该模型能够同时考虑缺失作答和随机猜测的影响。为评估新模型的性能及其在实证数据中的效果，研究采用Monte Carlo模拟实验与真实数据分析相结合的研究方法。模拟研究结果表明，新开发的IRTree-LCDM在各种实验条件下的参数估计精度表现良好。同时，与传统认知诊断模型相比，IRTree-LCDM的判准率更为精准，对被试单个属性的判准率均值超过.946，模式判准率均值达到.783。此外，IRTree-LCDM在实证数据中能够更好地拟合真实数据，且对被试的属性掌握模式的估计也更加合理。这些结果表明，IRTree-LCDM在处理异常作答方面具有显著的价值和意义。

Abstract

With the advancement in psychological and educational testing, researchers have increasingly focused not only on measuring the abilities or traits of test takers, but also on assessing their mastery of specific knowledge structures. As a result, cognitive diagnostic assessment has become a major focus within the fields of psychological and educational measurement. In practice, however, both general and cognitive diagnostic tests frequently reveal abnormal response patterns from test takers, including missing responses and random guessing, which can be attributed to either individual characteristics or item properties. These abnormal responses can introduce biases in parameter estimation, thereby threatening the reliability and validity of the tests. Addressing these common abnormal response patterns is crucial for accurate data analysis. While much of the existing research on abnormal responses has been concentrated within the Item Response Theory (IRT) framework, there is a notable lack of work in the cognitive diagnosis domain, which remains in its early stages of development. Inspired by the IRTree framework, this study develops a novel cognitive diagnostic model that simultaneously accounts for missing responses and random guessing. This innovative model seeks to enhance the representation of abnormal response patterns within cognitive diagnostic assessments, offering significant implications for future research.

The paper begins with a comprehensive review of relevant concepts, theories, and prior research. It then details the modeling approach and framework of the new model, including the prior information for parameter settings and the Markov Chain Monte Carlo (MCMC) estimation method. A 3×2×2×4 four-factorial experimental design is employed, varying the proportions of missing responses (2.5%, 5%, 10%), proportions of random guessing (2.5%, 5%), sample sizes (1000, 1500), and handling methods (IRTree-LCDM, LCDM-FCS, LCDM-CIM, LCDM-ZR). This simulation study evaluates the parameter estimation accuracy and robustness of the new model and compares its attribute classification accuracy with traditional cognitive diagnostic models using different methods to handle missing values (i.e., full conditional specification, corrected item mean imputation, and zero replacement). Finally, the new model is applied to real data from the 8th-grade mathematics test of TIMSS 2019. The fit of the new model to the data is compared with that of traditional cognitive diagnostic models, and typical test-takers are analyzed to illustrate the advantages and practical value of the new model.

Results show that: (1)Compared to traditional LCDM using FCS, CIM, and ZR for handling missing data, the newly developed IRTree-LCDM exhibits superior parameter estimation and diagnostic precision. The average Attribute Classification Correct Rate (ACCR) for test takers exceeds 0.946, while the average Pattern Classification Correct Rate (PCCR) reaches.783. (2)The proportion of abnormal response patterns affects the classification accuracy of attributes and patterns; the higher the proportion of abnormal responses, the lower the classification accuracy. However, compared to traditional LCDM (using FCS, CIM, and ZR methods for missing data imputation), the new model shows significant advantages in handling missing responses and random guessing. (3)Compared to traditional LCDM (using ZR for missing data imputation), IRTree-LCDM performs better in actual tests, providing more reasonable estimates of test takers' attribute mastery patterns.

In conclusion, the IRTree-LCDM model demonstrates significant value and importance in handling abnormal responses.

导出引用

李潇沛, 彭思韦, 王琴, 等. 处理缺失作答和随机猜测的认知诊断模型开发及其应用研究^*[J]. 心理科学. 2026, 49(1): 207-224 https://doi.org/10.16719/j.cnki.1671-6981.20260119

Li Xiaopei, Peng Siwei, Wang Qin, et al. Cognitive Diagnostic Model for Miss and Random Guesses[J]. Journal of Psychological Science. 2026, 49(1): 207-224 https://doi.org/10.16719/j.cnki.1671-6981.20260119

参考文献

列表( 原文顺序 | 文献年度倒序 | 文中引用次数倒序 ) 可视化分析

[1]

刘玥, 刘红云. (2021). 心理与教育测验中异常作答处理的新技术: 混合模型方法. 心理科学进展, 29(9), 1696-1710.

https://doi.org/10.3724/SP.J.1042.2021.01696

本文引用 [1] 摘要

混合模型方法(Mixture Model Method)是近年来提出的, 对心理与教育测验中的异常作答进行处理的方法。与反应时阈值法, 反应时残差法等传统方法相比, 混合模型方法可以同时完成异常作答的识别和模型参数估计, 并且, 在数据污染严重的情况下仍具有较好的表现。该方法的原理为根据正常作答和异常作答的特点, 针对分类潜变量(即作答层面的分类)的不同类别, 在作答反应和(或)反应时部分建立不同的模型, 从而实现对分类潜变量, 以及模型中其他题目和被试参数的估计。文章详细介绍了目前提出的几种混合模型方法, 并将其与传统方法比较分析。未来研究可在模型前提假设违背, 含有多种异常作答等情况下探索混合模型方法的稳健性和适用性, 通过固定部分题目参数, 增加选择流程等方式提高混合模型方法的使用效率。

[2]	罗照盛. (2012). 项目反应理论基础. 北京师范大学出版社. 本文引用 [1]

[3]

宋枝璘, 郭磊, 郑天鹏. (2022). 认知诊断缺失数据处理方法的比较: 零替换、多重插补与极大似然估计法. 心理学报, 54(4), 426-440.

https://doi.org/10.3724/SP.J.1041.2022.00426

本文引用 [1] 摘要

数据缺失在测验中经常发生, 认知诊断评估也不例外, 数据缺失会导致诊断结果的偏差。首先, 通过模拟研究在多种实验条件下比较了常用的缺失数据处理方法。结果表明：(1)缺失数据导致估计精确性下降, 随着人数与题目数量减少、缺失率增大、题目质量降低, 所有方法的PCCR均下降, Bias绝对值和RMSE均上升。(2)估计题目参数时, EM法表现最好, 其次是MI, FIML和ZR法表现不稳定。(3)估计被试知识状态时, EM和FIML表现最好, MI和ZR表现不稳定。其次, 在PISA2015实证数据中进一步探索了不同方法的表现。综合模拟和实证研究结果, 推荐选用EM或FIML法进行缺失数据处理。

[4]	涂冬波, 蔡艳, 高旭亮, 汪大勋. (2019). 高级认知诊断. 北京师范大学出版社. 本文引用 [1]

[5]	Boeck P. D., & Partchev I. (2012). IRTrees: Tree-based item response models of the GLMM family. Journal of Statistical Software, Code Snippets, 48(1), 1-28. 本文引用 [1]

[6]	Brooks S. P., & Gelman A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434-455. https://doi.org/10.1080/10618600.1998.10474787 http://www.tandfonline.com/doi/abs/10.1080/10618600.1998.10474787 本文引用 [1]

[7]	Cai M. Y., van Buuren S., & Vink G. (2023). Joint distribution properties of fully conditional specification under the normal linear model with normal inverse-gamma priors. Scientific Reports, 13, Article 644. 本文引用 [2]

[8]

Cao

, & Stokes

S. L.

(2008). Bayesian IRT guessing models for partial guessing behaviors. Psychometrika, 73(2), 209-230.

https://doi.org/10.1007/s11336-007-9045-9

https://www.cambridge.org/core/product/identifier/S003331230002216X/type/journal_article

本文引用 [2] 摘要

\nAccording to the recent Nation’s Report Card, 12th-graders failed to produce gains on the 2005 National Assessment of Educational Progress (NAEP) despite earning better grades on average. One possible explanation is that 12th-graders were not motivated taking the NAEP, which is a low-stakes test. We develop three Bayesian IRT mixture models to describe the results from a group of examinees including both nonguessers and partial guessers. The first assumes that the guesser answers questions based on his or her knowledge up to a certain test item, and guesses thereafter. The second model assumes that the guesser answers relatively easy questions based on his or her knowledge and guesses randomly on the remaining items. The third is constructed to describe more general low-motivation behavior. It assumes that the guesser gives less and less effort as he or she proceeds through the test. The models can provide not only consistent estimates of IRT parameters but also estimates of each examinee’s nonguesser/guesser status and degree of guessing behavior. We show results of a simulation study comparing the performance of the three guessing models to the 2PL-IRT model. Finally, an analysis of real data from a low-stakes test administered to university students is presented.\n

[9]

Dai

S. H.

(2021). Handling missing responses in psychometrics: Methods and software. Psych, 3(4), 673-693.

https://doi.org/10.3390/psych3040043

https://www.mdpi.com/2624-8611/3/4/43

本文引用 [2] 摘要

The presence of missing responses in assessment settings is inevitable and may yield biased parameter estimates in psychometric modeling if ignored or handled improperly. Many methods have been proposed to handle missing responses in assessment data that are often dichotomous or polytomous. Their applications remain nominal, however, partly due to that (1) there is no sufficient support in the literature for an optimal method; (2) many practitioners and researchers are not familiar with these methods; and (3) these methods are usually not employed by psychometric software and missing responses need to be handled separately. This article introduces and reviews the commonly used missing response handling methods in psychometrics, along with the literature that examines and compares the performance of these methods. Further, the use of the TestDataImputation package in R is introduced and illustrated with an example data set and a simulation study. Corresponding R codes are provided.

[10]

de la Torre

(2011). The generalized DINA model framework. Psychometrika, 76(2), 179-199.

https://doi.org/10.1007/s11336-011-9207-7

https://www.cambridge.org/core/product/identifier/S0033312300020585/type/journal_article

本文引用 [1] 摘要

The G-DINA (generalized deterministic inputs, noisy “and” gate) model is a generalization of the DINA model with more relaxed assumptions. In its saturated form, the G-DINA model is equivalent to other general models for cognitive diagnosis based on alternative link functions. When appropriate constraints are applied, several commonly used cognitive diagnosis models (CDMs) can be shown to be special cases of the general models. In addition to model formulation, the G-DINA model as a general CDM framework includes a component for item-by-item model estimation based on design and weight matrices, and a component for item-by-item model comparison based on the Wald test. The paper illustrates the estimation and application of the G-DINA model as a framework using real and simulated data. It concludes by discussing several potential implications of and relevant issues concerning the proposed framework.

[11]

de la Torre

, & Douglas

J. A.

(2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69(3), 333-353.

https://doi.org/10.1007/BF02295640

https://www.cambridge.org/core/product/identifier/S003331230002408X/type/journal_article

本文引用 [1] 摘要

Higher-order latent traits are proposed for specifying the joint distribution of binary attributes in models for cognitive diagnosis. This approach results in a parsimonious model for the joint distribution of a high-dimensional attribute vector that is natural in many situations when specific cognitive information is sought but a less informative item response model would be a reasonable alternative. This approach stems from viewing the attributes as the specific knowledge required for examination performance, and modeling these attributes as arising from a broadly-defined latent trait resembling the ϑ of item response models. In this way a relatively simple model for the joint distribution of the attributes results, which is based on a plausible model for the relationship between general aptitude and specific knowledge. Markov chain Monte Carlo algorithms for parameter estimation are given for selected response distributions, and simulation results are presented to examine the performance of the algorithm as well as the sensitivity of classification to model misspecification. An analysis of fraction subtraction data is provided as an example.

[12]	Debeer D., Janssen R., & De Boeck P. (2017). Modeling skipped and not-reached items using IRTrees. Journal of Educational Measurement, 54(3), 333-363. https://doi.org/10.1111/jedm.2017.54.issue-3 https://onlinelibrary.wiley.com/toc/17453984/54/3 本文引用 [3]

[13]

Glas

C. A. W.

, & Pimentel

J. L.

(2008). Modeling nonignorable missing data in speeded tests. Educational and Psychological Measurement, 68(6), 907-922.

https://doi.org/10.1177/0013164408315262

https://journals.sagepub.com/doi/10.1177/0013164408315262

本文引用 [1] 摘要

In tests with time limits, items at the end are often not reached. Usually, the pattern of missing responses depends on the ability level of the respondents; therefore, missing data are not ignorable in statistical inference. This study models data using a combination of two item response theory (IRT) models: one for the observed response data and one for the missing data indicator. The missing data indicator is modeled using a sequential model with linear restrictions on the item parameters. The models are connected by the assumption that the respondents' latent proficiency parameters have a joint multivariate normal distribution. Model parameters are estimated by maximum marginal likelihood. Simulations show that treating missing data as ignorable can lead to considerable bias in parameter estimates. Including an IRT model for the missing data indicator removes this bias. The method is illustrated with data from an intelligence test with a time limit.

[14]

Goegebeur

, De Boeck

, Wollack

J. A.

, & Cohen

A. S.

(2008). A speeded item response model with gradual process change. Psychometrika, 73(1), 65-87.

https://doi.org/10.1007/s11336-007-9031-2

https://www.cambridge.org/core/product/identifier/S0033312300022274/type/journal_article

本文引用 [1] 摘要

\nAn item response theory model for dealing with test speededness is proposed. The model consists of two random processes, a problem solving process and a random guessing process, with the random guessing gradually taking over from the problem solving process. The involved change point and change rate are considered random parameters in order to model examinee differences in both respects. The proposed model is evaluated on simulated data and in a case study.\n

[15]

Henson

R. A.

, Templin

J. L.

, & Willse

J. T.

(2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74(2), 191-210.

https://doi.org/10.1007/s11336-008-9089-5

https://www.cambridge.org/core/product/identifier/S0033312300021566/type/journal_article

本文引用 [1] 摘要

This paper uses log-linear models with latent variables (Hagenaars, in Loglinear Models with Latent Variables, 1993) to define a family of cognitive diagnosis models. In doing so, the relationship between many common models is explicitly defined and discussed. In addition, because the log-linear model with latent variables is a general model for cognitive diagnosis, new alternatives to modeling the functional relationship between attribute mastery and the probability of a correct response are discussed.

[16]	Holman R., & Glas C. A. W.(2005). Modelling non-ignorable missing-data mechanisms with item response theory models. British Journal of Mathematical and Statistical Psychology, 58(1), 1-17. 本文引用 [1]

[17]	Hsu C. L., Jin K. Y., & Chiu M. M. (2020). Cognitive diagnostic models for random guessing behaviors. Frontiers in Psychology, 11, Article 570365. 本文引用 [2]

[18]	Huang H. Y. (2016). Mixture random-effect IRT models for controlling extreme response style on rating scales. Frontiers in Psychology, 7, Article 1706. 本文引用 [1]

[19]

Huang

H. Y.

(2020). A mixture IRTree model for performance decline and nonignorable missing data. Educational and Psychological Measurement, 80(6), 1168-1195.

https://doi.org/10.1177/0013164420914711

https://journals.sagepub.com/doi/10.1177/0013164420914711

本文引用 [4] 摘要

In educational assessments and achievement tests, test developers and administrators commonly assume that test-takers attempt all test items with full effort and leave no blank responses with unplanned missing values. However, aberrant response behavior—such as performance decline, dropping out beyond a certain point, and skipping certain items over the course of the test—is inevitable, especially for low-stakes assessments and speeded tests due to low motivation and time limits, respectively. In this study, test-takers are classified as normal or aberrant using a mixture item response theory (IRT) modeling approach, and aberrant response behavior is described and modeled using item response trees (IRTrees). Simulations are conducted to evaluate the efficiency and quality of the new class of mixture IRTree model using WinBUGS with Bayesian estimation. The results show that the parameter recovery is satisfactory for the proposed mixture IRTree model and that treating missing values as ignorable or incorrect and ignoring possible performance decline results in biased estimation. Finally, the applicability of the new model is illustrated by means of an empirical example based on the Program for International Student Assessment.

[20]	Jiang S. Y., Wang C., & Weiss D. J. (2016). Sample size requirements for estimation of item parameters in the multidimensional graded response model. Frontiers in Psychology, 7, Article 109. 本文引用 [1]

[21]	Jin K. Y., Siu W. L., & Huang X. T. (2022). Exploring the impact of random guessing in distractor analysis. Journal of Educational Measurement, 59(1), 43-61. https://doi.org/10.1111/jedm.v59.1 https://onlinelibrary.wiley.com/toc/17453984/59/1 本文引用 [2]

[22]	Kim S., & Moses T. (2018). The impact of aberrant responses and detection in forced-choice noncognitive assessment. ETS Research Report Series, 2018(1), 1-15. 本文引用 [1]

[23]

Köhler

, Pohl

, & Carstensen

C. H.

(2015). Taking the missing propensity into account when estimating competence scores: Evaluation of item response theory models for nonignorable omissions. Educational and Psychological Measurement, 75(5), 850-874.

https://doi.org/10.1177/0013164414561785

https://www.ncbi.nlm.nih.gov/pubmed/29795844

本文引用 [1] 摘要

When competence tests are administered, subjects frequently omit items. These missing responses pose a threat to correctly estimating the proficiency level. Newer model-based approaches aim to take nonignorable missing data processes into account by incorporating a latent missing propensity into the measurement model. Two assumptions are typically made when using these models: (1) The missing propensity is unidimensional and (2) the missing propensity and the ability are bivariate normally distributed. These assumptions may, however, be violated in real data sets and could, thus, pose a threat to the validity of this approach. The present study focuses on modeling competencies in various domains, using data from a school sample (= 15,396) and an adult sample (= 7,256) from the National Educational Panel Study. Our interest was to investigate whether violations of unidimensionality and the normal distribution assumption severely affect the performance of the model-based approach in terms of differences in ability estimates. We propose a model with a competence dimension, a unidimensional missing propensity and a distributional assumption more flexible than a multivariate normal. Using this model for ability estimation results in different ability estimates compared with a model ignoring missing responses. Implications for ability estimation in large-scale assessments are discussed.

[24]

Kuha

, Katsikatsou

, & Moustaki

(2018). Latent variable modelling with non-ignorable item non-response: Multigroup response propensity models for cross-national analysis. Journal of the Royal Statistical Society Series A: Statistics in Society , 181(4), 1169-1192.

https://doi.org/10.1111/rssa.12350

https://academic.oup.com/jrsssa/article/181/4/1169/7072036

本文引用 [1] 摘要

When missing data are produced by a non-ignorable non-response mechanism, analysis of the observed data should include a model for the probabilities of responding. We propose such models for non-response in survey questions which are treated as measures of latent constructs and analysed by using latent variable models. The non-response models that we describe include additional latent variables (latent response propensities) which determine the response probabilities. We argue that this model should be specified as flexibly as possible, and we propose models where the response propensity is a categorical variable (a latent response class). This can be combined with any latent variable model for the survey items, and an association between the latent variables measured by the items and the latent response propensities then implies a model with non-ignorable non-response. We consider in particular such models for the analysis of data from cross-national surveys, where the non-response model may also vary across the countries. The models are applied to data on welfare attitudes in 29 countries in the European Social Survey.

[25]

Liu

C. W.

(2021). Examining nonnormal latent variable distributions for non-ignorable missing data. Applied Psychological Measurement, 45(3), 159-177.

https://doi.org/10.1177/0146621621990753

https://journals.sagepub.com/doi/10.1177/0146621621990753

本文引用 [2] 摘要

Missing not at random (MNAR) modeling for non-ignorable missing responses usually assumes that the latent variable distribution is a bivariate normal distribution. Such an assumption is rarely verified and often employed as a standard in practice. Recent studies for “complete” item responses (i.e., no missing data) have shown that ignoring the nonnormal distribution of a unidimensional latent variable, especially skewed or bimodal, can yield biased estimates and misleading conclusion. However, dealing with the bivariate nonnormal latent variable distribution with present MNAR data has not been looked into. This article proposes to extend unidimensional empirical histogram and Davidian curve methods to simultaneously deal with nonnormal latent variable distribution and MNAR data. A simulation study is carried out to demonstrate the consequence of ignoring bivariate nonnormal distribution on parameter estimates, followed by an empirical analysis of “don’t know” item responses. The results presented in this article show that examining the assumption of bivariate nonnormal latent variable distribution should be considered as a routine for MNAR data to minimize the impact of nonnormality on parameter estimates.

[26]

, Wang

, Zhang

J. W.

, & Tao

(2020). A mixture model for responses and response times with a higher-order ability structure to detect rapid guessing behaviour. British Journal of Mathematical and Statistical Psychology, 73(2), 261-288.

https://doi.org/10.1111/bmsp.v73.2

https://bpspsychub.onlinelibrary.wiley.com/toc/20448317/73/2

本文引用 [1]

[27]

McLeod

, Lewis

, & Thissen

(2003). A Bayesian method for the detection of item preknowledge in computerized adaptive testing. Applied Psychological Measurement, 27(2), 121-137.

https://doi.org/10.1177/0146621602250534

https://journals.sagepub.com/doi/10.1177/0146621602250534

本文引用 [1] 摘要

With the increased use of continuous testing in computerized adaptive testing, new concerns about test security have evolved, such as how to ensure that items in an item pool are safeguarded from theft. In this article, procedures to detect test takers using item preknowledge are explored. When test takers use item preknowledge, their item responses deviate from the underlying item response theory (IRT) model, and estimated abilities may be inflated. This deviation may be detected through the use of person-fit indices. A Bayesian posterior log odds ratio index is proposed for detecting the use of item preknowledge. In this approach to person fit, the estimated probability that each test taker has preknowledge of items is updated after each item response. These probabilities are based on the IRT parameters, a model specifying the probability that each item has been memorized, and the test taker’s item responses. Simulations based on an operational computerized adaptive test (CAT) pool are used to demonstrate the use of the odds ratio index.

[28]

Peng

S. W.

, Cai

, Wang

D. X.

, Luo

, & Tu

D. B.

(2022). A generalized diagnostic classification modeling framework integrating differential speediness: Advantages and illustrations in psychological and educational testing. Multivariate Behavioral Research, 57(6), 940-959.

https://doi.org/10.1080/00273171.2021.1928474

https://www.tandfonline.com/doi/full/10.1080/00273171.2021.1928474

本文引用 [1]

[29]

Peng

S. W.

, Man

K. W.

, Veldkamp

B. P.

, Cai

, & Tu

D. B.

(2024). A mixture model for random responding behavior in forced-choice noncognitive assessment: Implication and application in organizational research. Organizational Research Methods, 27(3), 414-442.

https://doi.org/10.1177/10944281231181642

https://journals.sagepub.com/doi/10.1177/10944281231181642

本文引用 [1] 摘要

For various reasons, respondents to forced-choice assessments (typically used for noncognitive psychological constructs) may respond randomly to individual items due to indecision or globally due to disengagement. Thus, random responding is a complex source of measurement bias and threatens the reliability of forced-choice assessments, which are essential in high-stakes organizational testing scenarios, such as hiring decisions. The traditional measurement models rely heavily on nonrandom, construct-relevant responses to yield accurate parameter estimates. When survey data contain many random responses, fitting traditional models may deliver biased results, which could attenuate measurement reliability. This study presents a new forced-choice measure-based mixture item response theory model (called M-TCIR) for simultaneously modeling normal and random responses (distinguishing completely and incompletely random). The feasibility of the M-TCIR was investigated via two Monte Carlo simulation studies. In addition, one empirical dataset was analyzed to illustrate the applicability of the M-TCIR in practice. The results revealed that most model parameters were adequately recovered, and the M-TCIR was a viable alternative to model both aberrant and normal responses with high efficiency.

[30]	Plummer, M. (2003). JAGS:A program for analysis of Bayesian graphical models using Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd international workshop on Distributed Statistical Computing (DSC 2003) (pp. 1-10).Vienna, Austria. 本文引用 [1]

[31]	Rasch G. (1960). Probabilistic models for some intelligence and attainment tests. Danmarks Paedagogiske Institut. 本文引用 [1]

[32]	R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. 本文引用 [1]

[33]

Rios

J. A.

(2022). Assessing the accuracy of parameter estimates in the presence of rapid guessing misclassifications. Educational and Psychological Measurement, 82(1), 122-150.

https://doi.org/10.1177/00131644211003640

https://www.ncbi.nlm.nih.gov/pubmed/34992309

本文引用 [1] 摘要

The presence of rapid guessing (RG) presents a challenge to practitioners in obtaining accurate estimates of measurement properties and examinee ability. In response to this concern, researchers have utilized response times as a proxy of RG and have attempted to improve parameter estimation accuracy by filtering RG responses using popular scoring approaches, such as the effort-moderated item response theory (EM-IRT) model. However, such an approach assumes that RG can be correctly identified based on an indirect proxy of examinee behavior. A failure to meet this assumption leads to the inclusion of distortive and psychometrically uninformative information in parameter estimates. To address this issue, a simulation study was conducted to examine how violations to the assumption of correct RG classification influences EM-IRT item and ability parameter estimation accuracy and compares these results with parameter estimates from the three-parameter logistic (3PL) model, which includes RG responses in scoring. Two RG misclassification factors were manipulated: type (underclassification vs. overclassification) and rate (10%, 30%, and 50%). Results indicated that the EM-IRT model provided improved item parameter estimation over the 3PL model regardless of misclassification type and rate. Furthermore, under most conditions, increased rates of RG underclassification were associated with the greatest bias in ability parameter estimates from the EM-IRT model. In spite of this, the EM-IRT model with RG misclassifications demonstrated more accurate ability parameter estimation than the 3PL model when the mean ability of RG subgroups did not differ. This suggests that in certain situations it may be better for practitioners to (a) imperfectly identify RG than to ignore the presence of such invalid responses and (b) select liberal over conservative response time thresholds to mitigate bias from underclassified RG.© The Author(s) 2021.

[34]

Spiegelhalter

D. J.

, Best

N. G.

, Carlin

B. P.

, & van der Linde

(2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society Series B: Statistical Methodology, 64(4), 583-639.

https://doi.org/10.1111/1467-9868.00353

https://academic.oup.com/jrsssb/article/64/4/583/7098621

本文引用 [1] 摘要

We consider the problem of comparing complex hierarchical models in which the number of parameters is not clearly defined. Using an information theoretic argument we derive a measure pD for the effective number of parameters in a model as the difference between the posterior mean of the deviance and the deviance at the posterior means of the parameters of interest. In general pD approximately corresponds to the trace of the product of Fisher's information and the posterior covariance, which in normal models is the trace of the ‘hat’ matrix projecting observations onto fitted values. Its properties in exponential families are explored. The posterior mean deviance is suggested as a Bayesian measure of fit or adequacy, and the contributions of individual observations to the fit and complexity can give rise to a diagnostic plot of deviance residuals against leverages. Adding pD to the posterior mean deviance gives a deviance information criterion for comparing models, which is related to other information criteria and has an approximate decision theoretic justification. The procedure is illustrated in some examples, and comparisons are drawn with alternative Bayesian and classical proposals. Throughout it is emphasized that the quantities required are trivial to compute in a Markov chain Monte Carlo analysis.

[35]	Ulitzsch E., von Davier M., & Pohl S. (2020a). Using response times for joint modeling of response and omission behavior. Multivariate Behavioral Research, 55(3), 425-453. https://doi.org/10.1080/00273171.2019.1643699 https://www.tandfonline.com/doi/full/10.1080/00273171.2019.1643699 本文引用 [1]

[36]

Ulitzsch

, von Davier

, & Pohl

(2020b). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non-response. British Journal of Mathematical and Statistical Psychology, 73(1), 83-112.

https://doi.org/10.1111/bmsp.v73.s1

https://bpspsychub.onlinelibrary.wiley.com/toc/20448317/73/S1

本文引用 [4]

[37]

Ulitzsch

, von Davier

, & Pohl

(2020c). A multiprocess item response model for not-reached items due to time limits and quitting. Educational and Psychological Measurement, 80(3), 522-547.

https://doi.org/10.1177/0013164419878241

https://journals.sagepub.com/doi/10.1177/0013164419878241

本文引用 [1] 摘要

So far, modeling approaches for not-reached items have considered one single underlying process. However, missing values at the end of a test can occur for a variety of reasons. On the one hand, examinees may not reach the end of a test due to time limits and lack of working speed. On the other hand, examinees may not attempt all items and quit responding due to, for example, fatigue or lack of motivation. We use response times retrieved from computerized testing to distinguish missing data due to lack of speed from missingness due to quitting. On the basis of this information, we present a new model that allows to disentangle and simultaneously model different missing data mechanisms underlying not-reached items. The model (a) supports a more fine-grained understanding of the processes underlying not-reached items and (b) allows to disentangle different sources describing test performance. In a simulation study, we evaluate estimation of the proposed model. In an empirical study, we show what insights can be gained regarding test-taking behavior using this model.

[38]

von Davier

, Khorramdel

, He

Q. W.

, Shin

H. J.

, & Chen

H. W.

(2019). Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. Journal of Educational and Behavioral Statistics, 44(6), 671-705.

https://doi.org/10.3102/1076998619881789

https://journals.sagepub.com/doi/10.3102/1076998619881789

本文引用 [1] 摘要

International large-scale assessments (ILSAs) transitioned from paper-based assessments to computer-based assessments (CBAs) facilitating the use of new item types and more effective data collection tools. This allows implementation of more complex test designs and to collect process and response time (RT) data. These new data types can be used to improve data quality and the accuracy of test scores obtained through latent regression (population) models. However, the move to a CBA also poses challenges for comparability and trend measurement, one of the major goals in ISLAs. We provide an overview of current methods used in ILSAs to examine and assure the comparability of data across different assessment modes and methods that improve the accuracy of test scores by making use of new data types provided by a CBA.

[39]	Wise S. L., & DeMars C. E. (2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19-38. https://doi.org/10.1111/jedm.2006.43.issue-1 https://onlinelibrary.wiley.com/toc/17453984/43/1 本文引用 [1]

[40]

Zhan

P. D.

, Jiao

, Man

K. W.

, & Wang

L. J.

(2019). Using JAGS for Bayesian cognitive diagnosis modeling: A tutorial. Journal of Educational and Behavioral Statistics, 44(4), 473-503.

https://doi.org/10.3102/1076998619826040

本文引用 [1] 摘要

In this article, we systematically introduce the just another Gibbs sampler (JAGS) software program to fit common Bayesian cognitive diagnosis models (CDMs) including the deterministic inputs, noisy and gate model; the deterministic inputs, noisy or gate model; the linear logistic model; the reduced reparameterized unified model; and the log-linear CDM (LCDM). Further, we introduce the unstructured latent structural model and the higher order latent structural model. We also show how to extend these models to consider polytomous attributes, the testlet effect, and longitudinal diagnosis. Finally, we present an empirical example as a tutorial to illustrate how to use JAGS codes in R.