Speed Difference Detection Based on Response Time Data

Xin Yunxi, Qin Chunying, Dong Shenghong, Yu Xiaofeng

Journal of Psychological Science ›› 2026, Vol. 49 ›› Issue (2) : 473-484.

PDF(2055 KB)
PDF(2055 KB)
Journal of Psychological Science ›› 2026, Vol. 49 ›› Issue (2) : 473-484. DOI: 10.16719/j.cnki.1671-6981.20260219
Psychological statistics, Psychometrics & Methods

Speed Difference Detection Based on Response Time Data

Author information +
History +

Abstract

Response time data is increasingly recognized for its potential to reveal the pace and conduct of examinees, offering valuable insights into educational and psychological assessments. Unusually rapid test completion may suggest irregular behavior, such as obtaining prior knowledge of certain test items ahead of his/her test. Current research indicates that the signal likelihood ratio (SLR) test outperforms other methods in maintaining type I error rates and enhancing statistical detection powers. This paper focuses on comparing the SLR with two novel test statistics designed to detect speed discrepancies.

Using response time data, we developed two Bayesian-inspired statistics to assess variations in test-taking speed. To gauge the efficacy of these statistics in detecting prior knowledge of test items, we initiate our analysis with a well-known data set from real-world scenarios. This data set has been previously scrutinized in studies aimed at identifying item preknowledge, allowing us to benchmark our findings against existing literature. Employing the signal likelihood ratio (SLR), the Bayesian factor (BF), and the posterior probabilistic (PP) approaches, we scrutinize each examinee’s response time data. Based on the flagged items marked in the dataset, the entire test is divided into two parts: the collection of normal items and the collection of flagged items. To more intuitively examine whether the “marked” abnormal examinees in the original dataset have differences in response speed during the test, the data of Form1 is analyzed as follows: (1) After excluding data with missing information and “marked” abnormal data, the log-normal time model is fitted to the data to obtain the item parameters for each item. (2) After excluding examinees with missing data, for the 41 “marked” examinees in the dataset, speed parameter estimation is conducted based on all 170 questions in the test, 106 non-leaked questions, and 64 leaked questions. The speed parameters obtained indicate that the examinees identified by the three methods all have speed differences. (3) After excluding examinees with missing data, all 1624 examinees in the test are analyzed using SLR, BF, and PP, and speed parameter estimation is conducted for the “marked” abnormal examinees.

The outcomes are then juxtaposed with the marked “aberrant examinees” in the original data set. Interestingly, all three methods exhibit a more “conservative” stance compared to the original dataset’s annotations. Specifically, BF, SLR, and PP identify 13, 11, and 9 examinees, respectively. Moreover, the detection sets from these methods are inclusive, with BF encompassing both SLR and PP detection, and SLR encompassing PP detection. This suggests that the PP method is the most stringent in flagging abnormal examinee behavior, while BF is comparatively more lenient.

Building on these findings, we designed a simulation study to further appraise the performance of the proposed methods. The results indicate that examinees with prior knowledge exhibit distinct response speeds on leaked versus normal items. The sensitivity of the three statistics—SLR, BF, and PP—varies, with BF being the most responsive to speed differences and PP the least. Targeted simulation experiments are conducted to assess the impact of varying degrees of speed differences due to item preknowledge, the prevalence of such preknowledge among examinees, and the proportion of known items within the total test under diverse conditions. A comprehensive comparison reveals that: (1) All three methods effectively control type I error rates; (2) A medium speed difference (U =.35 to.50) allows for high detection accuracy; (3) To accurately identify examinees with item preknowledge, they must have prior knowledge of at least 20% of the items; and (4) Given the study’s parameters are known, the prevalence of preknowledge in the population is expected to have a minimal impact on test outcomes. The newly developed statistics demonstrate robust performance in detecting response speed differences during the examination process.

Key words

response time / speed / posterior probability / difference detection / Bayesian factor

Cite this article

Download Citations
Xin Yunxi , Qin Chunying , Dong Shenghong , et al. Speed Difference Detection Based on Response Time Data[J]. Journal of Psychological Science. 2026, 49(2): 473-484 https://doi.org/10.16719/j.cnki.1671-6981.20260219

References

[1]
胡佳琪, 黄美薇, 骆方. (2020). 考试作弊甄别技术的研究进展:个体作弊的甄别. 中国考试, 11, 32-36.
[2]
黄庆, 王倩, 闻家君, 涂冬波. (2024). 加速作答行为的混合项目反应树模型研究. 江西师范大学学报(自然科学版), 48(5), 449-458.
[3]
梁润美. (2022). 基于反应时间的被试异常行为检测的变点分析方法比较(硕士学位论文), 东北师范大学,长春.
[4]
刘玥, 刘红云. (2021). 心理与教育测验中异常作答处理的新技术:混合模型方法. 心理科学进展, 29(9), 1696-1710.
混合模型方法(Mixture Model Method)是近年来提出的, 对心理与教育测验中的异常作答进行处理的方法。与反应时阈值法, 反应时残差法等传统方法相比, 混合模型方法可以同时完成异常作答的识别和模型参数估计, 并且, 在数据污染严重的情况下仍具有较好的表现。该方法的原理为根据正常作答和异常作答的特点, 针对分类潜变量(即作答层面的分类)的不同类别, 在作答反应和(或)反应时部分建立不同的模型, 从而实现对分类潜变量, 以及模型中其他题目和被试参数的估计。文章详细介绍了目前提出的几种混合模型方法, 并将其与传统方法比较分析。未来研究可在模型前提假设违背, 含有多种异常作答等情况下探索混合模型方法的稳健性和适用性, 通过固定部分题目参数, 增加选择流程等方式提高混合模型方法的使用效率。
[5]
刘玥, 刘红云, 游晓锋, 杨建芹. (2022). 用于处理不努力作答的标准化残差系列方法和混合多层模型法的比较. 心理学报, 54(4), 411-425.
文章采用模拟研究, 分别在混合多层模型假设满足和违背的情境下, 比较了混合多层模型方法与标准化残差系列方法在识别不努力作答和参数估计方面的表现。结果显示:(1)不存在不努力作答或其严重性低时, 各方法表现接近; (2)不努力作答严重性高时, 固定参数迭代标准化残差法普遍更优, 混合多层模型法仅在假设满足且两种作答反应时差异大的条件下表现较好。建议实际应用中优先选择固定参数迭代标准化残差法。
[6]
李亚玲. (2021). 心理与教育测验中侦测表现下降新视角—基于 JS散度的变点分析法(硕士学位论文), 江西师范大学,南昌.
[7]
骆方, 王欣夷, 徐永泽, 封慰. (2020). 考试作弊甄别技术的研究进展:团体作弊的甄别. 中国考试, 11, 37-41.
[8]
秦春影, 吴龙月, 王爱平. (2022). 计算机自适应测验中试题泄露的实时监控方法研究与应用. 江西师范大学学报(自然科学版), 46(2), 118-125.
[9]
童昊, 喻晓锋, 秦春影, 彭亚风, 钟小缘. (2022). 多级计分测验中基于残差统计量的被试拟合研究. 心理学报, 54(9), 1122-1136.
本文提出一种多级计分项目下的个人拟合统计量R, 考察它在检测6种常见的异常作答模式(作弊、猜测、随机、粗心、创新作答、混合异常)下的表现, 并与标准化对数似然统计量l<sub>zp</sub>进行比较。结果表明:(1) 在异常作答覆盖率较低并且异常作答类型为作弊和猜测时, R的检测率显著高于l<sub>zp</sub>; (2) 随着测验长度和被试异常程度的增加, 两种统计量的检测率都会上升; (3) 在一些条件下, R与l<sub>zp</sub>检测效果接近。实证数据分析进一步展示了R统计量的使用方法和过程, 结果也表明R统计量具有较好的应用前景。
[10]
王超. (2018). 自适应测验中认知风格对作答时间的影响机制(硕士学位论文), 山东师范大学,济南.
[11]
王丹, 刘红云. (2023). 量表数据中不努力作答的识别和清理. 心理学探新, 43(6), 558-566.
[12]
王雪, 罗芬, 蔡艳, 涂冬波. (2024). 迫选测验中后程随机作答的侦查:基于变点分析法. 心理科学, 47(6), 1507-1518.
[13]
杨志明, 徐庆树. (2023). 基于项目作答反应时间的作弊甄别研究进展. 心理学探新, 43(3), 278-288.
[14]
钟小缘, 喻晓锋, 苗莹, 秦春影, 彭亚风, 童昊. (2022). 基于作答时间数据的改变点分析在检测加速作答中的探索——已知和未知项目参数. 心理学报, 54(10), 1277-1292.
相对于传统的离散作答数据, 作答时间作为连续数据, 可以提供更多信息。改变点分析(change point analysis)技术在心理和教育领域是一个比较新的技术。本文一方面对改变点分析在心理测量领域的应用进行了一个综合的总结和分析; 另一方面, 将基于作答数据的两种改变点分析统计量推广到作答时间数据, 将改变点分析技术应用到测验异常作答模式:加速作答speededness的检测上。采用两种检验方法:似然比检验和Wald检验, 分别在已知和未知项目参数的条件下, 实现异常作答模式的检测。结果表明, 所采用的方法对于加速作答行为的检测具有很高的检验力, 同时能够很好的控制I类错误率。实证数据分析进一步表明本文中所使用的方法具有应用价值。
[15]
Belov D. I. (2016). Comparing the performance of eight item preknowledge detection statistics. Applied Psychological Measurement, 40(2), 83-97.
Item preknowledge describes a situation in which a group of examinees (called ) have had access to some items (called ) from an administered test prior to the exam. Item preknowledge negatively affects both the corresponding testing program and its users (e.g., universities, companies, government organizations) because scores for aberrant examinees are invalid. In general, item preknowledge is hard to detect due to multiple unknowns: unknown groups of aberrant examinees (at unknown test centers or schools) accessing unknown subsets of items prior to the exam. Recently, multiple statistical methods were developed to detect compromised items. However, the detected subset of items (called the ) naturally has an uncertainty due to false positives and false negatives. The uncertainty increases when different groups of aberrant examinees had access to different subsets of items; thus, compromised items for one group are uncompromised for another group and vice versa. The impact of uncertainty on the performance of eight statistics (each relying on the suspicious subset) was studied. The measure of performance was based on the receiver operating characteristic curve. Computer simulations demonstrated how uncertainty combined with various independent variables (e.g., type of test, distribution of aberrant examinees) affected the performance of each statistic.
[16]
Cheng Y., & Shao C. (2021). Application of change point analysis of response time data to detect test speededness. Educational and Psychological Measurement, 82(5), 10311062.
[17]
Cizek G. J., & Wollack J. A. (2017). Handbook of detecting cheating on tests. Routledge.
[18]
Fox J.-P., Koops J., Feskens R., & Beinhauer L. (2020). Bayesian covariance structure modelling for measurement invariance testing. Behaviormetrika, 47(2), 385-410.
[19]
Fox J. P., & Marianti S. (2016). Joint modeling of ability and differential speed using responses and response times. Multivariate Behavioral Research, 51(4), 540-553.
[20]
Gelman A., Carlin J. B., Stern H. S., Dunson D. B., Vehtari A., & Rubin D. B. (2014). Bayesian data analysis. Chapman & Hall.
[21]
Hong M., Steedle J. T., & Cheng Y. (2020). Methods of detecting insufficient effort responding: Comparisons and practical recommendations. Educational and Psychological Measurement, 80(2), 312-345.
Insufficient effort responding (IER) affects many forms of assessment in both educational and psychological contexts. Much research has examined different types of IER, IER's impact on the psychometric properties of test scores, and preprocessing procedures used to detect IER. However, there is a gap in the literature in terms of practical advice for applied researchers and psychometricians when evaluating multiple sources of IER evidence, including the best strategy or combination of strategies when preprocessing data. In this study, we demonstrate how the use of different IER detection methods may affect psychometric properties such as predictive validity and reliability. Moreover, we evaluate how different data cleansing procedures can detect different types of IER. We provide evidence via simulation studies and applied analysis using the ACT's Engage assessment as a motivating example. Based on the findings of the study, we provide recommendations and future research directions for those who suspect their data may contain responses reflecting careless, random, or biased responding.© The Author(s) 2019.
[22]
Hong M., Lin L., & Cheng Y. (2021). Asymptotically corrected person fit statistics for multidimensional constructs with simple structure and mixed item types. Psychometrika, 86(2), 464-488.
Person fit statistics are frequently used to detect aberrant behavior when assuming an item response model generated the data. A common statistic, [Formula: see text], has been shown in previous studies to perform well under a myriad of conditions. However, it is well-known that [Formula: see text] does not follow a standard normal distribution when using an estimated latent trait. As a result, corrections of [Formula: see text], called [Formula: see text], have been proposed in the literature for specific item response models. We propose a more general correction that is applicable to many types of data, namely survey or tests with multiple item types and underlying latent constructs, which subsumes previous work done by others. In addition, we provide corrections for multiple estimators of [Formula: see text], the latent trait, including MLE, MAP and WLE. We provide analytical derivations that justifies our proposed correction, as well as simulation studies to examine the performance of the proposed correction with finite test lengths. An applied example is also provided to demonstrate proof of concept. We conclude with recommendations for practitioners when the asymptotic correction works well under different conditions and also future directions.
[23]
Kass R. E., & Raftery A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773-795.
[24]
Kasli M., Zopluoglu C., & Toton S. L. (2023). A deterministic gated lognormal response time model to identify examinees with item preknowledge. Journal of Educational Measurement, 60(1), 148-169.
[25]
Lee S. Y. (2018). A Mixture model approach to detect examinees with item preknowledge (Doctorial dissertation). University of Wisconsin-Madison.
[26]
Lu Y., & Sireci S. G. (2007). Validity issues in test speededness. Educational Measurement: Issues and Practice, 26(3), 29-37.
[27]
Marianti S., Fox J. P., Marianna A., Veldkamp B. P., & Tijmstra J. (2014). Testing for aberrant behavior in response time modeling. Journal of Educational and Behavioral Statistics, 39(6), 426451.
[28]
Oshima T. C. (1994). The effect of speededness on parameter estimation in item response theory. Journal of Educational Measurement, 31(3), 200-219.
[29]
Pan Y. Q., & Wollack J. A. (2021). An unsupervised-learning-based approach to compromised items detection. Journal of Education Measurement, 58(3), 413-433.
[30]
Robert C. P. (2007). The Bayesian choice. Springer.
[31]
Schnipke D. L., & Scrams D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34(3), 213-232.
[32]
Shao C. (2016). Aberrant response detection using change-point analysis(Unpublished doctoral dissertation). University of Notre Dame.
[33]
Shu Z., Henson R., & Luecht R. (2013). Using deterministic, gated item response theory model to detect test cheating due to item compromise. Psychometrika, 78(3), 481-497.
The Deterministic, Gated Item Response Theory Model (DGM, Shu, Unpublished Dissertation. The University of North Carolina at Greensboro, 2010) is proposed to identify cheaters who obtain significant score gain on tests due to item exposure/compromise by conditioning on the item status (exposed or unexposed items). A "gated" function is introduced to decompose the observed examinees' performance into two distributions (the true ability distribution determined by examinees' true ability and the cheating distribution determined by examinees' cheating ability). Test cheaters who have score gain due to item exposure are identified through the comparison of the two distributions. Hierarchical Markov Chain Monte Carlo is used as the model's estimation framework. Finally, the model is applied in a real data set to illustrate how the model can be used to identify examinees having pre-knowledge on the exposed items.
[34]
Sinharay S. (2016). Person fit analysis in computerized adaptive testing using tests for a change point. Journal of Educational and Behavioral Statistics, 41(5), 521-549.
[35]
Sinharay S. (2017a). Detection of item preknowledge using likelihood ratio test and score test. Journal of Educational and Behavioral Statistics, 42(1), 46-68.
An increasing concern of producers of educational assessments is fraudulent behavior during the assessment (van der Linden, 2009). Benefiting from item preknowledge (e.g., Eckerly, 2017; McLeod, Lewis, &amp; Thissen, 2003) is one type of fraudulent behavior. This article suggests two new test statistics for detecting individuals who may have benefited from item preknowledge; the statistics can be used for both nonadaptive and adaptive assessments that may include either or both of dichotomous and polytomous items. Each new statistic has an asymptotic standard normal n distribution. It is demonstrated in detailed simulation studies that the Type I error rates of the new statistics are close to the nominal level and the values of power of the new statistics are larger than those of an existing statistic for addressing the same problem.
[36]
Sinharay S. (2017b). Which statistic should be used to detect item preknowledge when the set of compromised items is known? Applied Psychological Measurement, 41(6), 403-421.
Benefiting from item preknowledge is a major type of fraudulent behavior during educational assessments. Belov suggested the posterior shift statistic for detection of item preknowledge and showed its performance to be better on average than that of seven other statistics for detection of item preknowledge for a known set of compromised items. Sinharay suggested a statistic based on the likelihood ratio test for detection of item preknowledge; the advantage of the statistic is that its null distribution is known. Results from simulated and real data and adaptive and nonadaptive tests are used to demonstrate that the Type I error rate and power of the statistic based on the likelihood ratio test are very similar to those of the posterior shift statistic. Thus, the statistic based on the likelihood ratio test appears promising in detecting item preknowledge when the set of compromised items is known.
[37]
Sinharay S. (2020). Detection of item preknowledge using response times. Applied Psychological Measurement, 44(5), 376-392.
Benefiting from item preknowledge is a major type of fraudulent behavior during educational assessments. This article suggests a new statistic that can be used for detecting the examinees who may have benefited from item preknowledge using their response times. The statistic quantifies the difference in speed between the compromised items and the non-compromised items of the examinees. The distribution of the statistic under the null hypothesis of no preknowledge is proved to be the standard normal distribution. A simulation study is used to evaluate the Type I error rate and power of the suggested statistic. A real data example demonstrates the usefulness of the new statistic that is found to provide information that is not provided by statistics based only on item scores.© The Author(s) 2020.
[38]
Sinharay S., & Johnson M. S. (2020). The use of item scores and response times to detect examinees who may have benefited from item preknowledge. British Journal of Mathematical and Statistical Psychology, 73(3), 397-419.
[39]
Sinharay S., & Johnson M. S. (2021). The use of the posterior probability in score differencing. Journal of Educational and Behavioral Statistics, 46(4), 403-429.
Score differencing is one of the six categories of statistical methods used to detect test fraud (Wollack &amp; Schoenig, 2018) and involves the testing of the null hypothesis that the performance of an examinee is similar over two item sets versus the alternative hypothesis that the performance is better on one of the item sets. We suggest, to perform score differencing, the use of the posterior probability of better performance on one item set compared to another. In a simulation study, the suggested approach performs satisfactory compared to several existing approaches for score differencing. A real data example demonstrates how the suggested approach may be effective in detecting fraudulent examinees. The results in this article call for more attention to the use of posterior probabilities, and Bayesian approaches in general, in investigations of test fraud.
[40]
Stern H. S. (2005). Model inference or model selection: Discussion of Klugkist, Laudy, and Hoijtink(2005). Psychological Methods, 10(4), 494-499.
I. Klugkist, O. Laudy, and H. Hoijtink (2005) presented a Bayesian approach to analysis of variance models with inequality constraints. Constraints may play 2 distinct roles in data analysis. They may represent prior information that allows more precise inferences regarding parameter values, or they may describe a theory to be judged against the data. In the latter case, the authors emphasized the use of Bayes factors and posterior model probabilities to select the best theory. One difficulty is that interpretation of the posterior model probabilities depends on which other theories are included in the comparison. The posterior distribution of the parameters under an unconstrained model allows one to quantify the support provided by the data for inequality constraints without requiring the model selection framework.copyright 2006 APA, all rights reserved.
[41]
van der Linden W. J., van Krimpen-Stoop E.(2003). Using response times to detect aberrant responses in computerized adaptive testing. Psychometrika, 68(2), 251-265.
A lognormal model for response times is used to check response times for aberrances in examinee behavior on computerized adaptive tests. Both classical procedures and Bayesian posterior predictive checks are presented. For a fixed examinee, responses and response times are independent; checks based on response times offer thus information independent of the results of checks on response patterns. Empirical examples of the use of classical and Bayesian checks for detecting two different types of aberrances in response times are presented. The detection rates for the Bayesian checks outperformed those for the classical checks, but at the cost of higher false-alarm rates. A guideline for the choice between the two types of checks is offered.
[42]
van der Linden W. J.(2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287-308.
Current modeling of response times on test items has been strongly influenced by the paradigm of experimental reaction-time research in psychology. For instance, some of the models have a parameter structure that was chosen to represent a speed-accuracy tradeoff, while others equate speed directly with response time. Also, several response-time models seem to be unclear as to the level of parametrization they represent. A hierarchical framework for modeling speed and accuracy on test items is presented as an alternative to these models. The framework allows a “plug-and-play approach” with alternative choices of models for the response and response-time distributions as well as the distributions of their parameters. Bayesian treatment of the framework with Markov chain Monte Carlo (MCMC) computation facilitates the approach. Use of the framework is illustrated for the choice of a normal-ogive response model, a lognormal model for the response times, and multivariate normal models for their parameters with Gibbs sampling from the joint posterior distribution.
[43]
van der Linden W. J., & Guo F.(2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73(3), 365384.
[44]
van der Linden W. J.(2009). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46(3), 247-272.
[45]
van der Linden W. J., & Glas C. A. W.(2010). Statistical tests of conditional independence between responses and/or response times on test items. Psychometrika, 75(1), 120-139.
Three plausible assumptions of conditional independence in a hierarchical model for responses and response times on test items are identified. For each of the assumptions, a Lagrange multiplier test of the null hypothesis of conditional independence against a parametric alternative is derived. The tests have closed-form statistics that are easy to calculate from the standard estimates of the person parameters in the model. In addition, simple closed-form estimators of the parameters under the alternatives of conditional dependence are presented, which can be used to explore model modification. The tests were applied to a data set from a large-scale computerized exam and showed excellent power to detect even minor violations of conditional independence.
[46]
van der Linden W. J.(2006). A lognormal model for response times on test forms. Journal of Educational and Behavioral Statistics, 31(2), 181-204.
A lognormal model for the response times of a person on a set of test items is investigated. The model has a parameter structure analogous to the two-parameter logistic response models in item response theory, with a parameter for the speed of each person as well as parameters for the time intensity and discriminating power of each item. It is shown how these parameters can be estimated by a Markov chain Monte Carlo method (Gibbs sampler). The method was used to analyze response times for the adaptive version of a test from the Armed Services Vocational Aptitude Battery. The same data set was used to test the validity of the model against a normal model using posterior predictive checks on the response times. The lognormal model showed an excellent fit to the data, whereas the normal model seemed unable to allow for a characteristic skewness of the response time distributions. The addition of an equality constraint on the discrimination parameters led only to a slight loss of fit. The potential use of the model for improving the daily practice of testing is indicated.
[47]
van Krimpen-Stoop E. M. L. A., & Meijer R. R.(2001). CUSUM-based person-fit statistics for adaptive testing. Journal of Educational and Behavioral Statistics, 26(2), 199-218.
Item scores that do not fit an assumed item response theory model may cause the latent trait value to be inaccurately estimated. Several person-fit statistics for detecting nonfitting score patterns for paper-and-pencil tests have been proposed. In the context of computerized adaptive tests (CAT), the use of person-fit analysis has hardly been explored. Because it has been shown that the distribution of existing person-fit statistics is not applicable in a CAT, in this study new person-fit statistics are proposed and critical values for these statistics are derived from existing statistical theory. Statistics are proposed that are sensitive to runs of correct or incorrect item scores and are based on all items administered in a CAT or based on subsets of items, using observed and expected item scores and using cumulative sum (CUSUM) procedures. The theoretical and empirical distributions of the statistics are compared and detection rates are investigated. Results showed that the nominal and empirical Type I error rates were comparable for CUSUM procedures when the number of items in each subset and the number of measurement points were not too small. Detection rates of CUSUM procedures were superior to other fit statis­tics. Applications of the statistics are discussed.
[48]
van der Linden W. J. (2011). Modeling response times with latent variables: Principles and applications. Psychological Test and Assessment Modeling, 53(3), 334-358.
[49]
Wang C., Xu G., Shang Z., & Kuncel N. (2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43(4), 469-501.
The modern web-based technology greatly popularizes computer-administered testing, also known as online testing. When these online tests are administered continuously within a certain “testing window,” many items are likely to be exposed and compromised, posing a type of test security concern. In addition, if the testing time is limited, another recognized aberrant behavior is rapid guessing, which refers to quickly answering an item without processing its meaning. Both cheating behavior and rapid guessing result in extremely short response times. This article introduces a mixture hierarchical item response theory model, using both response accuracy and response time information, to help differentiate aberrant behavior from normal behavior. The model-based approach is compared to the Bayesian residual-based fit statistic in both simulation study and two real data examples. Results show that the mixture model approach consistently outperforms the residual method in terms of correct detection rate and false positive error rate, in particular when the proportion of aberrance is high. Moreover, the model-based approach is also able to correctly identify compromised items better than residual method.
[50]
Wang X., Liu Y., & Hambleton R. K. (2017). Detecting item preknowledge using a predictive checking method. Applied Psychological Measurement, 41(4), 243-263.
Repeatedly using items in high-stake testing programs provides a chance for test takers to have knowledge of particular items in advance of test administrations. A predictive checking method is proposed to detect whether a person uses preknowledge on repeatedly used items (i.e., possibly compromised items) by using information from secure items that have zero or very low exposure rates. Responses on the secure items are first used to estimate a person's proficiency distribution, and then the corresponding predictive distribution for the person's responses on the possibly compromised items is constructed. The use of preknowledge is identified by comparing the observed responses to the predictive distribution. Different estimation methods for obtaining a person's proficiency distribution and different choices of test statistic in predictive checking are considered. A simulation study was conducted to evaluate the empirical Type I error and power rate of the proposed method. The simulation results suggested that the Type I error of this method is well controlled, and this method is effective in detecting preknowledge when a large proportion of items are compromised even with a short secure section. An empirical example is also presented to demonstrate its practical use.
[51]
Wise S. L., & Kong X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163183.
[52]
Wollack J. A., & Schoenig R. W. (2018). The Sage encyclopedia of educational research, measurement, and evaluation. Sage.
[53]
Zhu H. Y., Jiao H., Gao W., & Meng X. B. (2023). Bayesian change-point analysis approach to detecting aberrant test-taking behavior using response times. Journal of Educational and Behavioral Statistics, 48(4), 490-520.
Change-point analysis (CPA) is a method for detecting abrupt changes in parameter(s) underlying a sequence of random variables. It has been applied to detect examinees’ aberrant test-taking behavior by identifying abrupt test performance change. Previous studies utilized maximum likelihood estimations of ability parameters, focusing on detecting one change point for each examinee. This article proposes a Bayesian CPA procedure using response times (RTs) to detect abrupt changes in examinee speed, which may be related to aberrant responding behaviors. The lognormal RT model is used to derive a procedure for detecting aberrant RT patterns. The method takes the numbers and locations of the change points as parameters in the model to detect multiple change points or multiple aberrant behaviors. Given the change points, the corresponding speed of each segment in the test can be estimated, which enables more accurate inferences about aberrant behaviors. Simulation study results indicate that the proposed procedure can effectively detect simulated aberrant behaviors and estimate change points accurately. The method is applied to data from a high-stakes computerized adaptive test, where its applicability is demonstrated.
PDF(2055 KB)

Accesses

Citation

Detail

Sections
Recommended

/