有效应用认知诊断模型(cognitive diagnosis model, CDM)的一个关键步骤是检查模型和测验题目是否拟合。尽管已有研究将IRT中的题目拟合检验方法应用于CDM中,然而这些方法在CDM中的表现仍缺乏系统的比较研究。本研究通过模拟实验比较了χ2,G2,S-χ2,z(r),z(l)和Stone-Q1的一类错误率和统计检验力。实验结果显示,综合一类错误率和统计检验力而言,当用ACDM作为生成模型时,z(r)和z(l)的效果最优;当生成模型是DINA或DINO时,在高质量测验中,z(r)的表现最好,而在低质量测验中,χ2和G2的表现更好。最后通过一个实测数据分析,进一步检验了题目拟合检验方法的实证应用效果。
Abstract
The goal of cognitive diagnosis model (CDM) is to classify participants into potential categories with different attribute patterns, which provide diagnostic information about whether the student has mastered a set of skills or attributes. Compared with single-dimensional item response models (e.g., item response models), CDM provides a more detailed assessment of the strengths and weaknesses of students. Although CDM was originally developed in the field of educational evaluation, it has now been used to evaluate other types of structures, such as psychological disorders and context-based ability assessment. As with any model-based evaluation, a key step in implementing the CDM is to check the model data fit, that is, the consistency between model predictions and observed data. Only when the model fits the data, the estimated model parameters can be reliably explained. Item fit is used to evaluate the fit of each item with the model, which helps to identify abnormal items. Deleting or modifying these items will improve the overall model data fit for the entire test.
At present, some commonly used item fit statistics in IRT have been extended to CDM. However, there is no research system to compare the comprehensive performance of these item fit indicators in CDM. In this study, we compared the performance of χ2, G2, S-χ2, z(r), z(l), and Stone-Q1 in the CDM. This study investigated the Type I error rate and power of the above item fit statistics through a simulation study. The factors manipulated include sample size (N=500, 1000), generating model (DINA, DINO, and ACDM), fitting model (DINA, DINO, and ACDM), test length (30 and 60), test quality (high and low), and significance level (.01 and .05). The test examined five attributes. For high-quality and low-quality tests, the guess parameters and slipping parameters of the three generating models are randomly extracted from uniform distributions U(.05, .15) and U(.15, .25), respectively.
The simulation results showed that, in terms of the Type I error, z(r) and z(l) performed best under all conditions. In terms of statistical test power, when the generating model was ACDM, z(r) and z(l) had the highest average power under all conditions. When the generating model was DINA or DINO, in the low-quality test, the power of χ2and G2 was higher; and in the high-quality test, z(r) had the highest power. In short, combining the performance of the Type I error and power, if the data fit A-CDM, z(r), and z(l)performed best; when the data fit the DINA or DINO model, in low-quality test, χ2, and G2 performed the best; however, in high-quality tests, the z(r) performed better among all methods.
This study only investigated the condition that the number of attributes is 5, and the actual test may measure more attributes. Therefore, future research should focus on the influence of the number of attributes. Lastly, the person fit assessment is also an important step in the cognitive diagnostic test, which can help identify the abnormal responses of individual students. More studies on the person fit in cognitive diagnosis model are needed.
关键词
认知诊断模型 /
题目拟合 /
一类错误率 /
统计检验力
Key words
CDM /
item fit /
Type I error rate /
power
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 涂冬波, 张心, 蔡艳, 戴海琦. (2014). 认知诊断模型-资料拟合检验统计量及其性能. 心理科学, 37(1), 205-211.
[2] Acevedo-Mesa A., Tendeiro J., Roest A., Rosmalen J., & Monden R. (2020). Improving the measurement of functional somatic symptoms with item response theory. Journal of Psychosomatic Research, 133, Article 110009.
[3] Chalmers, R. P., & Ng, V. (2017). Plausible-value imputation statistics for detecting item misfit. Applied Psychological Measurement, 41(5), 372-387.
[4] Chen, J. S., & de la Torre, J. (2014). A procedure for diagnostically modeling extant large-scale assessment data: The case of the programme for international student assessment in reading. Psychology, 5(18), 1967-1978.
[5] Chen J. S., de la Torre J., & Zhang Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123-140.
[6] Cui, Y., & Li, J. (2015). Evaluating person fit for cognitive diagnostic assessment. Applied Psychological Measurement, 39(3), 223-238.
[7] de la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69(3), 333-353.
[8] de la Torre J., van der Ark, L. A., & Rossi G. (2018). Analysis of clinical data from a cognitive diagnosis modeling framework. Measurement and Evaluation in Counseling and Development, 51(4), 281-296.
[9] Flens G., Smits N., Terwee C. B., Dekker J., Huijbrechts I., Spinhoven P., & de Beurs E. (2019). Development of a computerized adaptive test for anxiety based on the Dutch-Flemish version of the PROMIS item bank. Assessment, 26(7), 1362-1374.
[10] Gao X. L., Wang D. X., Cai Y., & Tu D. B. (2020). Cognitive diagnostic computerized adaptive testing for polytomously scored items. Journal of Classification, 37(3), 709-729.
[11] Köhler C., Robitzsch A., & Hartig J. (2020). A bias-corrected RMSD item fit statistic: An evaluation and comparison to alternatives. Journal of Educational and Behavioral Statistics, 45(3), 251-273.
[12] Li H. L., Kim M. K., & Xiong Y. (2020). Individual learning vs. interactive learning: A cognitive diagnostic analysis of MOOC students' learning behaviors. American Journal of Distance Education, 34(2), 121-136.
[13] Liu, Y., & Maydeu-Olivares, A. (2014). Identifying the source of misfit in item response theory models. Multivariate Behavioral Research, 49(4), 354-371.
[14] Ma W. C., Iaconangelo C., & de la Torre, J. (2016). Model similarity, model selection, and attribute classification. Applied Psychological Measurement, 40(3), 200-217.
[15] Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50-64.
[16] Sorrel M. A., Olea J., Abad F. J., de la Torre J., Aguado D., & Lievens F. (2016). Validity and reliability of situational judgement test scores: A new approach based on cognitive diagnosis models. Organizational Research Methods, 19(3), 506-532.
[17] Sorrel M. A., Abad F. J., Olea J., de la Torre J., & Barrada J. R. (2017). Inferential item-fit evaluation in cognitive diagnosis modeling. Applied Psychological Measurement, 41(8), 614-631.
[18] Su S. Y., Wang C., & Weiss D. J. (2021). Performance of the S-χ2 statistic for the multidimensional graded response model. Educational and Psychological Measurement, 81(3), 491-522.
[19] Wang C., Shu Z., Shang Z. R., & Xu G. J. (2015). Assessing item-level fit for the DINA model. Applied Psychological Measurement, 39(7), 525-538.
[20] Xi C. Q., Cai Y., Peng S. W., Lian J., & Tu D. B. (2020). A diagnostic classification version of Schizotypal Personality Questionnaire using diagnostic classification models. International Journal of Methods in Psychiatric Research, 29(1), Article e1807.
[21] Zhang X., Wang C., & Tao J. (2018). Assessing item-level fit for higher order item response theory models. Applied Psychological Measurement, 42(8), 644-659.
基金
*本研究得到贵州省科技计划项目(黔科合基础-ZK[2021]一般123)、贵州省高校人文社会科学研究项目(2020QN018)和贵州师范大学2019年博士科研启动项目(GZNUD[2019] 27号)的资助