Two-Step Differential Item Functioning Detection Procedures without A Priori Information

doi:10.16719/j.cnki.1671-6981.20240328

Abstract

Abstract: Differential item functioning (DIF) analysis plays a crucial role in determining the fairness and validity of educational assessments. Most traditional DIF analysis techniques rely on pre-defined anchor items that are assumed to be DIF-free. Unfortunately, if these anchor items themselves exhibit DIF, the results may be biased or even misleading. To address this issue, this article proposes a novel two-step DIF detection process inspired by the DIF detection method without prior information proposed by Yuan et al. (2021).
The difficulty-difference quantile-quantile (D-QQ) plot is a scatterplot that depicts the actual difference in difficulty between two groups of test items on the vertical axis against the theoretical difference in difficulty obtained by Monte Carlo methods under the null hypothesis (that there are no DIF items in the test) on the horizontal axis. If the test does not contain DIF, then the observed differences in difficulty between the items in the reference and target groups should match these of the Monte Carlo. Then the D-QQ plot will fall on a line. In the first step, the items that fall on the line of the D-QQ plot are selected as anchor items. In the second step, these anchor items are combined with traditional DIF methods for the actual analysis. The two-step DIF detection method proposed in this article includes a number of methods, including combining the anchor items selected in the first step with the MH method, known as the two-step MH method, and combining them with the Wald method, known as the two-step Wald method.
Two empirical studies were conducted using these DIF detection methods to analyze the datasets from the 2012 Program for International Student Assessment (PISA) math test and a language proficiency test for first grade middle school students in a certain region of China, demonstrating the practical application of the two-step DIF detection procedures.
A Monte Carlo simulation study was conducted to compare five DIF detection methods. They are the two-step MH method, the two-step Wald method, the original MH and Wald methods, and the Graphical DIF Detection Method based on the Relative Change of Item Difficulty Difference (RCD). The empirical Type I error rate and statistical power of these methods were compared under various combinations of the number of examinees, the number of test items, different patterns of DIF occurrence, and DIF size. Additionally, the impact of the number of anchor items on the two-step MH method, the two-step Wald method, and the RCD method was explored by simulation.
The results of the simulation study indicated that sample size and actual levels of DIF make little difference on the average empirical Type I error rate for various DIF detection methods. But they significantly affect statistical power. That is, the larger the sample size and the greater the DIF values, the higher the average statistical power of the DIF detection methods. The study also showed that when the test length was between 20 and 40, selecting the middle four items on the x = y line of the D-QQ plot as anchor items resulted in desirable outcomes. And the two-step DIF procedures performed optimally with respect to empirical type I error rate and statistical power, even when half of the items exhibited DIF. The RCD method performed well under most conditions, though its type I error rates were slightly inflated when the test contains DIF items favoring both reference and target groups (the balanced condition). Meanwhile, the MH and Wald methods with purification were ineffective in detecting DIF items when 10 out of the 20 total items favoring one group. However, when the sample size for each group is less than 2000 and under balanced conditions, the statistical power of the two-step methods might be slightly lower than that of the original MH and Wald methods. The empirical study confirmed the feasibility of the new approach in detecting DIF with real data, and demonstrated that the D-QQ plot allowed for a visual evaluation of the plausibility of each method.
The two-step DIF detection procedures proposed in this article offer a more effective and reliable solution for ensuring the fairness of educational tests. The proposed methods, based on the idea of the DIF detection method without prior information, allow for the identification of anchor items by visual inspection, instead of relying on predefined a priori information. The results of both simulation and empirical studies demonstrate the effectiveness of these approaches and their superiority over existing methods. The statistical rationale underlying the proposed approaches is sound and has the potential for broader applications.

Key words: differential item functioning, D-QQ plot, graphical test, two-step DIF detection procedure

摘要： 传统的项目功能差异检验方法依赖先验信息设定锚题,误设锚题可能产生误导性结果。研究提出以数据驱动的难度差异QQ图（D-QQ图）选择锚题,再结合传统方法检验DIF的两步DIF检验法。两个实证研究说明了新方法在实际测验公平性检验中的适用性及可视化优势。模拟研究进一步表明当测验中有一半试题存在DIF时,若DIF试题仅偏向一组,则两步法兼具高统计检验力和低I类错误的优势;若DIF试题分别有利于两组,则其在I类错误控制上优于RCD方法。

关键词: 项目功能差异, D-QQ图图形检验, 两步DIF检验法

Han Yuting, Yuan Kehai, Liu Hongyun. Two-Step Differential Item Functioning Detection Procedures without A Priori Information[J]. Journal of Psychological Science, 2024, 47(3): 734-743.

韩雨婷, 袁克海, 刘红云. 无需先验信息的两步项目功能差异检验方法^*[J]. 心理科学, 2024, 47(3): 734-743.

References

[1] 曹亦薇. (2003). 项目功能差异在跨文化人格问卷分析中的应用. 心理学报, 35(1), 120-126.
[2] 关丹丹, 乔辉, 陈康, 韩奕帆. (2019). 全国高考英语试题的城乡项目功能差异分析. 心理学探新, 39(1), 64-69.
[3] 郭聪颖, 边玉芳. (2013). 题组项目功能差异(DIF)检验方法的应用探索. 心理学探新, 33(5), 423-429.
[4] 林岳卿, 方积乾. (2011). 多维IRT与单维IRT在多维量表中应用的差异. 中国卫生统计, 28(3), 226-228.
[5] 刘文, 边玉芳, 陈玲丽, 马文超. (2010). 马洛-克罗恩社会赞许性量表在跨文化研究中的项目功能差异检验. 心理科学, 33(6), 1473-1476.
[6] 骆方, 张厚粲. (2006). 检验项目功能差异的两类方法——CFA和IRT的比较. 心理学探新, 26(1), 74-78.
[7] 漆书青, 戴海崎, 丁树良. (2002). 现代教育与心理测量学原理. 高等教育出版社..
[8] 魏丹, 张丹慧, 刘红云. (2020). 基于多维题组反应模型的项目功能差异检验探究. 心理科学, 43(1), 206-214.
[9] 余跃, 杜文久, 周娟, 秦菊香. (2016). LP方法及其与三种常用DIF检测方法的比较. 心理科学, 39(3), 720-726.
[10] 张龙, 涂冬波. (2015). 多级计分题项目功能差异常用检测方法及比较. 江西师范大学学报(自然科学版), 39(5), 441-448.
[11] 郑蝉金, 郭聪颖, 边玉芳. (2011). 变通的题组项目功能差异检验方法在篇章阅读测验中的应用. 心理学报, 43(7), 830-835.
[12] American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing: National council on measurement in education. Author.
[13] Barnett, V., & Lewis, T. (1994). Outliers in statistical data. Wiley.
[14] Bechger, T. M., & Maris, G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80(2), 317-340.
[15] Bond T. G.,& Fox, C. M. (2013). Applying the rasch model: Fundamental measurement in the human sciences Psychology Press Fundamental measurement in the human sciences. Psychology Press.
[16] Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144-152.
[17] Cai, L. (2017). flexMIRT® Version 3.51: Flexible multilevel multidimensional item analysis and test scoring . Vector Psychometric Group.
[18] Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12(3), 253-260.
[19] Cao M. Y., Tay L., & Liu Y. W. (2017). A Monte Carlo study of an iterative Wald test procedure for DIF analysis. Educational and Psychological Measurement, 77(1), 104-118.
[20] Clauser B., Mazor K., & Hambleton R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6(4), 269-279.
[21] Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-44.
[22] DeMars, C. E. (2011). An analytic comparison of effect sizes for differential item functioning. Applied Measurement in Education, 24(3), 189-209.
[23] Fidalgo A. M., Mellenbergh G. J., & Muñiz J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures. Methods of Psychological Research, 5(3), 43-53.
[24] Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278-295.
[25] Fischer, G. H., & Molenaar, I. W. (1995). Rasch models: Foundations, recent developments, and applications. Springer.
[26] French, B. F., & Maller, S. J. (2007). Iterative purification and effect size use with logistic regression for differential item functioning detection. Educational and Psychological Measurement, 67(3), 373-393.
[27] Frick H., Strobl C., & Zeileis A. (2015). Rasch mixture models for DIF detection: A comparison of old and new score specifications. Educational and Psychological Measurement, 75(2), 208-234.
[28] Halpern, D. F. (2000). Sex differences in cognitive abilities. Lawrence Erlbaum Associates Publishers.
[29] Hansen M., Cai L., Monroe S., & Li Z. (2016). Limited-information goodness-of-fit testing of diagnostic classification item response models. British Journal of Mathematical and Statistical Psychology, 69(3), 225-252.
[30] Holland, P. W., & Thayer, D. T. (1986). Differential item performance and the Mantel-Haenszel statistic. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
[31] Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Lawrence Erlbaum Associates.
[32] Hyde, J. S., & Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin, 104(1), 53-69.
[33] Jöreskog, K. G., & Goldberger, A. S. (1975). Estimation of a model with multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 70(351), 631-639.
[34] Kopf J., Zeileis A., & Strobl C. (2015a). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39(2), 83-103.
[35] Kopf J., Zeileis A., & Strobl C. (2015b). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educational and Psychological Measurement, 75(1), 22-56.
[36] Lord, F. M. (1980). Applications of item response theory to practical testing problems IRT. Lawrence Erlbaum Associates.
[37] Magis D., Béland S., Tuerlinckx F., & De Boeck P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847-862.
[38] Magis, D., & De Boeck, P. (2012). A robust outlier approach to prevent type I error inflation in differential item functioning. Educational and Psychological Measurement, 72(2), 291-311.
[39] Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: A counterexample with Angoff's delta plot. Educational and Psychological Measurement, 73(2), 293-311.
[40] May, H. (2006). A multilevel Bayesian item response theory method for scaling socioeconomic status in international studies of education. Journal of Educational and Behavioral Statistics, 31(1), 63-79.
[41] Muthén, B. (1985). A method for studying the homogeneity of test items with respect to other relevant variables. Journal of Educational Statistics, 10(2), 121-132.
[42] Navas-Ara, M. J., & Gómez-Benito, J. (2002). Effects of ability scale purification on the identification of dif. European Journal of Psychological Assessment, 18(1), 9-15.
[43] OECD. (2014). PISA 2012 Technical Report. OECD Publishing.
[44] Roussos L. A., Schnipke D. L., & Pashley P. J. (1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behavioral Statistics, 24(3), 293-322.
[45] Shaywitz B. A., Shaywltz S. E., Pugh K. R., Constable R. T., Skudlarski P., Fulbright R. K., & Gore J. C. (1995). Sex differences in the functional organization of the brain for language. Nature, 373(6515), 607-609.
[46] Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194.
[47] Shih, C. L., & Wang, W. C. (2009). Differential item functioning detection using the multiple indicators, multiple causes method with a pure short anchor. Applied Psychological Measurement, 33(3), 184-199.
[48] Sinharay S., Dorans N. J., Grant, M. C, Blew, E. O., & Knorr C. M. (2006). Using past data to enhance small-sample DIF estimation: A Bayesian approach. ETS Research Report, 2006(1), i-38.
[49] Soares T. M., Gonçalves F. B., & Gamerman D. (2009). An integrated Bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34(3), 348-377.
[50] Steiger, J. H., & Lind, J. C. (1980). Statistically based tests for the number of common factors. Paper presented at the Annual Meeting of the Psychometric Society, Iowa City, IA.
[51] Tay L., Meade A. W., & Cao M. Y. (2015). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods, 18(1), 3-46.
[52] Thissen D., Steinberg L., & Gerrard M. (1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99(1), 118-128.
[53] Thissen D., Steinberg L., & Wainer H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67-113). Lawrence Erlbaum Associates.
[54] Tutz, G., & Schauberger, G. (2015). A penalty approach to differential item functioning in Rasch models. Psychometrika, 80(1), 21-43.
[55] Wang W. C., Shih C. L., & Yang C. C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69(5), 713-731.
[56] Wang, W. C., & Su, Y. H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17(2), 113-144.
[57] Woods C. M., Cai L., & Wang M. (2013). The Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73(3), 532-547.
[58] Xu J., Paek I., & Xia Y. (2017). Investigating the behaviors of M2 and RMSEA2 in fitting a unidimensional model to multidimensional data. Applied Psychological Measurement, 41(8), 632-644.
[59] Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125-145.
[60] Yuan K. H., Liu H. Y., & Han Y. T. (2021). Differential item functioning analysis without a priori information on anchor items: QQ plots and graphical test. Psychometrika, 86(2), 345-377.
[61] Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337-347). Routledge.
[62] Zwick, R., & Thayer, D. T. (2002). Application of an empirical Bayes enhancement of Mantel-Haenszel differential item functioning analysis to a computerized adaptive test. Applied Psychological Measurement, 26(1), 57-76.
[63] Zwick R., Thayer D. T., & Lewis C. (2000). Using loss functions for DIF detection: An empirical Bayes approach. Journal of Educational and Behavioral Statistics, 25(2), 225-247.