心理科学 ›› 2024, Vol. 47 ›› Issue (3): 734-743.DOI: 10.16719/j.cnki.1671-6981.20240328

• 统计、测量与方法 • 上一篇    下一篇

无需先验信息的两步项目功能差异检验方法*

韩雨婷1, 袁克海2,3, 刘红云**4,5   

  1. 1上北京语言大学心理学院, 北京, 100083;
    2南京邮电大学理学院, 南京, 210023;
    3美国圣母大学心理系, 印第安纳州, 46556;
    4北京师范大学心理学部, 北京, 100875;
    5应用实验心理北京市重点实验室, 心理学国家级实验教学示范中心(北京师范大学), 北京, 100875
  • 出版日期:2024-05-20 发布日期:2024-05-15

Two-Step Differential Item Functioning Detection Procedures without A Priori Information

Han Yuting1, Yuan Kehai2,3, Liu Hongyun4,5   

  1. 1School of Psychology, Beijing Language and Culture University, Beijing, 100083;
    2School of Science of Nanjing University of Posts and Telecommunications, Nanjing, 210023;
    3Department of Psychology, University of Notre Dame, IN, 46556;
    4Faculty of Psychology, Beijing Normal University, Beijing, 100875;
    5Beijing Key Laboratory of Applied Experimental Psychology, National Demonstration Center for Experimental Psychology Education (Beijing Normal University), Beijing, 100875
  • Online:2024-05-20 Published:2024-05-15

摘要: 传统的项目功能差异检验方法依赖先验信息设定锚题,误设锚题可能产生误导性结果。研究提出以数据驱动的难度差异QQ图(D-QQ图)选择锚题,再结合传统方法检验DIF的两步DIF检验法。两个实证研究说明了新方法在实际测验公平性检验中的适用性及可视化优势。模拟研究进一步表明当测验中有一半试题存在DIF时,若DIF试题仅偏向一组,则两步法兼具高统计检验力和低I类错误的优势;若DIF试题分别有利于两组,则其在I类错误控制上优于RCD方法。

关键词: 项目功能差异, D-QQ图 图形检验, 两步DIF检验法

Abstract: Differential item functioning (DIF) analysis plays a crucial role in determining the fairness and validity of educational assessments. Most traditional DIF analysis techniques rely on pre-defined anchor items that are assumed to be DIF-free. Unfortunately, if these anchor items themselves exhibit DIF, the results may be biased or even misleading. To address this issue, this article proposes a novel two-step DIF detection process inspired by the DIF detection method without prior information proposed by Yuan et al. (2021).
The difficulty-difference quantile-quantile (D-QQ) plot is a scatterplot that depicts the actual difference in difficulty between two groups of test items on the vertical axis against the theoretical difference in difficulty obtained by Monte Carlo methods under the null hypothesis (that there are no DIF items in the test) on the horizontal axis. If the test does not contain DIF, then the observed differences in difficulty between the items in the reference and target groups should match these of the Monte Carlo. Then the D-QQ plot will fall on a line. In the first step, the items that fall on the line of the D-QQ plot are selected as anchor items. In the second step, these anchor items are combined with traditional DIF methods for the actual analysis. The two-step DIF detection method proposed in this article includes a number of methods, including combining the anchor items selected in the first step with the MH method, known as the two-step MH method, and combining them with the Wald method, known as the two-step Wald method.
Two empirical studies were conducted using these DIF detection methods to analyze the datasets from the 2012 Program for International Student Assessment (PISA) math test and a language proficiency test for first grade middle school students in a certain region of China, demonstrating the practical application of the two-step DIF detection procedures.
A Monte Carlo simulation study was conducted to compare five DIF detection methods. They are the two-step MH method, the two-step Wald method, the original MH and Wald methods, and the Graphical DIF Detection Method based on the Relative Change of Item Difficulty Difference (RCD). The empirical Type I error rate and statistical power of these methods were compared under various combinations of the number of examinees, the number of test items, different patterns of DIF occurrence, and DIF size. Additionally, the impact of the number of anchor items on the two-step MH method, the two-step Wald method, and the RCD method was explored by simulation.
The results of the simulation study indicated that sample size and actual levels of DIF make little difference on the average empirical Type I error rate for various DIF detection methods. But they significantly affect statistical power. That is, the larger the sample size and the greater the DIF values, the higher the average statistical power of the DIF detection methods. The study also showed that when the test length was between 20 and 40, selecting the middle four items on the x = y line of the D-QQ plot as anchor items resulted in desirable outcomes. And the two-step DIF procedures performed optimally with respect to empirical type I error rate and statistical power, even when half of the items exhibited DIF. The RCD method performed well under most conditions, though its type I error rates were slightly inflated when the test contains DIF items favoring both reference and target groups (the balanced condition). Meanwhile, the MH and Wald methods with purification were ineffective in detecting DIF items when 10 out of the 20 total items favoring one group. However, when the sample size for each group is less than 2000 and under balanced conditions, the statistical power of the two-step methods might be slightly lower than that of the original MH and Wald methods. The empirical study confirmed the feasibility of the new approach in detecting DIF with real data, and demonstrated that the D-QQ plot allowed for a visual evaluation of the plausibility of each method.
The two-step DIF detection procedures proposed in this article offer a more effective and reliable solution for ensuring the fairness of educational tests. The proposed methods, based on the idea of the DIF detection method without prior information, allow for the identification of anchor items by visual inspection, instead of relying on predefined a priori information. The results of both simulation and empirical studies demonstrate the effectiveness of these approaches and their superiority over existing methods. The statistical rationale underlying the proposed approaches is sound and has the potential for broader applications.

Key words: differential item functioning, D-QQ plot, graphical test, two-step DIF detection procedure