|
|
Multiple hypothesis testing in DIFdetect
An issue that comes up with DIFdetect is that of multiple testing. Of course, when we test a lot of hypotheses, there is a good chance that some of the null hypotheses are rejected incorrectly.
We offer here some comments regarding this issue with respect to DIF detection and we provide tools for looking at this issue with respect to non-uniform DIF.
With respect to uniform DIF detection, the 10% change in the coefficient criterion as explained earlier is somewhat arbitrary. The value of the change in the coefficient when including the group variable gives us an indication of the magnitude of the importance of that group on the relationship between the ability and the responses to the item in question. Thus, uniform DIF is not prone to problems with multiple testing.
Non-uniform DIF, on the other hand, does use a more traditional hypothesis testing mode. We see DIF detection as a hypothesis generating rather than a hypothesis confirming activity. That is, the results from DIFdetect provide items that should receive careful scrutiny by content experts. In our view, multiple hypothesis testing may not be a significant limitation.
DIFdetect does provide a means for assessing the importance of multiple testing.
The decision on whether to adjust for testing multiple hypotheses comes down to an economic question about the costs of flagging items as having DIF. In educational testing settings, large item banks are the rule, and a low threshold for finding DIF is fine, compatible with making no adjustments for multiple comparisons.
In medical settings, on the other hand, often the only items available are those that are already in wide use, and tests are often not altered for years. Finding significant numbers of test items with bias thus challenges the findings of numerous published studies, and one's threshold for identification of items with DIF should be much higher, compatible with significant adjustments for multiple comparisons.
In our own work, we have tried a number of approaches to this question. Please refer to Gerald van Belle's June 2002 monthly rule for Statistical Rules of Thumb, accessible through http://www.seattlecrc.org/vbelle/monthlyrule.htm. for a more complete discussion of this issue, as well as a discussion of graphical approaches from Schweder and Spjotvoll, 'Plots of P-values to evaluate many tests simultaneously,' Biometrika, 69(3) (1982), 493-502.
When we have compared several different techniques for adjusting for multiple comparisons (e.g., Bonferroni correction, Sidak correction, and Holm correction), we have found almost no differences in which items are flagged as having or not having DIF. Thus, there is no reason not to use Bonferroni, which is computationally the easiest.
To do this with DIFdetect, simply take the normal alpha value (0.05) and divide by the number of items in the test. This result should then be used as the alpha level for non-uniform DIF identification in DIFdetect.