Obtaining a common scale for IRT item parameters using separate versus concurrent estimation in the common item nonequivalent groups equating design

Bradley A. Hanson, ACT, Inc.
Anton A. Béguin, University of Twente

Paper presented at the Annual Meeting of the National Council on Measurement in Education (Montreal, April, 1999)

A revised version of this paper is available as ACT Research Report 99-8

Abstract: This paper used simulation to study the performance of separate versus concurrent IRT item parameter estimation in a common item equating design for two forms of a test with 60 dichotomous items. Four factors were considered: 1) program (MULTILOG versus BILOG-MG), 2) sample size per form (3000 versus 1000), 3) number of common items (20 versus 10), and 4) equivalent versus nonequivalent groups taking the two forms (no mean difference versus a mean difference of 1 standard deviation). In addition, four methods of item parameter scaling were used in the separate estimation condition: two item characteristic curve methods (Stocking-Lord and Haebara), and two moment methods (Mean/Mean and Mean/Sigma). Expected results were obtained for the effects of sample size (less error for the larger sample size), numbers of common items (less error with more common items), and group differences (less error with equivalent groups than nonequivalent groups). In the case of nonequivalent groups, concurrent estimation tended to produce less error than separate estimation using BILOG-MG, but the opposite occurred using MULTILOG. The item characteristic curve methods of parameter scaling produced substantially lower error than the moment methods. BILOG-MG and MULTILOG tended to perform similarly, except for the interaction noted above for the concurrent estimates in the case of nonequivalent groups. Conclusions from a limited simulation study such as this need to be made very cautiously, but the magnitude of the differences in error between the characteristic curve and moments methods of item parameter scaling suggest a preference for the characteristic curve methods, which is consistent with previous research. The results also indicate that if common items are available with randomly equivalent groups it is beneficial to perform an item parameter scaling (using a characteristic curve method) even though it is strictly not needed. Although concurrent estimation resulted less error than separate estimation more times than not, it is concluded that the results of this study, and other research performed on this topic, is not sufficient to recommend concurrent estimation should always be preferred to separate estimation.

Download paper in PDF format. Version 3.0 or later of Adobe Acrobat Reader (which is available for free) is needed to view this paper.


Papers by Brad Hanson

Brad Hanson's Home Page