Grading Techniques/Inter Rater Reliability

Calculating inter-rater reliability (IRR) provides an estimate of the degree of agreement between different graders using the same rubric. A well designed, objective rubric should result in a high IRR (approaching 1), whereas a poorly designed, ambiguous one will result in a low IRR (approaching 0 or –1, depending on your method).

All graders who will be grading assignments using your rubric should take part in IRR assessments; if they do not, the IRR estimate you obtain will not encompass data from everyone whose interpretations will provide the final grades. As a result, the rubric may not be effectively assessed before it is used to grade the assignments of a whole class.

There are various techniques for providing IRR estimates, and the best one to use depends on the situation⁶. When you obtain data from three or more coders, it is generally best to use an extension of Scott’s Pi statistic⁷, or compute the arithmetic mean of kappa⁸ (a statistic used in IRR analysis⁹). There are no cast-iron guidelines for an acceptable level of agreement, but popular benchmarks for high agreement using kappa are 0.75¹⁰ and 0.8¹¹. Hallgren⁶ provides a detailed overview of these procedures.