Grading Techniques/Inter Rater Reliability

From UBC Wiki

Calculating inter-rater reliability (IRR) provides an estimate of the degree of agreement between different graders using the same rubric. A well designed, objective rubric should result in a high IRR (approaching 1), whereas a poorly designed, ambiguous one will result in a low IRR (approaching 0 or –1, depending on your method).

All graders who will be grading assignments using your rubric should take part in IRR assessments; if they do not, the IRR estimate you obtain will not encompass data from everyone whose interpretations will provide the final grades. As a result, the rubric may not be effectively assessed before it is used to grade the assignments of a whole class.

There are various techniques for providing IRR estimates, and the best one to use depends on the situation6. When you obtain data from three or more coders, it is generally best to use an extension of Scott’s Pi statistic7, or compute the arithmetic mean of kappa8 (a statistic used in IRR analysis9). There are no cast-iron guidelines for an acceptable level of agreement, but popular benchmarks for high agreement using kappa are 0.7510 and 0.811. Hallgren6 provides a detailed overview of these procedures.