There are two common ways to interpret and calculate reliability in the comparative judgement literature.

Internal consistency

Internal consistency is generally referred to as Scale Separation Reliability in the comparative judgement literature. SSR is interpreted similarly to Cronbach's alpha.

We can use the nmmBTm package in R to calculate SSR.

Researchers should be aware that SSR is a contentious and arguably opaque measure, and we recommend reporting it alongside inter-rater reliability.

Inter-rater reliability

Inter-rater reliability, r, is based on calculating the Pearson Product-Moment correlation coefficient between two sets of scores generated from the judgements of two independent groups of experts. It provides a more robust and transparent measure of reliability than SSR, but the downside is that you need to double the number of judgements and judges in order to calculate it.

In practice we calculate r using a split-halves method. The expert judges are randomly assigned into two groups, and artefact scores calculated based on the judgements of each group. The correlation coefficient between the two sets of scores is calculated. We then repeat this process, typically around 100 times, and take the median correlation coefficient as our estimate of inter-rater reliability.

We can use the nmmBTm package in R to estimate inter-rater reliability.