To illustrate the statistics discussed in my previous post, here are a couple of fully worked examples. In order to show the calculations, real data is being used and actual scores for each rider in a class are shown although neither the riders nor their horses have been named. In future blogs, individual rider scores will not be shown, just the summary statistics for the judges. Note also that actual judges’ names will not be used anywhere in this blog. Judges will simply be titled according to their judging position (C, M, B, etc.), their level (A, B, C, etc) and any other characteristic that may relate to their level of competence (eg. JE [Judge Educator], JM [Judge Mentor], etc.).
For each of these worked examples there is a brief commentary to help readers understand the results. This commentary will not be included in future blogs — only the statistics will be provided and any conclusions that might be drawn by readers from them are theirs alone.
Example 1: An Elementary 3.2 Class at Clarendon, NSW
This example sets out the statistical results for a class where only two judges are involved.
The statistical analysis of these results would indicate a reasonably high level of alignment between the two judges.
The Score Range for each judge (5.972 and 5.883) is similar.
The Average Score Difference is 2.897. In this example, the Average Scale Factor is also 2.897 which clearly shows that the Judge at C was consistently marking somewhat higher than the Judge at B.
A Score Correlation of .930 and a Ranking Correlation of .919 would imply quite a strong alignment between the two judges in both scoring and ranking.
The Score Difference Graph shows that the largest score difference between the judges was 4.444 percentage points, with the smallest difference being 1.528 percentage points.
Example 2: An Advanced 5.1 Class at Clarendon, NSW
This is another example of a class where only two judges are involved.
The statistical analysis of these results would indicate clear differences between the two judges.
There is a difference in the Score Range being used by each judge, with the Judge at C using a range of 13.940 while the Judge at B is using the much smaller range of 3.637.
The Average Score Difference is 4.697 which is fairly large. The Average Scale Factor is -4.048, which clearly shows that on average, the Judge at C was tending to consistently mark somewhat lower than the Judge at B.
A Scores Correlation of .515 and a Ranking Correlation of .468 suggest a somewhat weak alignment between the two judges in both scoring and ranking.
The Score Difference Graph shows that the largest score difference between the judges was 10.455 percentage points, with differences above 6 percentage points for two more horses before falling to the smallest difference of 1.818 percentage points.
Example 3: The Grand Prix Freestyle at a Nationals Championship, Sydney Australia
This is an example of a class where there were 5 judges officiating. Whilst there is clearly a fairly large amount of data generated in relation to this number of judges, it is still “just manageable” to look at the individual pairings of judges when analysing judging outcomes. However, in the light of the volume of data that is being analysed, this will not be a fully worked example. However, the matrices showing the relationships between the judging of each pair of judges will be shown.
This analysis shows a reasonably good alignment between the judges who officiated at this class.
All judges had a fairly similar Score Range although the judge at M had a somewhat smaller range (6.625 percentage points) than the other judges.
The Score Correlations were also reasonably good, with every pair of judges correlating with each other with a coefficient above .800.
The Ranking Correlations were fairly good too. Although the rankings of the judges at H and M were very highly correlated (.904), the rankings of the judge at H were not so well correlated with those of the judge at C (.523) nor with the judge at B (.609).
The Scale Factor calculated for each pair of judges was quite small, indicating that all judges were essentially marking to the same scale. The average Score Differences were also fairly low, with only the difference between the judge at C and the judge at M (2.159 percentage points) indicating some degree of difference in their overall scoring.
Even the Maximum Score Differences (showing the maximum difference between each pair of judges in relation to any horse in the class) were reasonably good. The largest difference (4.125 percentage points) was between the judges at E and B (the horse concerned was Horse4), while there was also a 4.00 percentage point difference between the judge at H and the judge at B (the horse concerned was Horse10).
The Score Differences Graph clearly shows all the differences (ranked from the largest to the smallest) for every pair of judges over each of the horses in the class.
Example 4: The Grand Prix Class at the Rio Olympics, Brazil
Given that there were 59 horses and riders in the Grand Prix class at Rio, this example will also not be fully worked, since the amount of raw data to be displayed would be very large. It should be noted, too, that there were 7 judges officiating at this event, so again, there has to be a method of further summarising the data to keep the analysis compact. So whereas in the case of only a few judges it would be possible to set out the Scale Factors and Score Differences between each judge and every other judge individually, in this Rio example Scale Factors and Score Differences for each judge are simply compared with the average of all the other judges combined.
By now the meaning of such statistics as Score Range, Scale Factor, Score Difference, Score Correlation and Ranking Correlation should be well understood.
This analysis shows on all measures a very high level of alignment between the judges who officiated at the Grand Prix class at the Rio Olympics .
The Score Ranges used by the judges were quite similar, the Scale Factors are quite small indicating that all judges were, on average, using a similar scale of marks while the average Score Difference between each judge and the other judges was also quite small (all average differences were less than 1.3 percentage points).
Both the Score and Ranking Correlations between each pair of judges were all above .90, indicating that all the judges were quite strongly aligned with each other in their marking.
Finally, the Score Differences Graph shows that there were no particularly large mark differences between any judge and the average of all the other judges across all horses. In fact, for essentially half of the horses (29 out of 59) in the competition, no judge had more than a 1 percentage point difference from the average of all the other judges.