The analysis of dressage judging results undertaken in this blog will simply encompass a small number of key statistics, as described below. A couple of fully worked examples will also be provided in my next blog to further clarify exactly how these statistics can be interpreted.
Note that the following explanation of correlation is presented in “reader friendly” terms. Anyone wanting a more rigorous understanding of the concept and methodology should read some of the more technical resources available on the internet.
Score Correlation: Essentially, the score correlation coefficient shows the strength of the association between two variables. In our case, the variables are the scores awarded to the riders in a class by each of the judges. A calculated correlation coefficient will always lie in the range of -1 to +1. A coefficient close to +1 indicates that there is a strong positive association between the two variables (sets of scores) being compared. A coefficient close to -1 indicates that there is a strong negative association between the variables, ie. in our case that would mean that the judges’ scores actually ran in the opposite direction to each other. A coefficient close to zero would imply little or no relationship at all between the two judges’ scores.
So in comparing the scores of two judges, what is a “good” correlation as against a “poor” correlation? There is no real answer to that question. Maybe a rule of thumb might be to have a scale similar to dressage scoring itself, where a score of .90 to .99 is very good; .80 to .89 is good; .70 to .79 is fairly good, etc. and anything less than .50 is insufficient (or worse). However, I acknowledge that it might be argued that adopting such a scale involves some value-judgement. Suffice it to simply say, then, that the higher the coefficient the stronger the association, the lower the coefficient the weaker the association. Incidentally, in my experience it is quite rare to ever find a negative correlation, although I have found these from time to time in my analyses.
It should also be understood that a high correlation does not necessarily mean that the scores have to be identical. Clearly, if one judge’s scores were always exactly the same as the other judge, the correlation coefficient would be +1.
However, the Pearson correlation coefficients calculated in my analysis are actually indicating the strength of a linear relationship between the variables in the form of:
Judge1_score = α + β Judge2_score
So if, for example, one judge always gave, say, 5 marks more than the other judge to every rider (ie. α = 5; β = 1) the correlation coefficient would still be +1. Similarly, if one judge gave every rider a score that was, say, 10% higher than the other judge (ie. α = 0; β = 1.1) again the correlation coefficient would still be +1. The same would be the case if one judge’s score was always 10% higher plus 5 (ie. α = 5; β = 1.1).
That is why simply “eyeballing” the score differences on the published results sheet for the class does not always give a good indication of how aligned the judges actually have been in their judging! Incidentally, one would normally expect there to be some variation in judges’ scores — stemming from the fact that there are often two or more judges scoring a rider from different viewpoints around the arena.
Ranking Correlation: The ranking correlation coefficient shows the strength of the relationship between the ranking of riders by two judges. The principles of calculating and interpreting this coefficient are exactly the same as for the scores correlation coefficient except that it is being applied to rankings rather than scores.
[Note for technical readers: The Spearman rank correlation methodology is used to find the correlation between ranked (non-parametric) data. Where there are tied rankings, as is often the case with dressage scores, the Spearman correlation coefficient is calculated as the Pearson correlation coefficient with fractional rankings applied to the data. To explain this latter point, we are used to dressage data being ranked as “equals” eg. if we had two riders tied in second place, the rankings on the results sheet would be displayed as 1, 2, 2, 4, etc. That is the ranking that Excel would generate using its RANK.EQ function. Fractional ranking of the same data would take the form of 1, 2.5, 2.5, 4, etc. and is the ranking that Excel would generate using its RANK.AVG function.]
Score Range: Judges are encouraged to fully utilise a range of scores both when judging each of the movements within a test and in terms of the scores they award to riders across the class. This ensures that there is adequate differentiation of the performances of the riders. In fact, this is a factor that is definitely taken into consideration when assessing judge appointment and upgrades.
This blog will not report on “within test” mark ranges. However, it will report on the score range, that is, the difference between the highest and lowest scores awarded by each judge across a class. Clearly the range will depend upon the variation in rider performances within the class but, at the least, one might expect that for any class, multiple judges should probably have fairly similar score ranges.
Scale Factor: As pointed out in the discussion on score correlation, judges might actually give fairly aligned scores to riders but one judge might be judging more “generously” than the other. So it might be possible to end up with a situation similar to the score correlation example above, where the pattern of judge scores was very similar except that one judge gave 5 marks more to every rider. In other words, they were effectively using somewhat different marking scales. Presumably this is something that might need to be addressed by one or other of the judges, but it may not mean that the judges’ scores are not aligned.
I recall one Novice class where one judge consistently gave around 5 to 7 percentage points more than the other, so the results looked very odd indeed. However those judges had a scores correlation coefficient of .91 and a ranking correlation coefficient of .90 indicating that there was in fact a good “underlying” alignment.
There would be a number of ways of estimating the extent of any Scale Factor. The statistic I generate simply sums the score differences between the two judges (differences can be positive or negative where the judges have given higher or lower scores to riders) and takes the average. A large value for this statistic indicates that, on average, one judge is scoring on a much higher or lower scale than the other.
Note that where there are more than two judges, this statistic is generated for each pair of judges and displayed in the form of a matrix.
Score Difference: Whereas the Scale Factor statistic averages the actual score differences (some of which are negative and some positive), the Score Difference statistic uses the absolute differences in scores, that is, it indicates the size of the score differences between the judges regardless of whether the scores being compared are higher or lower.
The statistic I generate simply sums the absolute score differences between the two judges and takes the average. A large value for this statistic indicates that there are at least some significant differences in the marks the judges are giving to riders.
Again, where there are more than two judges, this statistic is generated for each pair of judges and displayed in the form of a matrix.
Maximum and Minimum Score Differences: The Score Difference statistic described above shows the average of the absolute score differences across an entire class. However, it is also interesting to see how judges have varied in their scoring of individual horses. The Maximum Score Difference shows the largest absolute score difference a pair of judges have had in relation to any horse in the class while the Minimum Score Difference shows the smallest absolute score difference in relation to any horse. Note that while the Maximum Score Difference will always be shown in the analysis, the Minimum Score Difference may or may not be shown.
Again, where there are more than two judges, this statistic is generated for each pair of judges and displayed in the form of a matrix, so it is possible to see how each judge has varied from every other individual judge.
Score Differences Graph: When looking at score differences it is interesting to see the size of those differences (using the absolute Score Difference data above), ranked from the largest to the smallest difference, that the judges have had for every individual horse in the class. Judges would presumably want their differences to be relatively small and, if they do have larger differences, for those to only apply to a small number of horses. The Score Difference Graph very clearly shows the pattern of score differences between the judges across all the horses in the class.
Where there are a small number of judges involved (ie. up to 5), the lines plotted on the graph will show the differences between each pair of judges across all the horses in a class, ordered from the largest to the smallest difference. However, if there are a large number of judges per class (eg. more than 5) the score differences plotted on the graph will show the difference between a particular judge’s scores and the average of all the other judges. This is simply done to reduce the number of data series needing to be plotted on the graph. Where this latter approach is used, it will be made clear in the analysis.
Correlation Coefficient Colour Codes: As discussed in the sections above dealing with score and ranking correlation coefficients, all that can be said is that the closer a coefficient is to -1 or +1, the stronger the relationship. As coefficients tends towards zero, the weaker the relationship is becoming. Especially when presenting statistics where many judges are involved and the coefficients between each pair of judges are therefore presented in the form of a matrix, I may use a set of “colour codes” to help readers to more easily interpret the coefficients. Where used, this has been done more to assist readers to differentiate between individual coefficients within the array rather than for the purposes of making any value judgements about judging performance. The colour code used is: