I recently had a very interesting discussion with a European friend – a Grand Prix dressage rider, judge and someone who has worked closely with the FEI in Lausanne. Our conversation covered lots of interesting things but we also discussed a couple of things which touch on this blog: (i) whether judges should have any concerns about their judging being analysed and (ii) how much emphasis should be placed on whether judges are aligned or not in the scores they have awarded.
I might add that our little discussion group also included another European rider and a judge from Victoria, so a range of views were heard. In this blog I am not going to publicly air the views of my conversation partners, not because they might necessarily differ from mine, but simply because it was a private discussion. However, I think the two matters mentioned above are very interesting, so here are my views on them.
- Should judges object to statistical analysis?
I think this issue is pretty clear. Of course they should not object to such analysis. There are a number of reasons why:
- As was emphasised at the time this blog was created, the statistical analysis is simply another way of looking at data that is already in the public domain. At no stage is any “new data” created. Anyone with a calculator on their smartphone or access to spreadsheet software could do similar calculations themselves. This blog simply does those calculations for them and hopefully provides a more “accessible” overview of the judges’ performance.
- The marks that judges give to riders, at least in all official classes, are published for all to see, essentially for the rest of eternity (nothing disappears once it has been on the web). In Australia, those results are searchable and available on the Equestrian Australia website, for example. If the rider performances assessed by judges are a permanent record, then why not summary statistics that provide an overview of judge performances?
But really, the key factor is the opinion judges have of their own performance. If a judge is of the view that they are proud of their performance, that they did a really good job, that they gave true “value for money” to riders paying hefty entry fees, that is all that is important. They need not feel concerned about their performances being scrutinised. If, on the other hand, they don’t want their performances scrutinised because at some level they are “ashamed” or “uncertain” of them, then maybe they should just give up judging.
As it is, this blog has never named judges, describing them only by their level and any other attributes such as Judge Educator, etc. Of course, any interested reader of the blog could easily go to the published results and identify the judges concerned, should they so wish. In fact, if judges feel comfortable with their performances, as hopefully they should, there really shouldn’t be any reason for this blog not to name them. However, there are no plans at this moment to do so. After all, those names would probably only be meaningful to Australian readers, not to my readers in Europe, the Americas and elsewhere. These latter readers are no doubt simply interested in the standard of judging in Australia.
- Is judge alignment important?
Now this really is the $64,000 question! Certainly, the statistical analysis in this blog is entirely focused on the alignment between the two or more judges involved in the judging of a class. It presents such values as the relative score ranges, average, maximum and minimum score differences, the distribution of those differences, score correlations and ranking correlations.
However, as was emphasised at the time this blog was created, even if the statistics indicate clear differences between judges scoring a class, at no stage has it ever been inferred that a particular judge was right or wrong, or even that they all might be wrong (or perhaps even that they all were right?). Certainly, as the volume of statistics accumulate, it can well become clear which judge (or judges) appear to be more “consistently out of step” with their colleagues but again, does that mean that they are more likely to have “gotten it wrong”? After all as an example one man, Galileo, differed from the entire Catholic Church orthodoxy and risked execution but was later confirmed to have been right!
So the question I am addressing in this blog post has nothing to do with right or wrong. The issue is really one of looking at the virtue, or lack thereof, of alignment.
I believe there are a number of reasons why two or more officials judging a class may come up with different marks. Clearly, one reason is the different view of a test by judges at different positions around the arena. They do see different things. Anyone watching judge training videos (such as Janet Foy’s “On the Levels”) would be familiar with her statements such as “that would be a 6 from the front, a 7 from the side”. So individual movement marks will vary, but do these in a sense cancel out over the course of test so that the overall marks still are fairly well aligned? It is not uncommon to hear judges account for differences in their marks by saying: “well of course, I was judging the test from C while you were at B” However, if that really does introduce a systematic bias into judge scoring, should we then take this to the logical extreme and start to have doubts or reservations when two or more judges do get very similar scores for riders in a class?
I believe another reason for differences between judges (and which I consider would be more likely to introduce a systematic component) is that there are so many elements in the way of going of horse and rider that need to be assessed, movement by movement. Faced with such an array of elements it would not be surprising for judges to give different “weightings” to those elements. Clearly, there are likely to be synergies or relationships between those elements (eg. if a horse is not supple over the back nor in a round frame, how can it get the “compression” it needs to attain collection?) but there is still plenty of scope for multiple judges, in their scores, to systematically give different emphasis to different elements.
This leads us to the nub of this discussion! So if multiple judges do have scores that are not aligned, due to their arena positions or because they apply different weightings or for any other reason, is that necessarily bad? Maybe the averaging out of those judges’ scores actually produces the fairest outcome for the riders! In which case, “non-alignment” could, in certain cases, be a virtue rather than a sin.
However, this raises very interesting questions for judge education and judge examination (ie. for the appointment or promotion of judges). In judge education, shadow judging, examination for promotion and so forth, the overwhelming emphasis is on “alignment”. In Australia, for example, in a judge’s practical examination for promotion, the candidate judge effectively incurs “penalty points” for every score for every movement or collective that differs from that of the Judge Examiner by more than a certain amount. Too many penalty points and the candidate is failed! So clearly, as far as the Judge Administrators are concerned, non-alignment is definitely a sin! The orthodoxy must prevail!
However, this stance by the Judge Administrators leads on to its own worrying question. As mentioned above, to date, this blog has not named judges in the analysis, describing them only by judging level and any other characteristic, such as being a Judge Educator or Judge Mentor.
It is early days in the compilation of my statistics but already some interesting observations are being hinted at by the figures. Clearly more data needs to be collected to be more definitive but it is interesting that some very clear examples of non-alignment have occurred where Judge Educators and/or Judge Mentors have been judging together! So if you were, for example, an aspiring judge who is shadow judging with one or other of these judges, your success could vary considerably according to which judge happens to be mentoring you. Your choice of Judge Mentor itself could determine whether you pass or fail. Is that the way Judge Administrators really want the system to operate? Hopefully not!