Magnus Carlsen holds the world record for highest rating in history and has been the World Chess Champion since 2013. Hans Niemann is a young Grandmaster who has stirred controversy in recent years, especially in the last two weeks after he beat Carlsen in the Sinquefield tournament and was accused of cheating. That episode made headlines in newspapers around the world. Since then there has been much speculation on the subject, some defending Niemann, others casting doubt on his integrity.
Among the many people who have opined on the matter are some of the world's best players, including Carlsen and Nakamura, and some of the world's leading experts on cheating, including Kenneth Regan, professor of mathematics at Oxford University and international chess master, featured on the ChessBase website as "The world's greatest expert on cheating detection in Chess". However, the conclusion that Regan presents is objectively incorrect, as we will demonstrate below.
On 09/05/2022, the day after Niemann's victory against Carlsen in the Sinquefield tournament, I analyzed the game between them focusing exclusively on the moves of the game in technical aspects related to Chess, and came to the same conclusion as Regan: there is no evidence of the use of engines, Niemann's game is indistinguishable from that of a human with 2700. I ended the matter at that point and considered it closed, understanding that Niemann was innocent. But after Carlsen's conduct in the online game on 9/19/2022, I thought I should take a deeper look, because Carlsen would not have acted like that if he was not feeling deeply outraged and wronged. At the same time, Niemann's conduct seemed inappropriate to me. If I were suspected of cheating in a similar situation and my opponent abandoned the game for it, the way Carlsen did, I would refuse the point and hand the point to him. But Niemann simply received Carlsen's point, unchallenged. This rekindled my doubts about this issue and I decided to investigate further.
For those interested in the technical analysis of the game, you can download it here (PGN, PDF).
The suspicions against Niemann can be divided into two groups:
1. Using engines.
2. Unauthorized access to Carlsen's training.
In the case of item 2, the accusations are more speculative, so I prefer to refrain from commenting.
Regarding item 1, there is some weak evidence and some strong evidence. Let's look at the strong evidence, among which are the differences between Niemann's results in tournaments with DGT board and without DGT board.
A DGT board is an electronic board with a touch-sensitive surface and a computerized structure that recognizes the moves played and transmits them to a computer via USB (or serial port, among others). In recent years, a large part of the most important tournaments use this type of board, due to the ease of transmitting the games in real time to Interntet and TV. At the same time, this type of board can give rise to doubts about the possibility of cheating.
How could one use a DGT board to cheat? That I couldn't answer, but the fact that the bids are transmitted in real time to the Internet certainly favors the direct use of this information without it having to be manually entered into some device. Elon Musk even commented on a hypothesis of how he thought the transmission of bids would be possible. It was unclear whether he was being ironic, but the fact is that similar alternatives are certainly possible.
How the trick structure can be constructed doesn't matter much. The point we will look at is about the point raised by Carlsen about Niemann's results being much better in tournaments in which DGT board was used.
The website https://www.chessdom.com/new-allegations-within-niemann-carlsen-case-hans-niemann-performs-much-better-with-live-dgt-boards/ provides some data on this, but it is possible that some of that data is incorrect. For example: the chart cites that Niemann would have performed 2893 at the "USCF K12 Grade National" in 2019, but on the official website for that event (https://www.uschess.org/results/2019/k12/) Niemann's name does not appear.
I have individually checked each of the results cited on this site, and also checked the FIDE ratings and number of games in each event, to make an analysis about the probabilities that the results in DGT events are not just casually better than the results in other events.
In all, Niemann played 20 events between March 2019 and November 2020, covering a total of 158 games. The table below summarizes this situation:
The first column indicates the name of the event, the second indicates the number of games played in that event, the third indicates the rating performance Niemann obtained measured by the United States Chess Federation (USCF) scale, then comes the "DGT" column, which indicates whether the event used (1) or not DGT board. In the last column is Niemann's performance rating for each event, but in this case it is in the scale used by the International Chess Federation (FIDE). I could not find information on whether DGT board was used in the Mumbai event, so I did not compute that championship in the calculations.
Upon a quick look at this table, the differences already stand out. Niemann's typical performance rating in events with DGT is about 200 points higher than in events without DGT. But is this difference statically significant? Could it be interpreted as an indication of fraud?
To investigate this hypothesis, we have done some mean-contrast tests and we can see that the difference is statistically significant, but there are details that need to be considered. For example: the average rating with DGT is 2610 and without DGT is 2404. If you removed result 2893 from the DGT group, the mean in this group would decrease, narrowing the difference, so it should reduce the probability that the samples show a statistically significant difference. If you removed result 2074 from the other group, the mean in that other group should increase, also causing the difference between the groups to narrow, and therefore the probability of a statistically significant difference should decrease. But this is not what happens. Before removing these results, the Student's t-test indicates 99.985% probability that the entity that played the DGT boards does not have the same strength as the entity that played the conventional boards, and if you remove these two outliers from the groups, instead of this probability decreasing, it increases to 99.990%. This demonstrates a distortion in Student's test.
This technical detail needs to be commented on in a little more depth, just to be clear, because while removing the value 2893 causes the group mean to decrease, narrowing the gap, it also causes the dispersion among the remaining elements to decrease, making the relative distance (measured in standard deviations) wider, and the proportion in which this widening occurs is greater than the proportion in which the group mean decreases, because the standard deviation is determined by the sum of the squares of the differences in the variables (exponent 2), while the mean is determined by the value of the variables (exponent 1). Therefore the insertion or removal of an outlier usually affects the dispersion more than the central tendency.
The problem in this case is that the presence of the element with value 2893 "stretched" the dispersion in the direction in which that element was, but did not necessarily produce a symmetric effect in the opposite direction. In fact, there is no reason to assume that it would produce a symmetric effect in the opposite direction. But the assumption of normality, which is adopted when applying these tests, implies symmetry of the distribution, so that the widening produced on one side should cause an equal widening on the opposite side, even if there is no outlier in that position to explain such an effect. Obviously this is a flaw in the theory, and this flaw cannot be lost sight of when examining the problem.
To better understand this inconsistency, let's look at a homologous error in which this effect is most evident. The Sharpe index is used to evaluate the quality of an investment based on the risk-adjusted return. Suppose an investment fund has been producing an average annual return of 12%, with a benchmark of 2% and with volatility 20%, then its annual Sharpe ratio is approximately 0.5. Suddenly this fund hits a good trade and produces a 200% profit in 1 month. Obviously this is good and should imply an increase in the Sharpe ratio, but as this event also implies an increase in volatility, and the impact on the historical volatility is greater than on the historical return, due to the reasons mentioned in the previous two paragraphs, therefore the 200% profit operation ends up resulting in a reduction in the Sharpe ratio, penalizing the fund's performance, when the correct thing to do would be to reward it.
To deal with this problem, the Sortino index was created, which measures positive volatility and negative volatility separately. In the Sortino index, positive volatility does not penalize the index, which is more logical and more useful for a correct measure of risk-adjusted performance.
There are several other flaws in Sharpe's index (and the Sortino index), but it is not our purpose to discuss these points here. The important point is that the error that occurred in Student's t when applied here is analogous to the error cited in Sharpe's index, where a skew on one side of the distribution does not imply the occurrence of another symmetric skew on the opposite side, without there being an element in that position that justifies this interpretation.
In any case, both 99.985% and 99.990% are both very suggestive results, as are any percentages at that level, so even if the t-test is giving a slightly skewed result, it doesn't affect the Boolean interpretation of whether Niemann is innocent of the charges. The probability of fraud may not be exactly 99.990% or 99.985%, but it is arguably a high probability, much higher than the probability of no fraud having occurred.
To further investigate this disparity, we also used the Kolmogorov-Smirnov, Anderson-Darling and Tukey HSD tests.
The purpose of Student's t-test is to compare two samples of normally distributed data and to estimate the probability that the observed differences are chance. For example: whether the average height of men is different from the average height of women, or whether the average air temperature in January is different from that in February. We know that some women are taller than some men, but the question is whether there is a difference in the average height of all women compared to the average height of all men, or whether the set of each distribution has the same parameters. To check this, Student's t-test applies very well, because the populations are very large and the shape of the distributions of these populations closely resembles the shape of a normal distribution within the range -2 to +2 standard deviations. Thus we can see that although some women are taller than some men and vice versa, the average height of men is greater. Similarly, we can see that although some Brazilians are taller than some Argentines and vice versa, when the average of the two groups is considered, there is no statistically significant difference.
But in the example we are analyzing, of the chess performance ratings, the situation is more complex, because the samples are small and we do not have accurate information about the shape of these distributions, which makes it more difficult to check whether the difference between the results obtained between the DGT board games compared to the results in the way board games is statistically significant. Therefore it may be necessary to use other tools.
The Kolmogorov-Smirnov test, when compared to Student's t-test, has the advantage of being more sensitive to the morphology of the distributions and less dependent on the assumption of normality, on the other hand it deals with the maximum distance between the points of greatest separation between the curves, rather than being a global measure that takes into account all points, such as Chi-squared and RMS measures.
The Anderson-Darling test has the advantage that it is shape-sensitive, as is the Kolmogorov-Smirnov, and better weights the presences of outliers in dense tails, making it one of the most appropriate for this specific purpose.
The HSD in Tukey's test stands for "Honestly Significant Difference," but in this specific case the name does not reflect the reality very well, because Tukey's statistics are robust, but in this case we need to measure precisely the occurrences of outliers.
When applying these tests for contrast between means, the results were as follows:
· Student's t: 99.985%
· Kolmogorov-Smirnov: 99.992%.
· Anderson-Darling: 99.972%.
· Tukey HSD: 99.981%.
The dispersion in these samples shows standard deviation 334 in the DGT group and 318 in the group without DGT (the measured deviations are 119 and 113, so the deviations in the individual games are as quoted above). Tukey's biweight for dispersion indicates 166 and 319, respectively, filtering out the value 2893 as an outlier. The average historical dispersion of players in general is about 377 and in the specific case of Niemann's history it is 341. So in the analyzed events of 2019 and 2020 we have well represented the typical dispersion, with an additional detail: if you remove the 2893 result the dispersion gets about half in the games with DGT. This abnormal narrowing in variability may be an additional symptom that should be considered. An analysis of variance shows that if you remove outlier 2893, the probability that the samples with DGT and without DGT have equal dispersion is less than 0.1%.
Using the FIDE ratings instead of the USCF ratings, the difference decreases from 206 to 175 points, but does not change the final conclusion. All tests performed for difference between means, differences between robust measures of central tendency, and differences between variances confirm a statistically significant difference at a 0.001 level.
The result can also be easily observed in a graph:
The red pawns represent the performances obtained in the events in which DGT board was used and the green kings in the events with traditional boards. The dotted line indicates the overall average performance (performance 2505).
Another analysis that should be made is about the evolution of the rating as a function of age. Chess prodigies present a characteristic curve that is very similar among all of them. This can be observed from Morphy, Capablanca and Reshevsky (see the book "Chess, the 2022 best players of all time, two new rating systems" at https://www.saturnov.org/livro/rating) to new talents such as Firouzja, Wei, Duda, Gukesh and Erigaisi, passing by Fischer, Kasparov, Kamsky, Leko, Carlsen, etc.
In Carlsen's case, this curve looks like this (FIDE rating):
One evolves very rapidly until approaching an asymptotic limit, then decays a bit after reaching a peak, and continues to decay slowly over the decades. Later on, we will see some more examples.
Carlsen was born in late November 1990. From age 11 to 15, his FIDE rating showed very rapid growth. From age 15 to 20, he had a slower, but still very consistent growth. Around the age of 23, his rating stabilized and his strength began to decline.
Now let's look at Hans Niemann's evolution curve:
The overall behavior is very different. There is a ladder with a few rungs and some oscillations with relatively large amplitude. Between early 2016 and mid-2018, there was virtually no evolution in the rating, spending 30 months with an almost unchanged rating. Suddenly, in 6 months (mid-2018 to early 2019) it rose from an average level of 2300 to an average level of 2460, and again remained stable for another 2 years. Then (2021) it rose again.
This is quite abnormal behavior among world elite players, who generally grow very fast until age 15 and then continue a less rapid growth until age 20. There is a progressive deceleration in the rate of growth as a function of age until around age 25, when a slow decrease begins. This decrease in playing strength can be masked by rating inflation (more details in the book cited a few paragraphs above), because it is slow. But during the initial growth phase, the pace is very fast. Carlsen went from 2050 to 2550 in 4 years. Then it went from 2550 to 2850 in 8 years. The evolution of Niemann's rating is very different, it speeds up, slows down for years, then speeds up again for a few months, then again slows down for years, and again speeds up for a few months. This is extremely unusual and unnatural. It's not just different from Carlsen; it's different from more than 95% of elite players.
Capablanca's rating, for example, during his period of rise, showed very similar behavior to Carlsen's, with the difference being that Capablanca's rating began to decline after the peak because the calculation method used by Rob. Edwards (link below) is immune to inflation, while the FIDE rating currently shows about 5.7 points of inflation each year. This is why Carlsen's rating has remained stable since 2013, although its strength is waning. There is also an evolution in the understanding of the game that makes players stronger each year, but this does not keep up with inflation. This evolution in the understanding of the game increases by about 1.4 points each year, so if the rating remains stable, it means that a real decline of 4.3 points per year in playing strength is occurring.
Source: http://www.edochess.ca/players/p864.html
Besides the curves of age-related rating evolution for prodigy boys, we can also observe the curves for players in general, which also differ dramatically from Niemann's rating curve. This is because there is a typical variation in the evolution of cognitive ability as a function of age, and all performances that depend on cognitive abilities approximately follow this variation.
The graph below shows the evolution of rating as a function of age for 396,311 players on FIDE's September 2022 list:
The information shown in the chart above is not the same as the others before and after. In the chart above are the ratings and ages of 396,311 different players, showing how the rating is related to age for different players. The other graphs show the evolution of rating as a function of age for the same person over a lifetime. Although they are not exactly the same information, when large samples of data are considered, they end up assuming very similar behavior statistically.
The next graph shows a comparison of the evolution of ratings as a function of age for some of chess's recent prodigies:
The same players as in the chart above (except Carlsen, who already has a large exclusive chart) are presented below in separate charts, to make it easier to see the shapes of the curves:
As can be seen, although some curves, such as those of Wei and So are somewhat different from the pattern of the others, they preserve some of the most important general properties, such as increasing monotonicity (if discretized into biennial intervals) until age 23. Almost all of the younger players who occupy the top positions in the world rankings exhibit this characteristic in their growth curves. An exception is the young Abdusattorov, who remained virtually unchanged in rating for 30 months between the ages of 7 and 9. This is extremely unusual. Then he showed an increase of 270 rating points in just 2 months, at age 11. This can happen if the child has his chess studies interrupted, as happened to Reshevsky and Kamsky, for example. But these are rare cases, and they do not show major drops or inconsistencies in results. Niemann's case is different, because not only his evolution curve is different from the standard, but it also presents drops and instabilities.
The next chart shows the evolution of intellectual ability as a function of age, based on WPPSI, WISC, and WAIS scores from age 3 to 91, with extrapolations outside this range:
Therefore the set of statistical results shows that Hans Niemann shows about 200 rating points higher playing strength when the events use DGT boards (March 2019 to November 2020), and this difference in this context is statistically significant at a 0.0002 level. Furthermore, Niemann's FIDE rating evolution is different from the rating evolution of other world elite players and other players in general, with very unusual particularities in the rate of rating evolution as a function of age, not behaving like the typical curves of natural evolution of cognitive ability.
I would like to believe that Niemann is another great chess talent, and I even defended him in my preliminary comments, right after the first game, due to the absence of evidence against him. But I must acknowledge that my initial interpretation was naive and incorrect. I presented arguments in Niemann's defense, based on the fact that he makes many moves different from the ones indicated by the engines, and this really shows that he does not use engines for all his moves, maybe not even for 50% of the moves. But the statistical analysis presented above is very strong and almost conclusive. The fact that he doesn't use engines in all his moves makes it harder to detect signs of fraud, because his game is very similar to that of a human player with 2700, with errors that the engines don't make and with logical moves that humans would have executed, but the engines would have avoided due to the concrete calculation of variants. Looking exclusively at the quality of the moves in their games, there is no evidence of cheating. On the other hand, when we look at the differences in Niemann's performances in games with DGT boards and without DGT boards, a fairly clear anomaly begins to emerge, and when we examine his power evolution curve, more inconsistencies with the hypothesis of innocence emerge, leading to a sad conclusion, but one that seems to be the only one that harmonizes with the empirically measured hard facts.
Since the conclusion posted on Ken Regan's ChessBase website (https://en.chessbase.com/post/is-hans-niemann-cheating-world-renowned-expert-ken-regan-analyzes) is opposite to mine, I thought I should look into some more relevant details.
Although Regan is a competent person, the analysis he presents is incomplete and the conclusion is probably incorrect, just like the first conclusion I presented in my analysis on 09/05/2022. I was relying exclusively on the analyses of the moves of the games, both Niemann's moves and those of his opponents, and from that perspective the conclusion is the same as the one Regan presented. Carlsen played slightly below his normal strength and Niemann presented a "normal" game of a human GM with a rating of 2700. This suggests that there was nothing unusual, no hint of cheating. The big problem is that Niemann's rating evolution curve is outrageously abnormal, and that part of the problem is not covered in Regan's analysis.
Niemann himself admitted to using engines in competitions on several occasions in 2015 and in 2019. Looking at the chart of his rating history, we can pinpoint precisely some of these events, but there were not just two, as he said. Two were the ones that produced fairly obvious anomalies, but beyond these there is clear contamination in the entire history.
In 2014 his rating was near 2000 (point A) and dropped below 1880 (B), then rose within 3 months to 2280 (C), but soon fell to 2050 (D). Then there was another one of these abnormal rises in a few months, this time reaching almost 2350 (E). From then on it oscillated, until in mid-2018 there was another rise (F), within a few months, and again it oscillated (G-H). Finally, starting in 2021, we can see a gradual rise (H) that looks very much like natural growth and could even be interpreted as such, if there were not all the other precedents. But considering the facts as they were, the likely interpretation is as follows:
1. In 2014 Niemann was close to reaching 2000, it's a magic number that people want to reach and surpass. He got there, but soon after it dropped below 1900, and apparently he couldn't stand that and couldn't handle that drop. That's when he decided to start cheating. We have no way of knowing if he reached 2000 without using an engine, but in principle let's assume that he only started using engines in 2015.
2. From 1880 it went up to 2280 in 3 months, but obviously when he stopped cheating, he couldn't stay at that level and went back down, down to 2050. There was no way to stay at ~2300 unless you kept cheating continuously, or concretely improved your game. Improving your game is a time consuming and costly process.
3. It went from 2050 to 2350 in 3 months, and he had already "learned" that if he stopped using engines, as he had done on two previous occasions, his rating would drop by hundreds of points. So he decided to stick with the scheme: he played one tournament without an engine and another with an engine, to avoid those big drops and big rises that could raise suspicions.
4. After Niemann had kept the rating stable for some time, he decided to go up a step further in mid-2018, and this time it went to around 2450. With that he also improved his method, and instead of going up steps, he went up a gentle ramp, which arouses less suspicion, and so he did from 2021 on.
In the 1980-1990s (and certainly before) 'post-mortem' analyses were common, in which players finished the game and analyzed it right there, in the tournament room, or in a nearby space reserved for this, discussing their plans, their ideas, their doubts, and this made it clear how well and how well each player had analyzed the game, how well and how well each player understood the game, whether he had won by luck or by merit, whether he had understood the subtleties and complexities of the game, etc. Nowadays the game ends and the person goes to analyze alone on his cell phone or laptop. That 'post-mortem' socialization played an important anti-fraud role, even before there were computers strong enough to carry out this kind of fraud. A simple way to investigate Niemann's real strength is for a group of GMs to analyze some games together with him, evaluate the quality of his suggestions, how well he understands strategic concepts, etc. Or he could be invited to be a commentator at some important event, and check how good his comments are.
Niemann spent about 30 months (2016.1 to 2018.6) with his rating hovering around 2300, then went up to 2460 using engines. Since then Niemann claims to have stopped using engines, but instead of his rating returning to 2300, which was where it was, it went even higher. The pattern of his rating dropping 200 points each time he stopped using engines is quite clear. If he really had stopped using engines, it would be expected that his rating would have dropped again. Regan's hypothesis about Niemann's innocence is inconsistent with this and other important facts.
If the results in events with DGT board produced rating performance 200 points higher, and the number of games with DGT was similar to the number of games without DGT, then when Niemann supposedly stopped using engines his rating should have dropped by at least 100 points. This never happened and his rating went up even more after he supposedly stopped using engines. A really hard situation to explain.
When I say that he should have dropped "at least 100 points" instead of "about 100 points" it is because maybe he used engines both in events with DGT and in tournaments without DGT, but in the cases with DGT he was able to take better advantage of it. So even in events without DGT perhaps his rating was inflated, but to a lesser extent than in winds with DGT.
None of these facts are visible when one tries to investigate the case solely on the basis of the analysis of the pitches of the games, because there are no overt symptoms of the use of engines in their games. Is only makes the detection of cheating more difficult this way, because there are versions of Lc0 with 2700 rating that make "human" mistakes that older engines (years ~1998) like Fritz 5, Hiarcs 6, Shredder 5 and Gandalf 6 do not make, and even less the newer engines. These engines (like Lc0) play with a game style indistinguishable from that of humans, including making typical human mistakes and making "logical" choices that some super-strong engines do not. This makes it possible to use engines without it being "visible" in an analysis based solely on the game play. Also one cannot rule out the alternative that a person could use engines on only some shots, but not all, and this would also make detection more difficult.
For these reasons, Regan's analysis leads to an unfortunately wrong conclusion. I say "unfortunately" because it is sad that in Chess this kind of cheating happens. When Felipe Neto was accused of using engines, my assessment of the case was completely different, because Chess for him is a game, like a game of truco. "Cheating in truco can be a game between friends, but cheating in a serious sport competition, where official titles, money, recognition, etc. are at stake, is much more serious.
I remember an account made by my friend Marcos Seman in 1998, of a game he played against his friend Álvaro Pena. They were playing a tournament and their game was the last, there was no one else in the room. Seman was in the bottom position, and at a certain moment he executed his move, triggered the clock, and got up to go to the bathroom. When he returned, he noticed that the clocks were stopped, with Mr. Pena looking at the board. This made Seman furious. He hit the clock hard to start it again, because his opponent was analyzing it without taking his time. Then Mr. Pena stopped the clocks again and explained that he had given up. Then Seman was very embarrassed and asked why he had dropped out, whether he had done something to offend him, which was not his intention. Then Mr. Pena explained that while Seman went to the bathroom, he touched a Rook, intending to move it, but then he saw that he would lose a piece and the game. So he decided to abandon it. No one was present to see that he had touched the piece, but he knew that he had touched it and was obliged to play with it, and this was enough for him to honor his commitment to the truth.
On another occasion, in a much older and more famous event, the London tournament of 1922, 16 great players were competing, and 4 of them had a chance to win first place: Capablanca, Alekhine, Vidmar and Rubinstein. With 3 rounds to go, Capablanca and Vidmar were paired up. In those days, games that were not completed in regulation time (2:30h for 40 moves) were postponed and continued the next day. The player wrote down his "secret move" without anyone seeing it, placed it in an envelope, which was sealed and signed by both players, and kept in the custody of the referee. This prevented the opponent from taking advantage by analyzing the position for hours, since there was no way of knowing which move had been executed. The next day, the envelope would be opened, the secret move executed, and the game would continue where it left off. This happened in the 13th round game between Capablanca and Vidmar, which was interrupted. Soon afterwards, in an informal conversation, Vidmar informed Capablanca that he intended to abandon the game. The next day, Capablanca did not attend the event. The referee started the round, the times began to run, and Capablanca did not arrive. After almost 1 hour of waiting, Capablanca's clock time was almost running out and he would lose, so Vidmar stopped the clocks and declared to the referee that he was abandoning, handing the point to Capablanca, who was champion of the tournament.
It is a shame that the current stories about Chess are so different.