21 de set. de 2022
Niemann x Carlsen, an objective analysis of the facts
A detailed study of the biggest controversy in the Chess Realm in the last decades
Special Thanks to Tamara Rodrigues
Translated to English by Felipe Rodrigues
Magnus Carlsen holds the highest rating world record in history and is world Chess champion since 2013. Hans Niemann is a young Grandmaster who has been causing controversy in recent years, especially in the last two weeks, after beating Carlsen in the Sinquefield tournament and be accused of cheating. This episode made headlines in newspapers around the world. Since then, much speculation has been made on the matter, some defending Niemann, others casting doubt on his integrity.
Among the many people who have given their opinion on the matter are some of the best players in the world, including Carlsen, Aronian, Nakamura, Shirov and Nepomniachtchi, and some of the greatest experts on fraud, notably Kenneth Regan, Professor of Mathematics at the University of Oxford and international Chess master, featured on the ChessBase website as “The world's greatest expert on cheating detection in Chess”. However, the conclusion that Regan presents is objectively incorrect, as we will demonstrate below.
On 09/05/2022, the day after Niemann's victory against Carlsen in the Sinquefield tournament, I analyzed the game between them, focusing exclusively on the moves of the game, on technical aspects related to Chess, and I came to the same conclusion as Regan: there is no evidence of the use of engines, Niemann's game is indistinguishable from that of a human with 2700. I ended the matter at that point and closed it, understanding that Niemann was innocent. But after Carlsen's conduct in the 9/19/2022 online game, I thought I should delve deeper into the analysis, because Carlsen wouldn't act that way if he wasn't feeling deeply outraged and wronged. At the same time, Niemann's conduct struck me as inappropriate. If I was suspected of cheating in a similar situation and my opponent abandoned the game, as Carlsen did, I would decline the point and hand the point over to him. But Niemann simply received Carlsen's point, without any dispute. This rekindled my doubts on this matter and I decided to investigate further.
For those interested in the technical analysis of the game, you can download it here (PGN, PDF).
The suspicions against Niemann can be divided into two groups:
1. Use of engines.
2. Unauthorized access to Carlsen training.
In the case of item 2, the accusations are more speculative, so I prefer to refrain from commenting.
In relation to item 1, there are weak and strong indications. Let's analyze the strong indications, emphasizing a point that has not been discussed so far, and which may be of crucial importance to unravel what is happening: the evolution of the rating as a function of age, what is this evolution like for the world's elite young people, how is this evolution for players in general and how is this evolution in the case of Niemann.
Our analysis will be divided into 5 parts:
1. Asymmetry of results in events with DGT board.
2. Evolution of the FIDE rating over time.
3. Analysis of game play, videos and other evidence.
A DGT board is an electronic board with a touch-sensitive surface and a computerized structure that recognizes the moves executed and transmits them to a computer via USB (or serial port or equivalent). In recent years, a large part of the most important tournaments uses this type of board, due to the ease of real-time transmission of games to the Internet and TV. At the same time, this type of board can give rise to doubts about the possibility of cheating.
The process by which a person could use a DGT board to cheat is not a point that will be discussed here. Elon Musk even commented on a hypothesis on how he thought it would be possible to broadcast moves. It was unclear whether he was being ironic, but the fact is that similar alternatives are certainly possible.
In this first part of the analysis, the point we are going to analyze is about Niemann's results being better in tournaments where DGT board was used. The website https://www.chessdom.com/new-allegations-within-niemann-carlsen-case-hans-niemann-performs-much-better-with-live-dgt-boards/ provides some data on this. Later we will see another source with similar data, with which these results will be compared, so we will call this table “A” and the other “B”.
Assuming that the data presented in table A is correct, some hypothesis tests were performed to investigate whether Niemann's performances at live broadcast events were different from his performances at non-live broadcast events. The results were:
Student 's t test: 99.987%
[In the introduction to the appendix there is a brief explanation of the meaning of these numbers and the usefulness of these tests]
Therefore, these are practically conclusive results, about there being a statistically significant difference at a level of 0.0002. However, there are some errors and some possible distortions in table A. For example: the table mentions that Niemann would have achieved 2893 performance in the “USCF K12 Grade National” in 2019, but Niemann 's name does not appear on the official website of this event: https: //www.uschess.org/results/2019/k12/. Therefore, the possibility of excluding this event from the analysis was considered and the results with and without this event were compared. A more detailed discussion of this has been included in appendix , so as not to interrupt the flow of the text.
Other possible more serious errors are pointed out in that table and the supposedly correct information is presented in this table (B): tinyurl.com/bwaucm78
Comparing tables A and B, we have the following:
Tabela B, fonte: tinyurl.com/bwaucm78
The discussion on which of the tables is closer to the truth has been discussed in this thread: https://twitter.com/thestrongchess/status/1568813904750411776 and pondering the arguments on both sides, Table B appears to have been crafted much more carefully, have fewer errors and is open to public review by others. However, table A is more famous and is influencing more people, because it was widely publicized on a highly visible website. With this article, we also hope to correct this distortion.
In the columns “Q” and “R” of the worksheet above, several errors are pointed out in which table A would have marked events in which DGT board would have been used or not. In addition, several events from March 2019 to November 2020, in which Niemann participated, were omitted from table A. In table B, care was even taken to separate the games from rounds 2, 3 and 8 of the “US Masters 2019”, which were broadcast live, from the other games of the same event, which were not broadcast live.
On the other hand, table B includes 7 events in which Niemann won 100%, and the performance rating cannot be calculated when the score is 0 or 100%, because it implies a division by 0 or a logarithm of 0. The table A includes only 1 event with this issue. We included a technical note on this in appendix . The performance ratings in table B, as in table A, are being calculated based on the formula proposed by FIDE, which is not appropriate, as we also analyze in appendix .
The graph below shows the variation in the sizes of disparities produced by the FIDE formula for calculating rating performance as a function of the percentages of points obtained. Although in most cases these differences are less than 5 points, when the percentages approach 100% the errors can exceed 175 rating points. In fact, for 100% the error can be infinite, depending on how the draw weight is handled:
Therefore, both tables have errors and distortions, but in table A the problems are much more numerous and more serious, so it is more appropriate to refer to table B.
The difference between the USCF ratings on events with and without DGT (or equivalent devices) shown in table A is 206 points. If we use FIDE ratings instead of USCF, it would still be 182 points. But when we consider the differences in table B, this difference is only 28.5 points. However, the calculations presented in table B were not performed in the most appropriate way, due to distortions when the score is 100%. So I redid the calculations as follows: all 90 games with live broadcast (BC) were included in one group and all 83 games without live broadcast in another group (NBC). Then, the performances in each of these groups were calculated. This procedure makes it possible to eliminate divisions by 0, without arbitrary adjustments in the case of events with 100% wins. While this procedure corrects the division by zero problem, it introduces some other, but minor, distortions. This results in a performance rating of 2544.3 in the group with DGT and 2501.5 without DGT.
Although the result is not the 28.5 points indicated in table B, the difference is actually much smaller than indicated in table A. Considering that there are many outliers rated far above and far below the average, it is more appropriate to use biweight Tukey (or Andrews waves) instead of using the arithmetic mean. In this case, the results are 2554.1 and 2508.9. Therefore, if the data mentioned in table B is correct on which games DGT board was used, the difference is only 45.2 points, instead of 206 points. Out of curiosity, if we use the FIDE method, the results are 2546.5 and 2500.9.
But there is still a problem, because tests of contrast between means, as well as analysis of variance, cannot be performed on samples with only 1 element, and if we include the results of all games with DGT in one group and all without DGT in the other group, there will be only 1 performance in each group. Therefore, it is inevitable to apply some boundary condition to deal with divisions by 0 and assign some plausible performance rating in cases of events with 100% wins, or eliminate these events from the calculation. But eliminating these events would produce a skewed result, because in 5 of the 7 events there was no live broadcast. So I decided to investigate what the hypothesis tests would look like assuming that the performance ratings indicated in table B are reasonably close to the correct values, or at least not so far from correct as to alter the inference. This leads us to the following results:
Student 's t : 67.42%
If you remove the 7 events with 100% wins, the results are:
Student 's t : 99.69%
Therefore, where there was a high degree of certainty (based on table A), there is an inconclusive situation (based on table B). The differences observed between events with DGT and without DGT are not statistically significant.
This is not a point in favor of Niemann's innocence. It just shows that one of the arguments used against him has no statistical validity. Furthermore, as the probabilities of guilt and innocence are complementary, when you reduce the probabilities of guilt, you are automatically increasing the probabilities of innocence. In this context, if the differences between results with and without DGT were being seen as evidence against Niemann, and this “evidence” needs to be discarded, the balance that was tipping against him returns to a position closer to neutrality.
Niemann 's FIDE rating evolves as a function of age and compare it with the evolution of other chess prodigies. Among the young people of the world elite, a characteristic curve can be observed that represents the evolution of the playing strength as a function of age and this curve is very similar among all of them. This can be seen from Morphy, Capablanca and Reshevsky (see the book “Chess, the 2022 best players of all time, two new rating systems” at https://www.saturnov.org/livro/rating) to the new talents such as Firouzja, Wei, Duda, Gukesh and Erigaisi, passing through Fischer, Kasparov, Kamsky, Leko, Carlsen etc.
In the case of Carlsen, this curve takes on this aspect (FIDE rating):
The player evolves very quickly until approaching an asymptotic threshold, then decays a bit after reaching peak and continues to decay slowly over the decades. We will see some more examples later.
Now let's look at Hans Niemann's evolution curve:
The general behavior is very different. Instead of a smooth curve that gradually decelerates, the evolution curve for Niemann has some steps and some oscillations with relatively large amplitude. Between the beginning of 2016 and mid-2018, there was practically no evolution in the rating, with 30 months with practically unchanged rating. Suddenly, in 6 months (mid 2018 to early 2019) it rose from an average level of 2300 to an average level of 2460, and again it remained stable for another 2 years. Then (2021), it grew again.
This is quite abnormal behavior among world elite players, who generally grow very fast until age 15 and then continue to grow less rapidly until age 20. There is a progressive deceleration in the growth rate as a function of age, until reaching around 25 years of age, when a slow decrease begins. This decrease in playing strength can be masked by rating inflation (more details in the book cited a few paragraphs above), because it is slow. But during the initial phase of growth, the pace is very fast. Carlsen went up from 2050 to 2550 in 4 years. Then it went from 2550 to 2850 in 8 years. The evolution of Niemann 's rating is very different, it accelerates, decelerates for years, then accelerates again for a few months, then again decelerates for years, and again accelerates for a few months. This is extremely unusual and unnatural. It's not just different from Carlsen; is very different from more than 95% of elite players. Later, we will present some studies quantitatively evaluating this disparity.
Capablanca's rating, for example, during his ascension period, behaved very similarly to Carlsen's, with the difference that Capablanca's rating began to decline after the peak because the calculation method used by Rob Edwards (link below) is immune to inflation, while the FIDE rating currently features around 5.7 points of inflation each year. That's why Carlsen's rating has remained stable since 2013, although his strength is decreasing. There's also an evolution in understanding the game that makes players stronger every year, but that doesn't keep up with inflation. This evolution in understanding the game increases by about 1.4 points each year, so if the rating remains stable, it means that there is a real decline of 4.3 points per year in game strength.
In addition to the rating evolution curves as a function of age for boy prodigies, we can observe the curves for players in general, which also differ dramatically from the Niemann rating curve. This is because there is a typical variation in the evolution of cognitive ability as a function of age, and all performances that depend on cognitive abilities closely follow this variation.
The graph below shows the evolution of ratings as a function of age for 396,311 players in FIDE's September 2022 list:
The variables represented by x and y axes in the above graph are not the same as in the previous and subsequent graphs. Here we can see how the rating is related to age for 396,311 different people. The other graphs show the evolution of the rating as a function of age for the same person throughout life. Although they are not exactly the same information, when considering large samples of data, they end up assuming statistically very similar behaviors.
The next chart shows the evolutions of some of the recent chess prodigies:
The same players as in the chart above (except for Carlsen, who already has a unique chart) are shown below in separate tables, to make it easier to see the shapes of the curves:
As can be seen, although some curves, such as those of Wei and So, are a little different from the pattern of the others, they preserve some of the most important general properties, such as increasing monotonicity (if discretized at biennial intervals) until age 23. Almost all the youngest players who occupy the first positions in the world ranking have this characteristic in their growth curves. An exception is the young Abdusattorov, who remained virtually unchanged in rating for 30 months, between the ages of 7 and 9. This is extremely unusual. Then he showed an increase of 270 rating points in just 2 months, at the age of 11. This can happen if the child has his Chess studies interrupted, as happened with Reshevsky and Kamsky, for example. But these are rare cases, and do not show large drops or inconsistencies in the results. Niemann's case is different, because in addition to his evolution curve being very non-standard, it also presents drops and instabilities.
To further investigate these differences from a quantitative perspective, I proposed a theoretical model of adjustment in which the rating of each player evolves with age according to a logistic function of the type:
Where “e” is the Napier number 2.71828..., “x” is the age in years with decimals, and “R” is the rating as a function of age. The values of the parameters “a, b, c” are determined in order to minimize the sum of squares of the distances between the theoretical curve and the empirical values for each player.
I then tested whether this model really did provide a good fit to the experimental data, and the results were above expectations. The model describes very well how each player's rating behaves over time until it reaches its apogee. But my model does not consider the rating downgrade after age 30-40. A complementary fit for this purpose would be work for another article or book.
The values of the parameters "a, b, c” vary from one player to another, but the shape of the best-fit curve is very similar for all of them. Some players present evolution very close to the theoretical curve, while others are not so close, but all curves present the same morphological class, adherent to the upper part of a logistic function. The one glaring exception is Niemann 's case, where the best fit is almost a straight line, and the distances of the points from the curve are much greater than those of any other player. The following are the curves for Carlsen and Niemann:
Other players' graphs can be downloaded here:
As can be seen in Carlsen´s case, and the same for all other players with “normal” evolution, the best-fit curve has a limit whose asymptote is close to the maximum rating achieved. In the case of Niemann it is almost a sloping straight line. In addition to the visually evident disparities, some objective measurements were taken to investigate the probability that the observed differences are random.
The Chi square contingency test is useful for measuring the goodness of fit of a theoretical curve to the experimental data, and for all players it indicates almost 100% probability that the theoretical model describes the evolution of the rating as a function of age, except in the case of Niemann, in which case the probability is 61%. For comparison, among the 13 other players analyzed, the second worst fit was Firouzja : 99.99999951%. The best fits are indicating 100% because the limit in Python is 15-16 significant figures in floating point operations. It would be possible to get more decimals using the mpmath library, but since the program is already implemented using numpy and scipy, and since 99.9999999999999% is sufficient for our purposes, we don't need more than 15 significant figures. It is just worth clarifying that where “100%” is indicated does not actually mean 100%, but something above 99.9999999999999%. It is also important to clarify that most of these calculations have much greater uncertainty than these numbers indicate. The fact that it indicates 99.99% probability does not mean that it actually has this probability. DNA tests, for example, which indicate a 99.9999% probability of a man fathering a child, disregard that there is a 0.4% probability that this man has a twin, among other errors of overestimation. Therefore a probability indication of 99.9999999999999% should be interpreted as “very high”, but not necessarily as high as the corresponding number.
Another test performed was on the size of the typical error in curve fitting. This measure also provides important data for analysis. The smallest adjustment error was Karjakin's: 1.78. The biggest fit error (except Niemann) was that of Firouzja: 13.34. In Niemann's case, the error was 212.41, about 16 times greater than the highest adjustment error observed among the other players.
Dynamic Time Warping (DTW) is a statistical tool with distinguished properties to measure similarity between time series. Most similarity measures, such as Chi squared and Kolmogorov-Smirnov, exclusively consider the vertical distance (on the y-axis), and this sometimes does not provide a correct assessment of the separation between curves, especially in cases like this, where ratings can become lagged if a person goes many months without playing tournaments. Ding Liren, for example, between the ages of 13 and 15, was hampered by this, leaving a long horizontal line in his record. In such situations, the use of DTW allows for more accurate and reliable measurements. The values obtained for DTW of the 13 players considered, except Niemann, were between the minimum of 1507 (for Deac) and the maximum of 2297 (for Caruana). In the case of Niemann, this value was 3571, much higher than the highest observed among the other players. The following graphic summarizes this situation:
On the other hand, the dissimilarities measured by discrete Frechet distance, which has similar properties to the DTW, did not show such a big difference between Niemann 's case and those of other players. Although the dissimilarity between Niemann 's rating evolution curve and the model was the greatest among all the players considered, he was not isolated from the group, as in the other cases, as can be seen in the next graph:
There are also other metrics that can be used in situations like this, including the measurement of the area of the region that separates the curves, which indicates not only the size of the disparities, but also the length of time the disparity has remained. But in this case, it would be useless, as the effects produced by the use of the engine produce large, short-lived local anomalies, as can be seen in the events that occurred at 12 and 16 years of age, and in this case these metrics do not help to reveal whether there was use of an engine, and may even confuse more than clarify. The use of Partial Curve Mapping (PCM), for example, can help to deal with cases in which each x-value corresponds to more than one y-value (hysteretic curves), but does not contribute to the investigation of the phenomenon we wish to analyze.
Another interesting point to consider is the projection of the maximum rating that the person should reach (if he is younger than 30-35 years old) or has already reached (if he is older than 30-35 years old). This is determined by parameter “c” of the adjustment function. In Carlsen's case, the value of parameter c is 2862, a little lower than the maximum rating he actually reached (2882). For Caruana, the parameter c is 2837, and the maximum he reached was 2844. For Firouzja the projection is 2821, which he has not yet reached, but is likely to exceed. For now, Firouzja's maximum is 2810. As the rating curves fluctuate above and below the theoretical curve, it is likely that most players will reach their best moment some 20 points above the value of parameter c. For Deac, the projection was 2664, which he has already surpassed with some ease, reaching 2710, being one of those who obtained a rating above the value of parameter c for his growth curve. In the case of Niemann, the projection for him is 3103, and as most arrive a little above the c parameter, the expectation for Niemann, if his rating were real, would be around 3120.
The rating evolution curves as a function of age make it possible to investigate an extensive list of quantitative properties that reveal very pronounced anomalies. The presence of one of these anomalies would not be a relevant indication that there is a problem to be investigated, but what is observed in Niemann's history is that all the measurements considered signaled the occurrence of large anomalies, some of which the probability can be estimated directly; in other cases the probability can be estimated indirectly, based on the absence of occurrences of similar magnitude among the other players and based on the number of standard deviations that the anomaly is observed in the Niemann case.
With the exception of the discrete Frechet distances, all other metrics considered to verify the evolution of the rating as a function of age indicated more than 99% probability that Niemann´s rating evolution curve does not follow the normal course and in some cases this probability was above 99.9%.
This set of anomalies in the evolution of ratings as a function of age provides a clear idea of the disparity between Niemann's case compared to other young talents of the world's elite. In addition, player ratings and ages are available from official sources for consultation, at https://ratings.fide.com/ and the probability that there are errors in this data is very low, almost 0, unlike the most controversial information on DGT, which depend on the accuracy of data from uncertain sources, in which inaccuracies and inconsistencies have already been verified.
This does not exclude the possibility that Niemann is simply a person of unusual development. But if that is the case, concrete evidence of this would need to be presented.
The next graph shows the evolution of intellectual capacity as a function of age, based on WPPSI, WISC and WAIS scores between 3 and up to 91 years, with extrapolations outside this range. The curve is similar to the evolution of the rating as a function of age, as the cognitive processes necessary for good performance in chess share several latent traits necessary for good performance in cognitive tests, albeit with different weights:
This graph is representative of the population average, but the curve is different for people with well above average IQs. Lasker, for example, at age 57 was almost at the peak of his strength, with just 50 rating points (and about 3 IQ points) below his all-time high. Philidor is the most notable example of this, who has gone on to nearly 70 years as the world's strongest Chess player, and Kortschnoj is a good recent example. Therefore, the point at which intellectual capacity begins to decrease is not the same for all people. At the highest intellectual levels, maintenance near the peak can go on for decades.
So after the age of 30, the shape of this curve would not apply to Niemann or other elite players. But in the age group that precedes 30 years, the curve is very similar for all people, which makes it a good model for the case at hand. This does not mean that all prodigies will be at the same level at every age before 30, but they will all follow a similar ratio according to their individual maximum, and most should show a slower reduction after 30.
There are more details to consider, including some live games, some online games, some interviews, etc. There is no way to exhaust all possibilities, but we try to list some important points:
Interview with GM Alejandro Ramírez: https://www.youtube.com/watch?v=xxWs8vy-GKU. In this video, Alejandro explores an excellent opportunity to test Niemann's skills. After the game between Lagrave and Niemann, which ended in a draw, Alejandro asks what Niemann intended to answer if Lagrave had played 26.Rxf5 instead of 26.c4. The position chosen by Alejandro for this contestation is very appropriate, because there is a long sequence of difficult moves that could not have been omitted from Niemann's analysis by allowing 26.Rxf5. If Niemann allowed 26.Rxf5, he should have something in mind to play against it, and should remember and respond immediately when asked. However, Niemann fumbled, practically did not hit any of the moves in a long sequence, demonstrating that he had simply omitted one of the most natural and most important continuations. Of course, this could happen to anyone, it is normal to make a mistake due to the tension of the moment, so it would be interesting if there was a systematic and controlled experiment, in which Niemann was questioned by a group of GMs. This would be great to save Niemann's reputation if he's being honest, or to unmask him if he's cheating.
As my friend Felipe Rodrigues commented, there are several videos of live streams in which Niemann plays with real-time transmission, in which he achieves expressive results, reaching 3000 in blitz and bullet. This would be very difficult to achieve through cheating, Niemann even executes many pre-moves, which would not be possible with the help of engines.
There are videos in which Niemann publicly admits to having benefited from the use of engines in online games, when he was 12 years old and when he was 16 years old. But according to Chess.com (https://twitter.com/chesscom/status/1568010971616100352), the number of times Niemann has used engines is greater and covers a longer period of time than he admitted. In addition, the study that we presented above on the variation of the rating as a function of age, shows several very strong indications in face-to-face (OTB) events.
Regarding the divergence between Niemann's statement about having stopped using engines and the thesis defended by Chess.com that Niemann would have used engines more times than he admits, our study on the evolution of the rating as a function of age helps to light on this issue. The graph below shows the evolution of Niemann's rating, this time with some specific points marked with a red ellipse, and with two horizontal lines:
In 2014 Niemann's rating was close to 2000 (point A) and dropped below 1880 (B), then it rose in 3 months to 2280 (C), but soon dropped to 2050 (D). Then there was another one of these abnormal rises in a few months, this time reaching almost 2350 (E). From then on, it fluctuated, until in mid-2018 there was another rise (F), in a few months, and again it fluctuated (GH). Finally, from 2021 onwards, we can observe a gradual rise (H) that looks a lot like a natural growth and could even be interpreted as such if there were not all other precedents. But considering the facts as they were, the likely interpretation is as follows:
In 2014 Niemann was close to reaching 2000, it's a magic number that people want to reach and surpass. He arrived, but soon after he dropped below 1900, and apparently he couldn't stand it and didn't know how to deal with that drop. That's when he decided to start cheating. We have no way of knowing if he reached 2000 without using an engine, but, in principle, let's assume that it only started using engines in 2015.
From 1880 it went up to 2280 in 3 months, but obviously when he stopped cheating, he couldn't stay at that level and went back down, to 2050. There was no way to stay at ~2300 unless you kept cheating continuously, or improved concretely your game. Improving the game is a time-consuming and costly process.
It went up from 2050 to 2350 in 3 months, and he had already “learned” that if he stopped using engines, as he did on two previous occasions, his rating would drop hundreds of points. So he decided to keep the scheme: he played a tournament without an engine and another with an engine, thus avoiding those big drops and big rises that could raise suspicion.
After Niemann kept the rating stable for some time, he decided to go up one more notch, in mid-2018, and this time it was around 2450. With that he also improved his method, and instead of climbing the stairs, he went up a gentle ramp, which arouses less suspicion, and has done so since 2021.
Therefore, the analysis of Niemann's history in OTB events corroborates the thesis of the website Chess.com, showing that not only in online events, there are also strong indications of the use of engines in OTB events, especially on the occasions he confessed, at 12 and 16 years. Even the confession may have been strategic, in the event of ameliorating his situation by the early confession, when the anomalies in his FIDE rating were discovered at those specific moments. However, his confession is quite clear: he says that he only used engines in online events, but not in live events, but the FIDE rating is based exclusively on live events, and the symptoms of manipulation of results are visible precisely in the FIDE rating, which puts Niemann in a delicate situation.
The set of statistical results shows that suspicions related to possible differences between events with DGT and without DGT do not provide evidence to support suspected fraud.
On the other hand, the evolution of Niemann's FIDE rating is different from the evolution of the rating of other players of the world elite and of other players in general, with very unusual particularities in the rate of evolution of the rating as a function of age, not behaving like the typical curves observed in other players nor as generic variables that accompany the natural evolution of cognitive ability.
Niemann spent about 30 months (2016.1 to 2018.6) with his rating hovering close to 2300, then it rose to 2460 in just 3 months. From then on, Niemann claims to have stopped using engines, but instead of his rating returning to 2300, it went up even more.
None of these facts are visible when trying to investigate the case exclusively based on the analysis of the games moves, because there are no obvious symptoms of the use of engines in his games. This makes fraud detection more difficult this way, because there are versions of Lc0 with 2700 rating that make “human” mistakes that older engines (~1998 years) like Fritz 5, Hiarcs 6, Shredder 5 and Gandalf 6 don't and even less the latest engines. These engines (like Lc0) play in a style indistinguishable from humans, including making typical human mistakes and making “logical” choices that some super-strong engines don't. This makes it possible to use engines without this being “visible” in an analysis based exclusively on game moves. Nor can one rule out the alternative that a person could use engines in only some moves, but not in all, and that would also make detection more difficult.
For these reasons, Regan's analysis leads to an unfortunately wrong conclusion. I say “unfortunately” because it's sad that in Chess this kind of cheating happens. When Felipe Neto was accused of using engines, my assessment of the case was completely different, because for him, Chess is just a game, like a truco game. “Stealing” in truco may be a joke between friends, but cheating in a serious sports competition, in which official titles, money, recognition, etc. are at stake, this is much more serious.
I would like to believe that Niemann is yet another great Chess talent, I even defended him in my preliminary remarks, right after the first game, due to the absence of evidence against him. But I must admit that my initial interpretation was naive and incorrect. I presented arguments in defense of Niemann, based on the fact that he makes many different moves than those indicated by the engines, and this really shows that he does not use engines in all moves, perhaps not even in 50% of the moves. But the statistical analysis presented above is very strong and practically conclusive. The fact that he doesn't use engines in every move makes it more difficult to detect evidence of fraud, because his games are very similar to a human player with 2700, with mistakes that engines don't make and with logical moves that humans would have executed, but the engines would have avoided it due to the concrete calculation of variants. Analyzing exclusively the quality of the moves of his games, there are no indications of fraud. On the other hand, when we examine his game strength evolution curve, inconsistencies with the hypothesis of innocence emerge, leading to a sad conclusion, but which is perhaps the one that most harmonizes with the empirically measured concrete facts.
I remember a report made by my friend Marcos Seman in 1998, of a game he played against his friend Álvaro Pena. They were playing a tournament and their game was the last one, there was no one else in the event room. Seman was in an inferior position, and at one point he made his move, triggered the clock and got up to go to the bathroom. When he returned, he noticed that the clocks were stopped, with Mr. Pena looking at the board. This made Seman furious. He hit the clock hard to triggered it again, as his opponent was analyzing without spending his time. So, Mr. Pena stopped the clocks again and explained that he had resigned. Then Seman was very uncomfortable and asked why he had resigned, if he had done something that had offended him, which was not the intention. Then Mr. Pena explained that while Seman went to the bathroom, he touched a rook, intending to move it, but then he saw that he would lose a piece and the game. So, he decided to resign. No one was present to see that he had touched the piece, but he knew that he had and was obliged to play with it, and that for him was enough to honor his commitment to the truth.
On another occasion, in a much older and more famous event, the London tournament of 1922, 16 great players were playing, and 4 of them with a chance of taking first place: Capablanca, Alekhine, Vidmar and Rubinstein. With 3 rounds remaining in the event, Capablanca and Vidmar were paired up. In those days, games that were not completed in regulation time (2:30h for 40 moves) were postponed and continued the next day. The player recorded his “secret move”, without anyone seeing it, deposited this move in an envelope, which was sealed and signed by both, and was in the custody of the arbiter. This prevented the opponent from taking advantage by analyzing the position for hours, as there was no way to know which move had been executed. The next day, the envelope was opened, the secret move was executed, and the game continued where it had left off. This happened in the 13th round game between Capablanca and Vidmar, which was interrupted. Soon afterwards, in an informal conversation, Vidmar informed Capablanca that he intended to resign. The following day, Capablanca did not attend the event. The arbiter started the round, the clocks were triggered, and Capablanca did not arrive. After almost 1 hour of waiting, Capablanca's clock was almost running out and he would lose, so Vidmar stopped the clocks and declared to the arbiter that he had resigned, handing the point to Capablanca, who was champion of the tournament.
It's a shame that the current Chess stories are so different.
About tests of contrast between means
The purpose of Student's t test is to compare two samples of normally distributed data and to estimate the probability that the observed differences are random. For example: if the average height of men is different from the average height of women, or if the average air temperature in January is different from that of February. We know that some women are taller than some men and vice versa, but the question is whether there is a difference in the average height of all women compared to the average height of all men, or if the set of each distribution has the same parameters. To verify this, Student's t test applies very well, because the populations are very numerous and the shape of the distributions of these populations closely resembles the shape of a normal distribution within the range of -2 to +2 standard deviations. So we can see that, although some women are taller than some men, the average height of men is greater. In the same way, we can verify that although some Brazilians are taller than some Argentines and vice versa, when considering the average of the two groups, there is no statistically significant difference, as shown in the two graphs below.
But in the example we are analyzing, of Chess performance ratings, the situation is more complex, because the samples are small and we do not have accurate information about the shape of these distributions, which makes it more difficult to verify if the difference between the results obtained between the games with and without DGT boards is statistically significant. Therefore, it may be necessary to use other tools.
The Kolmogorov-Smirnov test, when compared to the Student's t test, has the advantage of being more sensitive to the morphology of the distributions and less dependent on the hypothesis of normality, on the other hand, it deals with the maximum distance between the points of greatest separation between curves, rather than being a global measurement that takes into account all points, such as Chi-Square and RMS measurements.
The Anderson-Darling test has the advantage of being shape-sensitive, like the Kolmogorov-Smirnov test, in addition to better weighting the presence of outliers in dense tails, which makes it one of the most suitable for this specific purpose.
Although Tukey's HSD can only be applied to samples of the same size and Tukey-Kramer is a post-hoc test that needs the results of an ANOVA, it is possible to develop custom tests essentially based on the principles of Tukey's test, without prior need of an ANOVA or MANOVA. To deal with this problem, I designed a new test for this purpose, which consists of measuring the difference between each element of one group and each element of the other group, instead of just considering the averages of all elements of each group, but the standardization of the results and the interpretation of the method have not yet taken a definitive form, so I thought it was best to remove this part. Furthermore, as this is a new metric, before putting it to use I intend to publish a work exclusively describing how this new metric was developed and how it should be applied, what advantages it offers compared to existing tools, etc.
 About removing outliers when applying Student 's t test
If we removed the 2893 results from the DGT group, the mean in that group would decrease, narrowing the difference, so it should reduce the probability that the samples show a statistically significant difference. If we removed the result 2077 from the other group, the mean of that other group should increase, also causing the difference between the groups to narrow and, consequently, the probability of a statistically significant difference should decrease. But that's not what happens. Before removing these results, Student's t test indicates 99.987% probability that the entity that played on DGT boards does not have the same strength as the entity that played on conventional boards, and if we remove these two outliers from the groups, instead of this probability decreasing, it increases to 99.992%. This demonstrates a distortion in the Student´s test.
This technical detail needs to be commented in a little more depth, to make it clear, because at the same time that removing the value 2893 makes the group mean decrease, narrowing the difference, it also makes the dispersion between the remaining elements decrease, making the relative distance (measured in standard deviations) wider, and the proportion at which this widening happens is greater than the proportion at which the group mean decreases, because the standard deviation is determined by the sum of the squares of the differences in the variables ( exponent 2), while the mean is determined by the value of the variables (exponent 1). Therefore, inserting or removing an outlier usually affects dispersion more than central tendency.
The problem in this case is that the presence of the element with value 2893 “stretched” the scatter in the direction this element was, but did not necessarily produce a symmetrical effect in the opposite direction. In fact, there is no reason to suppose that it would produce a symmetrical effect in the opposite direction. However, the normality hypothesis, which is adopted when applying these tests, implies symmetry of the distribution, so that the enlargement produced on one side should cause an equal enlargement on the opposite side, even if there is no outlier in that position that explains such an effect. Obviously this is a flaw in the theory, and this flaw cannot be lost sight of when examining the problem.
To better understand this inconsistency, let's analyze a homologous error in which this effect is most evident. The Sharpe Ratio is used to assess the quality of an investment based on risk-adjusted returns. Suppose an investment fund has been producing an average annual return of 12%, with a benchmark of 2% and a volatility of 20%, then its annual Sharpe Ratio is approximately 0.5. Suddenly this fund hits a good trade and produces a 200% profit in 1 month. Obviously this is good and should imply an increase in the Sharpe ratio, but as this event also implies an increase in volatility, and the impact on historical volatility is greater than on historical profitability, due to the reasons mentioned in the two previous paragraphs, that is why the profitable operation of 200% ends up resulting in a reduction in the Sharpe ratio, penalizing the fund's performance, when it would be right to reward it.
To deal with this problem, the Sortino index was created, which separately measures positive volatility and negative volatility. In the Sortino index, positive volatility does not penalize the index, which is more logical and more useful for a correct measure of risk-adjusted performance.
There are several other flaws in the Sharpe Ratio (and the Sortino Ratio), but it is not our purpose to discuss these points here. The important thing is that the error that occurred in Student's t when applied here is analogous to the error cited in the Sharpe index, in which a distortion on one side of the distribution does not imply the occurrence of another symmetric distortion on the opposite side, without there being any element in that position that justifies this interpretation.
 About inaccuracies in the FIDE formula for rating performance
The conceptual bases on which the Elo system was created are very well founded, based on the following principle: for a large number of games, if player A beats player B in a ratio of 3:1 and this player B beats player C in a ratio of 5:1, then A must beat C in a ratio of 15:1, since 3x5=15, and the same applies to any other ratios. From there, a mathematical model is developed to relate a rating on a logarithmic scale to the proportion of probability of wins, according to the formula presented below:
Where “RP” is the performance rating, “RM” is the average rating of the opponents, “p” is the percentage of points obtained.
Thus, if the player scores 100% of the points, the term “1-p” will be equal to 0, causing a division by zero. At the opposite extreme, if the player scores 0 points, “log (0)” occurs. Both log (0) and division by 0 are operations inconsistent with the axioms on which Arithmetic is built.
To circumvent this problem, FIDE adopts a table for calculating performance ratings with a ceiling of 800 points. This value may be different in different situations. Some use 736 (which actually corresponds to 99.5%), or 677 (which corresponds to 99%), or 400 (which would be the result of linearization of the function). In table B, for example, one of the problems is that in each event a different value was used to add to the average rating in cases where the player scored 100%.
This patch used by FIDE is useful from a practical point of view, but it is inappropriate from a mathematical and logical point of view, because it means that if a player with a rating of 2810 plays against another with a rating of 2000, the probability of success of the player of 2810 is greater than 100%. This is obviously wrong. Another interpretation would be that 800 points established an asymptotic threshold, so all differences greater than 800 points would indicate a 100% winning probability. But this also creates inconsistencies. Just think of a game between two engines with 3800 and 3000, where the difference is 800 points, but the percentage of draws is high, and this interpretation that a difference of 800 rating points corresponds to a 100% probability of victory produces serious distortions.
Part of the problem is that the quantitative and conceptual significance of the draw in Arpad Elo's model had already been perceived as a distortion in his 1978 book and remains a difficult issue to deal with because the Elo system is basically the application of Rasch's dichotomous model (1 or 0) as a measure of Chess performance, but the possible outcomes in Chess are trichotomous (win, lose, draw). Elo comments in his book that he tried to use a trinomial distribution, but without much success, so he ended up choosing to assign 0.5 to the draw and use a dichotomous model with 1 for victory and 0 for defeat, treating the draw as the arithmetic mean between victory and defeat, but this interpretation has been shown to be inaccurate, which was predicted from Elo's initial studies on this topic, but this has not been properly corrected.
Arpad Elo understood that the draw weight should not be exactly 0.5, but it was a reasonable approximation, especially considering the meager computational resources of the time, so even though his system was incomplete and with some distortions, it represented an important advance in compared to previous methods of rating measurement, such as the old Harkness system from the 1950s. Also, the draw weight is not the same for all ratings and all game rhythms, as I demonstrate in my book on this topic. For higher ratings or for longer game rates, the probability of a draw increases for the same rating difference, which is quite logical and intuitive, but the formula used by FIDE is inconsistent with this fact, as are the formulas of Glicko2, Sonas, etc., which improve some details on the original Elo system, but fail to fix this and other issues. There is Miguel Ballicora's Ordo system, which alleviates the problem but does not correct it.
The formula used by the USCF is essentially the same as that of FIDE, therefore subject to the same distortions and limitations. There are minor differences, such as the criteria for determining the value of constant k in rating updates, which imply a slightly higher average USCF rating. The way USCF treats the constant k is conceptually better compared to FIDE (produces less distortion and in a smoother way), but numerically the difference is small and ultimately both formulas are affected in similar proportions.
One of the important virtues of the Elo system, when compared to other sports rankings, is that the Elo system preserves its consistency for any rating range: a difference of Δ points indicates the same probability of success in any region of the scale, that is, a player at 2600 beats a player at 2500 at the same proportion as a player at 1600 beats a player at 1500, and this proportion is roughly maintained in any other rating range. There are minor anomalies due to the draw issue mentioned above, and a few other minor distortions, but basically that's about it. However, in order for this property to be maintained and for the system to be consistent, some care is needed in the ratings update process, in rating performance calculations, etc., but the way in which FIDE determines the values of constant k does not harmonize with original proposal by Elo, implying distortions. In addition, the FIDE formula for calculating performance ratings generates other inconsistencies. The USCF solution is also not ideal, but has less distortion. Mark Glickman's model is comparatively better and has been adopted by online gaming platforms with relative success, although it also has points that could be improved. There are other more accurate and more complete alternatives than these, some of which are applied in other areas, such as the Financial Market, for ranking genotypes in the best processes for optimizing investment strategies, for example.
For these and other reasons, careful statistical analysis should not use the performance rating formula recommended by FIDE. If the purpose were to calculate the “official” rating, the use of the FIDE method would be mandatory, regardless of distortions. But if the goal is a correct statistical calculation, then it is necessary to approach the question through a consistent and well-calibrated model.
At this juncture, there are different ways of dealing with the situation in which a player scores 100% of the points. One is simply following the FIDE method, but being aware of the distortions. There is also a suggestion by Ken Thompson (co-creator of UNIX and pioneer of tablebases) to deal with these singularities, which unfortunately also has inconsistencies. There are Bayesian alternatives to use a different ceiling according to the number of games played and/or according to the dispersion in the results. One can also consider the weighted average between the player's current rating and the performance rating, balancing the weights according to the number of games in the event and the total number of games the player has played, among other alternatives.
In the specific case of this article, the solution that seemed most appropriate to me was to treat the set of all games with DGT as if they were part of a single event with 90 games, and the same for the 83 without DGT. In this way we can eliminate divisions by 0, without having to make arbitrary adjustments or botches.