# Surprising, but banal coincidences

## Errors in judgment lead us to see incredible phenomena in certain coincidences and to seek impossible explanations for them.

It is true that, on average in a primary school, the faster the pupils read, the older they are. Will we conclude that learning to read makes you grow taller? More seriously, a statistical study by Franz Messerli of Columbia University, conducted with all the necessary methodological rigor and published in 2012 in the New England Journal of Medicine, established that there is a close correlation between the consumption of chocolate per capita of a country and the number of Nobel Prizes obtained by that country per million inhabitants. Will we conclude that researchers must eat chocolate to increase their chances of being awarded the famous prize?

Wouldn’t the explanation be rather that the social and educational system of rich countries favors good research and therefore the attribution of Nobel Prizes, and that this same wealth favors the purchase by all of chocolate? The two facts are well linked, but only because they are the consequence of the same cause, not because one implies the other. When two facts are correlated, it does not mean that one is the consequence of the other, but sometimes only that a third factor causes them both. The search for causal links between facts must be carried out with caution.

# Pure luck?

Apart from those situations where, despite everything, the identified correlation has a satisfactory explanation involving a third factor, such as age or economic wealth, there are cases where, ultimately, the only explanation is chance.

The few examples given by the figure below make it possible to understand what this means. These illusory correlations were painstakingly collected by Tyler Vigen when he was a student at Harvard Law School in Cambridge, Massachusetts. He now works at the Boston Consulting Group and has published a book with his amusing findings (see bibliography and sites www.tylervigen.com/spurious-correlations or http://tylervigen.com/old-version.html).

Particularly disturbing is the observation on the same graph of the curve showing year after year the number of suicides in the United States by hanging or suffocation and the curve showing year after year the expenditure of American scientific research is particularly disturbing. When one of the curves goes down or up, the other follows it in parallel. The synchronization is almost perfect. It is translated in mathematical terms by a correlation coefficient of 0.992082, close to 1, the maximum possible. Does this establish that there is a real connection between the two sets of numbers?

This is so unlikely that we have to face the facts: it is only a matter of chance, or as they say, a coincidence. The only reasonable way to explain the parallelism of the two curves and the many cases of the same type proposed by Tyler Vigen is as follows:

- He collected a very large number of statistical series.

- He was able to compare them systematically, which allowed him to find them having similar gaits, hence correlation coefficients close to 1.

- We should not be surprised: it was inevitable because of the very large number of digital series taken into account.

The amount of basic data is the explanation and Tyler Vigen has detailed his method: after having gathered thousands of numerical series, he entrusted them to his computer so that he systematically searches for the pairs of series giving good correlation coefficients. ; he then called on the students of his faculty to indicate to him, by votes, the pairs of curves considered to be the most spectacular among those preselected by the computer.

This procedure is known in English as data dredging, which can be translated as “data dredging”. It is important to be aware of the dangers that such processing creates, especially today when collecting and exploiting colossal amounts of information of all kinds has become easy and widely practiced. The new computer discipline of data mining could be unwittingly a victim of these illusory correlations: without precaution, it can take them for real links between series of data that are in reality completely independent. .

# A victim: medical research

Medical research is frequently confronted with the problem of such questionable correlations. In a 2011 article, Stanley Young and Alan Karr of the American Institute for Statistical Studies cited 12 published studies that allegedly established links between vitamin consumption and the occurrence of cancer. These studies had sometimes been performed using a placebo protocol and double-blind measurement, and therefore appeared serious. Yet when they were repeated, in some cases multiple times, the new results contradicted the original results. In addition to the great temptation to round off the figures a bit so that they speak in a clear sense and allow the publication of an article, it may simply be that the authors of these works with conclusions impossible to reproduce have been victims of an illusory correlation situation, as Tyler Vigen has spotted many.

This misleading chance resulting from the exploration of too many combinations is a statistical error of judgment. We are mistaken in believing that an observation is significant and must therefore be explained, whereas it had to occur (it or another of the same type) because of the large number of hypotheses considered, sometimes collectively by all the research teams, in the quest for statistical regularities.

# Data snooping

This illusion is close to that known as data snooping which can be used for deliberate deception. To illustrate how this works, Halbert White of the University of California at San Diego suggested the following method of enabling a financial newspaper to increase the number of its subscribers.

The first week, the newspaper sends a free copy to 20,000 people; in half of the issues sent, it is written that the stock index (eg the Dow Jones) will go up, in the other half the newspaper says that the index will go down. Depending on whether the index has actually risen or fallen, the newspaper sends the following week, to the 10,000 addresses which have received the correct forecast, a second free number, with half of the copies announcing the following week that the index will go up and in the other half it will go down. In the third week, the newspaper sends 5,000 free issues to those who have received the right anticipation two weeks in a row, and so on.

Each time, the newspaper insists that it has correctly forecast the trend for several weeks and offers a subscription newsletter. At the end of 10 weeks, there will only be about 20 entries to make, but to be convinced that the newspaper can predict the good trend of the stock market index, few potential readers wait for an exact forecast 10 times of. following. Consequently, a good number of new subscriptions will have been taken out as the shipments progress.

Another situation of data snooping in the processing of financial data leads to heartbreaking disillusionment. We are looking for rules of the type “If today the price of A share rises and the price of B share falls and …, then the price of X share will rise tomorrow”. We write a large number of them, even we write all the rules of this type with 10 elements in their premises. The best rules are then selected using a series of data from the past. We eliminate all the rules that were wrong with the past data, and we only retain those which have always provided a good prediction, or those which have been right most often. We will then inevitably have a reduced set of rules which, if they had been applied to the test data, would have made it possible to save a lot of money … and yet which will not yield anything in the weeks following their implementation. to make purchases and sales.

Of course, researchers are familiar with the trap and methods have been developed to avoid falling victim to the illusion. For example, we will try to evaluate, before carrying out the search for the right rules, the probability that when the data are drawn at random one of the rules of the considered list works with this chance, and we will ensure that this probability is close from 0.

# Lottery rigged?

Who can consider it normal that a few days later, the same series of 6 numbers come out of a Loto draw? However, this is what happened on September 10, 2009 for the Bulgarian Loto. The series 4, 15, 23, 24, 35, 42 is nothing out of the ordinary, except that it had already been drawn 6 days before, on September 4, by the same Bulgarian Lotto. An incredible coincidence, to the point that Sports Minister Svilen Neikov called for an investigation. No cheating was detected, and moreover the two draws had taken place in front of the television cameras. The thing was considered so extraordinary that the press all over the world reported the event.

Perhaps more surprisingly, a year later, a month apart, on September 21 and October 16, 2010, the Israeli Lotto released the series 13, 14, 26, 32, 33, 36, again causing astonishment global. As David Hand explained in his article “Not so strange coincidences” (Pour la Science n ° 438, April 2014), if we take into account the number of Loto games in the world, and the number of draws that each of them operates, often several times a week, our astonishment must cease.

Specifically, consider the Bulgarian Lotto where 6 numbers are drawn between 1 and 49, which gives a probability of winning of 1 / 13,983,816. between them give the same result ”has a probability greater than 50% of occurring. The answer is 4,404.

The calculation is analogous to that made for the famous paradox of birthdays, according to which as soon as 23 people are reunited, the probability that two of them have the same anniversary date exceeds 50%. This shows that the events felt to be extraordinary for the Bulgarian and Israeli lotteries are in fact necessary. It is not the occurrence of identical prints that is improbable, but the reverse: if all the prints were always different, we should be surprised and seek the explanation.

Another source of astonishment comes from the close series of rare events, such as plane crashes. Journalists like to mention an alleged “law of series”, yet unknown to mathematicians, which would explain these comparisons considered both highly improbable and explainable by this untraceable law … Jacques Chirac simply said: “Emmerdes, it always flies in a squadron. ! ”… Some analyzes attempt to identify its origin by speaking of an“ excessive expectation of spreading out ”(see my article“ Our vision of chance is quite hazardous ”, Pour la Science n ° 293, March 2002): our intuition tells us wrongly that, for example, the dates of plane crashes should be evenly spaced, while statistics show us something else.

This excessive expectation of spreading has been clearly analyzed in the specific cases of aircraft accidents by Élise Janvresse and Thierry de la Rue, of the University of Rouen (see box 3 on page 111).

# Very different series, but with the same statistic

Our common sense fails to deal with probabilities and we are surprised in cases where there is no need to be. To use the word “coincidence” does not explain anything or else it introduces strange ideas like the synchronicity of Carl Gustav Jung or the morphic fields of Rupert Sheldrake, whose scientists have vainly sought to prove the existence (see for example www.skeptics .qc.ca / dictionary /).

This weakness of our mind to perceive probabilities correctly has only recently been highlighted. We are tempted to think that if the data have similar statistical parameters, they must necessarily resemble each other; and Justin Matejka and George Fitzmaurice of Autodesk Research in Toronto have deliberately created magnificent examples that show this to be wrong.

To generate their series, they have developed a technique that allows them to easily play with the shapes of the graphs created. Thus, the figure in box 4 opposite visualizes a collection of 13 series of pairs of numbers (x, y) which, despite their very different graphs (each pair (x, y) representing a point on the plane), have the same 5 following parameters: mean of x, standard deviation of x, mean of y, standard deviation of y, correlation between x and y.

# Simple and unexpected

Always in order to understand and classify the situations where our mind is surprised when it should not, let us now detail a less known case, because it is linked to a fairly recent theory. The poor understanding of what is probable, because simple, leads to perceive certain events as surprising when they are not, and therefore to believe to be in the presence of miraculous coincidences without good explanation, whereas the situation is banal.

The most general notion of simplicity is that which comes from algorithmic information theory. As we do not always have a good understanding of it, it leads us to see complex and unexpected things when there are in fact only simple things. This algorithmic information theory, or Kolmogorov’s theory of complexity, measures the complexity of an object by the size of the smallest program that generates it. It applies to digital objects, or objects capable of being represented digitally, such as images, sounds, films, and to most objects in the real world, if only their appearance is taken into account.

The theory suggests that among objects using the same number of bits of information (e.g. images of one million pixels), the simplest, i.e. those with the lowest Kolmogorov complexity, are those that we will meet the most frequently. This follows from a theorem by Leonid Levin according to which the higher the Kolmogorov complexity of an object or structure, the less likely it is that this object or structure is produced by a machine taken at random.

If one thinks that the physical world is a kind of large system of interactions leading to a form of calculation which produces the objects and structures that one meets there, it is then natural to think that the objects present in the world appear there. with a probability linked to their Levin measure, and therefore that the simplest in the sense of Kolmogorov complexity are the most frequent. Details on this idea are given in an article by Hector Zenil and myself (see bibliography). Note that this is an idea consistent with intuition: in a digital black and white image representing a photo taken from the real world, we are more likely to encounter the series of 10 white pixels, bbbbbbbbbb, which has a low Kolmogorov complexity, which the series specifies bnbbnnbnbb whose Kolmogorov complexity is greater.

A star is spherical, as are many fruits. The section of a tree trunk is circular, as are the wheels that equip our vehicles. A wheat stalk is perfectly straight, like the edges of a crystal. The surface of a lake is flat, like the skin of many animals, when viewed closely. All of this corresponds to simple forms within the meaning of Kolmogorov’s theory. In these cases, we therefore do not seek to find a common origin for two spherical, or rectilinear, or planes. Simple shapes do not necessarily have a common origin, their simplicity is enough to explain that they are found everywhere. So far, so good.

On the other hand, some objects or shapes that Kolmogorov’s theory of complexity identifies as simple are not perceived as such by our immediate judgment. A structure like that of the Sierpiński triangle (see the figure above) appears complex to us, although it is not, since very short programs generate it. If we often meet this shape in nature, especially on the surface of seashells, it is because this structure is simple, like a sphere or a line segment.

We should therefore not marvel at these multiple unrelated encounters, nor above all imagine that they come from a sort of secret functioning of the Universe which remains to be understood. As with the sphere, their simplicity is the explanation for their frequent appearance.

What is true for the Sierpiński carpet is also true for the Fibonacci sequence, which some people mistakenly marvel at so often. This is still true for the golden ratio, which arouses numerological superstition. This is also true of a multitude of fractal forms which are simple and which one should therefore not be surprised to find them. It is normal for low complexity Kolmogorov objects to be found everywhere. To seek a deep explanation for the multiple presence of the Fibonacci sequence in all kinds of natural or artificial objects is as naive as to seek a common explanation for the presence of long rectilinear segments in trees, in the lines drawn by geological layers, stalagmites or in the sky when a meteorite enters the Earth’s atmosphere.

We easily perceive the simplicity of some shapes, but for others, we need to think about the theory that allows us to understand simplicity. If we are successful, we will be less inclined to want to explain what we perceive to be coincidences. Here, as with the double Lotto identical draws, or the parallel curves showing illusory correlations, we must avoid looking for common causes for what, logically, does not need them.