An explanation for Benford’s law
Benford’s law, which relates to the first significant digit of numbers, has lost its mystery. At the same time, it has been generalized and, thus, has become more effective in detecting fraudulent data.
Benford’s law or “law of the first significant digit” continues to disturb, interest and give rise to studies and applications. Over 130 scientific papers have been published on the subject over the past five years (see http://www.benfordonline.net).
Some find it mysterious, while others believe they understand its nature and offer explanations. It has been generalized and used to test and spot fraud. We will present some of the recent ideas on this strange and exciting subject.
The law of the first digit
The numbers that we meet to measure the population of cities, the distances between stars or the prices appearing on the products of a large supermarket show a surprising property. In these series, the proportion of numbers whose first significant digit is 1 is greater than the proportion of numbers whose first significant digit is 2, itself greater than the proportion of numbers whose first significant digit is 3, and so following.
The law of the first significant digit indicates precisely that in a general context, and without opposing particular reasons, the probabilities of meeting the various digits at the head of the numbers are respectively:
p (1) = 30.1%
p (2) = 17.6%
p (3) = 12.5%
p (4) = 9.7%
p (5) = 7.9%
p (6) = 6.7%
p (7) = 5.8%
p (8) = 5.1%
p (9) = 4.6%.
Although today referred to as Benford’s law, this law was first formulated by Canadian astronomer Simon Newcomb in 1881. He noticed that the first pages of the logarithm tables were more damaged than the following ones. . His article was ignored until, 57 years later, American physicist Frank Benford also noticed the uneven wear of the pages of the digital tables. The articles of Newcomb and Benford propose the same formula: the probability of encountering the digit c as the first digit of a number is according to them log₁₀ (c + 1) — log₁₀(c), where the logarithm used is the logarithm decimal which, remember, checks:
log₁₀ (1) = 0;
log₁₀ (10ⁿ) = n for integer n;
log₁₀ (ab) = log₁₀(a) + log₁₀ (b);
log₁₀(a / b) =log₁₀(a) — log₁₀(b).
Not all statistical series verify this Benford law. The height of an adult human measured in centimeters starts, with rare exceptions, with 1 and therefore does not verify it. Likewise, the registration plates of motor vehicles have numbers which, in each country, are most often well distributed: as many numbers starting with 1, as with 2, etc. For the law to manifest itself, it seems necessary that the numbers of the series considered take values varying over several orders of magnitude (this is the case of the sizes of cities), and that they be fairly regularly spaced (see box 1 for further details).
For some numerical sequences, Benford’s law is not conjectured, but proven. This is the case for powers of 2 (2, 4, 8, 16, 32, …). We have shown that at infinity, the proportion of powers of 2 that start with the number 1 is exactly log₁₀(2) — log₁₀(1), that the proportion of powers of 2 that start with 2 is log₁₀(3) —log₁₀(2), etc.
The expected uniform spread
Can we formulate a simple explanation of what we observe and prove for certain numerical sequences? It seems so. The one we are going to present now convinces most of those who make the effort to understand it. It might seem a bit complicated, but it is the most general intuitive rationale known, and what it suggests is useful, as we will see.
The explanation is based on a first law which is rarely stated, because it is undoubtedly considered too simple, and which we will call “the law of uniform spreading of the fractional part”. Before formulating it, let’s set notations that will help to express yourself effectively. Let r be a real number (for example r = 2.71828 …). Its integer part is the largest integer less than or equal to r, which we will denote by frg. Its fractional part is r — frg and will be denoted {r}: thus, f2.71828g = 2 and {2.71828} = 0.71828. In simple language, the integer part is what lies in front of the comma, the fractional part is what lies behind.
The fractional part spreading law
If one chooses real numbers r at random in an interval wide of several units (for example between 0 and 20), and the law which indicates the probability of falling on one of the possible values is fairly regular and spread out, then the fractional part of the numbers r will be, more or less, evenly distributed in the interval of the numbers between 0 and 1.
Consider, for example, the overall grade average (out of 20) of students in a school. We find numbers like: 10.54; 12.43; 7.23; 11.97; 12.41; 13.80; 16.55… whose fractional parts are: 0.54; 0.43; 0.23; 0.97; 0.41; 0.80; 0.55 …
The notes will not necessarily be uniformly spread over the entire range from 0 to 20, and it is even likely that there will be a large number of notes around 10 or 11. On the other hand, what the law of d indicates The uniform spread is that the fractional part of these notes will be well spread between 0 and 1. In particular, there will be about as many fractional parts starting with 0 (after the comma) as there will be fractional parts starting with 1 , that of fractional parts starting with 2, etc. With undoubtedly small variations around 1/10, the proportion of notes will be the same for each of the 10 categories.
More generally, if we give ourselves two numbers a and b between 0 and 1 with a <b, the proportion of notes whose fractional part is between a and b is equal to b — a, the length of the interval [ a, b]. In our example, the proportion of student marks whose fractional part is between 0.25 and 0.40 is equal to approximately 15%.
The explanation of this law is that, except in special cases, the fractional parts of the numbers will not be concentrated on the same area of the interval [0, 1]. If there are more numbers between 12.3 and 12.4, this will not be true (with some exceptions) between 15.3 and 15.4. Thus, the possible irregularities in density over the 20 possible intervals between two consecutive integers will more or less compensate for each other, which will standardize the series of fractional parts, which can be seen as a sort of average of what happens on each of the 20 intervals between two integers. A graphic interpretation of this idea is provided in Box 3, as well as an analogy to the belief in the fairness of a lottery in Box 2.
Beside the intuitive and somewhat vague idea, there are of course precise formulations of mathematical conditions which ensure that the deviation from uniformity is small, or even zero; interested readers will be able to consult them in the articles of the bibliography.
The origin of this law, the central importance of which we will understand, is difficult to trace. In 1912, Henri Poincaré expressed an equivalent principle without stating a precise theorem. He explained about the third decimal of the numbers found in a table of logarithms that we will observe “that the ten digits 0, 1, 2, 3, …, 9 are equally distributed on this list and therefore the probability for that this third decimal is even is equal to 1/2 ”. For him, this statement of uniformity was obvious; he wrote: “[…] an invincible instinct leads to think so.”
The idea of standardizing fractional parts was taken up by the Croatian-born mathematician William Feller in his famous 1966 treatise on probability theory, but as has been noted since, in wanting to formulate a mathematical statement, he made a mistake. Without knowing either Poincaré’s or Feller’s text, and this time with an exact statement and demonstration, Nicolas Gauvrit and I rediscovered the idea in 2008, at the same time that another different and correct precise statement was proposed independently. by German researchers Lutz Dümbgen and Christoph Leuenberger. The inability to classify mathematical results for easy retrieval, as one classifies words in a dictionary, often results in results which do not clearly relate to a specific area of mathematics are independently discovered more than once.
Evidence of Benford’s Law
Thanks to this spreading law, which is both intuitive and formalizable, we will obtain a natural and simple justification of Benford’s law. The idea is just to apply a decimal logarithm to the previous law … and think a bit.
We take the previous statement and apply it not to the numbers r of the series considered, but to their decimal logarithm, log₁₀(r). If we choose real numbers r at random over a wide range covering several orders of magnitude (for example between 1 and 10²⁰), and the law which indicates the probability of falling on one of the possible values is fairly regular and spread out, then the fractional parts of the decimal logarithms of the numbers, i.e. the {log₁₀(r)}, will be roughly evenly distributed between 0 and 1.
We don’t see it straight away, but what we have just stated is Benford’s law (or more accurately a more powerful law sometimes referred to as “continuous Benford’s law”). Indeed, asserting that c is the first significant digit of the number r is equivalent to stating that log₁₀( c )≤ { log₁₀ (r)} <l log₁₀(c + 1), which will be justified a little later.
The fractional parts of the images by log10 of the numbers r whose first significant digit is c therefore occupy in the interval [0, 1] an interval of length log₁₀(c + 1) — log₁₀( c ), which means, if l ‘we assume the uniform distribution, that their proportion is log₁₀(c + 1) — log₁₀( c ). This is exactly what Benford’s law says about the first significant digit in decimal!
Benford’s continuous law is actually a little more powerful than the law that only mentions significant digits in base 10. It allows, for example, to find, in any number base, a statement evoking the first significant digit.
The rationale for equivalence used above is rigorous, but an example will clarify equivalence. Let us take r = 7 234. We have:
{ log₁₀(7 234)} = log₁₀(7 234) — flog₁₀ (7 234) g
= log₁₀ (7 234) — f3.8593 … g = log₁₀(7 234) — 3
= log₁₀ (7 234) — log₁₀(103) = log₁₀(7 234/1000)
= log₁₀(7.234).
Since log10 is an increasing function, we have: log₁₀(7) <log₁₀ (7,234) <log₁₀(8).
Subtracting flog₁₀(r) g from log₁₀(r) reduces the value in log₁₀ between 1 and 10 and so the box between log₁₀( c ) and log₁₀(c + 1) marks the first significant digit. Hence the indicated equivalence.
We want proof!
As before, the stated principle, i.e. Benford’s continuous law, is a bit vague, and it should be specified when it does not apply, using demonstrated mathematical results.
Such mathematical statements exist; they are a little complicated to formulate (see the article by Michel Valadier cited in the bibliography). They consist in formulating hypotheses expressing the idea of spreading and regularity, and, depending on the precision with which the hypotheses are imposed, they assert that the series of numbers satisfies Benford’s law with a precision that the theorem explains. .
You have to be a little careful with the informal formulation of Benford’s law; it is only the translation of an “invincible” intuition as Poincaré writes, but which remains an intuition. It all depends on what we call “wide range of several orders of magnitude” and “spread and regular law”. However, the informal statement shows why the law is often verified, at least approximately.
The informal law also explains another property of Benford’s law that has long been noted: by increasing the size of a series of numbers, we do not always tend towards the stated values of the probabilities log₁₀(c + 1) — log₁₀( c ) . The explanation is clear: if the numbers follow a precise law, the compensation between the intervals when we go to the fractional parts of the logarithms will be done only exceptionally, and therefore by multiplying the data, we do not converge to a perfectly uniform law. on [0, 1], but towards an approximately uniform distribution.
Understanding is the best way forward and it is here. Identifying the origin of Benford’s law suggests a simple method of generalizing it: replace the function log10 with another of the same type, i.e. increasing and continuous.
If f is an increasing continuous function, if we choose real numbers r at random over a wide range, and if the law which indicates the probability of falling on one of the possible values is fairly regular and spread out, then the fractional part of the f (r) will be roughly evenly distributed between 0 and 1.
Here again, theorems are possible. Unfortunately, for the functions f that we can consider (for example f (x) = œ — x, f (x) = x², f (x) = exp (x)) there is no simple translation of the general law in terms of the first significant digit. These generalizations are therefore not as spectacular as the law obtained with f (x) =log₁₀(x). They are useful, however, as they provide fraud detection tools.
Using general benford’s law to detect fraud
Benford’s law has been used many times to detect fraud. A recent book is even devoted to this subject (Mark Nigrini, Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection, Wiley, 2012).
The principle is simple: if they are spread regularly over several orders of magnitude, the numbers appearing in accounts or statistics should, except for special reasons, verify Benford’s law. If they are invented numbers, their author may have wanted to create about as many that start with 1 as with 2, 3, etc. Not being aware of the property under Benford’s law, the forger will not respect it. If the suspicious data does not agree with Benford’s law, it will be concluded that the data has been tampered with.
It happened in financial and tax data expertise, it happened in science where tests found data from fraudulent experiments. Benford’s law has also been used to distinguish artificial digital images from natural digital images or to identify which images in a series contained hidden data (steganography).
A problem appears. By dint of talking about Benford’s Law, data riggers could be informed. They would endeavor to produce invented data that respects it and thus pass through statistical tracking tools based on Benford’s law. The general law makes it possible to counter this risk: by using the variants with various functions f, we will identify the falsified data.
To test this method, a series of experiments were carried out by a group of researchers gathered around Nicolas Gauvrit. We describe here one of the experiments carried out. Human pseudo-random productions have been examined in four contexts where Benford’s law of significant digits is observed.
A group of 169 adults, recruited via social media or email, participated in this experiment. Their ages ranged from 13 to 73 years old. Participants were randomly divided into four groups to deal with data:
-(a) on the population of the 5,000 most populous American cities,
-(b) on mathematical constants from the tables of Simon Plouffe,
-© the distances in light years between the Earth and the nearest visible stars,
-(d) on the number of tuberculosis cases by country for the year 2012.
In each group, participants were told that a series of 30 numbers had been randomly selected from the actual data, and that they were to attempt to produce what they believed to be an analogous and plausible series of 30 numbers. Series of 30 numbers from actual databases were also constructed.
For each set of 30 values of r (manufactured or real), we examined the distributions of the fractional parts of f (r), with f (r) = Log (r), f (r) = p r² and f (r ) = Œ — r. The deviation from general Benford’s law was measured by a classical method (the Kolmogorov-Smirnov statistic). As expected, human-made data conforms less well to Benford’s law than real data (see Box 4).
Out of the 12 comparisons (3 functions, 4 data types), only one exception was noted coming from the comparison of the data of the table of constants of Simon Plouffe tested with Benford’s law for the function f (r) = p r² : in this case, humans are on average more compliant with Benford’s law than real data!
So the conclusion is clear: Humans get caught out by general Benford’s law when they attempt to fabricate fake data.
Other results of the study also suggest that, depending on the type of data whose authenticity is to be checked, some choices of the replacement function log₁₀ in general Benford’s law are preferable to others. Understanding why will be important to create detection tools that no fraudster can escape. But, to be sure, thanks to General Benford’s Law, the panoply of digital cheat hunters has been enriched with a powerful new weapon.