Draw an elephant with a single parameter
Reproduce any animal with finite numbers using a single function
Suppose we have data and we look for a simple explanation for this data, or at least a model that allows us to reproduce it. We will start, for example, from measurement values which, for x = 0, 1, …, n, give y₀, y₁, …, yₙ. If we find a function f which satisfies f (0) = y₀, f (1) = y₁, …, f (n) = yₙ, we will be quite satisfied. We will use f to predict f (n + 1), f (n + 2), … and we will analyze f to understand the data studied and perhaps to formulate a theory explaining them.
Obviously, we are looking for the simplest possible functions f. Polynomials, i.e. functions of the form f (x) = a₀ + a₁x + a₂x² + … + aₙxⁿ are among the simplest. We will, therefore, be quite satisfied if we find a polynomial that reproduces y₀, y₁, …, yₙ.
However, the mathematical world is generous and often allows us to do whatever we want. A nice result, called “Lagrange interpolation theorem”, indicates that whatever the data y₀, y₁, …, yₙ, there is a polynomial of degree at most n which gives y₀, y₁, …, yₙ for x = 0, 1, …, n. But this polynomial is overfitting: we give ourselves too many parameters, here the coefficients of the polynomial, and what we find is not really of interest because we were sure to get there!
Interpolation:
Imagine that we want to find a polynomial P(x) which takes the value 7 for x=0, the value 2 for x=1, and the value 18 for x=2. Here is a well-known method that allows us to construct it and which generalizes to any number of data.
We write: P(x)=7(x — 1)(x — 2)/[(0–1)(0–2)]+2(x — 0)(x — 2)/[(1–0 )(1–2)]+18(x — 0)(x — 1)/[(2–0) (2–1)].
The first term is zero for x=1 and x=2, the second for x=0 and x=2, the third for x=0 and x=1. Therefore, only the first term is not zero for x=0, only the second is not zero for x=1, only the third is not zero for x=2.
In addition, factors 7, 2, or 18 at the start of each term ensure that the non-zero term for x=0 takes the value 7, for x=1 takes the value 2, for x=2 takes the value 18. C ‘ is what we were looking for. Developing and simplifying, we find: P (x) = 21 x2 / 2 –31 x / 2 + 7.
We can easily verify that P (0) = 7, P (1) = 2, P (2) = 18.
The method is generalized and allows, if y₀, y₁, …, yₙ are given values, to construct without difficulty a polynomial which will take the value y₀ for x=0, y₁ for x=1, …, yₙ for x=n. This is the “Lagrange interpolation method”.
It shows that allowing a lot of parameters, here the coefficients of the polynomial, allows without difficulty to pass a polynomial through the desired values, whatever they are.
However, in general, the function found will oscillate a lot, whereas we would prefer a function as regular and stable as possible: its adjustment to the desired values is exact, but unsatisfactory. There are better methods for approaching the desired values by limiting the oscillations of the result function.
Keep in mind that interpolating exactly is easy, but that it generally does not satisfy the desire to find smooth and regular laws.
This logic of the Lagrange interpolation is pushed to the absurd in the results presented in the main text: the method discovered by Steven Piantadosi, much more powerful than the Lagrange interpolation, allows, by changing a single parameter p of a function fₚ(x), to obtain that for any values y₀, y₁, …, yₙ, we have: fₚ (0) = y0, fₚ(1) = y1, …, fₚ(n) = yₙ.
Also, the calculation of the Lagrange interpolation polynomial will only very rarely allow us to anticipate what happens for n + 1, n + 2, etc., and this polynomial will not be useful on a theoretical level. Fortunately, it is possible that the polynomial found is of low degree. If this is the case, it means that its discovery would have resulted from taking into account fewer data (as much as the degree of the polynomial, plus one unit) and therefore that the polynomial finds the other data. In such a situation, it will not be absurd to believe that the calculated polynomial makes it possible to anticipate f(n + 1), f (n + 2), etc.
The risk of over-adjustment is a significant problem in the area of deep learning in artificial intelligence, as it leads to neural networks that are well suited to a large dataset … and nothing else.
About the choice of parameters in a model, the Hungarian-born mathematician John von Neumann made a funny remark: “With four parameters, I can make an elephant, and with five, I can move its horn. “
A single parameter is sufficient for any adjustment:
A recent article by Steven Piantadosi,
from the Department of Psychology at the University of California at Berkeley, showed that von Neumann could have replaced the number “four” in his sentence with the number “one”. The article is, of course, a game, but what it shows, by implementing fairly simple mathematics, is remarkable and should serve as a warning: a single parameter allows you to adjust a function so that it gives back a thousand data or a billion if you do it right.
The statement demonstrated by Steven Piantadosi is as follows.
- We set a precision e, for example, 1/1000, which means that we will seek to have equalities with an error less than e.
- There exists a function fₚ(x) depending on a real parameter p, such that, whatever the data y₀, y₁, …, yₙ with finite number, there exists a value p of the parameter for which fₚ(0) = Y₀, fₚ(1) = y₁, …, fₚ(n) = yₙ, with an error less than e for each equality.
Clearly, by varying the parameter p, the function fₚ(x) adjusts with precision e to any set of data fixed in advance, however large it may be.
Steven Piantadosi’s miraculous function is: fₚ(x)=sin²(2ᵏˣarcsin(℘∠p)). Recall that arcsin(x) is the inverse function of sin(x), which means that for all x between — ≠/2 and ≠/2 , y=sin(x) ⇔ x=arcsin(y).
The value k that appears depends on the precision e desired, but once e is fixed, k is fixed too. Obviously, the researcher calculated the parameter p which gives an elephant.
One parameter for the elephant
Steven Piantadosi proposed the function: fₚ(x)=sin²(2ᵏˣarcsin(℘∠p)), where the real number p is a parameter and showed that it has an extraordinary property of adjustment. The value of k is fixed once and for all and indicates the precision with which the data will be approached. We consider for example values y₀, y₁, …, yₙ such that the coordinate points (0, y₀), (1, y₁), …, (n, yₙ) draw an elephant, tightening the abscissa and drawing fairly large points.
Steven Piantadosi’s method then makes it possible to determine the value of the parameter p so that: fₚ(0)=y0, fₚ(1) = y1, …, fₚ(n)= yₙ, in other words, so that the graph of fₚ(x) draws an elephant. A single parameter is enough to have an elephant! Of course, you can get all the designs you want by changing the value of p.
Laurent Boué wrote programs available on the internet that allow the calculation of p for the desired drawings and did the calculations for a bird and a turtle.
Where does the exploit come from?
Understanding in detail how the function fₚ(x) achieves this feat, and how to go about finding the right values for the parameter p, is a math game that requires a bit of computation. However, the general idea is simple: we will hide in the parameter p which, because it is a real number, can contain an infinity of information, the numbers y₀, y₁, …, yₙ. Then, the function fₚ(x) for x=0, 1, …, n will make them go up one by one: when x will be worth 0, y₀ will go up, when x will be worth 1, y₁ will go up, etc.
Box 3 explains that this rise in approximate values of y₀, y₁, …, yₙ takes place indirectly thanks to the 2x function. When we apply the 2x function to a real number written in base 2, for example, x=0.1100111011112, it brings up all the figures to the left: 2x=1.100111011112 (the 2 in subscript indicates that it is of writing in base 2).
A first miraculous function
Let D be the function such that D(x) = 2x for 0 ≤ x <1/2 and D(x)=2x — 1 for 1/2 ≤ x <1. This can write: D(x)=2x mod 1 .
The function x mod 1, sometimes called the fractional part of x, keeps only what is after the decimal point: 2.37 becomes 0.37; 219.2222 becomes 0.2222.
Applied to a number written in binary, for example x=0.110111101₂, the function D(x) gives a number with the same binary development as x, except that the first digit after the decimal point has disappeared: D (0.1101111012) = 0.101111012.
The explanation is simple. On the one hand, 2x shifts the binary development of a digit to the left (it’s like base 10 when you multiply a number by 10), and therefore 0.1101111012 becomes 1.101111012; on the other hand, 2x mod 1 removes the 1 in front of the decimal point if there is one.
Now consider the function Dk obtained by applying k times the function D: Dk (x) = D (D (… (D (x)…)), k times. The function Dk (x) slides the numbers binaries of x of k places to the left and eliminates the first k digits: Dk (x) = 2k x mod 1. For example, D4 (0.110111101₂) = 0.111101₂.
Let be numbers y₀, y₁, …, including between 0 and 1, each written in base 2 with r digits after the decimal point. Example for r=5 and n = 2: y0 = 0.101112; y₁ = 0.11001₂ ; y₂ = 0.01011₂
Let p = 0.1011111001010112 obtained by putting the digits of y₀, y₁ and y₂ one behind the other. We have :
p = D⁰(p) = 0.1011111001010112
D¹(p) = 0.011111001010112
D²(p) = 0.11111001010112
D³(p) = 0.111001010112
D⁴(p) = 0.111001010112
D⁵(p) = 0.11001010112, etc.Thus, D⁰(p) corresponds to y⁰ on the first 5 binary digits, D⁵(p) corresponds to y₁ on the first 5 binary digits, D¹⁰(p) corresponds to y² on the first 5 binary digits, etc.
This process therefore places in the binary digits of p the binary digits of the data y₀, y₁, …, yₙ with a precision of r binary digits for each, which makes it possible, by applying r times function D , have successively y₀, y₁, …, yₙ each time with an accuracy of r binary digits.
In summary, p contains, hidden in its binary digits, the numbers y₀, y₁, …, yₙ with a precision of r binary digits for each, and the function D, applied repeatedly to p, finds these hidden numbers.
Let now be the function Eₚ defined by Eₚ(x)=Dʳˣ(p), whose variable is x and which depends on the parameter p. We have :
-Eₚ(0)=y0 with a precision of r binary digits,
-Eₚ(1)=y1 with a precision of r binary digits, …,
-Eₚ(n)=yn with a precision of r binary digits.By choosing the parameter p well, the function Eₚ(x) is therefore capable of giving the numbers y₀, y₁, …, yₙ with a precision of r binary digits. If the data is not between 0 and 1, we come back to it by dividing everything by a fairly large number.
Thus, by properly adjusting its parameter p, the function Eₚ(x) is sufficient to find any finite set of data. This is not the result announced concerning the function fₚ(x)=sin²(2ᵏˣarcsin(℘∠p)), but we are almost there.
This property is equivalent, for base 2, to what we know well in base 10: multiplying by 10 shifts all decimals by one digit to the left. For example, 10 × 3.1415926… = 31.415926…
By using the function that removes what is before the decimal point, the function (x mod 1), sometimes called “fractional part of x”, we then have a way to bring up the binary digits of a given number x by erasing those which are placed at the head. Let us indeed put D(x)=2x mod 1, and take for example x = 0.1100111011110101₂. We then have:
-D(x) = 0.100111011110101₂,
- D(D(x)) = 0.00111011110101₂,
- D(D(D(x))) = 0.0111011110101₂, etc.
By exploiting this means of hiding information in a real parameter p, as well as some other tricks, we arrive at the function which, with the right choice of p, gives y₀ for x = 0, y₁ for x = 1, … , yₙ for x = n with a precision fixed in advance. This shows that a single parameter p and a single function fₚ(x) allow to approach all the data that we want, whatever their number, as long as it is finished!
The function of Piantadosi
Given what has been said already, on the D function, we now consider the three functions:
ϕ(x) = sin²(2 ≠ x);
ϕ⁻¹(x) = (arcsin℘ -x) / 2 ≠;
L(x) = 4x (1-x).By definition, ϕ(ϕ⁻¹(x))=x for 0 ≤ x ≤ 1 and ϕ⁻¹(ϕ (x))=x for 0 ≤ x ≤ 1/4.
Here are the inevitable calculations to arrive at the miracle.
ϕ (D (x))=ϕ(2x)=sin²(4≠x)=4sin²(2≠x)(1-sin²(2 ≠ x)) = L (ϕ (x)) (we used sin (2a) = 2 sin a cos a, cos² a=1 -sin² a and sin(x+2≠)=sin(x)).
Taking x = ϕ⁻¹ (p), i.e. p=ϕ(x), we have: ϕ(D(ϕ⁻¹(p)))=L(p), from which we deduce: ϕ(Dᵏ (ϕ⁻¹(p))) = Lᵏ(p).
In other words, to know the effect of L applied k times to p, just apply ϕ⁻¹ to p, then apply k times D, then apply ϕ to the result.
We will now understand which parameter p to choose so that fₚ(x) gives y₀, y₁, …, yₙ for x = 0, 1, …, n. We first note that fₚ(x)=sin²(2ᵏˣarcsin (√ -p)) = ϕ (Dᵏˣ(ϕ⁻¹(p))) = Lᵏˣ(p).
If we want fₚ(x) to give approximately y₀, y₁, …, yₙ for x = 0, 1, …, n, then it suffices that Dᵏˣ(ϕ⁻¹(p)) gives approximately ϕ⁻¹(y₀), ϕ⁻¹(y₁), …, ϕ⁻¹(yn) for x = 0, 1, …, n. According to what has been already told, it suffices therefore that ϕ⁻¹(p) be made of k binary digits of ϕ⁻¹(y₀), followed by k binary digits of ϕ⁻¹(y₁), …, followed of k binary digits of ϕ⁻¹(yₙ).
Conclusion: the parameter p we are looking for must be the image by ϕ of k binary digits of ϕ⁻¹(y₀), followed by k binary digits ofϕ⁻¹(y₁), …, followed by k binary digits of ϕ⁻¹(yₙ). The larger k is, the more precisely the yᵢ will be given by fₚ(0), fₚ(1), …, fₚ(n).
Take for example y₀ = 0.7, y₁ = 0.4, y₂ = 0.5
and k = 12. We calculate ϕ⁻¹(0.7), ϕ⁻¹(0.4) and ϕ⁻¹(0.5) with 12 significant binary digits:ϕ⁻¹(0.7)= 0.0010100001102;
ϕ⁻¹(0.4)= 0.0001101111102;
ϕ⁻¹(0.5)= 0.0010000000002.We put the binary digits found one after the other:
q = 0.001010000110000110111110001000000000₂=
0.157741434872150421142578.We calculate p = ϕ(q) = 0.6999652447 … and we have what is expected:
fₚ(0)= 0.6999652447, or about 0.7,
fₚ(1)= 0.3996355201, or about 0.4,
fₚ(2) = 0.4999999999, or about 0.5.
Thus, no need for four parameters to draw an elephant: with a single parameter, we can arrange, and the same function will always be used, by changing p, to draw a fish, a turtle, a bird, or all that we want.
Number of parameters and complexity of models
We sometimes measure the complexity of a model by counting the number of parameters used by the model. This is a procedure that cannot be justified without additional details on the nature of the model. Indeed, the function fₚ(x) of Steven Piantadosi shows that we can always be reduced to a single parameter. If we want to give a meaning to the counting of the parameters of a model, we must detail the constraints that we impose on the model, for example, “only involve polynomials” which will prohibit accepting the function fₚ(x) or one of its variants.
It should be emphasized that the article by Steven Piantadosi does not claim to propose a new method for constructing models approximating data using a function with a single parameter. The article is only a math game that exploits the ability of a real number to hold an infinite amount of information. What makes the article particularly interesting and makes it amazing is the overall simplicity of the method. I do not believe that we published before the work of Steven Piantadosi such an elementary method allowing to determine and effectively calculate a parameter p which accumulates information in unlimited number, and to bring back what we hid in p in using only one function constructed from the elementary functions of classical analysis (sin, arcsin, exponential, etc.).
The disappointed illusion of over-adjustment.
Laurent Boué took 95 values from the American S&P 500 stock market index. He calculated the value of p which allows the miraculous function of Piantadosi,fₚ(x)=sin²(2ᵏˣarcsin(℘-p)), to restore these values. He then looked at how it worked for the next values in the index (beyond the green dotted vertical bar).
If the approximation method fₚ(x) was a good way to discover a hidden law in the variations of this stock market index, the following values for the calculated p would have been close to those of the real data.
Not surprisingly, this is not at all the case. In the drawing, the red curve is what fₚ(x) gives and the blue dots are the actual values. The function fₚ(x) allows, by changing p, to adjust to any set of data, but it is by no means a means of discovering hidden laws. This is a perfect example of unnecessary overfitting.
Physicists use real numbers, but never take more than 20 or 30 decimals into account. So they will not make sense of models like the one suggested by the prodigious function of Steven Piantadosi, which only works if one accepts to manipulate hundreds of significant figures.
If we look closely at the function fₚ(x) when the parameter p has been set, we can see that it oscillates violently and that the values it takes at whole numbers 0, 1, …, n do not are in no way representative of what happens for x between 0 and 1, between 1 and 2, between 2 and 3, etc.
If we draw (as much as we can do) the entire plot of the functionfₚ(x) between 0 and n, we do not see at all the image produced by the drawing when we only retain the values of fₚ(x) for x = 0, 1, …, n. We see a tight zigzag which darkens almost the entire rectangle [0, n] × [0, 1]. The plot of the function fₚ(x) passes very close to the coordinate points (0, y0), (1, y1), …, (n, yₙ), but it does so without limiting its variations, whereas this is what we expect from a good interpolation function: we want the latter to be as smooth as possible.
Let us quote the comments of Steven Piantadosi concerning his miraculous function:
The existence of such a simple function with such freedom of behavior illustrates a more fundamental problem. It shows that the complexity of a model cannot be assessed by counting its parameters.
More generally, the non-critical use of a parameter counting approach ignores that a single parameter with real value potentially contains an unlimited amount of information since, to fully specify a real number, an infinite number of bits must to be, which incidentally is problematic in itself. Another notable fact is that the set of real numbers which can be described with a finite number of bits (for example by having them produced by a Turing machine) is countable. This set of real numbers therefore has a zero measure in the set of all real numbers. We also know that there are injective functions (i.e. not taking twice the same value) from ℝⁿ (space of dimension n) to ℝ. This confirms that the number of parameters in a model cannot be a reasonable measure of its complexity.
Universality
The function fₚ(x)=sin²(2ᵏˣarcsin(℘-p)) can be qualified as universal since it adapts to any series of data y0, y1, …, yₙ. It is not the first universal object to be encountered in mathematics.
The simplest is probably the number ≠. We conjecture that it is a universe number and the calculation of several thousands of billions of its decimals confirms it: each possible sequence of decimal digits (for example 243, 21145 or 929292), as long as we want, would thus appear in its decimal development. If ≠ is really a universe number, we can see it as a function which, when we fix an integer n, displays for example 1 million decimal digits starting from the n-th decimal. With this million digits, we can draw an image of 1000 × 1000 pixels with 10 gray levels. If ≠ is a universe number, then for each image of this format, your portrait for example, there exists at least one integer n such that image number n taken from ≠ produces it. In this sense, ≠ would be at least as universal as fₚ(x), since capable of producing all the images of the indicated format.
However, the amazing function of Piantadosi has a double advantage over ≠:
(1)-we are certain that it is universal, it is not a conjecture;
(2)-we know how to calculate the value of the parameter p necessary to obtain data y0, y1, …, yₙ, and this calculation is feasible, which is not the case for n giving your portrait in ≠.
Another universal object: Riemann’s Zeta function
Another completely unexpected universal object was encountered in mathematics in 1975 by the Russian mathematician Sergei Voronin. It concerns the famous Riemann zeta function.
This function noted ζ (s) is defined, for any real number s> 1, by the formula: ζ(s) =1+1/2s+1/3s+1/4s+…
We can give it a meaning for all the values of s, including when s is a complex number (except in s = 1). We can see it as a function which, at any point on the plane (complex numbers can be assimilated to points on the plane), associates another point on the plane. This function plays a central role in number theory and, in particular, the position of the values of s which cancels it is linked to the distribution of prime numbers.
When we consider functions of this type, which produce complex numbers from complex numbers, the functions playing the role of differentiable functions are called “holomorphic functions”. The usual functions like xⁿ, sin (x), cos(x), log(x), and all those that we obtain by combining them are of course holomorphic.
In the part of the plane corresponding to a vertical band B of points of coordinates (x, y) with 1/2 <x <1, Sergei Voronin demonstrated an incredible property of universality of ζ (s). For any part P of this band without hole and compact (that is to say of finite size, and which contains the limit of all its convergent sequences), and for any holomorphic function f which does not cancel in P, there exists a real number r such that ζ (s + r) takes the same values in P as f, except e (e, as before, is a precision that is fixed once and for all).
In other words: whatever the holomorphic function f defined on an area P without a hole in the band B, the function ζ takes almost the same values as f when we examine its values in a version shifted upwards from P. It is a bit as if ζ (s), on its own, could imitate all the holomorphic functions! In ≠, there are probably all finite sequences of decimals, but in ζ (s), we are sure that there are all holomorphic functions! Note that Sergei Voronin’s result has been extended to other functions than the ζ (s) function.
Another fascinating universal object has already been mentioned in this section, the Rado universal graph. This infinite graph indeed contains as subgraphs all finite graphs and all infinite graphs; moreover, it has the property that when we remove a finite part, what remains is identical to what we started with.
The infinite world of mathematics conceals wonders and gives astonishments far beyond, at least it is my point of view of mathematician, of all that one can find in the physical world which, in comparison, sometimes appears dull and boring!