The puzzle of small effects
Sometimes small differences are given meaning they don’t have. Distinguishing small statistically relevant effects from those arising from chance is a challenge!
“Engineers have more boys, nurses have more girls”, “Violent men have more boys”, “Attractive individuals have more daughters”… We owe these final sentences to Satoshi Kanazawa, evolutionary psychologist at the School. of Economics of London. The author is admittedly controversial, but he has nevertheless published these unusual “findings” in serious scientific journals. One of his latest articles, published in 2017, is entitled “Individuals are all the less attractive the older their father was at the time of conception.” What do you think ?
The specialists gave their opinion: the statistical analyzes on which these articles were based have been invalidated, for example because of sampling bias. Thus, these “discoveries” are not significant and may only be the result of chance: they would never have been published if their statistical relevance had been correctly assessed.
Should we therefore reject this work as a whole? No, and moreover, certain hypotheses of Satoshi Kanazawa are based on theories recognized by the scientific community. However, the bulldozer of media coverage has crushed any nuance and precaution, however essential to the interpretation of the small statistical differences noted by the psychologist. They come under what are called small effects. To what extent do these make statistical sense? How to interpret non-significant results (and which one still seeks to interpret)? What can be the consequences of an incorrect interpretation of these small effects?
Before answering, let’s see what “statistically significant” and “standard deviation” mean, two central concepts in the study of small effects. We will use the example of the study showing that “Attractive people have more daughters”. The standard deviation measures the dispersion of the studied random variable (the difference in the proportions of girls between two categories of parents, the beautiful and the others) around its mean value. The larger the standard deviation, the greater the dispersion of the values found. How to estimate it? The standard deviation is the square root of the variance, which is calculated from the size of the groups. For a proportion, it is inversely proportional to the size of the sample tested. It is therefore in the interest of choosing large samples to reduce measurement uncertainty. For example, for a sample of 100 couples, the standard deviation of the proportion of girls is equal to 5%. For a sample of 3,000 couples, the standard deviation is only 0.9%.
In the case that interests us, we seek to verify the following hypothesis: the proportion of girls is higher in the group of attractive parents than in the group of parents considered less beautiful. When can we say that the result found is statistically significant? To confirm this, it must be far enough from a plausible result if the parents considered beautiful have as much chance of having a daughter as the other parents (about one chance in two). The difference in proportions between two groups is not statistically significant when it is probable, due only to sampling fluctuations, even if there is no real difference between the two groups.
Take the simpler example of a sequence of 20 coin flips. Say we get 8 heads for 12 stacks. The observed proportion of faces would be 40%, for a standard deviation of 11%. Since the estimate obtained is not far enough from 50% (half faces and half tails), the result obtained can be attributed to chance and is therefore not significant. In the studies in question here, we will approximate the law of the random variable by a Gaussian law, and we will be satisfied to say, without detailing the calculation, that a difference is significant if it is greater than its multiplied standard deviation. by 1.96.
Girl or boy ?
Let’s take a closer look at Kanazawa’s analysis. The subjects’ beauty was rated on a scale of 1 to 5 and the sex of their children listed. For the 3,000 or so parents studied, Kanazawa reported a difference in proportions of 8%, which seemed significant: the proportion of girls was equal to 52% for the most attractive parents (rated 1), against 44% for the average of the other four. categories (from 2 to 5). In fact, comparing the first category to the other four is only one of many possible approaches. We could, for example, have compared the two most beautiful groups to the two least beautiful groups: this time, the significance disappears.
This is a fine example of a suggestive sociological result, but one which is not statistically significant: it could very well be the result of chance. Yet it seems to support the proposed model. Faced with this type of statistical problem, we must take into account the magnitude of the expected effects.
The effects studied here are expected to be small, as evidenced by multiple studies of changes in the girl-to-boy ratio at birth. This ratio varies from 1% (the probability of having a daughter, for example, from 48.5 to 49.5%), depending on various factors: the ethnic group, the age of the parents, the birth order, the weight of the mother, the couple’s status and the season of birth. Socioeconomic conditions, especially poverty and undernourishment, have a stronger influence, reaching 3%, because male fetuses are more fragile.
Based on these scientific data, one would expect the effect of parental beauty on the girl-to-boy ratio at birth to be less than 1%, comparable to commonly observed variations. Let us check whether this is indeed the case by basing ourselves on two statistical approaches: the so-called frequentist analysis and the Bayesian analysis.
Frequentist analysis
In the first approach, we set ourselves some hypotheses, then we statistically treat the data to find out if they are more compatible with one of the hypotheses: this one will then possibly be declared true. In the second, we take into account a priori information (here, the effects of the beauty of the parents on the sex ratio at birth cannot be great) in the form of a distribution of plausible values of the effect. The information which one draws from the experiment is the law conditioned by the observation, therefore modified by it, which is called a posteriori law: it is a new distribution of the plausible effects.
Resuming the Kanazawa study, we first followed a frequentist analysis method to estimate the probability of daughters being born based on the beauty of the parents. We estimated a 4.7% probability difference between the two groups, for a standard deviation of 4.3%, which remains consistent with Kanazawa’s result. With these values, we can calculate the confidence interval which contains the true value with a probability of 95%, and which is, in percentages: [-3.9; 13.3]. How to statistically interpret the confidence interval?
It contains the value zero which corresponds to the lack of effect of the beauty of the parents on the sex ratio at birth. Our estimate of 4.7% is therefore not significant and it is necessary to continue the statistical analyzes before concluding (if it is possible!). And outside the bounds of the 95% confidence interval, what happens? What does the remaining 5% probability represent?
For the effect found to be statistically significant, it would have to be greater than 1.96 times the standard deviation, i.e. above 8.4% (the confidence interval would not contain 0). Expressed differently: the probability of wrongly rejecting the zero value is 5%.
But is it reasonable to consider significant effects as large as 8.4%? No, of course. This is called a magnitude error. The study is constructed in such a way that any statistically significant result overestimates the true effect (which cannot exceed 1%). Sign errors are added when the estimate found is of the opposite sign to the true effect (here: “handsome parents have more boys than girls”). Thus, there are two types of effect: positive, if the “beautiful” parents have more daughters than the others, and negative, if they have fewer.
To illustrate the probabilities associated with these errors, consider four scenarios based on standard deviations equal to 4.3%. They show that a study based on this sample size (approximately 3,000 couples) is not relevant for estimating small effects of the order of 1%. This is due in particular to the value of the standard deviation which is particularly high. This is why studies of the sex distribution of newborns use much larger samples, based on large demographic databases, numbering over one million individuals.
Bayesian analysis
We summarize everything we know about the effect to be analyzed, using external sources already known, by an a priori distribution. This distribution will be modified (conditioned) by the result of the experiment to give the a posteriori distribution. If we did not know anything in advance, we could take an a priori non-informative distribution and the result would correspond to that of the frequentist approach. Here, the distribution obtained a posteriori would then be approximately a Gaussian, with mean 4.7% and standard deviation 4.3%, which would correspond to a probability of approximately 86% that the real effect is positive. In general, the more the prior distribution is concentrated around zero (the assumption that the real effect on gender difference is small), the closer the posterior probability is to 50%.
Let us choose, for example, a Gaussian a priori distribution (bell-shaped), centered on zero with a shape such that the real difference in probability of having a daughter (depending on whether the parents are beautiful or not) is close to zero, with 50%, 90% and 94% probabilities of being respectively in the intervals: [–0.3; 0.3], [–1; 1], and [–3; 3]. Why focus the distribution on zero? Because we do not have a priori on the sign of the real difference in the probability of the birth of daughters according to the beauty of the parents.
In the next step, we calculate, from this prior distribution and the data, the posterior distribution of the effect. To summarize, the posterior distribution gives a probability that the effect is positive (beautiful parents have more daughters) of only 58%, of which 45% that this positive difference is less than 1%. This analysis depends — but little — on the a priori distribution; for example, if one decides to widen the distribution curve around zero, the probability that the real effect is positive increases only by 7% (to reach 65%). Changing the family of distribution curves has little effect on the results: the real effects remain weak, which is confirmed by the data.
The scientific ideal in inference about a quantity (here, the link between the beauty of the parents and the sex ratio at birth) is the description of the uncertainty which is summarized here by a distribution of probabilities. Researchers can collect data or analyze data that has already been published creatively (as Kanazawa has done) and publish their findings. Meta-analyzes can be done to review all of these results together; they will smooth out the variations that are inherent in these small sample studies in which the probability of a positive effect may drop from 50% to 58%, then perhaps drop back down to 38% and so on.
How to recognize reliable data? By collecting more and more data. Each year, the American magazine People publishes a list of the 50 most beautiful world celebrities. We have listed the sex of their children for issues published between 1995 and 2000. In 1995, for example, there were 32 births of girls to 24 boys, or 57.1% of girls, which is 8.6% more than in the general population (48.5%). A result in agreement with the Kanazawa hypothesis. But the standard deviation being 6.7%, the estimate of 8.6% is not statistically significant. To confirm this, we compared this result with those of the following years. Between 1995 and 2000, the most beautiful people according to People had 157 daughters out of a total of 329 children, or 47.7% of girls (for a standard deviation of 2.8%), which is only 0.8% lower than the figure obtained for the general population. We can not conclude …
Why waste our time studying statistical errors that no one has noticed? For two reasons. First, the results that seem to make sense without being statistically significant are the most problematic. Then, certain media and various scientific publications, by their interest in certain sociological subjects and their selection of results, bias research in social sciences.
In fact, the Kanazawa results immediately attracted media interest, including blogs from The New York Times. This publication in a peer-reviewed journal seemed to be sufficient guarantee to dispel any doubts.
A deafening noise
What’s more, the effect has been steadily increasing. Thus, the estimate — statistically not significant — of 4.7% that we made rose to 8% in the Kanazawa analysis (comparison of the most beautiful group to the average of the four least attractive groups), to reach the value of 26% after an additional study introducing other corrections, before climbing to 36% for still unclear reasons!
This inflation surprised us, this figure being 10 to 100 times higher than all the girl-boy ratios published in the literature. We concluded that in this study the noise (spurious data) was greater than the relevant signal. Statistical power refers to the ability to detect a difference when it exists. Studies with larger samples always have more power. So, if we want to say something about effects of the order of 1%, we have every interest in starting from relevant data and to carry out tests which exploit the data well. This example illustrates well that studies that lack statistical power are unlikely to achieve statistical relevance and, more importantly, overestimate the size of the effects. In other words, with these studies, the noise becomes stronger than the signal, that is to say the observed effect.
How to escape this type of problem in sociology? Today, most subjects in sociology have been sifted through, and researchers are therefore reduced to studying the small effects. The study of the sex ratio at birth is a social subject close to our concerns. Presented as a “politically incorrect truth”, the Kanazawa result, because it concerns births, touches on sensitive issues such as abortion, parental leave, the role of men and women. in the society.
We have seen that studies with insufficient statistical relevance produce random results, sometimes statistically significant, but more often intuitive. This is one of the weak points of evolutionary psychology: it interprets random results without recognizing the fragility of the explanations it gives. For example: people considered attractive are more likely to be healthy, wealthy and from dominant ethnic groups, and more generally to have characteristics valued by society. They would have power, which according to some sociological theories, would be more beneficial for men than for women. It would therefore be “natural” for attractive parents to have more boys. We do not claim that it is true; we are simply saying that it could be, but that we could just as easily imagine an argument leading to the fact that they have more daughters… Which is not without posing some difficulties!
Find differences where there are none
Compare these two statements: “Beautiful parents have more daughters” and “It is not proven that beautiful parents have more or less daughters”. There is no doubt that the sensational premiere would make the headlines more! Have the editors of the serious journals where Kanazawa’s claim appeared been influenced themselves? No doubt, and there are two possible reasons for this. On the one hand, statistical errors are sometimes difficult to detect, even by specialists. On the other hand, statistical significance is not directly related to sample size when the effects tested are small. With a large enough sample, one can almost always find statistically significant small effects. But when the effects studied are minimal, studies carried out on cohorts that are too small lead to abusive interpretations.
The study of the sex ratio at birth is not new. For example, in his work Probability, Statistics and Truth, published in 1957, Richard von Mises studied this relationship for the 1907 and 1908 births in Vienna, and found less variation than one would expect from mere chance. He attributed it to different gender distributions according to ethnic groups. However, the uncertainty obtained on the measurements was neither more nor less than that of pure chance. What to do in the face of this desire to find differences where there are none? To avoid these biases, it is necessary to show that the observed results represent real effects independent of the selection of the samples, and to find a biological argument confirming that effects of the order of 1% are important.
When we need to estimate small effects, statisticians should keep a critical eye on the estimates obtained. But the analysis methods are not exempt from methodological flaws: frequentist calculations do not take into account the sizes of the effects; Bayesian analyzes are often not better using mostly Gaussian prior distributions, and often ignore issues of statistical power. Hence the importance of correctly estimating the uncertainties, in particular by calculating the statistics on the sign of the effects and their amplitude. There is no doubt that the exchange of methods and ideas will pave the way for a better estimate of small effects.