Peter Bancel's simulation of Questionable Research Practices

#2 · Chris 2018-06-20, 03:01 PM Unregistered

I think this is an interesting attempt to determine whether the significant results indicated by meta-analyses could be explained by a combination of different "questionable research practices" (QRPs). Essentially it's a more elaborate version of previous attempts to estimate the "file drawer effect". Computational methods are used to simulate the effect of several QRPs acting together, and to test whether they could account for the observed results of meta-analyses. The observed results are (1) the frequency distribution of different p-values for the studies covered by the meta-analysis - in this case represented by 6 "bins" covering different ranges of p-values - and (2) other calculated statistics - in this case, only the effect size.

The method works by finding the optimal combination of QRPs, and working out an overall p-value for the hypothesis that the the observed results could be explained by that combination. The smaller this p-value, the more implausible it is that the results can be explained in this way.

The claim is that the way in which the p-value frequency distribution is divided up into "bins", together with the use of other statistics, makes the method more able to discriminate between genuine effects and effects of QRPs. So - in contrast to the conclusion of Bierman, Spottiswoode and Bijl - the overall p-value for the Ganzfeld studies is 0.053, which is almost significant at the 5% level, indicating that, assuming the optimal combination of QRPs, the observed data are unlikely to have arisen by chance.

This seems like a potentially useful technique, but there are a couple of things I am concerned about. One is a technical concern - from the description in the preprint it sounds as though the p-values obtained from the frequencies for different ranges of p-values, and the p-value for the effect size, are combined as though they were statistically independent. I don't see why they should be statistically independent, so it looks as though this may artificially reduce the overall p-value, making the QRPs seem a less plausible explanation of the observed results.

More seriously, when the author says "a broad set of Qrps fails to account for the Ganzfeld data, even if these are used in maximal combination and are adopted by researchers at frequencies approaching 100%", it's important to understand that the 100% frequency doesn't mean the QRPs are attaining their greatest possible size. It means they are attaining the (arbitrary) size deemed to be plausible when modelling the QRPs. For example, the publication bias model is that studies with a p-value of 0.3 or more would be published with a probability of only about 0.5, while studies with a very small p-value would almost always be published, with a smooth transition between the two. That is the maximum amount of publication bias considered in the model. But, of course, in theory there could be a greater degree of publication bias than this.

So ultimately this model still depends on a subjective assessment of how common QRPs are. What would be interesting would be a model which didn't restrict the strength of QRPs in this way. Perhaps that might show that the predictions based on QRPs were still inconsistent with observations. I doubt the current model would be powerful enough to say that. Perhaps Peter Bancel is right that the method could be made more powerful, but I think in that case the issue of statistical dependence between the observed quantities would need to be considered carefully.

Peter Bancel's simulation of Questionable Research Practices

Chris