Louie Savva's thesis

18 Replies, 2643 Views

To explain the apparent problems with the Time-Reversed Interference results:

For these three studies, Savva presents the results in tables containing mean values and standard deviations of the normalised reaction times. With the normalisation used, the reaction times for each participant have mean 0 and standard deviation 1.

From the values shown in the tables, and particularly those for the third study, where most of the standard deviations are around 1, it appears that these means and standard deviations are for the whole population of trials in each category (typically there will be about 20 trials in each category for each participant).

Savva's statistical analysis is done partly using analysis of variance, and partly using t tests, and these are based on means and standard deviations per participant for each category, not per trial. The average of the mean per participant will be close to the overall average of all the trials, but of course the standard deviation will be much smaller because of the effect of averaging over about 20 individual trials in each case. So we can't reproduce those statistical calculations.

The odd thing is that where Savva does a t test, if we do the same calculation using the standard deviations in the tables, the result is numerically similar. But that is not the appropriate standard deviation to use. We should use the much smaller standard deviation per participant, not the standard deviation per trial. Unless I am missing something, this means that Savva's t tests are seriously  underestimating the significance of the results. Those t tests apply to the result in the expected direction in the first study (which he found to be just non-significant) and the result in the opposite direction in the second study (which on his calculation would be significant in the unexpected direction).

Unfortunately that isn't the only problem. In the table for the second study, for the second reaction time (that is for the conventional effect), the average times are found to  be -1.2 for congruent trials, and 1.2 for incongruent trials. The problem is that the times for each participant are supposed to have been normalised so that their standard deviation is 1. If you start out with a set of numbers whose mean is 0 and whose standard deviation is 1, it's arithmetically impossible to split them up into two subsets whose means are less than -1 and greater than 1 respectively. So the result in the table seems impossible.

The only way of explaining this that I can think of is if the two sets of reaction times for each participant weren't normalised separately, but instead were all combined and normalised using a common mean and standard deviation. Re-reading the methods section, the description of the normalisation is not very specific, but it does read as if the reaction times were all normalised together before being separated into two sets - first and second. And the figures in the table seem consistent with this idea. (But what I can't understand in that case is why the normalised means of the the first and second reaction times should be equal, which they seem approximately to be in every category. Unless for some reason after the normalisation they were further shifted to make them equal.)

I don't like normalising the data like this, because it tends to complicate the behaviour of the statistics and can lead to artefacts. In this case it could clearly lead to an artefactual effect like the one observed. Because if in a particular run there is a preponderance of congruent trials, the conventional effect will tend to push the second reaction time down. Therefore it will push the overall average of both reaction times down. And if that average is being used to normalise the data, the result will be to push the normalised first reaction times up! This would be consistent with what Savva found in the second study (and perhaps also the one statistic in the third study which would have been significant in the opposite direction to that expected).

So in summary, it looks as though there is a problem with the statistics in at least the first two of these three studies, which may have caused the significance of the observations to be seriously underestimated. On the other hand, there is clearly a way in which the observation in the first study can be explained as a statistical artefact. And if it's true that the normalisation was done in the way suggested above, the second and third studies may be vulnerable to another statistical artefact.
[-] The following 2 users Like Guest's post:
  • Laird, Ninshub
Thankfully the next group of studies - Psi-Timing - is much more straightforward.

In Psi-Timing I, participants did a series of 36 trials, in which they pressed the space bar of a keyboard to generate a pseudo-random number between 1 and 6 based on the computer's clock. The computer then generated a further pseudo-random number to act as a target. The participants were again divided into two groups - arachnophobes (26) and non-arachnophobes (24). If the numbers matched, a neutral picture was displayed. If they didn't, a picture of a spider was displayed.

The psi hypothesis was that the arachophobes would score better than the others. In fact the result was that the arachnophobes scored worse than the others, and the total hit rate for all participants was also under chance expectation.

Psi-Timing II was simpler. There were 30 participants and 25 trials per run. No spiders were involved. Instead of the computer generating a target, the participant pressed the space bar twice, and a hit was registered if matching random numbers were produced (this time the random numbers were between 1 and 5). A hit was rewarded with the sound of a bell, and a miss with a buzzer.

This was the one study of the series that produced a significant result. The hit rate was 23.2%, compared with chance expectation of 20%, which was significant at p=0.017 (by exact binomial calculation; Savva calculated 0.01 using a normal approximation). That was due in part to four participants whose scores were individually significant, including one who scored 13 out of 25 (p=0.00037).

The final study of the series. Psi-Timing III, was a larger replication of the second one, with 50 participants. But in contrast the hit rate was well under chance expectation at only 18.5% (p=0.917).

As the second of these three studies did achieve significance at p=0.017, Savva's description of his tests - "none of which provide evidence for any kind of paranormal functioning" - doesn't seem particularly accurate. His justification for saying this is apparently that the second and third tests - and particularly the significant second test - were low powered: his estimates of their power were 29% and 40% respectively, meaning that in the presence of a genuine psi effect they would produce significant results less than a third and less than half of the time respectively.

While it's true that from a Bayesian point of view low power does reduce the strength of the conclusion that can be drawn from a significant result, I don't think Savva's conclusion about these two studies is justified:
"given the fact that both studies are severely under-powered, the over-riding conclusion from both studies must be that they are “scientifically useless” and thus all we can safely conclude is that no strong psi-timing effect is in evidence."

If we do plug the p=0.017 result into Bayes's Theorem, together with Savva's power estimate of 29% and a neutral prior assumption that the likelihood of psi existing is 50-50, we obtain a probability of 94% that psi exists.* That compares with a probability of 98% if the study had been properly powered at, say, 80% and had achieved the same level of significance. So the lack of power makes only a few percentage points' difference.

(* Fortuitously, someone asked on another thread about a form of Bayes's Theorem in terms of the significance level and the power that I'd posted there. That made me realise that this probability of 94% is wrong, because I used the observed p value together with Savva's power value. For consistency, I should either have used the significance level, or else recalculated the power as the probability of attaining the observed result or better. I think it's fairer to do the former, which gives the probability of psi existing given that the observation was significant at 0.05 or better. This gives an 85% probability that psi exists, compared with 94% if the study had been properly powered at 80%.

Alternatively, instead of using significance level and power we can calculate directly the probabilities of the observed number of hits - 174 out of 750 - given the null hypothesis and the psi hypothesis. This actually gives about the same result as the calculation above - a probability of 85% of psi existing given the observed number of hits (using for the psi hypothesis a hit probability of 0.216, that is the same factor above chance expectation of 0.2 as Braud's and Shafer's hit probability of 0.18 was above 0.167.)

So the lack of power in Savva's study does make somewhat more of a difference, in Bayesian terms, than I had thought. But I still think it's far from fair to say that it makes the study "scientifically useless.")
[-] The following 2 users Like Guest's post:
  • Laird, Ninshub
The final study - Insect Death-Avoidance - is perhaps the simplest of all. Each trial consisted of placing an insect near the centre of a disc-shaped chamber. After a minute, a photograph was taken showing the insect's position in the chamber, and after another minute one half or the other of the chamber was randomly designated as the death zone. If the insect was in the fatal half it was killed. Otherwise it was preserved.

The experiment was performed with one group of 35 ants and also with several generations of beetles. In each generation the beetles that had survived the experiment were allowed to breed in order to produce the next experimental group. This continued for 5 generations, giving a total of 650 experimental subjects.

The psi hypothesis was that more than half of the insects would survive. But in fact, only 337 of the 685 insects survived - about 49.2%. So the final study provided no evidence of psi.

I thought the idea about breeding from the survivors was interesting, but it seems a bit presumptuous to think that five generations of this kind of artificial selection would make an appreciable difference after hundreds of millions of years of natural selection!
[-] The following 1 user Likes Guest's post:
  • Ninshub
(2019-08-06, 08:54 PM)Chris Wrote: I thought the idea about breeding from the survivors was interesting, but it seems a bit presumptuous to think that five generations of this kind of artificial selection would make an appreciable difference after hundreds of millions of years of natural selection!
You could say the same for Sheldrake's efforts to test morphic resonance with successive batches of chicks. But it's also presumptuous to think that, if psi exists, beetles (and chickens) would exhibit it to an appreciable degree, or that it would work in a way that allows an insect to comprehend psychic signals from a mammal.
(2019-08-06, 10:56 PM)Will Wrote: You could say the same for Sheldrake's efforts to test morphic resonance with successive batches of chicks. But it's also presumptuous to think that, if psi exists, beetles (and chickens) would exhibit it to an appreciable degree, or that it would work in a way that allows an insect to comprehend psychic signals from a mammal.

I'd have thought Sheldrake's experiment was different, as (if I understand correctly) the idea was to test a direct psi influence on successive batches of unrelated chicks, rather than trying to select for psi ability during the breeding process.

I think Savva's hypothesis was that as precognition would confer an evolutionary advantage, natural selection would tend to favour its development. But I don't see why he would have expected his artifical selection process to be so much more powerful than the kind of selection that occurs in the wild - or indeed any more powerful.

I agree that psi in insects is rather a special area, and experiments on beetles won't necessarily tell us much about psi in general.
(2019-08-07, 06:27 AM)Chris Wrote: I'd have thought Sheldrake's experiment was different, as (if I understand correctly) the idea was to test a direct psi influence on successive batches of unrelated chicks, rather than trying to select for psi ability during the breeding process.
I believe that's right. My bad.


I'd not heard of Savva before stumbling onto this thread, and I am woefully ill-equipped to assess statistics, but my general impression is that, to paraphrase Martin Gardner about another researcher, Savva's reasons for turning on parapsychology seem as questionable as some of the more spurious reasons for engaging with the field in the first place.
To conclude -

I think three of the ten tests in the thesis clearly have to be excluded, for the reasons given above:
(1) Presentiment I, because owing to the variations between participants the power of the statistical test is practically zero.
(2) Time-Reversed Interference I, because the effect being sought can be produced by a statistical artefact.
(3) Indirect Test of an Exceptional Precognitive Claim, because (although it gives a statistically significant result) it tells us nothing about whether David Mandell's pictures are precognitive.

There are also apparently serious problems with another three:
(1) Presentiment II, because the test is not stated to have been fixed before the data were examined, and in any case there is insufficient power to give a significant result even for the normal (non-precognitive) reaction to the stimulus.
(2) Time-Reversed Interference II and III, because again there may be scope for a statistical artefact - this time in the opposite direction to the effect being sought - and also because there seem to be errors in the statistical analysis.

That leaves the three Psi-Timing tests and the Insect Death-Avoidance test.

The second of the Psi-Timing tests produced a significant result. The extent to which a negative conclusion about the existence of psi can be drawn from the non-significant results of the other three depends on their statistical power - that is, the probability of their giving a significant result if psi exists.

It is difficult to estimate statistical power for parapsychology experiments, because there is no agreement that the effects exist, let alone agreement about how strong they are if they do exist.

Savva estimated the power of the Psi-Timing tests on the basis of a similar experiment published by Braud and Shafer (1989), in which there was a hit rate of 18%, compared with chance expectation of 16.7%.

For Psi-Timing I, to estimate the power, he took as his psi hypothesis that the arachnophobes (who experienced the negative stimulus of a spider picture in response to a miss) should score at 20%, and the non-arachnophobes (for whom the stimulus would be ineffective) should score at chance. From that he estimated the power at about 60%. That seems over-optimistic to me. It seems more reasonable to assume that the arachnophobes would score at 18% like the subjects in Braud's and Shafer's experiment, all of whom experienced feedback in the forms of bells and buzzers. On that assumption, the power of the experiment plummets to about 20%.

Using the 18% hit rate for Psi-Timing II and III, I get, from exact binomial calculations, powers of 28.0% and 40.3%, similar to the figures calculated by Savva using a normal approximation.

For the Insect Death-Avoidance test, I can't see an estimate of power in the thesis, probably because there's a lack of successful similar psi experiments to compare it with. But if we consider a psi hypothesis in which the survival probability is boosted from 50% to something similar to the success rates in Daryl Bem's binary precognition studies - that is about 52.5% - then the power of the insect test is found to be 35.5%.

With results in the range 20-40%, these tests are severely under-powered (as Savva says of Psi-Timing II and III), and it is difficult to see the unsuccessful tests as significant evidence against the existence of psi.
[-] The following 4 users Like Guest's post:
  • berkelon, Typoz, laborde, Laird
A very insightful and useful analysis, Chris. Thank you for taking the time to deconstruct this thesis.
[-] The following 1 user Likes Laird's post:
  • Typoz
(2019-08-06, 08:21 PM)Chris Wrote: If we do plug the p=0.017 result into Bayes's Theorem, together with Savva's power estimate of 29% and a neutral prior assumption that the likelihood of psi existing is 50-50, we obtain a probability of 94% that psi exists.* That compares with a probability of 98% if the study had been properly powered at, say, 80% and had achieved the same level of significance. So the lack of power makes only a few percentage points' difference.

(* Fortuitously, someone asked on another thread about a form of Bayes's Theorem in terms of the significance level and the power that I'd posted there. That made me realise that this probability of 94% is wrong, because I used the observed p value together with Savva's power value. For consistency, I should either have used the significance level, or else recalculated the power as the probability of attaining the observed result or better. I think it's fairer to do the former, which gives the probability of psi existing given that the observation was significant at 0.05 or better. This gives an 85% probability that psi exists, compared with 94% if the study had been properly powered at 80%.

Alternatively, instead of using significance level and power we can calculate directly the probabilities of the observed number of hits - 174 out of 750 - given the null hypothesis and the psi hypothesis. This actually gives about the same result as the calculation above - a probability of 85% of psi existing given the observed number of hits (using for the psi hypothesis a hit probability of 0.216, that is the same factor above chance expectation of 0.2 as Braud's and Shafer's hit probability of 0.18 was above 0.167.)

So the lack of power in Savva's study does make somewhat more of a difference, in Bayesian terms, than I had thought. But I still think it's far from fair to say that it makes the study "scientifically useless.")

Just to let people know about the correction of an error in this post.
[-] The following 1 user Likes Guest's post:
  • Laird

  • View a Printable Version
Forum Jump:


Users browsing this thread: 3 Guest(s)