VERITAC Revisited

#1 · Chris 2018-09-02, 06:32 PM Unregistered

I posted a link a few days ago to some pieces about parapsychology by the psychologist Eliot Slater. They included a review of Hansel's "ESP. A Scientific Evaluation" (1966), in which Slater endorsed Hansel's sceptical conclusions. The review includes this statement (this appears to be Slater's view rather than Hansel's; comments by Hansel quoted elsewhere read as though he was unaware of the results of the study, even though the report seems to have been published in 1963):

"Very inadequate use has been made of mechanical and electronic recording systems to guard against trickery or error. Apparently there has been only one really satisfactory experiment when such use was made, by the U.S. Air Force Research Laboratories with VERITAC, in which 37 subjects completed a total of 55,500 trials. Neither the group as a whole, nor any single member of it displayed any evidence of ESP in the form of clairvoyance, in precognition, or in general extra-sensory perception."
http://eliotslater.org/index.php/psychia...evaluation

Later, in the introduction of a paper discussing subsequent experiments with random number generators, Martin Gardner made a similar comment about the VERITAC study (Scientific American, vol. 233, pp. 114-119 (1975)):

"When results were analyzed, using the familiar chi-square test for statistical significance, there was no deviation from chance either for the entire group or for any individual. Nor were there significant differences in the scores of sheep and goats."
https://www.jstor.org/stable/24949922

I'm not sure how widely available the VERITAC report was when these comments were published. It is a typescript, and I assume only a small number of copies were produced. But now it's available on the website of the US Defense Technical Information Center:
http://www.dtic.mil/dtic/tr/fulltext/u2/414809.pdf

As Slater said, 37 subjects completed their trials (8 failed to complete, and their data were not included, which could potentially bias the overall results, though the report says only 2 gave up through "loss of interest"). This was a forced-choice experiment in which the targets were digits between 0 and 9 (so on the no-psi hypothesis the probability of success was 0.1). There were three modes: clairvoyance, precognition and telepathy. Each subject did 5 runs of 100 trials in each mode. A "sheep-goat" classification was also applied, based on belief in psi and evidence of psi-conducive characteristics, judged from interviews with the subjects. Of the 37 subjects, there were 18 "sheep" and 19 "goats".

I think the comments above give a rather misleading impression of what the report actually says. A number of different statistical tests were applied. The study used quite a stringent criterion for statistical significance, p < 0.01, which interestingly the report says was the norm in parapsychology at the time (of course, p < 0.05 is the norm these days). Despite what Slater and Gardner wrote, some of the results were found to be significant at this level.

(1) The study used several measures of individual performance. For each subject, there was the overall score from all three modes combined, and total scores for the three modes separately. There was also a chi-squared statistic calculated from the scores for each run of 100 trials. This combined the effects of scores that were greater than chance expectation and scores that were smaller (i.e. "psi-missing"). This chi-squared measure, too, was calculated both for all three modes combined, and for the three modes separately.

It's a bit difficult to know what Gardner meant by "no deviation from chance ... for any individual", because obviously any specified deviation from chance will be achieved if a study has enough subjects. The question is whether the number of deviations is more than would be expected by chance, and that's a question about the whole group, not about any individual.

Certainly, in the VERITAC study, some of the individual scores were significant at the p < 0.01 level. For overall scores from all three modes, it was found that 2 of the 37 subjects scored above chance at the p < 0.01 level (i.e. using a one-tailed test). The authors noted that another subject scored below chance at the same level (i.e. using a one-tailed test for "psi-missing"). Two of the individual chi-squared statistics were also significant at p < 0.01 - one for the precognition mode, and another for the telepathy mode.

(2) For the group as a whole, the total scores were considered for all three modes combined, and for each mode separately. Totals for "sheep" and "goats" were also calculated for each mode. None of these total scores differed from expectation significantly at the p < 0.01 level.

However, chi-squared statistics were also calculated from the scores for each subject, again combining the deviations from chance expectation in both directions. This was done for all three modes combined, for the three modes separately, and for the "sheep" and "goats" separately in each mode. There were also chi-squared tests of goodness of fit to the binomial distribution for each mode.

Again, one of these chi-squared tests did produce a significant result. This was the one for the 18 "sheep" in the telepathy mode. The p value calculated was 0.00865. This was the result of deviations from chance in both directions (i.e., "psi-missing" as well as scores above expectation), which occurred in similar proportions - in fact the overall score was below expectation, though not significantly so.

So it's not true to say that this study produced no statistically significant results, even though the bar for significance was set rather high, at p < 0.01. Instead, it's a question of whether there were more significant results than would have been expected by chance. Certainly there were quite a few different statistical tests, and it can be argued that there were enough to produce this number of significant results by chance. Of course, the separation of the subjects into "sheep" and "goats" itself multiplies the number of tests (and probably the fact that the "sheep" showed both psi success and "psi-missing" isn't what would have been expected).

But on the other hand, if the criterion for significance had been set at today's conventional level of p < 0.05, the chi-squared statistic would have been significant (at p = 0.0432) for all the subjects combined in the telepathy mode, not just for the "sheep". (Edit: Further erroneous comment removed.)

In summary, the VERITAC study wasn't really the straightforward triumph for psi scepticism that it's been presented as. Instead, as so often in the field, the conclusion to be drawn from it is a matter of messy post-hoc interpretation.

#2 · Chris 2018-09-05, 11:57 AM Unregistered

(2018-09-02, 06:32 PM)Chris Wrote: However, chi-squared statistics were also calculated from the scores for each subject, again combining the deviations from chance expectation in both directions. This was done for all three modes combined, for the three modes separately, and for the "sheep" and "goats" separately in each mode. There were also chi-squared tests of goodness of fit to the binomial distribution for each mode.

Again, one of these chi-squared tests did produce a significant result. This was the one for the 18 "sheep" in the telepathy mode. The p value calculated was 0.00865. This was the result of deviations from chance in both directions (i.e., "psi-missing" as well as scores above expectation), which occurred in similar proportions - in fact the overall score was below expectation, though not significantly so.

That's the result the authors published, but their calculation of the chi-squared statistic is implicitly approximating the binomial probability distribution using a normal distribution. I suspect the result could be quite badly biased. (The erroneous comment I deleted from my post above was an estimate based on the same approximation, and it turned out to overestimate the significance by quite a lot.)

#3 · Chris 2018-09-06, 10:34 AM Unregistered

(2018-09-05, 11:57 AM)Chris Wrote: That's the result the authors published, but their calculation of the chi-squared statistic is implicitly approximating the binomial probability distribution using a normal distribution. I suspect the result could be quite badly biased. (The erroneous comment I deleted from my post above was an estimate based on the same approximation, and it turned out to overestimate the significance by quite a lot.)

I did some exact calculations of p values for the sum of squares of deviations from expectation. The p values were higher than those obtained by approximating the binomial distribution by a normal one, but not too much higher (though one of the values found in the report to be (just) significant at p<0.01 became (just) insignificant - the value reflecting deviations for subject number 1 in the general telepathy tests.)

However, to complicate matters, the p values given in the report are somewhat different from the ones given by online chi-squared calculators. Of course, in the 1960s they would have had to use tables to estimate the p values, and it seems this wasn't done accurately.

#4 · Chris 2018-09-07, 10:54 AM Unregistered

(2018-09-02, 06:32 PM)Chris Wrote: I posted a link a few days ago to some pieces about parapsychology by the psychologist Eliot Slater. They included a review of Hansel's "ESP. A Scientific Evaluation" (1966), in which Slater endorsed Hansel's sceptical conclusions. The review includes this statement (this appears to be Slater's view rather than Hansel's; comments by Hansel quoted elsewhere read as though he was unaware of the results of the study, even though the report seems to have been published in 1963):

"Very inadequate use has been made of mechanical and electronic recording systems to guard against trickery or error. Apparently there has been only one really satisfactory experiment when such use was made, by the U.S. Air Force Research Laboratories with VERITAC, in which 37 subjects completed a total of 55,500 trials. Neither the group as a whole, nor any single member of it displayed any evidence of ESP in the form of clairvoyance, in precognition, or in general extra-sensory perception."
http://eliotslater.org/index.php/psychia...evaluation

Actually this is closely based on what Hansel wrote in the book. The comments of his I had seen quoted elsewhere come from the final paragraph of the book. Apparently Hansel was suggesting there that additional trials with VERITAC might settle the question of ESP one way or another. I don't know whether there were ever any further VERITAC trials.

#5 · Chris 2018-09-10, 04:01 PM Unregistered

One question about VERITAC that might well be relevant is how it generated its random numbers. Were they "true" random numbers based on physical measurements of some kind, or only pseudo-random numbers produced by a computational algorithm? From online sources, it seems that both would have been possible in 1963 (Alan Turing, for example, having designed a process to produce random numbers from electrical noise as early as 1951:
https://medium.freecodecamp.org/a-brief-...98737f5b6c ).

I can't see the answer to that question in the VERITAC report.

In the 1980 version of his critique of parapsychology, Hansel asks why the results obtained by Helmut Schmidt with random number generators are different from those of the VERITAC study (which as noted above he characterises, a bit inaccurately, as having produced no evidence at all for anomalous phenomena). He notes that "It might be said that Schmidt's machine was dependent on the indeterminacy of quantum processes and that VERITAC was not." But whether than means that Hansel knew that VERITAC was not, or only that he thought it might be suggested that it was not, I don't know.

Typoz · Typoz 2018-09-11, 06:03 AM

(2018-09-10, 04:01 PM)Chris Wrote: (Alan Turing, for example, having designed a process to produce random numbers from electrical noise as early as 1951:
https://medium.freecodecamp.org/a-brief-...98737f5b6c ).

That article makes a useful point,

Quote:But Turing’s random number instruction was maddening for programmers at the time because it created too much uncertainty in an environment that was already so unpredictable. We expect consistency from our software, but programs that used the instruction could never be run in any consistently repeatable way, which made them nearly impossible to test.

It is indeed the case, a particular set of input values may uncover an error or bug in a piece of software, and in order to fully understand and correct the software, repeatability is necessary. Actually, the same applies to any kind of data, not just random numbers, identifying suitable test data for a particular condition is a large part of the testing process. In that respect, pseudo-random numbers find a valid place in the testing stages of software development. But, having done that, one then does want to move on to running the program with real data, whether that relates to finance or psychic phenomena.

VERITAC Revisited

Chris

Chris

Chris

Chris

Chris

Typoz