I posted a link a few days ago to some pieces about parapsychology by the psychologist Eliot Slater. They included a review of Hansel's "ESP. A Scientific Evaluation" (1966), in which Slater endorsed Hansel's sceptical conclusions. The review includes this statement (this appears to be Slater's view rather than Hansel's; comments by Hansel quoted elsewhere read as though he was unaware of the results of the study, even though the report seems to have been published in 1963):

"Very inadequate use has been made of mechanical and electronic recording systems to guard against trickery or error. Apparently there has been only one really satisfactory experiment when such use was made, by the U.S. Air Force Research Laboratories with VERITAC, in which 37 subjects completed a total of 55,500 trials. Neither the group as a whole, nor any single member of it displayed any evidence of ESP in the form of clairvoyance, in precognition, or in general extra-sensory perception."

http://eliotslater.org/index.php/psychia...evaluation

Later, in the introduction of a paper discussing subsequent experiments with random number generators, Martin Gardner made a similar comment about the VERITAC study (Scientific American, vol. 233, pp. 114-119 (1975)):

"When results were analyzed, using the familiar chi-square test for statistical significance, there was no deviation from chance either for the entire group or for any individual. Nor were there significant differences in the scores of sheep and goats."

https://www.jstor.org/stable/24949922

I'm not sure how widely available the VERITAC report was when these comments were published. It is a typescript, and I assume only a small number of copies were produced. But now it's available on the website of the US Defense Technical Information Center:

http://www.dtic.mil/dtic/tr/fulltext/u2/414809.pdf

As Slater said, 37 subjects completed their trials (8 failed to complete, and their data were not included, which could potentially bias the overall results, though the report says only 2 gave up through "loss of interest"). This was a forced-choice experiment in which the targets were digits between 0 and 9 (so on the no-psi hypothesis the probability of success was 0.1). There were three modes: clairvoyance, precognition and telepathy. Each subject did 5 runs of 100 trials in each mode. A "sheep-goat" classification was also applied, based on belief in psi and evidence of psi-conducive characteristics, judged from interviews with the subjects. Of the 37 subjects, there were 18 "sheep" and 19 "goats".

I think the comments above give a rather misleading impression of what the report actually says. A number of different statistical tests were applied. The study used quite a stringent criterion for statistical significance, p < 0.01, which interestingly the report says was the norm in parapsychology at the time (of course, p < 0.05 is the norm these days). Despite what Slater and Gardner wrote, some of the results were found to be significant at this level.

(1) The study used several measures of individual performance. For each subject, there was the overall score from all three modes combined, and total scores for the three modes separately. There was also a chi-squared statistic calculated from the scores for each run of 100 trials. This combined the effects of scores that were greater than chance expectation and scores that were smaller (i.e. "psi-missing"). This chi-squared measure, too, was calculated both for all three modes combined, and for the three modes separately.

It's a bit difficult to know what Gardner meant by "no deviation from chance ... for any individual", because obviously any specified deviation from chance will be achieved if a study has enough subjects. The question is whether the number of deviations is more than would be expected by chance, and that's a question about the whole group, not about any individual.

Certainly, in the VERITAC study, some of the individual scores were significant at the p < 0.01 level. For overall scores from all three modes, it was found that 2 of the 37 subjects scored above chance at the p < 0.01 level (i.e. using a one-tailed test). The authors noted that another subject scored below chance at the same level (i.e. using a one-tailed test for "psi-missing"). Two of the individual chi-squared statistics were also significant at p < 0.01 - one for the precognition mode, and another for the telepathy mode.

(2) For the group as a whole, the total scores were considered for all three modes combined, and for each mode separately. Totals for "sheep" and "goats" were also calculated for each mode. None of these total scores differed from expectation significantly at the p < 0.01 level.

However, chi-squared statistics were also calculated from the scores for each subject, again combining the deviations from chance expectation in both directions. This was done for all three modes combined, for the three modes separately, and for the "sheep" and "goats" separately in each mode. There were also chi-squared tests of goodness of fit to the binomial distribution for each mode.

Again, one of these chi-squared tests did produce a significant result. This was the one for the 18 "sheep" in the telepathy mode. The p value calculated was 0.00865. This was the result of deviations from chance in both directions (i.e., "psi-missing" as well as scores above expectation), which occurred in similar proportions - in fact the overall score was below expectation, though not significantly so.

So it's not true to say that this study produced no statistically significant results, even though the bar for significance was set rather high, at p < 0.01. Instead, it's a question of whether there were more significant results than would have been expected by chance. Certainly there were quite a few different statistical tests, and it can be argued that there were enough to produce this number of significant results by chance. Of course, the separation of the subjects into "sheep" and "goats" itself multiplies the number of tests (and probably the fact that the "sheep" showed both psi success and "psi-missing" isn't what would have been expected).

But on the other hand, if the criterion for significance had been set at today's conventional level of p < 0.05, the chi-squared statistic would have been significant (at p = 0.0432) for all the subjects combined in the telepathy mode, not just for the "sheep". (Edit: Further erroneous comment removed.)

In summary, the VERITAC study wasn't really the straightforward triumph for psi scepticism that it's been presented as. Instead, as so often in the field, the conclusion to be drawn from it is a matter of messy post-hoc interpretation.

"Very inadequate use has been made of mechanical and electronic recording systems to guard against trickery or error. Apparently there has been only one really satisfactory experiment when such use was made, by the U.S. Air Force Research Laboratories with VERITAC, in which 37 subjects completed a total of 55,500 trials. Neither the group as a whole, nor any single member of it displayed any evidence of ESP in the form of clairvoyance, in precognition, or in general extra-sensory perception."

http://eliotslater.org/index.php/psychia...evaluation

Later, in the introduction of a paper discussing subsequent experiments with random number generators, Martin Gardner made a similar comment about the VERITAC study (Scientific American, vol. 233, pp. 114-119 (1975)):

"When results were analyzed, using the familiar chi-square test for statistical significance, there was no deviation from chance either for the entire group or for any individual. Nor were there significant differences in the scores of sheep and goats."

https://www.jstor.org/stable/24949922

I'm not sure how widely available the VERITAC report was when these comments were published. It is a typescript, and I assume only a small number of copies were produced. But now it's available on the website of the US Defense Technical Information Center:

http://www.dtic.mil/dtic/tr/fulltext/u2/414809.pdf

As Slater said, 37 subjects completed their trials (8 failed to complete, and their data were not included, which could potentially bias the overall results, though the report says only 2 gave up through "loss of interest"). This was a forced-choice experiment in which the targets were digits between 0 and 9 (so on the no-psi hypothesis the probability of success was 0.1). There were three modes: clairvoyance, precognition and telepathy. Each subject did 5 runs of 100 trials in each mode. A "sheep-goat" classification was also applied, based on belief in psi and evidence of psi-conducive characteristics, judged from interviews with the subjects. Of the 37 subjects, there were 18 "sheep" and 19 "goats".

I think the comments above give a rather misleading impression of what the report actually says. A number of different statistical tests were applied. The study used quite a stringent criterion for statistical significance, p < 0.01, which interestingly the report says was the norm in parapsychology at the time (of course, p < 0.05 is the norm these days). Despite what Slater and Gardner wrote, some of the results were found to be significant at this level.

(1) The study used several measures of individual performance. For each subject, there was the overall score from all three modes combined, and total scores for the three modes separately. There was also a chi-squared statistic calculated from the scores for each run of 100 trials. This combined the effects of scores that were greater than chance expectation and scores that were smaller (i.e. "psi-missing"). This chi-squared measure, too, was calculated both for all three modes combined, and for the three modes separately.

It's a bit difficult to know what Gardner meant by "no deviation from chance ... for any individual", because obviously any specified deviation from chance will be achieved if a study has enough subjects. The question is whether the number of deviations is more than would be expected by chance, and that's a question about the whole group, not about any individual.

Certainly, in the VERITAC study, some of the individual scores were significant at the p < 0.01 level. For overall scores from all three modes, it was found that 2 of the 37 subjects scored above chance at the p < 0.01 level (i.e. using a one-tailed test). The authors noted that another subject scored below chance at the same level (i.e. using a one-tailed test for "psi-missing"). Two of the individual chi-squared statistics were also significant at p < 0.01 - one for the precognition mode, and another for the telepathy mode.

(2) For the group as a whole, the total scores were considered for all three modes combined, and for each mode separately. Totals for "sheep" and "goats" were also calculated for each mode. None of these total scores differed from expectation significantly at the p < 0.01 level.

However, chi-squared statistics were also calculated from the scores for each subject, again combining the deviations from chance expectation in both directions. This was done for all three modes combined, for the three modes separately, and for the "sheep" and "goats" separately in each mode. There were also chi-squared tests of goodness of fit to the binomial distribution for each mode.

Again, one of these chi-squared tests did produce a significant result. This was the one for the 18 "sheep" in the telepathy mode. The p value calculated was 0.00865. This was the result of deviations from chance in both directions (i.e., "psi-missing" as well as scores above expectation), which occurred in similar proportions - in fact the overall score was below expectation, though not significantly so.

So it's not true to say that this study produced no statistically significant results, even though the bar for significance was set rather high, at p < 0.01. Instead, it's a question of whether there were more significant results than would have been expected by chance. Certainly there were quite a few different statistical tests, and it can be argued that there were enough to produce this number of significant results by chance. Of course, the separation of the subjects into "sheep" and "goats" itself multiplies the number of tests (and probably the fact that the "sheep" showed both psi success and "psi-missing" isn't what would have been expected).

But on the other hand, if the criterion for significance had been set at today's conventional level of p < 0.05, the chi-squared statistic would have been significant (at p = 0.0432) for all the subjects combined in the telepathy mode, not just for the "sheep". (Edit: Further erroneous comment removed.)

In summary, the VERITAC study wasn't really the straightforward triumph for psi scepticism that it's been presented as. Instead, as so often in the field, the conclusion to be drawn from it is a matter of messy post-hoc interpretation.