In another thread, fls has raised the question of bias owing to subjects not completing the prescribed number of trials in Caroline Watt's 2014 dream precognition study (Journal of Parapsychology, 78(1): 115-125).

The same paper raises another interesting question. The overall hit rate based on the target or decoy most highly ranked by the judges was significant, even after the inclusion of the less successful trials performed by the drop-outs, at 30.6% for a p value of 0.04 (one-tailed).

Watt also did some post hoc analysis based on the judges' percentage ratings of targets and decoys, which reflected their similarity with the subjects' dreams. She made three comparisons: (1) between the ratings of targets, and the ratings of decoys, (2) between the highest ratings in trials for which the target was chosen, and the highest ratings in trials for which a decoy was chosen, and (3) between the ratio of the highest ratings to the average of other ratings in trials for which the target was chosen, and the same thing for trials in which a decoy was chosen. She used the Mann-Whitney test, which compares two sample sets by combining them into a single ranked set, and checking whether the average ranks differ significantly between the two original sets. None of these tests yielded a significant result, and Watt concluded that this didn't support the precognition hypothesis, and might indicate that non-psi factors were at work.

I'm not convinced that the first of these comparisons tells us very much. Watt quotes a p value of p=0.16, which was certainly very different from the original p value based on hit rate - 0.015. But the former was based on a two-tailed test, whereas the latter was one-tailed. If one-tailed tests are used for both, and if the results from drop-outs are included in the latter, the p values are not so very different - 0.08 and 0.04. (Note that revised figures for the Mann-Whitney tests weren't included in Watt's note on the drop-out question.)

In contrast, the differences for the second and third Mann-Whitney tests are nowhere near significant. This seems paradoxical, and does seem hard to reconcile with a precognitive effect. As the Mann-Whitney test is based on rankings, and is often discussed in terms of median values, I wondered whether it might be insensitive to a precognitive effect that manifested itself only in a small percentage of trials - say 10%. But some simple model calculations suggest that's not the case. I also wondered whether, if there were systematic differences between the size of the ratings given by the two judges, that might decrease the significance of the Mann-Whitney statistic (there was only one judge per trial in this study). Watt doesn't say that she tested for such differences, and I think in principle they could affect the results. But I find it difficult to believe they could explain such a large divergence between the different measures of success.

However, it seems to me that this is exactly what we should expect to see if the mechanism were not precognition, but a psychokinetic effect influencing the random selection of the target. I don't think that would tend to produce any differences in comparisons (2) and (3) - though I'm happy to be corrected if that's wrong. This raises the interesting possibility that Watt's post hoc comparisons are not pointing to a non-psi explanation, but a different modality of psi. It would be interesting to know whether this idea could be tested using data from other precognition studies.

The same paper raises another interesting question. The overall hit rate based on the target or decoy most highly ranked by the judges was significant, even after the inclusion of the less successful trials performed by the drop-outs, at 30.6% for a p value of 0.04 (one-tailed).

Watt also did some post hoc analysis based on the judges' percentage ratings of targets and decoys, which reflected their similarity with the subjects' dreams. She made three comparisons: (1) between the ratings of targets, and the ratings of decoys, (2) between the highest ratings in trials for which the target was chosen, and the highest ratings in trials for which a decoy was chosen, and (3) between the ratio of the highest ratings to the average of other ratings in trials for which the target was chosen, and the same thing for trials in which a decoy was chosen. She used the Mann-Whitney test, which compares two sample sets by combining them into a single ranked set, and checking whether the average ranks differ significantly between the two original sets. None of these tests yielded a significant result, and Watt concluded that this didn't support the precognition hypothesis, and might indicate that non-psi factors were at work.

I'm not convinced that the first of these comparisons tells us very much. Watt quotes a p value of p=0.16, which was certainly very different from the original p value based on hit rate - 0.015. But the former was based on a two-tailed test, whereas the latter was one-tailed. If one-tailed tests are used for both, and if the results from drop-outs are included in the latter, the p values are not so very different - 0.08 and 0.04. (Note that revised figures for the Mann-Whitney tests weren't included in Watt's note on the drop-out question.)

In contrast, the differences for the second and third Mann-Whitney tests are nowhere near significant. This seems paradoxical, and does seem hard to reconcile with a precognitive effect. As the Mann-Whitney test is based on rankings, and is often discussed in terms of median values, I wondered whether it might be insensitive to a precognitive effect that manifested itself only in a small percentage of trials - say 10%. But some simple model calculations suggest that's not the case. I also wondered whether, if there were systematic differences between the size of the ratings given by the two judges, that might decrease the significance of the Mann-Whitney statistic (there was only one judge per trial in this study). Watt doesn't say that she tested for such differences, and I think in principle they could affect the results. But I find it difficult to believe they could explain such a large divergence between the different measures of success.

However, it seems to me that this is exactly what we should expect to see if the mechanism were not precognition, but a psychokinetic effect influencing the random selection of the target. I don't think that would tend to produce any differences in comparisons (2) and (3) - though I'm happy to be corrected if that's wrong. This raises the interesting possibility that Watt's post hoc comparisons are not pointing to a non-psi explanation, but a different modality of psi. It would be interesting to know whether this idea could be tested using data from other precognition studies.