Statistical Significance

65 Replies, 12964 Views

^ That Slate piece was just referenced in their write-up on the state of science today: http://www.slate.com/articles/health_and...roken.html
(2017-10-06, 02:07 PM)jkmac Wrote: We've discussed that already. And like it or not,,, this seems to be a fact of life with these phenomenon. They are not consistent, and people need to deal with it.

I know I'm not a professional researcher, but I am a professional engineer, and I don't believe that the results in test number N+1 negate the results found in test N.

OTOH- if multiple test runs with the same procedure, show data, that when combined with meta analysis techniques negate each other, now you have something to talk about. 

But I don't think that's what we see in the cases in question, is it? We see indicative results, which vary in magnitude, some of which support the claim and others which do not, but when taken in aggregate show non-trivial support of the claims. 

In other words- if the data in conflict with the claim is not significant enough to reduce the P enough to negate the claim, the effect is demonstrated. Right? 

Two steps forward one back, still results in forward motion by one step.
You are talking about true results and false results. There are two kinds of true results - true positives and true negatives. And there are two kinds of false results - false positives and false negatives. Mostly researchers are interested in finding true positives, although they should also be interested in finding true negatives (whole 'nother discussion). What you mention above is how we go about distinguishing between true positives and false positives and false negatives. I excluded true negatives, because we are talking about whether a claim can be supported with significance testing, not whether it can be disproven.

The use of a p-value in order to determine whether a study is positive or negative, is the same as using a diagnostic test to decide if a patient has a particular condition (conceptually and mathematically). So I'm going to explain it in those terms.

The p-value represents the specificity of the test. That is, it tells us how many false positives (and true negatives) we can expect when the patient doesn't have the condition (there is no psi)The p-value doesn't tell us about true positives. True positives are determined by the sensitivity of the test, which in the case of significance testing with p-values is something called "power". Sensitivity (power) tells us how many positive tests (significant studies) we can expect when the patent has the condition (there is psi). "Power" depends upon the size of the effect and the size of the study (and the sensitivity which we can ignore for now). When we quote the p-value, we give no information about true positives and false negatives - the two things we are most interested in if we are trying to prove a claim or determine if a patient has a condition. That is, the p-value doesn't tell us what we want to know - what is the probability that I have found the effect I am looking for, in this study, or what is the probability that this patient has the condition?

Medicine uses the Likelihood Ratio to include information about true positives and false negatives, which is a form of the Bayes' Factor discussed in the link provided earlier. If we do the same thing to the idea of significance testing, we discover that we falsely think a significant study supports the claim more often than the p-value suggests. That is, a p-value of 0.000001 doesn't mean that there is only a one in a million chance that the effect was due to chance - it could be one in three for all we know. That makes a huge difference when it comes to treating patients. When it comes to psi, misclassifications in whether or not a study is a true-positive or a false-positive (and true-negative/false-negative), would give the appearance of inconsistency.

The situation you describe arises all the time in medical testing - the result of N+1 testing overturns the results of N testing. And this happens when the Likelihood Ratio of the N+1 diagnostic test is better than the LR of the N diagnostic test. With respect to the Bayes' Factor, this would reflect the results under a high Bayes' factor supplanting the results under a low Bayes' Factor.

Linda
(This post was last modified: 2017-10-06, 04:41 PM by fls.)
[-] The following 2 users Like fls's post:
  • berkelon, Arouet
(2017-10-06, 04:13 PM)fls Wrote: You are talking about true results and false results. There are two kinds of true results - true positives and true negatives. And there are two kinds of false results - false positives and false negatives. Mostly researchers are interested in finding true positives, although they should also be interested in finding true negatives (whole 'nother discussion). What you mention above is how we go about distinguishing between true positives and false positives and false negatives. I excluded true negatives, because we are talking about whether a claim can be supported with significance testing, not whether it can be disproven.

The use of a p-value in order to determine whether a study is positive or negative, is the same as using a diagnostic test to decide if a patient has a particular condition (conceptually and mathematically). So I'm going to explain it in those terms.

The p-value represents the specificity of the test. That is, it tells us how many false positives (and true negatives) we can expect when the patient doesn't have the condition (there is no psi)The p-value doesn't tell us about true positives. True positives are determined by the sensitivity of the test, which in the case of significance testing with p-values is something called "power". Sensitivity (power) tells us how many positive tests (significant studies) we can expect when the patent has the condition (there is psi). "Power" depends upon the size of the effect and the size of the study (and the sensitivity which we can ignore for now). When we quote the p-value, we give no information about true positives and false negatives - the two things we are most interested in if we are trying to prove a claim or determine if a patient has a condition. That is, the p-value doesn't tell us what we want to know - what is the probability that I have found the effect I am looking for, in this study, or what is the probability that this patient has the condition?

Medicine uses the Likelihood Ratio to include information about true positives and false negatives, which is a form of the Bayes' Factor discussed in the link provided earlier. If we do the same thing to the idea of significance testing, we discover that we falsely think a significant study supports the claim more often than the p-value suggests. That is, a p-value of 0.000001 doesn't mean that there is only a one in a million chance that the effect was due to chance - it could be one in three for all we know. That makes a huge difference when it comes to treating patients. When it comes to psi, misclassifications in whether or not a study is a true-positive or a false-positive (and true-negative/false-negative), would give the appearance of inconsistency.

The situation you describe arises all the time in medical testing - the result of N+1 testing overturns the results of N testing. And this happens when the Likelihood Ratio of the N+1 diagnostic test is better than the LR of the N diagnostic test. With respect to the Bayes' Factor, this would reflect the results under a high Bayes' factor supplanting the results under a low Bayes' Factor.

Linda
So here's the deal: I've already said that I'm not expert in research. Given that, I can't help but think that almost everything you are saying, assuming it is true   Wink   , is based on tests on a more deterministic system. IE: the thing you are testing, doesn't change behavior significantly, and thus it behaves in the same way the second time you test it. Given these limitations, it is certainly reasonable to think that a second test could shed light on a previous tests results.

I can't help but think however, that your experience in medical testing is corrupting your expectations and assumptions in regard to testing this sort of system. I'm suggesting that how one must interpret psi test data might be quite different due to the nature of the thing. I can't make any pronouncements to that fact, it's just an instinct I have.

I wish we had someone here with deep experience collecting and analyzing this sort of test data (specifically psi) and could comment and set me (or you) straight.  Smile

Until then, I'll probably need to keep my inexpert opinion/instinct on the matter to myself. (which will probably make lots of people happy)  Thumbs Up
(This post was last modified: 2017-10-06, 05:30 PM by jkmac.)
(2017-10-06, 05:28 PM)jkmac Wrote: So here's the deal: I've already said that I'm not expert in research. Given that, I can't help but think that almost everything you are saying, assuming it is true   Wink   , is based on tests on a more deterministic system. IE: the thing you are testing, doesn't change behavior significantly, and thus it behaves in the same way the second time you test it. Given these limitations, it is certainly reasonable to think that a second test could shed light on a previous tests results.

Not really. The stuff we are testing in medicine acts much the same way psi acts.

Quote:I can't help but think however, that your experience in medical testing is corrupting your expectations and assumptions in regard to testing this sort of system. I'm suggesting that how one must interpret psi test data might be quite different due to the nature of the thing. I can't make any pronouncements to that fact, it's just an instinct I have.

I have experience both in psi and in medicine. I suspect you don't have a realistic understanding of medicine (if you think that the thing we are testing doesn't change behavior significantly and behaves in the same way the second time you test it). 

Quote:I wish we had someone here with deep experience collecting and analyzing this sort of test data (specifically psi) and could comment and set me (or you) straight.  Smile

We have had other people with expertise in collecting and analyzing this sort of test data come by on the previous Skeptic forums. Unfortunately, you would be unlikely to find them helpful, given that they weren't proponents and they agreed with me. Wink

Linda
(This post was last modified: 2017-10-06, 06:45 PM by fls.)
(2017-10-06, 05:28 PM)jkmac Wrote: So here's the deal: I've already said that I'm not expert in research. Given that, I can't help but think that almost everything you are saying, assuming it is true   Wink   , is based on tests on a more deterministic system. IE: the thing you are testing, doesn't change behavior significantly, and thus it behaves in the same way the second time you test it. Given these limitations, it is certainly reasonable to think that a second test could shed light on a previous tests results.

I can't help but think however, that your experience in medical testing is corrupting your expectations and assumptions in regard to testing this sort of system. I'm suggesting that how one must interpret psi test data might be quite different due to the nature of the thing. I can't make any pronouncements to that fact, it's just an instinct I have.

I wish we had someone here with deep experience collecting and analyzing this sort of test data (specifically psi) and could comment and set me (or you) straight.  Smile

Until then, I'll probably need to keep my inexpert opinion/instinct on the matter to myself. (which will probably make lots of people happy)  Thumbs Up

What you describe is known as a loophole. In other words psi is too tricky to pin down so let's just assume it's all true.
(This post was last modified: 2017-10-06, 11:27 PM by Steve001.)
It seems often that psi researchers are quick to shout eureka when they should be a great deal more reserved in expressing positive conclusions. Here's a non psi example of counting chickens before they've hatched. Start a 7 minutes specifically or just watch the whole vid.
Are the Fundamental Constants Changing?
http://psiencequest.net/forums/thread-421.html

Quote:Three things the p value can't tell you about your hypothesis.
Statistics can be confusing, especially when you look under the hood at the mathematical engines that underlie it. That's why we use statistical software to do so much of the work for us, and why we use tools like p-values to help us make sense of what our data are saying.

The p-value is used in basic statistics, linear models, reliability, multivariate analysis, and many other methods. It's a concept every introductory statistics student and every Lean Six Sigma Green Belt learns at the start.  But it's frequently misinterpreted.

Andrew Gelman, director of the Applied Statistics Center at Columbia University, wrote a blog post that contains (amidst other interesting discussion) a good explanation of what a p-value is and, probably even more important, is NOT: 
 
 
"A p-value is the probability of seeing something as extreme as was observed, if the model were true."

In hypothesis testing, when your p-value is less than the alpha level you selected (typically 0.05), you'd reject the null hypothesis in favor of the alternative hypothesis.

Let's say we do a 2-sample t-test to assess the difference between the mean strength of steel from two mills. The null hypothesis says the two means are equal; the alternative hypothesis states that they are not equal. 

If we get a p-value of 0.02 and we're using 0.05 as our alpha level, we would reject the hypothesis that the population means are equal.

But here are three things we can't say based on the p-value:
 
 
  1. "There is 2% probability no difference exists, and 98% probability it does." 
    In fact, the p-value only says that IF the null hypothesis were true, we would see a difference as large or larger than this one only 2% of the time. If this seems confusing, just keep in mind that the p-value doesn't tell you anything directly about what you're observing, it tells you about your odds of observing it. 
  2. "Since we have a low p-value, this difference is important." 
    A p-value can tell you that a difference is statistically significant, but it tells you nothing about the size or magnitude of the difference.
  3. "The p-value is low, so the alternative hypothesis is true."
    A low p-value can give us a statistical evidence to support rejecting the null hypothesis, but it does not prove that the alternative hypothesis is true. If you use an alpha level of 0.05, there's a 5% chance you will incorrectly reject the null hypothesis.
Does this mean that quality practitioners and others shouldn't use p-values?  Of course not--the p-value is a very useful tool!  We just need to be careful about how we interpret the p-value, and particularly careful about how we explain its significance to others.
 
(2017-10-06, 11:37 PM)Steve001 Wrote: [quoted]
A low p-value can give us a statistical evidence to support rejecting the null hypothesis, but it does not prove that the alternative hypothesis is true. If you use an alpha level of 0.05, there's a 5% chance you will incorrectly reject the null hypothesis.

That's not true, but to be fair to Andrew Gelman, he didn't say that - it came from Eston Martz, who writes a blog for a software company that sells a statistics program.
(2017-10-06, 04:13 PM)fls Wrote: The p-value represents the specificity of the test. That is, it tells us how many false positives (and true negatives) we can expect when the patient doesn't have the condition (there is no psi)The p-value doesn't tell us about true positives. True positives are determined by the sensitivity of the test, which in the case of significance testing with p-values is something called "power". Sensitivity (power) tells us how many positive tests (significant studies) we can expect when the patent has the condition (there is psi). "Power" depends upon the size of the effect and the size of the study (and the sensitivity which we can ignore for now). When we quote the p-value, we give no information about true positives and false negatives - the two things we are most interested in if we are trying to prove a claim or determine if a patient has a condition. That is, the p-value doesn't tell us what we want to know - what is the probability that I have found the effect I am looking for, in this study, or what is the probability that this patient has the condition?

Chris, I'd be interested in your comments on the above. To me, it seems confused. Supposedly, "power" "tells us how many positive tests (significant studies) we can expect", but at the same time, p-values "give no information about true positives" - so how do we judge a study as "significant" in the absence of a p-value?
(2017-10-06, 06:45 PM)fls Wrote: We have had other people with expertise in collecting and analyzing this sort of test data come by on the previous Skeptic forums. Unfortunately, you would be unlikely to find them helpful, given that they weren't proponents and they agreed with me. Wink

Linda

I'm trying hard not to be rude right now,, because if you look at my post, I pretty clearly state that either one of us might need to be set straight. Confused 

Your wink doesn't change the fact that you are being passive aggressive the way I see it.
[-] The following 1 user Likes jkmac's post:
  • tim
(2017-10-06, 10:50 PM)Steve001 Wrote: What you describe is known as a loophole. In other words psi is too tricky to pin down so let's just assume it's all true.

Do you work at being this annoying or does it come naturally to you?

  • View a Printable Version


Users browsing this thread: 1 Guest(s)