The other day I was reading an answer of an interviewer on Quora, “What-is-a-typical-data-scientist-interview-like“, he wrote: ” What is P-Value ? – I expect candidates to know to explain to me what a P-Value is and what P-Value means (even at 4am…)”. This pretty much justifies the importance of understanding P-value.
There are so many definitions already provided on web but still I always have difficulty in understanding its significance. I believe many others with non-statistical background would empathize to this. So let me give a bit intuitive understanding of p-value.
Consider there are two groups, Control group and Experimental group. Experimental group is a sample taken out from a population over which an experiment will be done and then it will be compared with the control group. Difference in the groups is defined in terms of test statistic be it t-test or f-test.
Null Hypothesis means there is no difference between the two groups.
Alternate Hypothesis- statistically significant difference between the two groups.
Now assumption is made that the null hypothesis is true i.e. there is no difference between two groups and then experiment is done on experimental group. It is then checked if there is any significant effect on the group or not. Let’s take real world examples to understand what’s happening –
- Measuring the effect of a new treatment against a best existing treatment.
- New ad placement on a website produces more clicks than in the previous placement.
In the first case patients constitute the groups over which the effect is being observed while in the second case user clicks are being observed.
Here comes the significance of p-value. How do we know that the effect on the group is not just a matter of chance? Well we don’t so now we’ll calculate the probability it is attributable to chance.
P-value simplified – If you repeat the experiment over and over again at the same sample size(experimental group), what percentage of the time you see difference in experiment group by chance or another way to say the same thing what percentage of the time do you see extreme results in the experimental group. (Mind it!! all this is happening assuming null hypothesis is true)
P-value greater than 0.05 means that more than 1/20 of the time, the experiment shows no difference between the two groups. 0.05 is used typically it is known as the level of significance(α). So in a regression problem you are always interested that your p-value should be much less than 0.05 for the variable to be consider as a significant variable.