Before we talk about p-value, let's first talk about hypothesis testing.
Whenever we are working on some data, it's most likely we are working on a sample from a population. The main reason is that: it may not be feasible to study the whole population.
Let's say it is stated that the average height of the Indian population is 5.5 feet, is it feasible to collect the height of all Indians and calculate the average to check whether the statement/hypothesis is right? Not really.
So what we do is, maybe collect a random sample that includes the height of people from different parts of India and based on that see whether our hypothesis holds. So basically, we will do a statistical hypothesis test to decide whether the data/sample at hand sufficiently supports our hypothesis.
It is important to note, that if we are doing random sampling multiple times, it is most likely that at each time we will get a different sample. So, the average height calculated based on a sample will change with the sample itself. Hence, whatever statement we will give about the average height of the population will have some uncertainty or will be subjected to a probability.
In statistics terminology, the statement which is being tested is called the null hypothesis. In our case, it is, the average height of the Indian population is 5.5 feet.
In most cases, the average height calculated based on the sample will not be the same as 5.5 feet. So the question is, whether the difference between the average height calculated based on the sample and 5.5 feet is due to chance or is it significant.
Now let's come to the p-value.
In null-hypothesis significance testing, the p-value is the probability of obtaining statistical test results at least as extreme as the result observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. [Wikipedia p-value]
The question is: How small a p-value should be to reject the null hypothesis?
The cutoff value is defined by the level of significance (alpha) which is nothing but the probability of rejecting the null hypothesis when the null hypothesis is true. So level of significance is nothing but the probability of mistakenly rejecting the null hypothesis. So the lower the cutoff the lower the probability of an error.
We usually reject the null hypothesis when the p-value is less than equal to alpha. For our height example, if the p-value is less than equal to alpha, we say the difference we observed between the average height calculated based on the sample and 5.5 feet is not due to chance and is statistically significant.
Note: Typically alpha is considered as 5% or 0.05.
For testing the hypothesis, we calculate a test statistic which is nothing but a function of the sample observations that reduces the data into a single value.
Since the test statistic is calculated based on the sample observations, if we change the sample observations, our test statistic value will change. So basically, the test statistic is a random variable.
The p-value derived from the test statistic will also be a random variable, which means it is subjected to a probability distribution.
Mathematically, a p-value is the probability of observing the test statistic at least as extreme as the one observed under the assumption that the null hypothesis is true [Wikipedia p-value]. So basically, what is the probability of observing the result at least as extreme as the one that you have observed under the null hypothesis?
Consider an observed test-statistic $t$ from unknown distribution $T$. Then the p-value $p$ is what the prior probability would be of observing a test-statistic value at least as "extreme" as $t$ if null hypothesis $H_{0}$ were true [Wikipedia p-value]. That is:
$p= Pr(T\geq t\mid H_{0}) $ for observing extreme value on the right side (one-sided right-tailed test).
For example, when testing the average height of Indians is 5.5 feet, the alternative hypothesis is, that the height is greater than 5.5 feet.
Similarly, $p= Pr(T\leq t\mid H_{0}) $ for observing extreme value on the left side (one-sided left-tailed test).
For example, when testing the average height of Indians is 5.5 feet, the alternative hypothesis is, that the height is less than 5.5 feet.
Why has the question of reliability been raised?
A lot of research especially in medicine, psychology, and social sciences has gone through a reproducibility crisis. That is, the results obtained from the research have failed to reproduce. And a lot of these results have been derived based on the p-value.
Let's consider a real-life example. We have a dataset on breast cancer. We need to check that those who have undergone any surgery, whether there's a significant difference in the average (median) age of patients who are alive and who are dead. In this case, our null hypothesis will be, that there's no significant difference between the average age of patients who are alive and those who are dead.
After doing the Mood's Median test we obtain a p-value of 0.9684, if we select the level of significance, alpha as 0.05, $ p-value>>alpha $, which means that such an extreme observed outcome is highly likely to be seen under the null hypothesis. We can conclude that after undergone any surgery, there's no significant difference between the average age of patients who are alive and who are dead.
Now there's an interesting property of p-value: If the Null hypothesis is true, the distribution of the p-value is uniform. And it makes sense. If the null hypothesis is true, the probability of p-value less than 5% (alpha) is 5%, and so on. That is, if the null hypothesis is true then the probability of a 5% error is 5%. But in reality, it may not be possible to repeat the experiment 100 times to obtain the distribution of the p-value and check whether it is uniform to confirm the results. So what we do is, use the bootstrap method to compute the distribution of the p-value.
Bootstrap is a method of sampling with replacement from the original data.
So let's say we have observations: 1,2,3,4,5. An example of a bootstrap sample can be 1,2,2,3,4.
In this way, we can construct many such bootstrap samples. So, to obtain the distribution of the p-value for our difference in the average age of patient problem, we will generate many bootstrap samples from the original data. Now, for every bootstrap sample, we will run the test and compute the p-value. Hence we will have many p-values. I have generated $10^{5}$ bootstrap samples, so we will have $10^{5}$ corresponding p-values.
From the distribution of the p-value, we can see it is approximately uniform.
Now what if the distribution of the p-value is not uniform?
We are going to compare the average age of patients who are at various stages of tumor development.
We will check whether there's a significant difference in the average age of patients who are at different stages of tumor development. After doing the test on the observed data, we obtained a p-value of 0.02405. If we consider alpha as 0.05 then, $ p-value<alpha$ and we can choose to reject the null hypothesis and conclude there's a significant difference in the average age of patients with different tumor stages.
But things get interesting if we observe the distribution of bootstrap p-values. Approximately 71% of the p-values lie below 0.05 (alpha). That means, for 71% of the cases our p-value will be less than the level of significance, alpha. The curve in the plot is the best-fit beta distribution with an expected value of 0.1026 and the mean computed from the bootstrap p-values is 0.0748. So, even if the Expected Value for the p-value distribution is around 0.1 which is greater than alpha (0.05), we will end up rejecting the null hypothesis approximately 71% of the time at a 5% level of significance and this can lead to serious consequences.
In conclusion, making decisions only based on p-value can cause serious consequences. If you want to use a p-value, consider a p-value which is very much less than 0.05. In 2018, a group of statisticians led by Daniel Benjamin proposed the adoption of the 0.005 value as the standard value for statistical significance worldwide [Wikipedia p-value]. Rejection of the null hypothesis does not necessarily mean that the alternative hypothesis is true.
Even in cases where there's no evidence of significance, we cannot say we accept the null hypothesis or that the null hypothesis is true. Just because we cannot prove two things are different does not mean they are necessarily the same. Remember, the absence of evidence is not the same as evidence of absence.
Along with the p-value, confidence intervals should be considered as they also tell about how reliable the estimates are. Even after using multiple methods for testing the significance, we might obtain flawed results, and that might be because of wrong assumptions, inappropriate sampling techniques used for the data, and so on. In that case, the methods themselves become unreliable. To dive more into the p-value issues and possible solutions, I suggest looking into the article, A critical evaluation of the current “p-value controversy” In this article we also shed some light on the distribution of p-value. In case you want to dive deeper, How to interpret p-value histogram is an interesting read. If you want to look into the theoretical aspects, Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications by Nassim Nicholas Taleb is an excellent book. One more possible solution is the Bayesian one. In the Bayesian aspect, we can talk about the credible interval which contains the parameter with known probability. But the Bayesian methods have their challenges especially when it comes to the prior distribution. We will keep the exploration of the Bayesian solution for the future.