Consider you are working on a medical dataset that consists of patients' information who regularly visits the clinic.
Now for every patient, we will have multiple observations on several health parameters such as blood glucose level, blood pressure, and so on.
And we are interested in the relationship of these health parameters with Forced vital capacity (FVC)
FVC is the maximum amount of air a person can forcibly exhale from their lungs after fully inhaling.
A diminished FVC value is a sign of several conditions, including Chronic obstructive pulmonary disease (COPD), chronic bronchitis, emphysema, and bronchiectasis.
So it becomes vital to understand what relationship the other health parameters, such as blood pressure, share with the FVC.
When thinking about understanding the relationship between variables, the first thing that comes to mind is the regression setup.
If we consider building the simplest possible regression model, we will have the regression coefficients of different health parameters the same for all the patients. This model will not give us a complete picture of the health of individual patients. The above model is called a pooled model.
On the other hand, if we build a separate model for each patient, we will get the information about every patient separately. Here, we are considering every patient as some different entity. But that may not be true. Patients may share something in common. Assuming that every patient is a completely different entity can lead to an inaccurate picture. Also, we may run across some other problems. We may not have sufficient data on some patients to build a regression model on which we are confident.
The above model is called an unpooled model.
Here we will be using Partial pooling. Specifically, we will assume regression coefficients are different for different patients as in the unpooled case, but the coefficients all share some common part. We can model this by assuming that each coefficient comes from a common group distribution.
* Bayesian Hierarchical Linear Regression is a statistical model which allows us the analysis of data that consists of a hierarchical or nested structure.
* In this model, the data is assumed to come from multiple levels with each level having its own set of predictor values and outcomes. For example, in our medical study, the patients may be nested within hospitals and each patient has their characteristics (e.g., age, sex, blood pressure) that influence their response to an FVC.
* The hierarchical structure allows for the estimation of both individual-level effects and group-level effects, which can improve the accuracy of the estimates and account for the variability within and between groups. The Bayesian approach allows for the incorporation of prior knowledge and uncertainty into the model, which can improve the interpretability of the results and provide more robust inferences.
For the illustration, we will be specifically working on the generated data.
For our data we will be assuming 5 levels or classes. If we use analogy from our FVC case, then levels can correspond to the patients.
We genereate total 100 observations with 5 levels.
Observations are equally distributed across all levels. Hence, we have 20 observations for each level.
set.seed(1779)
n=100
level=5 # In the data there are 5 classes
k=n/level # no. of observations in each class
mu1= rcauchy(level)
mu2= rnorm(level)
sigma1= 5
sigma2= 1
X1=c()
X2=c()
count=0
for(i in 1:level)
{
X1[((count*20)+1):((count+1)*20)]=rnorm(20,mu1[i],sigma1)
X2[((count*20)+1):((count+1)*20)]=rnorm(20,mu2[i],sigma2)
count=count+1
}
eps=rnorm(n)
mubeta0=rnorm(1,8,1)
mubeta1=rnorm(1,5,3)
mubeta2=rnorm(1,3,2)
print('Beta0 Values are:')
beta0= abs(rnorm(level,mubeta0)); beta0
print('Beta1 Values are:')
beta1= rcauchy(level,mubeta1) ;beta1
print('Beta2 Values are:')
beta2= rnorm(level,mubeta2,3);beta2
X1beta=c()
X2beta=c()
count=0
for(i in 1:level)
{
X1beta[((count*20)+1):((count+1)*20)]= beta0[i]+(beta1[i]*X1[((count*20)+1):((count+1)*20)])
X2beta[((count*20)+1):((count+1)*20)]= beta2[i]*X2[((count*20)+1):((count+1)*20)]
count=count+1
}
y=X1beta+X2beta+eps
data=cbind(y,X1,X2)
[1] "Beta0 Values are:"
[1] "Beta1 Values are:"
[1] "Beta2 Values are:"
# we plot the data
par(mfrow=c(1,3))
plot(X1,y)
plot(X2,y)
plot(X1,X2)
# we fit separate regression line for every level separately
#for( i in 1:level)
mod=list()
count=0
for( i in 1:level)
{
mod[[i]]=lm(y[((count*k)+1):((count+1)*k)]~(X1[((count*k)+1):((count+1)*k)]+
X2[((count*k)+1):((count+1)*k)]))
count=count+1
}
# collecting all the coefficients (for different classes)
beta1_hat=c()
beta0_hat=c()
beta2_hat=c()
for(i in 1:level)
{
beta0_hat[i]= mod[[i]]$coefficients[1]
beta1_hat[i]= mod[[i]]$coefficients[2]
beta2_hat[i]= mod[[i]]$coefficients[3]
}
# Now we use the above coefficients in the form of priors for the
# bayesian hierarchical modelling
m_beta0= mean(beta0_hat)
m_beta1=mean(beta1_hat)
m_beta2= mean(beta2_hat)
sd_beta0=sd(beta0_hat)
sd_beta1=sd(beta1_hat)
sd_beta2=sd(beta2_hat)
Approximate Bayesian Computation (ABC) is a statistical method that allows for the inference of complex models when likelihood functions are difficult or impossible to calculate. ABC works by simulating data from a model using a set of parameters sampled from a prior distribution. These simulated data are then compared to the actual observed data, and the parameters that produce simulations that are most similar to the observed data are retained. This process is repeated many times to generate a posterior distribution of the parameters that are consistent with the observed data.
# Now we use some prior for our betas
# Assuming we do not have much of an idea about the parameters
# we will select some vague priors
#y_bar=mean(y)
sim=10^7
beta2_gen=beta1_gen=beta0_gen=matrix(0,nrow=level,ncol=sim) # just initializing the list
thresh=1
#y_gen=matrix(0,nrow=n,ncol=sim)
#sigma=c()
diff_b2=diff_b1=diff_b0=c()
for( i in 1:sim)
{
mubeta0= rnorm(1,m_beta0,30)
sigmabeta0= abs(rnorm(1,sd_beta0,10))
mubeta1= rnorm(1,m_beta1,10)
sigmabeta1= abs(rnorm(1,sd_beta1,10))
mubeta2= rnorm(1,m_beta2,30)
sigmabeta2= abs(rnorm(1,sd_beta2,30))
#count=0
#sigma[i]= abs(rnorm(1,0,10))
for(j in 1:level)
{
b0=beta0_gen[j,i]= rnorm(1,mubeta0,sigmabeta0)
b1=beta1_gen[j,i]= rnorm(1,mubeta1,sigmabeta1)
b2=beta2_gen[j,i]= rnorm(1,mubeta2,sigmabeta2)
}
diff_b0[i]= abs(sqrt(sum((beta0_gen[,i]-beta0_hat)^2)))
diff_b1[i]= abs(sqrt(sum((beta1_gen[,i]-beta1_hat)^2)))
diff_b2[i]= abs(sqrt(sum((beta2_gen[,i]-beta2_hat)^2)))
}
index_b0=which(diff_b0<thresh)
index_b1= which(diff_b1<thresh)
index_b2= which(diff_b2<thresh)
m_beta0=apply(beta0_gen[,index_b0],1,mean)
m_beta1= apply(beta1_gen[,index_b1],1,mean)
m_beta2= apply(beta1_gen[,index_b2],1,mean)
print("Number of posterior samples obtained for Beta0 are: ");length(index_b0)
print("Number of posterior samples obtained for Beta1 are: ");length(index_b1)
print("Number of posterior samples obtained for Beta2 are: ");length(index_b2)
[1] "Number of posterior samples obtained for Beta0 are: "
[1] "Number of posterior samples obtained for Beta1 are: "
[1] "Number of posterior samples obtained for Beta2 are: "
m_beta0 # mean of the posterior sample of Beta0
beta0 # Original Coefficients of Beta0
m_beta1
beta1
m_beta2
beta2
# mean of posterior samples based on the threshold are close to the --
# orginal values of beta's
# Except for that of beta2
# Converting the posterior samples for 5 levels (matrix) into the vector form
gen_beta0=as.vector(beta0_gen[,index_b0])
gen_beta1=as.vector(beta1_gen[,index_b1])
gen_beta2=as.vector(beta2_gen[,index_b2])
beta0.samples=t(beta0_gen[,index_b0])
beta1.samples= t(beta1_gen[,index_b1])
beta2.samples= t(beta2_gen[,index_b1])
library(ggplot2)
# Density plots
par(mfrow=c(3,1))
ggplot(data.frame(beta = gen_beta0), aes(x = beta)) +
geom_density(fill = "lightblue", alpha = 0.5) +
labs(title = paste0("Density plot of beta0"))
ggplot(data.frame(beta = gen_beta1), aes(x = beta)) +
geom_density(fill = "lightblue", alpha = 0.5) +
labs(title = paste0("Density plot of beta1"))
ggplot(data.frame(beta = gen_beta2), aes(x = beta)) +
geom_density(fill = "lightblue", alpha = 0.5) +
labs(title = paste0("Density plot of beta2"))
# Box plots
ggplot(data.frame(beta = gen_beta0, class =
factor(rep(1:5, each = length(gen_beta0)/level))),
aes(y = beta, x = class)) +
geom_boxplot(fill = "lightblue", alpha = 0.5) +
facet_wrap(~ class) +
labs(title = paste0("Box plot of beta0 with levels"))
ggplot(data.frame(beta = gen_beta1, class =
factor(rep(1:5, each = length(gen_beta1)/level))),
aes(y = beta, x = class)) +
geom_boxplot(fill = "lightblue", alpha = 0.5) +
facet_wrap(~ class) +
labs(title = paste0("Box plot of beta1 with levels"))
ggplot(data.frame(beta = gen_beta2, class =
factor(rep(1:5, each = length(gen_beta2)/level))),
aes(y = beta, x = class)) +
geom_boxplot(fill = "lightblue", alpha = 0.5) +
facet_wrap(~ class) +
labs(title = paste0("Box plot of beta2 with levels"))
# Trace plots
ggplot(data.frame(beta = gen_beta0), aes(x = 1:length(beta))) +
geom_line(aes(y = beta)) +
labs(title = paste0("Trace plot of beta0"))
ggplot(data.frame(beta = gen_beta1), aes(x = 1:length(beta))) +
geom_line(aes(y = beta)) +
labs(title = paste0("Trace plot of beta1"))
ggplot(data.frame(beta = gen_beta2), aes(x = 1:length(beta))) +
geom_line(aes(y = beta)) +
labs(title = paste0("Trace plot of beta2"))
# Generate predicted values for the range of predictor values
y.predict = matrix(0,nrow=n,ncol= dim(beta1.samples)[1])
for( i in 1:(dim(beta1.samples)[1]))
{
count=0
for( j in 1:level)
{
y.predict[((count*20)+1):((count+1)*20),i]=beta1.samples[i,j]*
X1[((count*20)+1):((count+1)*20)]
count=count+1
}
}
# Compute the mean and standard deviation of the predicted values for each x value
y.mean = apply(y.predict, 1, mean)
y.sd = apply(y.predict, 1, sd)
# Compute the 95% credible interval for each x value
ci.lower = y.mean - 1.96*y.sd
ci.upper = y.mean + 1.96*y.sd
# Create a plot of the regression line with shaded confidence intervals
ggplot(data.frame(x = X1, y = y), aes(x = x, y = y)) +
geom_point() +
geom_ribbon(aes(ymin = ci.lower, ymax = ci.upper), alpha = 0.3) +
geom_line(aes(x = X1, y = y.mean), color = "red", linewidth = 1) +
labs(title = "95% Credible Plot for Linear Regression using X1 as predictor",
x = "X1", y = "y")
# Generate predicted values for the range of predictor values
y.predict = matrix(0,nrow=n,ncol= dim(beta2.samples)[1])
for( i in 1:(dim(beta2.samples)[1]))
{
count=0
for( j in 1:level)
{
y.predict[((count*20)+1):((count+1)*20),i]=beta2.samples[i,j]*
X2[((count*20)+1):((count+1)*20)]
count=count+1
}
}
# Compute the mean and standard deviation of the predicted values for each x value
y.mean = apply(y.predict, 1, mean)
y.sd = apply(y.predict, 1, sd)
# Compute the 95% credible interval for each x value
ci.lower = y.mean - 1.96*y.sd
ci.upper = y.mean + 1.96*y.sd
# Create a plot of the regression line with shaded confidence intervals
ggplot(data.frame(x = X2, y = y), aes(x = x, y = y)) +
geom_point() +
geom_ribbon(aes(ymin = ci.lower, ymax = ci.upper), alpha = 0.3) +
geom_line(aes(x = X2, y = y.mean), color = "red", linewidth = 1) +
labs(title = "95% Credible Plot for Linear Regression using X2 as predictor",
x = "X2", y = "y")
* From the above 95% credible plots we can see that the Regression line generated from Bayesian Hierarchical model for variable X1 is sufficiently able to capture the pattern.
* Since, we already examined visually that no clear relationship exists between X2 and y. The 95% credible plot is for X2 is comes with a very wide 95% credible interval.
* If you have carefully observed, the number of posterior samples that got accepted are different for different Beta's, which makes it difficult to draw inferences between the variables. Hence,this is one of the problem with using ABC.
* Very few posterior samples for beta got accepted even after generating 1e+7 samples.
The above two problems can be approached by using different thresholds for different Beta's, and changing the prior parameters of these Beta* Coefficients.
* Of course, the code can be made simpler.
* What about Non-linear Models?