# Back to the Basics: Probit Regression | by Akif Mustafa | Nov, 2023

Category:

Harness the Potential of AI Tools with ChatGPT. Our blog offers comprehensive insights into the world of AI technology, showcasing the latest advancements and practical applications facilitated by ChatGPT’s intelligent capabilities.

## A crucial method in binary outcome analysis Whenever we face any task related to analyzing binary outcomes, we often think of logistic regression as the go-to method. That’s why most articles about binary outcome regression focus exclusively on logistic regression. However, logistic regression is not the only option available. There are other methods, such as the Linear Probability Model (LPM), Probit regression, and Complementary Log-Log (Cloglog) regression. Unfortunately, there is a lack of articles on these topics available on the internet.

The Linear Probability Model is rarely used because it is not very effective in capturing the curvilinear relationship between a binary outcome and independent variables. I have previously discussed Cloglog regression in one of my previous articles. While there are some articles on Probit regression available on the internet, they tend to be technical and difficult for non-technical readers to understand. In this article, we will explain the basic principles of Probit regression and its applications and compare it with logistic regression.

This is how a relationship between a binary outcome variable and an independent variable typically looks:

The curve you see is called an S-shaped curve or sigmoid curve. If we closely observe this plot, we’ll notice that it resembles a cumulative distribution function (CDF) of a random variable. Therefore, it makes sense to use the CDF to model the relationship between a binary outcome variable and independent variables. The two most commonly used CDFs are the logistic and the normal distributions. Logistic regression utilizes the logistic CDF, given with the following equation:

In Probit regression, we utilize the cumulative distribution function (CDF) of the normal distribution. Reasonably, we can just replace logistic CDF with normal distribution CDF to get the equation of Probit regression:

Where Φ() represents the cumulative distribution function of the standard normal distribution.

We can memorise this equation, but it will not clarify our concept related to the Probit regression. Therefore, we will adopt a different approach to gain a better understanding of how Probit regression works.

Let us say we have data on the weight and depression status of a sample of 1000 individuals. Our objective is to examine the relationship between weight and depression using Probit regression. (Download the data from this link. )

To provide some intuition, let’s imagine that whether an individual (the “ith” individual) will experience depression or not depends on an unobservable latent variable, denoted as Ai. This latent variable is influenced by one or more independent variables. In our scenario, the weight of an individual determines the value of the latent variable. The probability of experiencing depression increases with increase in the latent variable.

The question is, since Ai is an unobserved latent variable, how do we estimate the parameters of the above equation? Well, if we assume that it is normally distributed with the same mean and variance, we will be able to obtain some information regarding the latent variable and estimate the model parameters. I will explain the equations in more detail later, but first, let’s perform some practical calculations.

Coming back to our data: In our data, let us calculate the probability of depression for each age and tabulate it. For example, there are 7 people with a weight of 40kg, and 1 of them has depression, so the probability of depression for weight 40 is 1/7 = 0.14286. If we do this for all weight, we will get this table:

Now, how do we get the values of the latent variable? We know that the normal distribution gives the probability of Y for a given value of X. However, the inverse cumulative distribution function (CDF) of the normal distribution enables us to obtain the value of X for a given probability value. In this case, we already have the probability values, which means we can determine the corresponding value of the latent variable by using the inverse CDF of the normal distribution. [Note: Inverse Normal CDF function is available in almost every statistical software, including Excel.]

This unobserved latent variable Ai is known as normal equivalent deviate (n.e.d.) or simply normit. Looking closely, it is nothing but Z-scores associated with the unobserved latent variable. Once we have the estimated Ai, estimating β1 and β2 is relatively simple. We can run a simple linear regression between Ai and our independent variable.

The coefficient of weight 0.0256 gives us the change in the z-score of the outcome variable (depression) associated with a one-unit change in weight. Specifically, a one-unit increase in weight is associated with an increase of approximately 0.0256 z-score units in the likelihood of having high depression. We can calculate the probability of depression for any age using standard normal distribution. For example, for weight 70,

Ai = -1.61279 + (0.02565)*70

Ai = 0.1828

The probability associated with a z-score of 0.1828 (P(x<Z)) is 0.57; i.e. the predicted probability of depression for weight 70 is 0.57.

It is quite reasonable to say that the above explanation was an oversimplification of a moderately complex method. It is also important to note that it is just an illustration of the basic principle behind the use of cumulative normal distribution in Probit regression. Now, let us have a look at the mathematical equations.

## Mathematical Structure

We discussed earlier that there exists a latent variable, Ai, that is determined by the predictor variables. It will be very logical to consider that there exists a critical or threshold value (Ai_c) of the latent variable such that if Ai exceeds Ai_c, the individual will have depression; otherwise, he/she will not have depression. Given the assumption of normality, the probability that Ai is less than or equal to Ai_c can be calculated from standardized normal CDF:

Where Zi is the standard normal variable, i.e., Z ∼ N(0, σ 2) and F is the standard normal CDF.

The information related to the latent variable and β1 and β2 can be obtained by taking the inverse of the above equation:

Inverse CDF of standardized normal distribution is used when we want to obtain the value of Z for a given probability value.

Now, the estimation process of β1, β2, and Ai depends on whether we have grouped data or individual-level ungrouped data.

When we have grouped data, it is easy to calculate the probabilities. In our depression example, the initial data is ungrouped, i.e. there is weight for each individual and his/her status of depression (1 and 0). Initially, the total sample size was 1000, but we grouped that data by weight, resulting in 71 groups, and calculated the probability of depression in each weight group.

However, when the data is ungrouped, the Maximum Likelihood Estimation (MLE) method is utilized to estimate the model parameters. The figure below shows the Probit regression on our ungrouped data (n = 1000):

It can be observed that the coefficient of weight is very close to what we estimated with the grouped data.

Now that we have grasped the concept of Probit regression and are familiar (hopefully) with logistic regression, the question arises: which model is preferable? Which model performs better under different conditions? Well, both models are quite similar in their application and yield comparable results (in terms of predicted probabilities). The only minor distinction lies in their sensitivity to extreme values. Let’s take a closer look at both models: