# Binary logistic regression in r

So we must sample the observations in approximately equal proportions to get better models. In doing so, we will put rest of the inputData not included for training into testData validation sample. Next it is desirable to find the information value of variables to get an idea of how valuable they are in explaining the dependent variable ABOVE50K.

Optionally, we can create WOE equivalents for all categorical variables. This is only an optional step, for simplicity, this step is NOT run for this analysis. A quick note about the plogis function: When we use the predict function on this model, it will predict the log odds of the Y variable. This is not what we ultimately want because, the predicted values may not lie within the 0 and 1 range as expected.

So, to convert it into prediction probability scores that is bound between 0 and 1, we use the plogis. The default cutoff prediction probability score is 0. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. Lets compute the optimal score that minimizes the misclassification error for the above model. The summary logitMod gives the beta coefficients, Standard error, z Value and p Value. If your model had categorical variables with multiple levels, you will find a row-entry for each category of that variable.

That is because, each individual category is considered as an independent binary variable by the glm. Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4. The lower the misclassification error, the better is your model. Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0.

The above equation can be modeled using the glm by setting the family argument to "binomial". But we are more interested in the probability of the event, than the log odds of the event. So, the predicted values from the above model, i. This conversion is achieved using the plogis function, as shown below when we build logit models and predict.

In this process, we will:. Ideally, the proportion of events and non-events in the Y variable should approximately be the same. Clearly, there is a class bias, a condition observed when the proportion of events is much smaller than proportion of non-events.

So we must sample the observations in approximately equal proportions to get better models. In doing so, we will put rest of the inputData not included for training into testData validation sample. Next it is desirable to find the information value of variables to get an idea of how valuable they are in explaining the dependent variable ABOVE50K. Optionally, we can create WOE equivalents for all categorical variables. This is only an optional step, for simplicity, this step is NOT run for this analysis.

A quick note about the plogis function: When we use the predict function on this model, it will predict the log odds of the Y variable. This is not what we ultimately want because, the predicted values may not lie within the 0 and 1 range as expected. So, to convert it into prediction probability scores that is bound between 0 and 1, we use the plogis. The default cutoff prediction probability score is 0.

But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. Lets compute the optimal score that minimizes the misclassification error for the above model.