Credit Risk & Logistic Regression

I thought I’d do a little bit of analysis to showcase some credit risk analysis, using logistic regression. I’ve pulled this sample loan data from a DataCamp course on credit risk. It’s a cool class, you should check it out if you have time. In a later post, I will try this same analysis with a neural network to see if it has better predictive capabilities.

Here’s a look at the data. You can see we have all kinds of valuable information we can use in determining whether someone is likely to default - people with high rates and low credit scores are more likely to default, while people who own their homes and have had long term employment grades are less likely to default. We also have a column called loan_status, which is a boolean value indicating whether that particular borrower has defaulted.

head(data)

##   loan_status loan_amnt int_rate grade emp_length home_ownership
## 1           0      5000    10.65     B         10           RENT
## 2           0      2400       NA     C         25           RENT
## 3           0     10000    13.49     C         13           RENT
## 4           0      5000       NA     A          3           RENT
## 5           0      3000       NA     E          9           RENT
## 6           0     12000    12.69     B         11            OWN
##   annual_inc age
## 1      24000  33
## 2      12252  31
## 3      49200  24
## 4      36000  39
## 5      48000  24
## 6      75000  28

Logistic Regression

If you’re unfamiliar with logistic regression, that’s alright. What it does (in broad strokes) is allow you to predict a value between 1 and 0, and provide you with a degree of certainty. For example, if we ran a logistic regression on a bunch of variables, and then found relevant coefficients, we could use the features of a particular borrower to determine what level of risk they have. A lender could take appropriate measures with someone with a very low chance (0.02) of default by providing them with lower rates, or by simply not lending to someone with a very high chance of default (0.99).

Now we should tidy up some of the data and get rid of any rows with NAs. There are more efficient ways of dealing with this problem, but for our purposes we only lose about 3,000 observations, bringing us to about 25,000 observations. This is enough by most measures to build a rudimentary predictive model.

#Filter out rows with any NAs.
data <- data[complete.cases(data),]
head(data)

##   loan_status loan_amnt int_rate grade emp_length home_ownership
## 1           0      5000    10.65     B         10           RENT
## 3           0     10000    13.49     C         13           RENT
## 6           0     12000    12.69     B         11            OWN
## 7           1      9000    13.49     C          0           RENT
## 8           0      3000     9.91     B          3           RENT
## 9           1     10000    10.65     B          3           RENT
##   annual_inc age
## 1      24000  33
## 3      49200  24
## 6      75000  28
## 7      30000  22
## 8      15000  22
## 9     100000  28

Let’s split our dataset into two pieces, 60/40. This allows us to design a model with the 60% dataset and test it on the 40% dataset. If I was performing more exploratory analysis, I’d split the 40 in half, one for messing around in and the other for testing, but I’m pretty much skipping right to the modeling for now.

set.seed(9090)
bound <- floor((nrow(data)/4)*3)
data <- data[sample(nrow(data)),]
train <- data[1:bound,]
test <- data[(bound+1):nrow(data),]

Now to the preliminary model. We can use the R’s built-in functions to handle this. See below a summary of the output.

model <- glm(loan_status ~., family = binomial(link='logit'), data = train)
summary(model)

## 
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1467  -0.5358  -0.4416  -0.3374   3.3591  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.006e+00  2.151e-01 -13.976  < 2e-16 ***
## loan_amnt           -2.241e-06  4.141e-06  -0.541  0.58839    
## int_rate             9.058e-02  2.301e-02   3.936 8.29e-05 ***
## gradeB               3.338e-01  1.084e-01   3.080  0.00207 ** 
## gradeC               4.932e-01  1.569e-01   3.143  0.00167 ** 
## gradeD               5.809e-01  1.995e-01   2.911  0.00360 ** 
## gradeE               5.946e-01  2.505e-01   2.374  0.01760 *  
## gradeF               8.550e-01  3.343e-01   2.558  0.01053 *  
## gradeG               1.242e+00  4.367e-01   2.844  0.00446 ** 
## emp_length           5.405e-03  3.655e-03   1.479  0.13920    
## home_ownershipOTHER  7.172e-01  3.331e-01   2.153  0.03130 *  
## home_ownershipOWN   -1.000e-01  9.608e-02  -1.041  0.29795    
## home_ownershipRENT  -1.647e-02  5.329e-02  -0.309  0.75723    
## annual_inc          -5.325e-06  7.722e-07  -6.896 5.36e-12 ***
## age                 -5.048e-03  3.911e-03  -1.291  0.19685    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13297  on 19177  degrees of freedom
## Residual deviance: 12761  on 19163  degrees of freedom
## AIC: 12791
## 
## Number of Fisher Scoring iterations: 5

Now, we test our accuracy. How well does our model predict loan status? The below code spits out the accuracy when we test our model on the test dataframe, and we get a result of 89%. Not bad!

fit <- predict(model, newdata = test, type = "response")
fit <- ifelse(fit > 0.5, 1, 0)
error <- mean(fit != test$loan_status)
print(paste( "Accuracy is: ", 1 - error))

## [1] "Accuracy is:  0.893477240732051"

This lovely blogpost recommends plotting the true positive vs. false positives. The code for that is below. The plot shows a nearly straight line, which means we really aren’t especially predictive - the output at the bottom of 0.66 similarly shows the same. We’d like this value to be closer to one to indicate predictive ability.

library(ROCR)
p <- predict(model, newdata = test, type = "response")
pr <- prediction(p, test$loan_status)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure="auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.6606138

But we’ve skipped a couple of important steps in modeling. The model summary shows a litany of variables that really aren’t that predictive; we need to take them out. We’re going to leave anything with a . or any number of asterisks (*) in, because they are significant. A 10% significant will suffice for me.

model2 <- glm(loan_status ~ int_rate + grade + emp_length +
                (home_ownership=='OTHER') + annual_inc + age, 
              family = binomial(link='logit'), 
              data = train)
summary(model2)

## 
## Call:
## glm(formula = loan_status ~ int_rate + grade + emp_length + (home_ownership == 
##     "OTHER") + annual_inc + age, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1355  -0.5361  -0.4424  -0.3373   3.3724  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -3.030e+00  2.117e-01 -14.309  < 2e-16 ***
## int_rate                       8.976e-02  2.300e-02   3.903 9.51e-05 ***
## gradeB                         3.324e-01  1.082e-01   3.071  0.00213 ** 
## gradeC                         4.950e-01  1.569e-01   3.155  0.00161 ** 
## gradeD                         5.807e-01  1.994e-01   2.912  0.00359 ** 
## gradeE                         5.941e-01  2.501e-01   2.375  0.01755 *  
## gradeF                         8.522e-01  3.338e-01   2.553  0.01067 *  
## gradeG                         1.230e+00  4.358e-01   2.822  0.00478 ** 
## emp_length                     5.473e-03  3.593e-03   1.523  0.12767    
## home_ownership == "OTHER"TRUE  7.319e-01  3.316e-01   2.208  0.02728 *  
## annual_inc                    -5.386e-06  6.847e-07  -7.867 3.65e-15 ***
## age                           -5.070e-03  3.911e-03  -1.296  0.19490    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13297  on 19177  degrees of freedom
## Residual deviance: 12762  on 19166  degrees of freedom
## AIC: 12786
## 
## Number of Fisher Scoring iterations: 5

Now to test accuracy. It basically has yielded no meaningful change in predictive ability - but then again, that’s hard to do. All we’ve done is create a more parsimonious model in line with current thinking in statistics and econometrics.

fit <- predict(model2, newdata = test, type = "response")
fit <- ifelse(fit > 0.5, 1, 0)
error <- mean(fit != test$loan_status)
print(paste( "Accuracy is: ", 1 - error))

## [1] "Accuracy is:  0.893477240732051"

Finally, we plot the cure we showed before. Again, no real difference, but we can feel better that we have a smaller model with less “junk” floating around.

library(ROCR)
p <- predict(model2, newdata = test, type = "response")
pr <- prediction(p, test$loan_status)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure="auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.661003

Thanks for taking the time to read this post. Check out later posts where I use neural networks to look at this same dataset. It’ll be fun for the whole family!

Cameron's Blog

Finance, statistics, data, and other trash

About ∘ Blog ∘ Links ∘ Github ∘ CV

Credit Risk & Logistic Regression

Logistic Regression