tidyverse package.# loading important packages.
pacman::p_load(tidyverse, mfx, InformationValue, car, caret)
# Read data from working directory and stores it as hr_data in R
hr_data <- read_csv("datasets/HR_comma_sep.csv",
show_col_types = F)
# Print the structure of the dataframe
glimpse(hr_data)
## Rows: 14,999
## Columns: 10
## $ satisfaction_level <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10, 0.92, …
## $ last_evaluation <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77, 0.85, …
## $ number_project <dbl> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2, 2, 2, …
## $ average_montly_hours <dbl> 157, 262, 272, 223, 159, 153, 247, 259, 224, 142…
## $ time_spend_company <dbl> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3, 3, 3, …
## $ Work_accident <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ left <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ promotion_last_5years <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Department <chr> "sales", "sales", "sales", "sales", "sales", "sa…
## $ salary <chr> "low", "medium", "medium", "low", "low", "low", …
# convert the variables work_accident, left, promotion_last_5years, Department, and salary into factors.
hr_data <- hr_data %>%
mutate(Work_accident = as.factor(Work_accident),
left = as.factor(left),
promotion_last_5years = as.factor(promotion_last_5years),
Department = as.factor(Department),
salary = as.factor(salary))
# print new variables and their class
sapply(hr_data, class)
## satisfaction_level last_evaluation number_project
## "numeric" "numeric" "numeric"
## average_montly_hours time_spend_company Work_accident
## "numeric" "numeric" "factor"
## left promotion_last_5years Department
## "factor" "factor" "factor"
## salary
## "factor"
glimpse(hr_data) output that
there is an inconsistent naming of variables; Work_accident and
Department start with upper case while others do not. To enhance
productivity in later stages, I will convert them to lower case.# print variable names
names(hr_data)
## [1] "satisfaction_level" "last_evaluation" "number_project"
## [4] "average_montly_hours" "time_spend_company" "Work_accident"
## [7] "left" "promotion_last_5years" "Department"
## [10] "salary"
# rename all variables to lower case
hr_data <- hr_data %>%
rename_all(tolower)
# check for rows with NAs
hr_data[!complete.cases(hr_data),]
## # A tibble: 0 × 10
## # … with 10 variables: satisfaction_level <dbl>, last_evaluation <dbl>,
## # number_project <dbl>, average_montly_hours <dbl>, time_spend_company <dbl>,
## # work_accident <fct>, left <fct>, promotion_last_5years <fct>,
## # department <fct>, salary <fct>
# No missing data
left variable is a
factor containing two levels, 0 if the employee continue to stay at the
company and 1 if they left. For easier workflow, I will change these
levels into no and yes.hr_data <- hr_data %>%
mutate(left = if_else(left == 0, "no", "yes")) %>%
mutate(left = as.factor(left))
# read the first 3 rows of the dataframe
hr_data %>%
head(3)
## # A tibble: 3 × 10
## satisfa…¹ last_…² numbe…³ avera…⁴ time_…⁵ work_…⁶ left promo…⁷ depar…⁸ salary
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
## 1 0.38 0.53 2 157 3 0 yes 0 sales low
## 2 0.8 0.86 5 262 6 0 yes 0 sales medium
## 3 0.11 0.88 7 272 4 0 yes 0 sales medium
## # … with abbreviated variable names ¹satisfaction_level, ²last_evaluation,
## # ³number_project, ⁴average_montly_hours, ⁵time_spend_company,
## # ⁶work_accident, ⁷promotion_last_5years, ⁸department
# unique observations in the left variable
unique(hr_data$left)
## [1] yes no
## Levels: no yes
yes and
those who stayed no. For the numeric variables, I will
calculate the mean for each of the groups. These means can be useful
metrics to explain employee retention.# summarize the data using the variable left based on mean (EDA)
hr_data %>%
group_by(left) %>%
summarise(mean_satisfaction_level = mean(satisfaction_level),
mean_last_evaluation = mean(last_evaluation),
mean_number_project = mean(number_project),
mean_average_monthly_hours = mean(average_montly_hours),
mean_time_spend_company = mean(time_spend_company))
## # A tibble: 2 × 6
## left mean_satisfaction_level mean_last_evaluation mean_numb…¹ mean_…² mean_…³
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 no 0.667 0.715 3.79 199. 3.38
## 2 yes 0.440 0.718 3.86 207. 3.88
## # … with abbreviated variable names ¹mean_number_project,
## # ²mean_average_monthly_hours, ³mean_time_spend_company
There exists a big difference in the
mean_satisfaction_level between employees who left and
those who stayed. Employees who left had a lower satisfaction level
compared to those who stayed by 0.2267116
(0.6668096-0.4400980). This could explain why they left but
further statistical analysis is necessary to test whether the difference
observed is statistically significant.
There is no clear impact of the variable
last_evaluation on employee retention since the means for
both groups are nearly equal.
The same case applies to mean_number_project. The
difference is very small but statistical analysis is vital to test if it
is indeed statistically significant.
The difference in the mean_average_monthly_hours
between employees who left and those who stayed is quite big. On
average, those who left appear to have been working more more than those
who stayed by 8.359 hours.
For the categorical variables i.e (work_accident,
department, and salary), I will have to take
another approach to understand their impact on employee
retention.
# count the number of employees in each salary category and group by the left variable
hr_data %>%
dplyr::select(left, salary) %>%
group_by(left) %>%
table()
## salary
## left high low medium
## no 1155 5144 5129
## yes 82 2172 1317
# bar chart showing impact of employees salaries on retention
ggplot(data = hr_data)+
geom_bar(aes(x = salary, fill = left), position = "dodge", alpha = 0.8)+
labs(title = "Employee Retention by Salary Category", y = "Count")+
theme_classic()
# count the number of employees in each department category and group by the left variable
hr_data %>%
dplyr::select(department, left) %>%
group_by(left) %>%
table()
## left
## department no yes
## accounting 563 204
## hr 524 215
## IT 954 273
## management 539 91
## marketing 655 203
## product_mng 704 198
## RandD 666 121
## sales 3126 1014
## support 1674 555
## technical 2023 697
# Achieving the same using base R xtabs function
xtabs(~department+ left, hr_data)
## left
## department no yes
## accounting 563 204
## hr 524 215
## IT 954 273
## management 539 91
## marketing 655 203
## product_mng 704 198
## RandD 666 121
## sales 3126 1014
## support 1674 555
## technical 2023 697
## bar chart showing correlation between Employee Department and Retention
hr_data %>%
ggplot(aes(department, fill = left))+
geom_bar(position = "dodge", alpha = 0.8)+
theme_classic()+
theme(axis.text.x = element_text(angle = 90))+
labs(title = "Employee Retention by Department")
promotion_last_5years and
employee retention#Modify the variable promotion_last_5years into "promoted" and "not promoted" for easy understanding of the visualization
hr_data <- hr_data %>%
mutate(promotion_last_5years = if_else(promotion_last_5years == 1, "promoted", "not promoted"))
# print the first 3 rows
head(hr_data, 3)
## # A tibble: 3 × 10
## satisfa…¹ last_…² numbe…³ avera…⁴ time_…⁵ work_…⁶ left promo…⁷ depar…⁸ salary
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <chr> <fct> <fct>
## 1 0.38 0.53 2 157 3 0 yes not pr… sales low
## 2 0.8 0.86 5 262 6 0 yes not pr… sales medium
## 3 0.11 0.88 7 272 4 0 yes not pr… sales medium
## # … with abbreviated variable names ¹satisfaction_level, ²last_evaluation,
## # ³number_project, ⁴average_montly_hours, ⁵time_spend_company,
## # ⁶work_accident, ⁷promotion_last_5years, ⁸department
# count the number of employee in each promotion category and group by whether they left or not
xtabs(~ promotion_last_5years + left, data = hr_data)
## left
## promotion_last_5years no yes
## not promoted 11128 3552
## promoted 300 19
# visualizing the above table
hr_data %>%
ggplot(aes(promotion_last_5years, fill = left))+
geom_bar(position = "dodge", alpha = 0.8)+
theme_classic()+
labs(title = "Employe Retention by Promotion last 5 years")
From the above table and bar chart, not receiving a promotion has
a high impact on retention. Out of a total of 3571 employees who left
the company, 3552(about 99 percent) did not receive a promotion in the
last five years. This produces the idea that the variable
promotion_last_5years has a direct on employee
retention.
Based on this exploratory analysis, it is safe to say that the
variables satisfaction_level,
average_monthly_hours, promotion_last_5years,
and salary have a direct impact on employee retention.
Using these variables, I will proceed to building a logistic regression
model to predict employee retention.
Logistic regression utilizes the method of maximum likelihood estimation to identify an equation of the form log[p(X)/(1-p(x))] = pr(D = 1) = B0 +B1X1 +B2X2 + … + BnXn + u.
The right hand side of the equation predicts the logit (log odds)
of the dependent variable taking the value 1. In my case, I am
predicting the probability of leaving (1) using the variables
satisfaction_level, average_monthly_hours,
promotion_last_5years, and salary.
Forming an equation to represent this
`left = B0 + B1 x satisfaction_level + B2 x average_monthly_hours + B3 x promotion_last_5years + B4 x salary
Selecting necessary variables and assigning the data to model_data
# select variables for use in modeling
model_data <- hr_data %>%
dplyr::select(left, satisfaction_level, average_montly_hours, promotion_last_5years, salary)
# print first six rows
head(model_data)
## # A tibble: 6 × 5
## left satisfaction_level average_montly_hours promotion_last_5years salary
## <fct> <dbl> <dbl> <chr> <fct>
## 1 yes 0.38 157 not promoted low
## 2 yes 0.8 262 not promoted medium
## 3 yes 0.11 272 not promoted medium
## 4 yes 0.72 223 not promoted low
## 5 yes 0.37 159 not promoted low
## 6 yes 0.41 153 not promoted low
# view the structure of new dataset
str(model_data)
## tibble [14,999 × 5] (S3: tbl_df/tbl/data.frame)
## $ left : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ satisfaction_level : num [1:14999] 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ average_montly_hours : num [1:14999] 157 262 272 223 159 153 247 259 224 142 ...
## $ promotion_last_5years: chr [1:14999] "not promoted" "not promoted" "not promoted" "not promoted" ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
# manipulating the variables promotion_last_5years and left into 1's and 0's
model_data <- model_data %>%
mutate(promotion_last_5years = if_else(promotion_last_5years == "promoted", 1,0)) %>%
mutate(left = if_else(left == "yes", 1,0))
# print first six rows
head(model_data)
## # A tibble: 6 × 5
## left satisfaction_level average_montly_hours promotion_last_5years salary
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 1 0.38 157 0 low
## 2 1 0.8 262 0 medium
## 3 1 0.11 272 0 medium
## 4 1 0.72 223 0 low
## 5 1 0.37 159 0 low
## 6 1 0.41 153 0 low
# check variable types
sapply(model_data, class)
## left satisfaction_level average_montly_hours
## "numeric" "numeric" "numeric"
## promotion_last_5years salary
## "numeric" "factor"
# convert the variables promotion_last_5years and left into factors
model_data <- model_data %>%
mutate(left = as.factor(left)) %>%
mutate(promotion_last_5years = as.factor(promotion_last_5years))
# print first six rows
head(model_data)
## # A tibble: 6 × 5
## left satisfaction_level average_montly_hours promotion_last_5years salary
## <fct> <dbl> <dbl> <fct> <fct>
## 1 1 0.38 157 0 low
## 2 1 0.8 262 0 medium
## 3 1 0.11 272 0 medium
## 4 1 0.72 223 0 low
## 5 1 0.37 159 0 low
## 6 1 0.41 153 0 low
# check variable types
sapply(model_data, class)
## left satisfaction_level average_montly_hours
## "factor" "numeric" "numeric"
## promotion_last_5years salary
## "factor" "factor"
# check correct data types
str(model_data)
## tibble [14,999 × 5] (S3: tbl_df/tbl/data.frame)
## $ left : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ satisfaction_level : num [1:14999] 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ average_montly_hours : num [1:14999] 157 262 272 223 159 153 247 259 224 142 ...
## $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
# loading fastDummies package for efficient dummy variable creation
library(fastDummies)
# Create dummies using the fastDummies package
model_data <- dummy_cols(model_data, select_columns = "salary")
head(model_data)
## # A tibble: 6 × 8
## left satisfaction_level average_mont…¹ promo…² salary salar…³ salar…⁴ salar…⁵
## <fct> <dbl> <dbl> <fct> <fct> <int> <int> <int>
## 1 1 0.38 157 0 low 0 1 0
## 2 1 0.8 262 0 medium 0 0 1
## 3 1 0.11 272 0 medium 0 0 1
## 4 1 0.72 223 0 low 0 1 0
## 5 1 0.37 159 0 low 0 1 0
## 6 1 0.41 153 0 low 0 1 0
## # … with abbreviated variable names ¹average_montly_hours,
## # ²promotion_last_5years, ³salary_high, ⁴salary_low, ⁵salary_medium
# modify the created dummies into factors
model_data <- model_data %>%
mutate(salary_high = as.factor(salary_high),
salary_low = as.factor(salary_low),
salary_medium = as.factor(salary_medium))
# print first six rows
head(model_data)
## # A tibble: 6 × 8
## left satisfaction_level average_mont…¹ promo…² salary salar…³ salar…⁴ salar…⁵
## <fct> <dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
## 1 1 0.38 157 0 low 0 1 0
## 2 1 0.8 262 0 medium 0 0 1
## 3 1 0.11 272 0 medium 0 0 1
## 4 1 0.72 223 0 low 0 1 0
## 5 1 0.37 159 0 low 0 1 0
## 6 1 0.41 153 0 low 0 1 0
## # … with abbreviated variable names ¹average_montly_hours,
## # ²promotion_last_5years, ³salary_high, ⁴salary_low, ⁵salary_medium
# deselect the column salary
model_data <- model_data %>%
dplyr::select(-salary)
# Check for correct data types
str(model_data)
## tibble [14,999 × 7] (S3: tbl_df/tbl/data.frame)
## $ left : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ satisfaction_level : num [1:14999] 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ average_montly_hours : num [1:14999] 157 262 272 223 159 153 247 259 224 142 ...
## $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ salary_high : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ salary_low : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 2 2 ...
## $ salary_medium : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
# To make my results reproducible
set.seed(1)
# Utilize 70 percent of the dataset as training set, and the remaining 30 percent as testing set.
sample <- sample(c(T, F), nrow(model_data), replace = T, prob = c(0.7, 0.3))
# assign train set to train
train <- model_data[sample, ]
#assign test set to test
test <- model_data[!sample, ]
# fitting the logistic regression model. I omit one dummy variable (salary_medium) to avoid the dummy variable trap
logistic_model <- glm(left ~ satisfaction_level + average_montly_hours + promotion_last_5years + salary_high + salary_low, data = train, family = "binomial")
# Disable the scientific notation for model summary
options(scipen = 999)
# call model summary
summary(logistic_model)
##
## Call:
## glm(formula = left ~ satisfaction_level + average_montly_hours +
## promotion_last_5years + salary_high + salary_low, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6359 -0.7018 -0.4813 -0.1888 2.7585
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2300893 0.1214909 1.894 0.0582 .
## satisfaction_level -3.7577705 0.1058720 -35.494 < 0.0000000000000002 ***
## average_montly_hours 0.0026961 0.0004882 5.523 0.0000000334 ***
## promotion_last_5years1 -1.2414339 0.2807097 -4.422 0.0000097573 ***
## salary_high1 -1.3149267 0.1456135 -9.030 < 0.0000000000000002 ***
## salary_low1 0.4807762 0.0526416 9.133 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11508.1 on 10474 degrees of freedom
## Residual deviance: 9617.1 on 10469 degrees of freedom
## AIC: 9629.1
##
## Number of Fisher Scoring iterations: 5
# loading the mfx package to calculate marginal effects
library(mfx)
# Calculate coefficient marginal effects
marginals <- logitmfx(logistic_model, data = train)
# print marginals
marginals
## Call:
## logitmfx(formula = logistic_model, data = train)
##
## Marginal Effects:
## dF/dx Std. Err. z P>|z|
## satisfaction_level -0.585600940 0.016121827 -36.3235 < 0.00000000000000022
## average_montly_hours 0.000420151 0.000076546 5.4888 0.000000040456854
## promotion_last_5years1 -0.131127097 0.017862274 -7.3410 0.000000000000212
## salary_high1 -0.143944516 0.010079452 -14.2810 < 0.00000000000000022
## salary_low1 0.075207864 0.008243888 9.1229 < 0.00000000000000022
##
## satisfaction_level ***
## average_montly_hours ***
## promotion_last_5years1 ***
## salary_high1 ***
## salary_low1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## dF/dx is for discrete change for the following variables:
##
## [1] "promotion_last_5years1" "salary_high1" "salary_low1"
# combining results from the two outputs
results <- data.frame(variable = c("constant", "satisfaction_level", "average_montly_hours", "promotion_last_5years", "salary_high", "salary_low"),
coefficient_estimate = c( 0.230089265, -3.757770499, 0.002696088, -1.241433934, -1.314926658, 0.480776151),
marginal_effect = c(NA, -0.585600940, 0.000420151, -0.131127097, -0.143944516, 0.075207864),
P_value = c(0.0582, 0.0000000000000002, 0.0000000334, 0.0000097573, 0.0000000000000002, 0.0000000000000002))
# print results
tibble(results)
## # A tibble: 6 × 4
## variable coefficient_estimate marginal_effect P_value
## <chr> <dbl> <dbl> <dbl>
## 1 constant 0.230 NA 5.82e- 2
## 2 satisfaction_level -3.76 -0.586 2 e-16
## 3 average_montly_hours 0.00270 0.000420 3.34e- 8
## 4 promotion_last_5years -1.24 -0.131 9.76e- 6
## 5 salary_high -1.31 -0.144 2 e-16
## 6 salary_low 0.481 0.0752 2 e-16
A coefficient in a logit model tells us the change in the log of the odds ratio per unit change in the independent variable concerned from its mean.
Marginal effect of an independent variable gives us the change in the expected value of (Di) caused by a one unit increase in X1i holding constant the other independent variables in the equation.
We can therefore interpret the above logit model as follows:
Holding all other factors constant, every unit increase in an
employee’s satisfaction level decreases the log of the odds ratio
(logit) of leaving the company by 3.757770499. Additionally, a unit
increase in satisfaction level reduces the probability of an employee
leaving the company by 0.585600940 (58.56%), holding other factors
constant. satisfaction_level is statistically significant
at 5% level of significance, therefore, a good predictor of employee
retention.
Holding all other factors constant, every unit increase in an
employee’s average monthly hours increases the log of the odds ratio of
leaving the company by 0.002696088. Also every unit increase of an
employee’s average monthly hours increases the probability of leaving
the company by 0.000420151 (0.042%), holding other factors constant.
average_monthly_hours is statistically significant at 5%
level of significance, therefore, a good predictor of employee
retention.
Holding all other factors constant, employees who have received a
promotion in the last 5 years have a lower logit of leaving the company
by 1.241433934. Additionally, their probability of leaving the company
reduces by 0.131127097 (13.11%), ceteris paribus.
promotion_last_5years is statistically significant at 5%
level of significance, therefore, a good predictor of employee
retention.
Holding all other factors constant, employees who receive a high
salary have a lower logit of leaving the company by 1.314926658, their
probability of leaving the company reduces by 0.143944516 (14.39%).
salary_high is statistically significant at 5% level of
significance, therefore, a good predictor of employee
retention.
Holding all other factors constant, employees who receive a low
salary have a higher logit of leaving the company by 0.480776151, their
probability of leaving the company increases by 0.075207864 (7.52%).
salary_low is statistically significant at 5% level of
significance, therefore, a good predictor of employee
retention.
R2 is a metric that we use to measure the goodness of fit of a model in a typical regression. However, in Logistic Regression, there is no such R2 value. As an alternative, we calculate the McFadden’s R2, which has a range of 0 to just below 1. Values that are very close to 0 show that the model has no forecasting ability. Values over 0.40 often signify that a model fits the data well.
To compute the McFadden’s R2, we use the pR2 () from
the pscl package.
# Calculate the pR2
pscl::pR2(logistic_model)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.1643235
# Alternatively, using base R functions
with(summary(logistic_model), 1-deviance/null.deviance)
## [1] 0.1643235
satisfaction_level, average_montly_hours,
promotion_last_5years and salary can explain
for about 16.43 percent of the probability of an employee leaving the
company, leaving the rest (83.57 percent) to be explained by other
variables not in the model. This pseudo R2 square is quite low
indicating poor predictive power of the model. This calls for model
improvement, but that is not the major interest in this project.varImp() function from the caret
package, I can also determine the importance of each explanatory
variable used in the model.# Calculate variable importance
caret::varImp(logistic_model)
## Overall
## satisfaction_level 35.493516
## average_montly_hours 5.522690
## promotion_last_5years1 4.422483
## salary_high1 9.030253
## salary_low1 9.133005
# utilizing the vif function from the car package
car::vif(logistic_model)
## satisfaction_level average_montly_hours promotion_last_5years
## 1.005144 1.002352 1.002419
## salary_high salary_low
## 1.049310 1.051234
A VIF score above 4 or 5 often denotes strong multicollinearity between an explanatory variable and other explanatory variables. An explanatory variable does not have a correlation with other explanatory variables if its value is 1.
Given that none of my explanatory variables have a VIF greater than 4, I can infer that multicollinearity is not a problem in my model.
# Defining two new employees
new <- data.frame(satisfaction_level = 0.64, average_montly_hours = 250, promotion_last_5years = c(1, 0), salary_high = c(0, 1), salary_low = c(1,0), salary_medium = 0)
# changing categorical variables to factors
new <- new %>%
mutate(promotion_last_5years = as.factor(promotion_last_5years),
salary_high = as.factor(salary_high),
salary_low = as.factor(salary_low),
salary_medium = as.factor(salary_medium))
# view new
head(new)
## satisfaction_level average_montly_hours promotion_last_5years salary_high
## 1 0.64 250 1 0
## 2 0.64 250 0 1
## salary_low salary_medium
## 1 1 0
## 2 0 0
# predict probability of leaving
predict(logistic_model, new, type = "response")
## 1 2
## 0.09436042 0.05647693
The probability of leaving of an employee with a satisfaction level of 0.64, 250 average monthly hours, has received a promotion in the last five years and earns a low salary is 0.09436042 (about 9.44 percent). Conversely, the probability of leaving of an employee with the same satisfaction level and average monthly hours, has not received a promotion in the last 5 years and earns a high salary is 0.05647693 (about 5.65 percent).
I can there use the model to calculate the probability of leaving of each employee in the test data created in an earlier stage.
# calculate probabilities using the test data
predicted <- predict(logistic_model, test, type = "response")
# view the first 10 probabilities
head(predicted, 10)
## 1 2 3 4 5 6 7 8
## 0.1988493 0.3971585 0.7312595 0.4322768 0.3661528 0.1776000 0.1917521 0.7422713
## 9 10
## 0.3811379 0.1440041
OptimalCutoff() function from the
InformationValue Package, I can, however, determine the
optimal probability to employ in order to optimize the model’s
accuracy.# Loading the Information value package
library(InformationValue)
# find the optimal cutoff probability to use
optimal <- optimalCutoff(test$left, predicted)[1]
# print optimal
optimal
## [1] 0.6200252
The results show that 0.6200252 is the appropriate probability limit. An employee with a predicted likelihood of at least 0.6200252 will almost certainly quit the organization, whereas one with a forecasted probability of less than 0.6200252 will almost certainly stay on.
Using the function misClassError() from the
InformationValue package, I can also determine the overall
misclassification error, which is the proportion of all incorrect
classifications.
# calculate misclassification error
misClassError(test$left, predicted, threshold = optimal)
## [1] 0.1954
Last but not least, I can plot the ROC curve, which shows the
proportion of true positives the model correctly predicts when the
prediction probability cutoff is dropped from 1 to 0. The higher the
area under the curve (AUC), the more accurately the model is able to
predict outcomes (Kim & Hwang, 2020). Once more, this phase requires
the use of the InformationValue package.
# plot the ROC curve
plotROC(test$left, predicted)
According to the findings above, job satisfaction is by far the most crucial aspect to take into account when developing changes that promote staff retention. In fact, the investigation revealed that a unit increase in an employee’s level of satisfaction will lower that employee’s likelihood of quitting the organization by 58.56%. The management of this organization may increase employee happiness by paying competitive salaries, acknowledging achievements, listening to problems, providing more incentives, and being open and honest. The management also need to be cognizant of the time that workers spend working for the organization. Employees should have adequate time for personal growth. Besides, overworking them just makes them more likely to leave the organization. Finally, the organization should have a defined promotion process in place so that staff members never have to wonder when their next pay raise could be coming.
Regression Output
Pictorial view of the dataset
Kim, J. H. (2019). Multicollinearity and misleading statistical results. Korean journal of anesthesiology, 72(6), 558-569.
Kim, J., & Hwang, I. C. (2020). Drawing guidelines for receiver operating characteristic curve in preparation of manuscripts. Journal of Korean Medical Science, 35(24).