This assignment will look at customer churn data from several telecommunication companies. The goal is to determine what makes a customer more likely or less likely to churn so these companies can understand how to best retain customers. Supervised programming techniques qill be utelized in the model building process, followed by ROC and Bagging AUC analysis to ultimately determine the final model.
This data set is a small portion of a telecommunication companies customer data. The data set was created to investigate service quality and improve customer retention. There are 1,000 observations in the data set and 14 variables which are described below. While the categorical variables do get recategorized as factor variables for analysis purposes we will describe the data set as it originally stands.
url ="https://chloewinters79.github.io/STA551/Data/churn_output.csv"
churn = read.csv(url, header = TRUE)
cat_vars <- c("Sex", "Marital_Status", "Phone_service", "International_plan",
"Voice_mail_plan", "Multiple_line", "Internet_service",
"Technical_support", "Streaming_Videos", "Agreement_period", "Churn")
churn[cat_vars] <- lapply(churn[cat_vars], factor)Now that we are familiar with our starting data set we need to go through EDA and data cleaning so the final working data set meets the requirements of the assignment.
First we start by checking for missing values, as shown in the output none of the variables have any missing observations which means we do not need to do any imputation or removing of observations for too many missing values.
Sex Marital_Status Term Phone_service
0 0 0 0
International_plan Voice_mail_plan Multiple_line Internet_service
0 0 0 0
Technical_support Streaming_Videos Agreement_period Monthly_Charges
0 0 0 0
Total_Charges Churn
0 0
The next step is to make sure all the variables are properly categorized for our analysis, so the approportiate variables are being converted to factor variables to make for an easier analysis of these variables when we go into our model building.
# Convert Appropriate Variables to Factors
cat_vars <- c("Sex", "Marital_Status", "Phone_service", "International_plan",
"Voice_mail_plan", "Multiple_line", "Internet_service",
"Technical_support", "Streaming_Videos", "Agreement_period", "Churn")
churn[cat_vars] <- lapply(churn[cat_vars], factor)Now moving into observing the summary statistics this output is actually very important, becasue the original output showed that the International plan variable had “yes” and “Yes” observations which were being split so we were able to write code that combined these two types of observations since they are the same.
Looking at the rest of the summary statistics, Term, Monthly_Charges, and Total_Charges all appear to be right-skewed, indicating that most customers have shorter service terms, lower monthly bills, and lower total charges, with a smaller number of customers showing much higher values. This non-normality could affect models like logistic regression and perceptron, which are sensitive to scale and distribution. Applying standardization or log transformation (particularly for Total_Charges) could help stabilize variance and improve model performance. The categorical variables look well-balanced overall, though the Churn variable shows a moderate imbalance (about 26% churned), which should be monitored to ensure fair model evaluation. Overall, most variables are suitable for modeling after basic preprocessing and scaling.
# Summary Statistics
# For categorical variables
for (v in cat_vars) {
print(v)
print(table(churn[[v]]))
}[1] "Sex"
Female Male
486 514
[1] "Marital_Status"
Married Single
735 265
[1] "Phone_service"
No Yes
89 911
[1] "International_plan"
No Yes
429 571
[1] "Voice_mail_plan"
No Yes
442 558
[1] "Multiple_line"
No No phone Yes
467 89 444
[1] "Internet_service"
Cable DSL Fiber optic No Internet
171 280 341 208
[1] "Technical_support"
No No internet Yes
476 208 316
[1] "Streaming_Videos"
No No internet Yes
382 208 410
[1] "Agreement_period"
Monthly contract One year contract Two year contract
539 208 253
[1] "Churn"
No Yes
741 259
# For numerical variables
num_vars <- c("Term", "Monthly_Charges", "Total_Charges")
summary(churn[num_vars]) Term Monthly_Charges Total_Charges
Min. : 0.0 Min. : 18.95 Min. : 5.0
1st Qu.: 8.0 1st Qu.: 40.16 1st Qu.: 334.8
Median :30.0 Median : 74.72 Median :1442.3
Mean :32.8 Mean : 66.64 Mean :2351.6
3rd Qu.:57.0 3rd Qu.: 90.88 3rd Qu.:4016.8
Max. :72.0 Max. :116.25 Max. :8476.5
# Fix International
churn$International_plan <- ifelse(tolower(churn$International_plan) == "yes", "Yes",
ifelse(tolower(churn$International_plan) == "no", "No",
churn$International_plan))
churn$International_plan <- factor(churn$International_plan)Looking at our binary variable churn, we see a skew towards No, which means a majority of customers are not churning which is ideal for the company. Considering the size of the data set we are working with the lower amount of yes for churns is not concering because despite being almost a 3:1 ratio of no to yes we still have plenty of data to work with and we should not be concerened with this impacting our analysis.
No Yes
0.741 0.259
ggplot(churn, aes(x = Churn, fill = Churn)) +
geom_bar() +
labs(title = "Distribution of Customer Churn") +
theme_minimal()Looking at the relationship between our feature variables and the churn is very interesting for the most part with categorical variables the yes and no for churn match pretty well, there doesn’t seem to be any difference between the different factor levels. The only time this is not true is typically when we have factor levels that contain a “no internet option” these levels have much lower yes observations compared to their other factors. Additionally in the Agreement Period variable we see less churn the longer the agreement is for. Looking at the numerical variables is where we see some change. We see that customers who do churn have less terms and higher monthly charges compared to those who don’t churn, however the total charges for churners vs non churners is not too different comparatively.
# Explore Relationships Between Features and Churn
# Numeric vs Target
churn %>%
dplyr::select(all_of(num_vars), Churn) %>%
pivot_longer(cols = all_of(num_vars), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Churn, y = Value, fill = Churn)) +
geom_boxplot(outlier.color = "red") +
facet_wrap(~Variable, scales = "free") +
labs(title = "Numeric Feature Distributions by Churn Status") +
theme_minimal()# Select categorical variables excluding Churn
cat_vars_plot <- cat_vars[-length(cat_vars)]
# Set up 3x3 grid
par(mfrow = c(2, 3), mar = c(5, 4, 2, 1))
for (v in cat_vars_plot) {
counts <- table(churn[[v]], churn$Churn)
prop <- prop.table(counts, 1) # row proportions
barplot(t(prop), col = c("blue", "red"), beside = TRUE,
main = v, ylab = "Proportion", legend.text = TRUE)
}Looking at the correlation between the numerical variables there seems to be high correlation between term and total charges, and moderate correaltion between total charges and monthly charges. These do make sense in the context of the data since the longer a customer is with the company the more terms they will have and thus the more they will have paid overall. So this correlation while high does make sense contextually and does not appear to me a major cause for concern at this moment just something to keep in mind.
num_data <- churn %>%
select(all_of(num_vars)) %>%
drop_na()
# Compute correlation matrix
cor_mat <- cor(num_data)
# Melt correlation matrix for ggplot
cor_melt <- melt(cor_mat)
# Heatmap
ggplot(cor_melt, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation") +
geom_text(aes(label = round(value, 2)), color = "black", size = 4) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(title = "Correlation Heatmap of Numeric Variables", x = "", y = "")Now that all the exploratory data analysis has been conducted and there does not appear to be anything drastically concerning that needs to undergo specific transformations in this moment we can move onto our analysis and starting with our question.
In this assignment we are going to be looking into which variables help predict the likelihood of a customer churning. This assignment will cover three different types of models, logistic, perceptron, and decision tree modeling. The project will utilize a training and testing data set along with bagging.
First we are going to start by building our logistic models, we are going to build three models and when we have all our models we are going to do some ROC analysis to pick the best models. For the three models we are going to do a full, reduced, and stepwise model and give a brief interpretation of them.
Starting with the full model, we also split the data set into our training and testing data to be utilized later. The full model included all the predictor variables available in the data set. Among these, Term, Internet_serviceNo Internet, Agreement_periodOne year contract, Agreement_periodTwo year contract, and Total_Charges were statistically significant, indicating that shorter customer terms, lack of internet service, shorter agreement periods, and higher total charges increased the likelihood of churn. Other variables, including gender, marital status, and phone/streaming plans, were not significant, suggesting they have little independent effect on churn after accounting for other factors. There were some factor levels with NAs because of multicollinearity however, keeping them in the model does not harm the model, it just means those factor levels can not be interpreted in context in a final model.
# Split data into training and testing
set.seed(123)
train_index <- sample(1:nrow(churn), 0.7 * nrow(churn))
train_data <- churn[train_index, ]
test_data <- churn[-train_index, ]
# Full logistic model including all variables
fullModel <- glm(Churn ~ Sex + Marital_Status + Term + Phone_service + International_plan +
Voice_mail_plan + Multiple_line + Internet_service + Technical_support +
Streaming_Videos + Agreement_period + Monthly_Charges + Total_Charges,
data = train_data, family = binomial)
summary(fullModel)
Call:
glm(formula = Churn ~ Sex + Marital_Status + Term + Phone_service +
International_plan + Voice_mail_plan + Multiple_line + Internet_service +
Technical_support + Streaming_Videos + Agreement_period +
Monthly_Charges + Total_Charges, family = binomial, data = train_data)
Coefficients: (3 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3437542 1.0826865 0.318 0.750863
SexMale -0.0606378 0.2143552 -0.283 0.777265
Marital_StatusSingle 0.0215681 0.4431041 0.049 0.961178
Term -0.0755545 0.0203023 -3.721 0.000198 ***
Phone_serviceYes 0.0049314 0.6449226 0.008 0.993899
International_planYes 0.1005081 0.2608868 0.385 0.700048
Voice_mail_planYes -0.1599207 0.2177671 -0.734 0.462726
Multiple_lineNo phone NA NA NA NA
Multiple_lineYes -0.1588082 0.2612356 -0.608 0.543246
Internet_serviceDSL -0.5316876 0.4725579 -1.125 0.260535
Internet_serviceFiber optic -0.1704813 0.2768196 -0.616 0.537989
Internet_serviceNo Internet -1.9314335 0.9331285 -2.070 0.038467 *
Technical_supportNo internet NA NA NA NA
Technical_supportYes -0.4083748 0.2781549 -1.468 0.142062
Streaming_VideosNo internet NA NA NA NA
Streaming_VideosYes 0.3112777 0.3466962 0.898 0.369271
Agreement_periodOne year contract -1.6102100 0.3742255 -4.303 1.69e-05 ***
Agreement_periodTwo year contract -2.1326236 0.5178134 -4.119 3.81e-05 ***
Monthly_Charges 0.0029681 0.0164428 0.181 0.856753
Total_Charges 0.0006610 0.0002203 3.000 0.002696 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 782.84 on 699 degrees of freedom
Residual deviance: 571.58 on 683 degrees of freedom
AIC: 605.58
Number of Fisher Scoring iterations: 6
Next we look at our reduced model, which kept Term, Internet_service, Agreement_period, and Total_Charges as predictors. In this model significant predictors included Term, Internet_serviceNo Internet, Agreement_periodOne year contract, Agreement_periodTwo year contract, and Total_Charges. This model indicates that customers with shorter terms, no internet service, shorter agreement periods, and higher total charges are more likely to churn. Internet_serviceDSL showed marginal significance, while other levels of internet service were not significant.
# Reduced logistic model
reducedModel <- glm(Churn ~ Term + Internet_service + Agreement_period + Total_Charges,
data = train_data, family = binomial)
summary(reducedModel)
Call:
glm(formula = Churn ~ Term + Internet_service + Agreement_period +
Total_Charges, family = binomial, data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.4884859 0.2608144 1.873 0.061079 .
Term -0.0764982 0.0187508 -4.080 4.51e-05 ***
Internet_serviceDSL -0.5861887 0.3179624 -1.844 0.065245 .
Internet_serviceFiber optic -0.1125347 0.2651529 -0.424 0.671264
Internet_serviceNo Internet -2.0459393 0.4580323 -4.467 7.94e-06 ***
Agreement_periodOne year contract -1.7066788 0.3624799 -4.708 2.50e-06 ***
Agreement_periodTwo year contract -2.2908746 0.4988898 -4.592 4.39e-06 ***
Total_Charges 0.0006786 0.0001919 3.536 0.000406 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 782.84 on 699 degrees of freedom
Residual deviance: 576.85 on 692 degrees of freedom
AIC: 592.85
Number of Fisher Scoring iterations: 6
Finally, we have our stepwise model, this model included Term, Technical_support, Agreement_period, Monthly_Charges, and Total_Charges. Significant predictors were Term, Technical_supportNo internet, Agreement_periodOne year contract, Agreement_periodTwo year contract, and Total_Charges. Somewhat significant predictors included Technical_supportYes and Monthly_Charges. The model suggests that shorter terms, lack of technical support, shorter agreement periods, and higher total charges increase the likelihood of churn, while other variables have little effect.
# Stepwise logistic model (both directions)
stepModel <- stepAIC(fullModel, direction = "both", trace = FALSE)
summary(stepModel)
Call:
glm(formula = Churn ~ Term + Technical_support + Agreement_period +
Monthly_Charges + Total_Charges, family = binomial, data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5002374 0.5020732 -0.996 0.319083
Term -0.0758036 0.0199369 -3.802 0.000143 ***
Technical_supportNo internet -1.3090452 0.5147591 -2.543 0.010990 *
Technical_supportYes -0.4513076 0.2581836 -1.748 0.080462 .
Agreement_periodOne year contract -1.6119374 0.3696002 -4.361 1.29e-05 ***
Agreement_periodTwo year contract -2.1458239 0.5081868 -4.223 2.42e-05 ***
Monthly_Charges 0.0118295 0.0067427 1.754 0.079358 .
Total_Charges 0.0006424 0.0002156 2.980 0.002884 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 782.84 on 699 degrees of freedom
Residual deviance: 574.84 on 692 degrees of freedom
AIC: 590.84
Number of Fisher Scoring iterations: 6
Moving onto our peceptron model we will be looking at two models since peceptron does not have the natural ability to do a stepwise regression process like a logistic regression model would. Thus, for this analysis there will be a full and reduced model created. Before that happens though, the numeric variables need to undergo some scaling so they can be utilized properly in the peceptron models.
# Scale numeric variables
churn_nn <- churn %>%
mutate(
Term_scale = (Term - min(Term)) / (max(Term) - min(Term)),
Monthly_Charges_scale = (Monthly_Charges - min(Monthly_Charges)) /
(max(Monthly_Charges) - min(Monthly_Charges)),
Total_Charges_scale = (Total_Charges - min(Total_Charges)) /
(max(Total_Charges) - min(Total_Charges))
)
# Drop original numeric columns
churn_nn <- churn_nn[, !(names(churn_nn) %in% c("Term", "Monthly_Charges", "Total_Charges"))]
# Convert Churn to numeric
churn_nn$Churn_num <- ifelse(churn_nn$Churn == "Yes", 1, 0)
# Convert categorical variables to dummy variables
cat_vars <- setdiff(colnames(churn_nn), c("Term_scale", "Monthly_Charges_scale", "Total_Charges_scale", "Churn", "Churn_num"))
churn_nn_dummy <- churn_nn %>%
mutate(across(all_of(cat_vars), as.factor)) %>%
model.matrix(~ . - 1, data = .) %>%
as.data.frame()
# Clean column names
colnames(churn_nn_dummy) <- make.names(colnames(churn_nn_dummy))
# Add numeric target back
churn_nn_dummy$Churn_num <- churn_nn$Churn_numAfter the necessary conversions and cleaning we can build and comment on our full model. The full model includes all predictors: sex, marital status, term, phone service, international plan, voice mail plan, multiple line status, internet service type, technical support, streaming videos, agreement period, monthly charges, and total charges. All numeric variables were scaled and categorical variables were converted to dummy variables. In this model, the weights indicate that streaming videos (No internet), monthly charges, and sex (Female) have relatively large positive influences on the hidden layer, while the connection from the hidden layer to the output shows a strong negative weight, reflecting the non-linear mapping learned for churn prediction. Overall, the model captures relationships across all features, although the influence of each feature varies.
# Full Model
NN_vars_full <- setdiff(colnames(churn_nn_dummy), "Churn_num")
nn_formula_full <- as.formula(paste("Churn_num ~", paste(NN_vars_full, collapse = "+")))
NN_full <- neuralnet(nn_formula_full, data = churn_nn_dummy, hidden = 1,
act.fct = "logistic", linear.output = FALSE)
NN_full$result.matrix [,1]
error 0.012213442
reached.threshold 0.009195539
steps 104.000000000
Intercept.to.1layhid1 -1.765595987
SexFemale.to.1layhid1 -0.245926262
SexMale.to.1layhid1 -0.253686033
Marital_StatusSingle.to.1layhid1 0.005971395
Phone_serviceYes.to.1layhid1 -2.304524453
International_planYes.to.1layhid1 -0.312527421
Voice_mail_planYes.to.1layhid1 -0.155468189
Multiple_lineNo.phone..to.1layhid1 -2.503953944
Multiple_lineYes.to.1layhid1 -0.079376838
Internet_serviceDSL.to.1layhid1 -0.466524555
Internet_serviceFiber.optic.to.1layhid1 -0.379924253
Internet_serviceNo.Internet.to.1layhid1 -1.005602624
Technical_supportNo.internet..to.1layhid1 -0.757467851
Technical_supportYes.to.1layhid1 -0.430049673
Streaming_VideosNo.internet..to.1layhid1 0.550296136
Streaming_VideosYes.to.1layhid1 -0.048011527
Agreement_periodOne.year.contract.to.1layhid1 -0.517383521
Agreement_periodTwo.year.contract.to.1layhid1 -0.140213517
ChurnYes.to.1layhid1 11.214773271
Term_scale.to.1layhid1 -1.201428910
Monthly_Charges_scale.to.1layhid1 0.842362162
Total_Charges_scale.to.1layhid1 -0.614272770
Intercept.to.Churn_num -5.433009144
1layhid1.to.Churn_num 10.572450268
The reduced model includes the most important predictors identified from prior logistic modeling: term, total charges, and agreement period (one-year and two-year contracts). In this simplified network, the hidden layer weights indicate strong positive and negative influences for term and total charges, with agreement period showing moderate positive influence. The weight from the hidden layer to the output is strongly negative, emphasizing the impact of these core predictors on churn probability. This reduced model highlights the most critical factors for churn while simplifying the network structure, making it easier to interpret compared to the full model.
# Reduced Model
NN_vars_reduced <- c("Term_scale", "Total_Charges_scale", "Agreement_periodOne.year.contract", "Agreement_periodTwo.year.contract")
nn_formula_reduced <- as.formula(paste("Churn_num ~", paste(NN_vars_reduced, collapse = "+")))
NN_reduced <- neuralnet(nn_formula_reduced, data = churn_nn_dummy, hidden = 1,
act.fct = "logistic", linear.output = FALSE)
NN_reduced$result.matrix [,1]
error 7.586771e+01
reached.threshold 9.862302e-03
steps 1.199000e+03
Intercept.to.1layhid1 -1.911221e-01
Term_scale.to.1layhid1 6.331412e+00
Total_Charges_scale.to.1layhid1 -7.560879e+00
Agreement_periodOne.year.contract.to.1layhid1 1.660183e+00
Agreement_periodTwo.year.contract.to.1layhid1 3.433707e+00
Intercept.to.Churn_num 2.620568e+00
1layhid1.to.Churn_num -5.669395e+00
For the decision tree model we are going to analyze three different options. The first will be a non-penalized model where there is no deduction for false negatives or false positives. Then we will be looking at a model that penalizes for false negatives, followed my a model that penalized for false positives.
This first model is a non-penalized model which means there are no deductions made when the model creates a false postie or negative result. Looking at this model breakdown the first split begins with the agreement period for the contract. The following split is on the monthly charges and whether or not those charges are under 95. After that the split is whether or not the term is greater than or equal to 12. Interestingly enough we then go back to a Monthly Charge split this time seeing if its under 44. Then we are followed by two Total Charge splits, the first one being if total charges are under 45 and then the next and final split being if charges are greater than or equal to 629.
tree.builder = function(in.data, fp, fn, purity){
tree = rpart(Churn ~ .,
data = in.data,
na.action = na.rpart,
method = "class",
model = FALSE,
x = FALSE,
y = TRUE,
parms = list(
loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),
split = purity),
control = rpart.control(
minsplit = 10,
minbucket = 10,
cp = 0.01,
xval = 10)
)
}gini.tree.1.1 <- tree.builder(train_data, 1, 1, "gini")
rpart.plot(gini.tree.1.1, main = "Churn Tree: Gini Non-Penalized")Now that the nonpenalized model has been created it is time to look at a peanalized model for false negatives. For this data set, a false negative would mean that the model incorrectly identifies a customer as a churn risk when they would actually stay loyal. This model also starts out with the split on the length of the Agreement Period. Then it has a left sided tree and a right sided tree. The left sided tree then splits on whether or not total charges are under 3933. Followed by a term greater than or equal to 58, then split on Multiple Lines, then gain on Term, then two total charge splits and an internet service split. Moving to the right tree it splits first on techincal support, followed by three total charges split, the first one is greater than or equal to 218, followed by a less than 132 split, and ending with a greater than or equal to 47 split.
# Penalized: FN = 10 times FP
gini.tree.1.10 <- tree.builder(train_data, 1, 10, "gini")
# Plot
rpart.plot(gini.tree.1.10, main = "Churn Tree: Gini Penalized (FN = 10x)")Finishing out our model creation we are going to look at a penalizing model for false positives. For this data set, a false positive would mean that the model incorrectly identifies a customer as loyal when they actually leave. This model again splits on the agreement length. Followed by two monthly charges splits, the first one is for monthly charges under 95 and then its monthly charges greater than or equal to 101. The final split is for total charges greater than or equal to 2256.
# Penalized: FP = 10 times FN
gini.tree.10.1 <- tree.builder(train_data, 10, 1, "gini")
rpart.plot(gini.tree.10.1, main = "Churn Tree: Gini Penalized (FP = 10x)")
That is the third and final decision tree model we will be looking at
for this data set. The next section will go into assessing both the
candidate models from last weeks assignment and the three candidate
models from above. This will allow us to determine the best candidate
models for each of the three model types.
ROC Analysis and AUC will be used to determine which canidate models are the best from the models created over this week and last week. We are going to start with the logistic models, then look at the perceptron models, finishing with the decision trees.
Looking at the results for our logistic models we can see that in terms of ranking the AUC output the stepwise model narrowly beats out the full model which narrowly beats out the reduced model. When selecting final models not only is AUC important but also making sure the model is reduced down if possible. Not only is the stepwise model smaller than the full model and equally sized to the reduced model, it has the highest AUC, which is ideal when selecting a final model. Taking all these factors into consideration, it makes sense to use the stepwise model as the final logistic regression model.
# --- Predictions for each logistic model ---
pred_full <- predict(fullModel, newdata = test_data, type = "response")
pred_reduced <- predict(reducedModel, newdata = test_data, type = "response")
pred_step <- predict(stepModel, newdata = test_data, type = "response")
# --- ROC and AUC for each ---
roc_full <- roc(test_data$Churn, pred_full, levels = c("No", "Yes"))
roc_reduced <- roc(test_data$Churn, pred_reduced, levels = c("No", "Yes"))
roc_step <- roc(test_data$Churn, pred_step, levels = c("No", "Yes"))
auc_full <- auc(roc_full)
auc_reduced <- auc(roc_reduced)
auc_step <- auc(roc_step)
# --- AUC summary table ---
AUC_logistic <- data.frame(
Model = c("Full Logistic", "Reduced Logistic", "Stepwise Logistic"),
AUC = c(auc_full, auc_reduced, auc_step)
)
AUC_logistic Model AUC
1 Full Logistic 0.8042817
2 Reduced Logistic 0.7915127
3 Stepwise Logistic 0.8081395
plot(roc_full, col = "blue", lwd = 2, main = "ROC Curves for Logistic Models")
plot(roc_reduced, col = "red", lwd = 2, lty = 2, add = TRUE)
plot(roc_step, col = "darkgreen", lwd = 2, lty = 3, add = TRUE)
abline(0, 1, lty = 2, col = "gray")
legend("bottomright",
legend = c(
paste("Full (AUC =", round(auc_full, 3), ")"),
paste("Reduced (AUC =", round(auc_reduced, 3), ")"),
paste("Stepwise (AUC =", round(auc_step, 3), ")")
),
col = c("blue", "red", "darkgreen"), lty = 1:3, lwd = 2, bty = "n")Moving onto the perceptron models we only have two to look at, the full and reduced model. When looking at this output we see something that is not common, a perfect 1.0 AUC and a right angle. While in typical scenarios a AUC of 1 is ideal, the likelihood of this actually happening is close to impossible. Some trouble shooting went on behind the scenes in an attemt to correct this and get a more accurate AUC but it was unsuccessful. It seems to be some kind of issue with data leakage that is unable to be fixed. Considering this unnatural AUC value and the full model being so complex it is typically not ideal, it makes sense not to move forward with this model. Thankfully, the reduced model does still have a good AUC value and it will work as the final perceptron model moving forward.
# Split churn_nn_dummy into train/test
set.seed(123)
train_index <- sample(1:nrow(churn_nn_dummy), 0.7 * nrow(churn_nn_dummy))
train_nn <- churn_nn_dummy[train_index, ]
test_nn <- churn_nn_dummy[-train_index, ]
# Variables for full and reduced perceptron
NN_vars_full <- setdiff(colnames(train_nn), "Churn_num")
NN_vars_reduced <- c("Term_scale", "Total_Charges_scale",
"Agreement_periodOne.year.contract",
"Agreement_periodTwo.year.contract")
# Train full perceptron
NN_full <- neuralnet(as.formula(paste("Churn_num ~", paste(NN_vars_full, collapse="+"))),
data = train_nn, hidden = 1,
act.fct = "logistic", linear.output = FALSE)
# Train reduced perceptron
NN_reduced <- neuralnet(as.formula(paste("Churn_num ~", paste(NN_vars_reduced, collapse="+"))),
data = train_nn, hidden = 1,
act.fct = "logistic", linear.output = FALSE)
# Predictions on test set
NN_full_pred <- compute(NN_full, test_nn[, NN_vars_full])$net.result
NN_reduced_pred <- compute(NN_reduced, test_nn[, NN_vars_reduced])$net.result
# ROC and AUC
roc_NN_full <- roc(test_nn$Churn_num, NN_full_pred)
roc_NN_reduced <- roc(test_nn$Churn_num, NN_reduced_pred)
auc_NN_full <- auc(roc_NN_full)
auc_NN_reduced <- auc(roc_NN_reduced)
# --- AUC summary table ---
AUC_perceptron <- data.frame(
Model = c("Full Perceptron", "Reduced Perceptron"),
AUC = c(auc_NN_full, auc_NN_reduced)
)
AUC_perceptron Model AUC
1 Full Perceptron 1.0000000
2 Reduced Perceptron 0.7410074
# --- Plot ROC curves ---
plot(roc_NN_full, col = "purple", lwd = 2, main = "ROC Curves for Perceptron Models (Test Data)")
plot(roc_NN_reduced, col = "orange", lwd = 2, lty = 2, add = TRUE)
abline(0, 1, lty = 2, col = "gray")
legend("bottomright",
legend = c(
paste("Full (AUC =", round(auc_NN_full, 3), ")"),
paste("Reduced (AUC =", round(auc_NN_reduced, 3), ")")
),
col = c("purple", "orange"), lty = 1:2, lwd = 2, bty = "n")Finally it is time to look at the decision tree models. This final model decision is slightly different because while we have 3 models we will be doing 6 ROC AUC comparisons. This is because each of the three models has a gini and info version. At their core the gini and info versions of a model are the same, they just use slightly different criteria when creating the tree. Looking at the output the model with the best AUC out of the 6 is the info false negative model, represented on the graph as info.1.10. This means the false negative model will be the final decision tree model for the future analysis.
SenSpe <- function(in.data, fp, fn, purity){
cutoff = seq(0, 1, length = 20)
model = tree.builder(in.data, fp, fn, purity)
pred = predict(model, newdata = in.data, type = "prob")
# Initialize matrix for Sensitivity and Specificity
senspe.mtx = matrix(0, ncol = length(cutoff), nrow = 2)
for (i in 1:length(cutoff)){
pred.out = ifelse(pred[,"Yes"] >= cutoff[i], "Yes", "No")
TP = sum(pred.out == "Yes" & in.data$Churn == "Yes")
TN = sum(pred.out == "No" & in.data$Churn == "No")
FP = sum(pred.out == "Yes" & in.data$Churn == "No")
FN = sum(pred.out == "No" & in.data$Churn == "Yes")
senspe.mtx[1,i] = TP / (TP + FN) # Sensitivity
senspe.mtx[2,i] = TN / (TN + FP) # Specificity
}
# ROC and AUC calculation
ROCobj <- roc(in.data$Churn == "Yes", pred[,"Yes"])
AUC = auc(ROCobj)
list(senspe.mtx = senspe.mtx, AUC = round(AUC, 3))
}# Non-penalized
giniROC11 = SenSpe(in.data = train_data, fp = 1, fn = 1, purity = "gini")
infoROC11 = SenSpe(in.data = train_data, fp = 1, fn = 1, purity = "information")
# Penalize false negatives (customers leaving)
giniROC110 = SenSpe(in.data = train_data, fp = 1, fn = 10, purity = "gini")
infoROC110 = SenSpe(in.data = train_data, fp = 1, fn = 10, purity = "information")
# Penalize false positives (customers incorrectly flagged as leaving)
giniROC101 = SenSpe(in.data = train_data, fp = 10, fn = 1, purity = "gini")
infoROC101 = SenSpe(in.data = train_data, fp = 10, fn = 1, purity = "information")par(pty = "s") # square plot
colors = c("#008B8B", "#00008B", "#8B008B", "#8B0000", "#8B8B00", "#8B4500")
plot(1 - giniROC11$senspe.mtx[2,], giniROC11$senspe.mtx[1,],
type = "l",
xlim = c(0, 1),
ylim = c(0, 1),
xlab = "1 - Specificity: FPR",
ylab = "Sensitivity: TPR",
col = colors[1],
lwd = 2,
main = "ROC Curves of Decision Trees",
cex.main = 0.9,
col.main = "navy")
abline(0, 1, lty = 2, col = "orchid4", lwd = 2)
lines(1 - infoROC11$senspe.mtx[2,], infoROC11$senspe.mtx[1,],
col = colors[2], lwd = 2, lty = 2)
lines(1 - giniROC110$senspe.mtx[2,], giniROC110$senspe.mtx[1,],
col = colors[3], lwd = 2)
lines(1 - infoROC110$senspe.mtx[2,], infoROC110$senspe.mtx[1,],
col = colors[4], lwd = 2, lty = 2)
lines(1 - giniROC101$senspe.mtx[2,], giniROC101$senspe.mtx[1,],
col = colors[5], lwd = 2, lty = 4)
lines(1 - infoROC101$senspe.mtx[2,], infoROC101$senspe.mtx[1,],
col = colors[6], lwd = 2, lty = 2)
legend("bottomright",
c(paste("gini.1.1, AUC =", giniROC11$AUC),
paste("info.1.1, AUC =", infoROC11$AUC),
paste("gini.1.10, AUC =", giniROC110$AUC),
paste("info.1.10, AUC =", infoROC110$AUC),
paste("gini.10.1, AUC =", giniROC101$AUC),
paste("info.10.1, AUC =", infoROC101$AUC)),
col = colors,
lty = rep(1:2, 3),
lwd = rep(2, 6),
cex = 0.8,
bty = "n")# Create AUC summary table
AUC_results <- data.frame(
Model = c("Gini (1,1)", "Gini (1,10)", "Gini (10,1)",
"Info (1,1)", "Info (1,10)", "Info (10,1)"),
AUC = c(giniROC11$AUC, giniROC110$AUC, giniROC101$AUC,
infoROC11$AUC, infoROC110$AUC, infoROC101$AUC)
)
AUC_results Model AUC
1 Gini (1,1) 0.843
2 Gini (1,10) 0.810
3 Gini (10,1) 0.786
4 Info (1,1) 0.843
5 Info (1,10) 0.894
6 Info (10,1) 0.854
The selection of each model types final and optimal model ends the ROC analysis. In order to determine the overall best model from the three final models the bagging AUC method will be used, which will be done in the next section.
Now that the final model from each section has been selected it is time to compare these three models to each other to pick the optimal final model. To do this the bagging method will be utilized. Bagging will be run on each of the three models and a distribution of the different AUCs will be outputted. In addition the output will also include a 95% confidence interval for the AUCs and the mean AUC for each model. All this output will be analyzed to determine the optimal final model.
Starting with the sampling distribution histograms is appears that the perceptron model has the lowest AUC distribution of the three models. The decision tree and logistic models have similar distributions so further output will need to be analyzed to differentiate them when selecting the best model. For this differentiation the 95% confidence intervals and the mean AUC outputs will be utilized. For the logistic stepwise model there is a mean AUC on 0.8439 and a 95% CI of [0.81122397, 0.8736398]. Then for the decision tree model there is a mean AUC of 0.8444 and a 95% CI of [0.8043273, 0.8790614]. At first glance it may appear that the decision tree model is the best model considering its higher AUC. However, it should be noted that the logistic stepwise model has a very similar AUC. Additionally, the logistic stepwise model has a smaller confidence interval than the decision tree model, and a smaller confidence interval is ideal. Either model could be selected as the final model, and that decision will be discussed in the final results and conclusion.
set.seed(123)
# Number of bootstrap resamples
B <- 1000
# Function to perform bootstrap AUC evaluation
bootstrap_auc <- function(model, data, model_type, model_name = "Model", response_col = "Churn") {
auc_vec <- numeric(B)
sample.size <- nrow(data)
for (k in 1:B) {
boot.id <- sample(1:sample.size, sample.size, replace = TRUE)
boot.sample <- data[boot.id, ]
if (model_type == "logistic") {
preds <- predict(model, newdata = boot.sample, type = "response")
} else if (model_type == "tree") {
preds <- predict(model, newdata = boot.sample, type = "prob")[, "Yes"]
} else if (model_type == "nn") {
# Preprocessing to match NN training
boot.sample$Term_scale <- scale(boot.sample$Term)
boot.sample$Total_Charges_scale <- scale(boot.sample$Total_Charges)
boot.sample$Agreement_periodOne.year.contract <- ifelse(boot.sample$Agreement_period == "One year contract", 1, 0)
boot.sample$Agreement_periodTwo.year.contract <- ifelse(boot.sample$Agreement_period == "Two year contract", 1, 0)
newdata <- boot.sample[, c("Term_scale",
"Total_Charges_scale",
"Agreement_periodOne.year.contract",
"Agreement_periodTwo.year.contract"), drop = FALSE]
preds <- compute(model, newdata)$net.result
preds <- as.vector(preds)
}
ROCobj <- roc(boot.sample[[response_col]], preds, levels = c("No", "Yes"), quiet = TRUE)
auc_vec[k] <- as.numeric(auc(ROCobj))
}
# Print mean AUC with model name
cat(paste0("Mean Bootstrap AUC for ", model_name, ": ", round(mean(auc_vec), 4), "\n"))
return(auc_vec)
}
# Run bootstrapping for each final model
btAUC_logistic <- bootstrap_auc(stepModel, train_data, model_type = "logistic", model_name = "Logistic Stepwise")Mean Bootstrap AUC for Logistic Stepwise: 0.8439
btAUC_nn <- bootstrap_auc(NN_reduced, train_data, model_type = "nn", model_name = "Reduced Perceptron")Mean Bootstrap AUC for Reduced Perceptron: 0.698
btAUC_tree <- bootstrap_auc(gini.tree.1.1, train_data, model_type = "tree", model_name = "Decision Tree FN Info")Mean Bootstrap AUC for Decision Tree FN Info: 0.8444
# Combine results into a data frame for plotting
boot_df <- data.frame(
AUC = c(btAUC_logistic, btAUC_nn, btAUC_tree),
Model = factor(rep(c("Stepwise Logistic", "Reduced Perceptron", "Decision Tree FN"),
each = B))
)
# Plot AUC distributions
ggplot(boot_df, aes(x = AUC, fill = Model)) +
geom_histogram(alpha = 0.6, position = "identity", bins = 30) +
facet_wrap(~Model, scales = "free_y") +
theme_minimal(base_size = 13) +
labs(title = "Bootstrap Sampling Distributions of AUCs (Final Models)",
x = "AUC", y = "Frequency") +
theme(legend.position = "none")# Compute 95% Confidence Intervals
logistic_ci <- quantile(btAUC_logistic, c(0.025, 0.975))
nn_ci <- quantile(btAUC_nn, c(0.025, 0.975))
tree_ci <- quantile(btAUC_tree, c(0.025, 0.975))
cat("95% Bootstrap CI for Stepwise Logistic:", logistic_ci, "\n")95% Bootstrap CI for Stepwise Logistic: 0.8122397 0.8736398
95% Bootstrap CI for Reduced Perceptron: 0.6630222 0.7336177
95% Bootstrap CI for Decision Tree FN 0.8043273 0.8790614
After understanding and analyzing the three final models, looking at their ROC and bagging AUC output, my determination that the optimal final model is the logistic stepwise model. While it has a slightly smaller AUC than the decision tree model, the decision tree model has a wider confidence interval. Additionally, in a more practical sense logistic models are know for having less variance than decision tree models, and are a more stable model to work with. As a reminder, here is a contextualized explanation of that model. For every 1-unit increase in Term, the log-odds of churn decrease by 0.0758, meaning the odds of churn are multiplied by approximately 0.93 (a 7% decrease). Customers with no internet have log-odds of churn that are 1.309 lower than the reference group, while those with technical support have log-odds that are 0.451 lower. Having a one-year contract reduces the log-odds of churn by 1.612, and having a two-year contract reduces them by 2.146, compared to customers on month-to-month agreements. For each 1-unit increase in Monthly_Charges, the log-odds of churn increase by 0.0118, and for each 1-unit increase in Total_Charges, the log-odds of churn increase by 0.00064. Looking at this final optimal model in comparison to other models there appears to be a good amount of variable overlap, which strengthens our support in the significance of these variables, and this being a strong optimal model.
It should be noted that no project is without fault or could result in different conclusions with added context. As discussed earlier the logistic and decision tree models were very similar in the final analysis and there were arguments for both to be the final model. The decision was ultimately left in my hands. However, this model is not being presented to anyone without a background and understanding in statistics. If this was a scenario where I was presenting this to the telecommunication companies and I knew they had very limited statistical knowledge I most likely would have selected the decision tree model. While it has that slightly wider confidence interval it is more straightforward and interpretable than a log odds stepwise model for those with limited statistical knowledge. If a model is being created for non statisticians to use and understand sometimes a slightly weaker straightforward model is ultimately better than a slightly stronger complex model. This is because if the people you create the model for do not understand it or how to properly implement the model, it is utterly useless to them. However, since there is no scenario tied to this report it was my decision to utilize the model I identified as stronger despite being more complex for a non statistician.