First, we need to download and load the necessary packages, “rpart” and “rpart.plot”, for building and visualizing classification trees.
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
Next, load the MBA application dataset, which contains information about applicants’ GPA, GMAT scores, and admission decisions.
# Load the MBA application dataset
# This dataset contains information about applicants' GPA, GMAT scores, and admission decisions.
file_path = "https://jiantongwang.com/MBA_Admission.csv"
MBA.data <- read.csv(file = file_path, header=T)
str(MBA.data)
## 'data.frame': 250 obs. of 3 variables:
## $ GPA : num 2.89 3.51 2.77 2.48 2.81 ...
## $ GMAT : num 426 566 301 333 403 ...
## $ Decision: chr "notadmit" "admit" "admit" "notadmit" ...
By default, the variable “Decision” is a character variable, and we need to convert it into a factor variable.
MBA.data$Decision <- as.factor(MBA.data$Decision)
str(MBA.data) # take a look at the structure of the dataset
## 'data.frame': 250 obs. of 3 variables:
## $ GPA : num 2.89 3.51 2.77 2.48 2.81 ...
## $ GMAT : num 426 566 301 333 403 ...
## $ Decision: Factor w/ 2 levels "admit","notadmit": 2 1 1 2 2 1 2 1 1 2 ...
We split the dataset into training and testing data. To ensure reproducibility, we set a random seed using set.seed() function.
set.seed(1234) # ensure reproducibility
training_index <- sample(1 : 250, 200) # randomly select 80% of the data set as training data, and leave 20% as testing data.
MBA.training <- MBA.data[training_index, ] # 80% training data
MBA.testing <- MBA.data[-training_index, ] # 20% testing data
Create a scatter plot to visualize the distribution of GPA and GMAT scores, and their admission status.
# Define colors for each category
colors <- c("admit" = "red", "notadmit" = "black") # Red for 'admit', Black for 'notadmit'
shapes <- c("admit" = 19, "notadmit" = 15) # Circle for 'admit', square for 'notadmit'
# Create the scatter plot
plot(MBA.training$GPA ~ MBA.training$GMAT,
col = colors[MBA.training$Decision],
main = "MBA Admission",
xlab = "GMAT",
ylab = "GPA",
ylim = c(2,4),
pch = shapes[MBA.training$Decision])
# Add legend
legend("bottomright",
legend = names(colors),
col = colors,
pch = shapes,
title = "Admission Status")
We can use rpart() function to build a classification tree model. After we build this classification tree model, we can use the prp() function to visualize it.
MBA_rpart0 <- rpart(formula = Decision ~ .,
data = MBA.training,
method = "class") # we use all the variables (GPA, GMAT) as predictors.
prp(MBA_rpart0)
We can use predict() function to make prediction with our fitted tree model. We can set the argument type = “prob” will give you the predicted probability of each class; type = “class” to get the predicted class with cut-off threshold of 0.5 (majority vote).
pred_test_prob <- predict(MBA_rpart0, MBA.testing, type = "prob")
head(pred_test_prob)
## admit notadmit
## 1 0.1875000 0.81250000
## 3 0.0000000 1.00000000
## 5 0.1875000 0.81250000
## 13 0.9607843 0.03921569
## 27 0.9607843 0.03921569
## 32 0.9607843 0.03921569
pred_test_class <- predict(MBA_rpart0, MBA.testing,type = "class")
head(pred_test_class)
## 1 3 5 13 27 32
## notadmit notadmit notadmit admit admit admit
## Levels: admit notadmit
# Generate a confusion matrix to compare predicted and actual admission decisions
table(pred_test_class, MBA.testing$Decision)
##
## pred_test_class admit notadmit
## admit 26 4
## notadmit 3 17
Accuracy: The proportion of correctly classified observations out of all observations.
Accuracy: \(\frac{26 + 17}{26 + 4 + 3 + 17} = \frac{43}{50}=0.86\)
Misclassification Rate: The proportion of incorrectly classified observations out of all observations.
Misclassification Rate: \(\frac{3 + 4}{26 + 4 + 3 + 17} = \frac{7}{50}=0.14\)
Data Preparation: Always ensure variables are correctly formatted (e.g., categorical variables as factors) and split data into training and testing sets for robust model evaluation.
Visualization: Use scatter plots to explore data patterns and relationships, which can guide feature selection and model design.
Model Fitting and Prediction Performance: Use rpart() to fit classification trees and predict() to evaluate performance on new data. The confusion matrix provides a clear summary of model prediction performance, which are essential for assessing model performance.