R Lab Note: Classification Tree

Load in data and required package

First, we need to download and load the necessary packages, “rpart” and “rpart.plot”, for building and visualizing classification trees.

install.packages("rpart") 
install.packages("rpart.plot") 
library(rpart)
library(rpart.plot)

Next, load the MBA application dataset, which contains information about applicants’ GPA, GMAT scores, and admission decisions.

# Load the MBA application dataset
# This dataset contains information about applicants' GPA, GMAT scores, and admission decisions.
file_path = "https://jiantongwang.com/MBA_Admission.csv"
MBA.data <- read.csv(file = file_path, header=T)
str(MBA.data)

## 'data.frame':    250 obs. of  3 variables:
##  $ GPA     : num  2.89 3.51 2.77 2.48 2.81 ...
##  $ GMAT    : num  426 566 301 333 403 ...
##  $ Decision: chr  "notadmit" "admit" "admit" "notadmit" ...

By default, the variable “Decision” is a character variable, and we need to convert it into a factor variable.

MBA.data$Decision <- as.factor(MBA.data$Decision)
str(MBA.data) # take a look at the structure of the dataset

## 'data.frame':    250 obs. of  3 variables:
##  $ GPA     : num  2.89 3.51 2.77 2.48 2.81 ...
##  $ GMAT    : num  426 566 301 333 403 ...
##  $ Decision: Factor w/ 2 levels "admit","notadmit": 2 1 1 2 2 1 2 1 1 2 ...

Split Training and Testing Data

We split the dataset into training and testing data. To ensure reproducibility, we set a random seed using set.seed() function.

set.seed(1234) # ensure reproducibility
training_index <- sample(1 : 250, 200) # randomly select 80% of the data set as training data, and leave 20% as testing data.
MBA.training <- MBA.data[training_index, ] # 80% training data
MBA.testing <- MBA.data[-training_index, ] # 20% testing data

Data Visualization

Create a scatter plot to visualize the distribution of GPA and GMAT scores, and their admission status.

# Define colors for each category
colors <- c("admit" = "red", "notadmit" = "black") # Red for 'admit', Black for 'notadmit'
shapes <- c("admit" = 19, "notadmit" = 15) # Circle for 'admit', square for 'notadmit'
# Create the scatter plot
plot(MBA.training$GPA ~ MBA.training$GMAT, 
     col = colors[MBA.training$Decision], 
     main = "MBA Admission", 
     xlab = "GMAT", 
     ylab = "GPA",
     ylim = c(2,4),
     pch = shapes[MBA.training$Decision])

# Add legend
legend("bottomright", 
       legend = names(colors), 
       col = colors, 
       pch = shapes, 
       title = "Admission Status")

Build a Classification Tree Model

We can use rpart() function to build a classification tree model. After we build this classification tree model, we can use the prp() function to visualize it.

MBA_rpart0 <- rpart(formula = Decision ~ ., 
                       data = MBA.training, 
                       method = "class") # we use all the variables (GPA, GMAT) as predictors.
prp(MBA_rpart0)

Prediction with Classification Tree

We can use predict() function to make prediction with our fitted tree model. We can set the argument type = “prob” will give you the predicted probability of each class; type = “class” to get the predicted class with cut-off threshold of 0.5 (majority vote).

pred_test_prob <- predict(MBA_rpart0, MBA.testing, type = "prob")
head(pred_test_prob)

##        admit   notadmit
## 1  0.1875000 0.81250000
## 3  0.0000000 1.00000000
## 5  0.1875000 0.81250000
## 13 0.9607843 0.03921569
## 27 0.9607843 0.03921569
## 32 0.9607843 0.03921569

pred_test_class <- predict(MBA_rpart0, MBA.testing,type = "class")
head(pred_test_class)

##        1        3        5       13       27       32 
## notadmit notadmit notadmit    admit    admit    admit 
## Levels: admit notadmit

# Generate a confusion matrix to compare predicted and actual admission decisions
table(pred_test_class, MBA.testing$Decision)

##                
## pred_test_class admit notadmit
##        admit       26        4
##        notadmit     3       17

Accuracy: The proportion of correctly classified observations out of all observations.

Accuracy: \(\frac{26 + 17}{26 + 4 + 3 + 17} = \frac{43}{50}=0.86\)

Misclassification Rate: The proportion of incorrectly classified observations out of all observations.

Misclassification Rate: \(\frac{3 + 4}{26 + 4 + 3 + 17} = \frac{7}{50}=0.14\)

Key Takeaways:

Data Preparation: Always ensure variables are correctly formatted (e.g., categorical variables as factors) and split data into training and testing sets for robust model evaluation.
Visualization: Use scatter plots to explore data patterns and relationships, which can guide feature selection and model design.
Model Fitting and Prediction Performance: Use rpart() to fit classification trees and predict() to evaluate performance on new data. The confusion matrix provides a clear summary of model prediction performance, which are essential for assessing model performance.