Why this project
This originated as a teaching-and-baselining exercise: before using advanced methods, we should understand how robust classic classifiers behave on a messy, real‑world dataset. The goal was to quantify strengths/weaknesses across bias–variance regimes, get trustworthy reference metrics (Accuracy, AUC), and build intuition about which weather variables actually matter for short‑term temperature changes.
- Audience: students and practitioners — from basic stats to ML engineers.
- Design principles: consistent splits, transparent pre‑processing, clear metrics, and reproducibility.
- Outcome: Random Forest emerged as the most reliable baseline (highest AUC), with wind direction and afternoon temperature/humidity key signals alongside sunshine/evaporation.
Data & preprocessing
- Target:
WarmerTomorrow(1 if tomorrow’s max temp exceeds today’s; else 0). - Sampling: 10 random locations from 49; 5,000 daily rows sampled; missing values removed; class balance ≈ 1055 : 953 (warm vs not), i.e., reasonably balanced.
- Predictors: Sunshine, Evaporation, Min/Max/Temp3pm, Humidity3pm, wind direction features, etc. Date/location retained to test importance but excluded from some models.
- Split: 70% train / 30% test, reproducible seed.
Potential leakage check: ensure today’s max temperature doesn’t trivially encode tomorrow by construction; keep lag definitions clean.
Models compared
- Decision Tree (
tree): interpretable baseline; pruned via cross‑validation. - Naive Bayes (
e1071): strong with conditionally independent evidence; surprisingly competitive. - Bagging (
adabag): reduces variance by aggregating bootstrapped trees. - Boosting (
adabag+rpart): focuses on hard cases; good AUC on this data. - Random Forest (
randomForest): best overall balance of accuracy/AUC and variable importance stability. - Neural Net (
neuralnet): tiny MLP on a compact subset of features.
Results (test set)
| Classifier | Accuracy | AUC | Notes |
|---|---|---|---|
| Decision Tree | 0.612 | 0.665 | Improves to 0.625 after pruning (CV). |
| Naive Bayes | 0.648 | 0.685 | Simple, fast, decent baseline. |
| Bagging | 0.625 | 0.703 | Variance reduction; solid AUC. |
| Boosting | 0.658 | 0.729 | Best single‑tree ensemble on AUC aside from RF. |
| Random Forest | 0.668 | 0.744 | Top performer overall. |
| Neural Net (small) | ≈0.660 | — | Comparable accuracy; RF still ahead on AUC. |
Metrics reproduced from the original analysis; minor variation is expected if you re‑sample or change pre‑processing.
ROC curves & AUC
We compute ROC curves with ROCR from model posterior probabilities/scores. On this dataset the bagging and boosting curves overlap substantially; Random Forest dominates the upper‑left area, translating to the best AUC.
Feature importance
Across tree‑based models, the consistently influential variables were:
- Sunshine, Temp3pm, Humidity3pm, Evaporation, MaxTemp, MinTemp
- Wind directions: WindDir9am, WindDir3pm, WindGustDir
Date/location fields contributed little and can be dropped in production to avoid spurious associations.
My classifier (parsimonious RF)
Based on importance, I trained a compact Random Forest using only: Sunshine, Evaporation, Temp3pm, Humidity3pm, MinTemp. This matched the broader RF baseline (≈66–67% accuracy) while remaining simpler to explain and faster to score.
- Why RF here? Stable across re‑samples, handles non‑linear interactions, and gives robust importance estimates.
- Tuning: consider
mtrygrid search viacaret::train, and calibrate the classification threshold from the ROC Youden point.
Full R code (reproducible)
Drop this into an R script. It mirrors the original analysis and adds optional tuning + calibration. Requires packages: tree, e1071, ROCR, randomForest, adabag, rpart, neuralnet, caret, pROC.
rm(list = ls())
library(tree)
library(e1071)
library(ROCR)
library(randomForest)
library(adabag)
library(rpart)
library(neuralnet)
library(caret)
library(pROC)
set.seed(31179762)
WAUS <- read.csv("WarmerTomorrow2022.csv")
L <- as.data.frame(1:49)
L <- L[sample(nrow(L), 10, replace = FALSE), ]
WAUS <- WAUS[(WAUS$Location %in% L[,1]), ]
WAUS <- WAUS[sample(nrow(WAUS), 5000, replace = FALSE), ]
WAUS <- WAUS[complete.cases(WAUS), ]
# target as factor
WAUS$WarmerTomorrow <- factor(WAUS$WarmerTomorrow)
# split 70/30
set.seed(31179762)
idx <- sample(1:nrow(WAUS), 0.7*nrow(WAUS))
WAUS.train <- WAUS[idx,]
WAUS.test <- WAUS[-idx,]
# Decision Tree
WAUS.dt.fit <- tree(WarmerTomorrow ~ ., data = WAUS.train)
WAUS.dt.pred <- predict(WAUS.dt.fit, WAUS.test, type = "class")
WAUS.dt.vec <- predict(WAUS.dt.fit, WAUS.test, type = "vector")
cm.dt <- caret::confusionMatrix(WAUS.dt.pred, WAUS.test$WarmerTomorrow, positive = "1")
roc.dt <- ROCR::performance(ROCR::prediction(WAUS.dt.vec[,2], WAUS.test$WarmerTomorrow), "tpr","fpr")
# Naive Bayes
WAUS.nb.fit <- naiveBayes(WarmerTomorrow ~ . - WarmerTomorrow, data = WAUS.train)
WAUS.nb.pred <- predict(WAUS.nb.fit, WAUS.test, type = "class")
WAUS.nb.vec <- predict(WAUS.nb.fit, WAUS.test, type = "raw")
cm.nb <- caret::confusionMatrix(WAUS.nb.pred, WAUS.test$WarmerTomorrow, positive = "1")
roc.nb <- ROCR::performance(ROCR::prediction(WAUS.nb.vec[,2], WAUS.test$WarmerTomorrow), "tpr","fpr")
# Bagging
WAUS.bag.fit <- adabag::bagging(WarmerTomorrow ~ . - WarmerTomorrow, data = WAUS.train)
WAUS.bag.pred<- predict(WAUS.bag.fit, WAUS.test)
cm.bag <- WAUS.bag.pred$confusion
roc.bag <- ROCR::performance(ROCR::prediction(WAUS.bag.pred$prob[,2], WAUS.test$WarmerTomorrow), "tpr","fpr")
# Boosting
WAUS.boost.fit <- adabag::boosting(WarmerTomorrow ~ . - WarmerTomorrow, data = WAUS.train)
WAUS.boost.pred<- predict.boosting(WAUS.boost.fit, newdata = WAUS.test)
cm.boost <- WAUS.boost.pred$confusion
roc.boost <- ROCR::performance(ROCR::prediction(WAUS.boost.pred$prob[,2], WAUS.test$WarmerTomorrow), "tpr","fpr")
# Random Forest
WAUS.rf.fit <- randomForest(WarmerTomorrow ~ . - WarmerTomorrow, data = WAUS.train)
WAUS.rf.pred <- predict(WAUS.rf.fit, WAUS.test)
WAUS.rf.prob <- predict(WAUS.rf.fit, WAUS.test, type = "prob")
cm.rf <- caret::confusionMatrix(factor(WAUS.rf.pred), WAUS.test$WarmerTomorrow, positive = "1")
roc.rf <- ROCR::performance(ROCR::prediction(WAUS.rf.prob[,2], WAUS.test$WarmerTomorrow), "tpr","fpr")
# AUC helper
auc_of <- function(pred){ as.numeric(ROCR::performance(pred, "auc")@y.values) }
auc.dt <- auc_of(ROCR::prediction(WAUS.dt.vec[,2], WAUS.test$WarmerTomorrow))
auc.nb <- auc_of(ROCR::prediction(WAUS.nb.vec[,2], WAUS.test$WarmerTomorrow))
auc.bag <- auc_of(ROCR::prediction(WAUS.bag.pred$prob[,2], WAUS.test$WarmerTomorrow))
auc.boost <- auc_of(ROCR::prediction(WAUS.boost.pred$prob[,2], WAUS.test$WarmerTomorrow))
auc.rf <- auc_of(ROCR::prediction(WAUS.rf.prob[,2], WAUS.test$WarmerTomorrow))
# Plot ROC
plot(roc.dt, col = "red"); abline(0,1)
plot(roc.nb, add=TRUE, col = "orange")
plot(roc.bag, add=TRUE, col = "green")
plot(roc.boost, add=TRUE, col = "blue")
plot(roc.rf, add=TRUE, col = "purple")
legend("bottomright", legend=c("Decision Tree","Naive Bayes","Bagging","Boosting","Random Forest"), fill=c("red","orange","green","blue","purple"))
# Best tree via cross‑validation pruning
WAUS.dt.cv <- cv.tree(WAUS.dt.fit, FUN = prune.misclass)
WAUS.dt.pruned <- prune.misclass(WAUS.dt.fit, best = 4)
WAUS.dt.pr.pred<- predict(WAUS.dt.pruned, WAUS.test, type = "class")
cm.dt.pr <- caret::confusionMatrix(factor(WAUS.dt.pr.pred), WAUS.test$WarmerTomorrow, positive = "1")
# My compact RF
WAUS.my.fit <- randomForest(WarmerTomorrow ~ Sunshine + Evaporation + MinTemp + Temp3pm + Humidity3pm, data=WAUS.train)
WAUS.my.pred<- predict(WAUS.my.fit, subset(WAUS.test, select=c("Sunshine","MinTemp","Evaporation","Temp3pm","Humidity3pm")))
cm.my <- table(Predicted_Class = WAUS.my.pred, Actual_Class = WAUS.test$WarmerTomorrow)
# Optional: caret tuning for RF (mtry grid)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, classProbs = TRUE, summaryFunction = twoClassSummary)
WAUS.train$WarmerTomorrow2 <- ifelse(WAUS.train$WarmerTomorrow=="1","yes","no")
set.seed(31179762)
rf.tuned <- train(WarmerTomorrow2 ~ Sunshine + Evaporation + MinTemp + Temp3pm + Humidity3pm,
data = WAUS.train,
method = "rf",
metric = "ROC",
trControl = ctrl,
tuneGrid = data.frame(mtry = c(2,3,4)))
print(rf.tuned)
# Threshold calibration using Youden J (pROC)
rf_probs <- predict(WAUS.my.fit, WAUS.test, type = "prob")[,2]
roc_obj <- pROC::roc(WAUS.test$WarmerTomorrow, rf_probs)
coords <- pROC::coords(roc_obj, "best", ret = c("threshold","sensitivity","specificity"), best.method = "youden")
coords
# Variable importance
importance(WAUS.rf.fit)
varImpPlot(WAUS.rf.fit)