# Install required packages if not already installed
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidymodels")) install.packages("tidymodels")
if (!require("ranger")) install.packages("ranger")
if (!require("xgboost")) install.packages("xgboost")
if (!require("glmnet")) install.packages("glmnet")
if (!require("vip")) install.packages("vip")
if (!require("doParallel")) install.packages("doParallel")
if (!require("finetune")) install.packages("finetune")
if (!require("skimr")) install.packages("skimr")
if (!require("tictoc")) install.packages("tictoc")
5 π― Machine Learning (ML) I: Model Selection π¬ & Hyperparameter Tuning π§ͺ
5.1 Learning Objectives
By the end of this chapter, you will be able to:
- Understand the principles of model selection and hyperparameter tuning for predicting greenhouse gas emissions
- Implement cross-validation for robust model evaluation with panel data
- Apply grid search and random search for hyperparameter optimization
- Use the tidymodels framework for streamlined machine learning workflows
- Evaluate model performance using appropriate metrics
- Avoid overfitting through proper validation techniques
- Compare and select models based on business requirements
- Create reproducible machine learning pipelines for sustainability analytics
5.2 Prerequisites
For this chapter, youβll need:
- R and RStudio installed on your computer
- Understanding of data manipulation with dplyr
- Basic knowledge of data visualization with ggplot2
- Familiarity with regression analysis
- The following R packages installed:
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(tidymodels) # For modeling framework
library(ranger) # For random forests
library(xgboost) # For gradient boosting
library(glmnet) # For regularized regression
library(vip) # For variable importance plots
library(doParallel) # For parallel processing
library(finetune) # For advanced hyperparameter tuning
library(skimr) # For data summaries
library(tictoc) # For timing code
library(bonsai) # For LightGBM
library(rules) # For Cubist algorithm
library(embed) # For embeddings
library(themis) # For balancing
# Set seed for reproducibility
set.seed(12345)
5.3 Introduction to Predicting Greenhouse Gas Emissions and SBTI Targets
In this chapter, weβll work with a dataset related to corporate sustainability, specifically focusing on greenhouse gas (GHG) emissions and Science-Based Targets Initiative (SBTI) commitments. The Science-Based Targets initiative is a collaboration that helps companies set ambitious emissions reduction targets consistent with limiting global warming to well below 2Β°C above pre-industrial levels.
Our primary task is to predict a companyβs greenhouse gas emissions (log_scope_1_ghg_emissions) based on various financial and sustainability indicators. Additionally, weβll explore what factors are associated with companies committing to SBTI targets (sb_ti_committed).
Understanding the Data
Letβs first load and examine our dataset:
# For demonstration purposes, we'll create a sample of the data
# In a real scenario, you would load the data from a file or database
# Create a sample dataset based on the structure provided
set.seed(12345)
<- 1000 # Using a smaller sample for efficiency
n_obs
# Generate sample data
<- tibble(
ml_data log_scope_1_ghg_emissions = rnorm(n_obs, mean = 11.1, sd = 3.17),
sb_ti_committed = sample(c(0, 1), n_obs, replace = TRUE, prob = c(0.93, 0.07)),
management_incentives = sample(c(0, 1, NA), n_obs, replace = TRUE, prob = c(0.64, 0.09, 0.27)),
esg_audit = sample(c(0, 1, NA), n_obs, replace = TRUE, prob = c(0.62, 0.38, 0.001)),
climate_change_policy = sample(c(0, 1, NA), n_obs, replace = TRUE, prob = c(0.31, 0.68, 0.01)),
log_market_capitalization = rnorm(n_obs, mean = 23.3, sd = 1.43),
log_net_sales = rnorm(n_obs, mean = 22.6, sd = 1.39),
price_to_book_ratio = rexp(n_obs, rate = 1/5.21),
log_total_assets = rnorm(n_obs, mean = 23.3, sd = 1.47),
gvkey = sample(1:1039, n_obs, replace = TRUE)
)
# Convert categorical variables to factors
<- ml_data %>%
ml_data mutate(across(c(sb_ti_committed, management_incentives, esg_audit, climate_change_policy),
~factor(., levels = c(0, 1))),
gvkey = factor(gvkey))
# Display data structure
glimpse(ml_data)
Rows: 1,000
Columns: 10
$ log_scope_1_ghg_emissions <dbl> 12.956126, 13.349007, 10.753508, 9.662414, 1β¦
$ sb_ti_committed <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,β¦
$ management_incentives <fct> 0, NA, 0, NA, NA, NA, NA, 0, 0, NA, 0, 0, 0,β¦
$ esg_audit <fct> 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,β¦
$ climate_change_policy <fct> 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,β¦
$ log_market_capitalization <dbl> 24.02915, 22.27703, 22.72010, 23.28209, 24.8β¦
$ log_net_sales <dbl> 21.15066, 22.34431, 22.84918, 21.67058, 21.2β¦
$ price_to_book_ratio <dbl> 9.71222598, 0.72187204, 8.86182428, 21.03081β¦
$ log_total_assets <dbl> 24.09063, 23.65575, 24.02548, 24.19977, 24.7β¦
$ gvkey <fct> 292, 188, 667, 1012, 272, 897, 312, 42, 351,β¦
# Summary statistics
skim(ml_data)
Name | ml_data |
Number of rows | 1000 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
factor | 5 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
sb_ti_committed | 0 | 1.00 | FALSE | 2 | 0: 935, 1: 65 |
management_incentives | 267 | 0.73 | FALSE | 2 | 0: 651, 1: 82 |
esg_audit | 1 | 1.00 | FALSE | 2 | 0: 633, 1: 366 |
climate_change_policy | 8 | 0.99 | FALSE | 2 | 1: 690, 0: 302 |
gvkey | 0 | 1.00 | FALSE | 645 | 623: 6, 209: 5, 344: 5, 660: 5 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
log_scope_1_ghg_emissions | 0 | 1 | 11.25 | 3.17 | 2.29 | 9.21 | 11.25 | 13.28 | 21.66 | ββ βββ |
log_market_capitalization | 0 | 1 | 23.32 | 1.36 | 18.50 | 22.43 | 23.28 | 24.23 | 27.01 | βββββ |
log_net_sales | 0 | 1 | 22.59 | 1.43 | 17.20 | 21.67 | 22.67 | 23.51 | 27.14 | ββββ β |
price_to_book_ratio | 0 | 1 | 5.40 | 5.34 | 0.01 | 1.54 | 3.68 | 7.59 | 39.65 | βββββ |
log_total_assets | 0 | 1 | 23.37 | 1.49 | 18.62 | 22.35 | 23.40 | 24.39 | 27.36 | βββββ |
Data Description
Our dataset contains the following variables:
- log_scope_1_ghg_emissions: Logarithm of Scope 1 greenhouse gas emissions (direct emissions from owned or controlled sources)
- sb_ti_committed: Whether the company has committed to Science-Based Targets (0 = No, 1 = Yes)
- management_incentives: Whether management compensation is linked to ESG/sustainability performance (0 = No, 1 = Yes)
- esg_audit: Whether the company conducts ESG audits (0 = No, 1 = Yes)
- climate_change_policy: Whether the company has a climate change policy (0 = No, 1 = Yes)
- log_market_capitalization: Logarithm of market capitalization (company size)
- log_net_sales: Logarithm of net sales (revenue)
- price_to_book_ratio: Price-to-book ratio (market value relative to book value)
- log_total_assets: Logarithm of total assets
- gvkey: Company identifier
Our primary task is to predict log_scope_1_ghg_emissions
using the other variables as predictors.
5.4 Data Preparation
Before building our models, we need to prepare the data. This includes adding random noise features (for feature importance testing), splitting the data into training and testing sets, and setting up cross-validation.
# Add random noise features (for feature importance testing)
for (i in 1:4) {
<- tibble(sample(1:100, nrow(ml_data), replace = TRUE))
column names(column) <- paste0("random", i)
<- cbind(ml_data, column)
ml_data
}
# Train-test split (group-based sampling by company)
<- group_initial_split(ml_data, group = gvkey, prop = 0.8)
train_test_split <- training(train_test_split)
train_data <- testing(train_test_split)
test_data
# Set up cross-validation (3-fold, grouped by company)
<- group_vfold_cv(
train_resamples
train_data, group = gvkey,
v = 3,
repeats = 1
)
# Display the resampling structure
train_resamples
# Group 3-fold cross-validation
# A tibble: 3 Γ 2
splits id
<list> <chr>
1 <split [531/269]> Resample1
2 <split [540/260]> Resample2
3 <split [529/271]> Resample3
Why Group-Based Splitting?
In this dataset, we have multiple observations per company (identified by gvkey
). Using group-based splitting ensures that all observations from the same company are either in the training set or the testing set, but not both. This is crucial for several reasons:
Preventing Data Leakage: If observations from the same company appear in both training and testing sets, the model might βmemorizeβ company-specific patterns rather than learning generalizable relations.
Realistic Evaluation: In practice, we often want to predict emissions for companies that werenβt in our training data. Group-based splitting simulates this scenario.
Independence Assumption: Many statistical models assume independence between observations. Observations from the same company are likely correlated, violating this assumption.
5.5 Feature Engineering
Next, weβll create a recipe for preprocessing our data. This includes handling missing values, encoding categorical variables, and other transformations.
# Base recipe (feature engineering)
<- recipe(log_scope_1_ghg_emissions ~ ., data = train_data) %>%
ml_recipe update_role(gvkey, new_role = "id") %>% # Preserve ID columns
step_novel(all_nominal_predictors()) %>% # Handle new levels in categorical variables
step_unknown(all_nominal(), new_level = "unknown") %>% # Handle unknown levels
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% # One-hot encode categorical variables
step_impute_median(all_numeric_predictors()) %>% # Impute missing numeric values with median
step_zv(all_predictors()) # Remove zero-variance predictors
# Print the recipe
ml_recipe
ββ Recipe ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ Inputs
Number of variables by role
outcome: 1
predictor: 12
id: 1
ββ Operations
β’ Novel factor level assignment for: all_nominal_predictors()
β’ Unknown factor level assignment for: all_nominal()
β’ Dummy variables from: all_nominal_predictors()
β’ Median imputation for: all_numeric_predictors()
β’ Zero variance filter on: all_predictors()
Understanding the Recipe Steps
update_role(gvkey, new_role = βidβ): This tells the recipe that
gvkey
is an identifier, not a predictor.step_novel(all_nominal_predictors()): This handles new levels in categorical variables that might appear in the test set but not in the training set.
step_unknown(all_nominal(), new_level = βunknownβ): This handles unknown levels in categorical variables.
step_dummy(all_nominal_predictors(), one_hot = TRUE): This converts categorical variables to dummy variables using one-hot encoding.
step_impute_median(all_numeric_predictors()): This imputes missing numeric values with the median.
step_zv(all_predictors()): This removes predictors with zero variance, which donβt provide any information for prediction.
5.6 Model Specification
Now, letβs define the models we want to evaluate. Weβll start with a gradient boosting model (XGBoost) and then add more models for comparison.
# XGBoost model specification with tunable parameters
<-
boost_tree_tune_spec boost_tree(mode = "regression",
engine = "xgboost",
mtry = tune(),
trees = 1000,
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune())
# LightGBM model specification
<-
lightGBM_tune_spec boost_tree(mode = "regression",
engine = "lightgbm",
mtry = tune(),
trees = 1000,
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune())
# Neural network model specification
<-
mlp_tune_spec mlp(mode = "regression",
engine = "nnet",
hidden_units = tune(),
penalty = tune(),
epochs = tune())
# Regularized regression model specification (Elastic Net)
<-
glmnet_tune_spec linear_reg(mode="regression",
engine = "glmnet",
penalty = tune(),
mixture = tune())
# K-nearest neighbors model specification
<-
knn_tune_spec nearest_neighbor(mode = "regression",
engine = "kknn",
neighbors = tune(),
weight_func = tune(),
dist_power = tune())
# Random forest model specification
<-
rand_forest_tune_spec rand_forest(mode="regression",
engine = "ranger",
mtry = tune(),
trees = 1000,
min_n = tune())
# Linear SVM model specification
<-
svm_linear_tune_spec svm_linear(mode = "regression",
engine = "LiblineaR",
cost = tune(),
margin = tune())
# Polynomial SVM model specification
<-
svm_poly_tune_spec svm_poly(mode = "regression",
engine = "kernlab",
cost = tune(),
degree = tune(),
scale_factor = tune(),
margin = tune())
# Radial basis function SVM model specification
<-
svm_rbf_tune_spec svm_rbf(mode = "regression",
engine = "kernlab",
cost = tune(),
rbf_sigma = tune(),
margin = tune())
Understanding Model Hyperparameters
Each model has its own set of hyperparameters that need to be tuned:
- Gradient Boosting (XGBoost and LightGBM):
mtry
: Number of predictors randomly sampled at each splittrees
: Number of trees in the ensemblemin_n
: Minimum number of observations in a nodetree_depth
: Maximum depth of treeslearn_rate
: Learning rate (shrinkage)loss_reduction
: Minimum loss reduction to split a nodesample_size
: Proportion of data used for training each tree
- Neural Network:
hidden_units
: Number of units in the hidden layerpenalty
: Weight decay (regularization)epochs
: Number of training iterations
- Regularized Regression (Elastic Net):
penalty
: Regularization strengthmixture
: Mixing parameter (0 = ridge, 1 = lasso)
- K-Nearest Neighbors:
neighbors
: Number of neighborsweight_func
: Weighting functiondist_power
: Power parameter for Minkowski distance
- Random Forest:
mtry
: Number of predictors randomly sampled at each splittrees
: Number of trees in the forestmin_n
: Minimum number of observations in a node
- Support Vector Machines:
cost
: Cost of constraint violationmargin
: Epsilon in the epsilon-insensitive loss functiondegree
: Degree of polynomial kernel (for polynomial SVM)scale_factor
: Scale factor for polynomial kernel (for polynomial SVM)rbf_sigma
: Inverse kernel width (for radial basis function SVM)
5.7 Creating Workflow Sets
Now, letβs create workflow sets that combine our preprocessing recipe with different models. This allows us to systematically evaluate multiple modeling approaches.
# Define workflow set for tree-based models
<-
tree_workflows workflow_set(preproc = list(recipe = ml_recipe),
models = list(xgboost = boost_tree_tune_spec,
lightGBM = lightGBM_tune_spec,
rand_forest = rand_forest_tune_spec),
cross = TRUE)
# Define workflow set for other models
<-
other_workflows workflow_set(preproc = list(recipe = ml_recipe),
models = list(glmnet = glmnet_tune_spec,
knn = knn_tune_spec,
nnet = mlp_tune_spec,
svm_linear = svm_linear_tune_spec,
svm_poly = svm_poly_tune_spec,
svm_rbf = svm_rbf_tune_spec),
cross = TRUE)
# Combine all workflows
<- bind_rows(tree_workflows, other_workflows)
tune_workflows
# Display the workflow set
tune_workflows
# A workflow set/tibble: 9 Γ 4
wflow_id info option result
<chr> <list> <list> <list>
1 recipe_xgboost <tibble [1 Γ 4]> <opts[0]> <list [0]>
2 recipe_lightGBM <tibble [1 Γ 4]> <opts[0]> <list [0]>
3 recipe_rand_forest <tibble [1 Γ 4]> <opts[0]> <list [0]>
4 recipe_glmnet <tibble [1 Γ 4]> <opts[0]> <list [0]>
5 recipe_knn <tibble [1 Γ 4]> <opts[0]> <list [0]>
6 recipe_nnet <tibble [1 Γ 4]> <opts[0]> <list [0]>
7 recipe_svm_linear <tibble [1 Γ 4]> <opts[0]> <list [0]>
8 recipe_svm_poly <tibble [1 Γ 4]> <opts[0]> <list [0]>
9 recipe_svm_rbf <tibble [1 Γ 4]> <opts[0]> <list [0]>
Understanding Workflow Sets
A workflow set combines preprocessing recipes with model specifications. This allows us to:
Systematically Evaluate Models: We can compare different models using the same preprocessing steps and evaluation metrics.
Streamline Tuning: We can tune multiple models with a single function call.
Ensure Consistency: All models are evaluated on the same resamples, making comparisons fair.
5.8 Parallel Processing Setup
Hyperparameter tuning can be computationally intensive. To speed up the process, weβll use parallel processing.
# Parallel processing setup
<- makePSOCKcluster(8) # Create clusters (adjust based on your computer's capabilities)
cl registerDoParallel(cl)
getDoParWorkers() # Check number of workers
# When done, remember to stop the cluster
# stopCluster(cl)
# registerDoSEQ()
5.9 Model Tuning
Now, letβs tune our models using the workflow set and cross-validation.
# Tune all models in the workflow set
<-
tune_workflows_models %>%
tune_workflows workflow_map("tune_grid", # Specify tuning function
resamples = train_resamples, # Resampling scheme
grid = 25, # Grid size (25 combinations per model)
metrics = metric_set(rmse, rsq, mae), # Evaluation metrics
verbose = TRUE)
# Stop parallel processing
# stopCluster(cl)
# registerDoSEQ()
# gc() # Run garbage collection to clear memory
For demonstration purposes, weβll create a simulated result:
# Simulated results for demonstration
set.seed(12345)
<- c("recipe_xgboost", "recipe_lightGBM", "recipe_rand_forest",
model_names "recipe_glmnet", "recipe_knn", "recipe_nnet",
"recipe_svm_linear", "recipe_svm_poly", "recipe_svm_rbf")
# Create simulated metrics
<- tibble(
sim_metrics wflow_id = rep(model_names, each = 3),
.metric = rep(c("rmse", "rsq", "mae"), times = length(model_names)),
mean = case_when(
== "rmse" ~ runif(length(model_names) * 3, 0.8, 2.5),
.metric == "rsq" ~ runif(length(model_names) * 3, 0.6, 0.9),
.metric == "mae" ~ runif(length(model_names) * 3, 0.6, 2.0)
.metric
),n = rep(3, length(model_names) * 3),
std_err = runif(length(model_names) * 3, 0.05, 0.2),
rank = case_when(
== "rmse" ~ rank(mean),
.metric == "rsq" ~ rank(-mean),
.metric == "mae" ~ rank(mean)
.metric
),model = case_when(
str_detect(wflow_id, "xgboost") ~ "boost_tree",
str_detect(wflow_id, "lightGBM") ~ "boost_tree",
str_detect(wflow_id, "rand_forest") ~ "rand_forest",
str_detect(wflow_id, "glmnet") ~ "linear_reg",
str_detect(wflow_id, "knn") ~ "nearest_neighbor",
str_detect(wflow_id, "nnet") ~ "mlp",
str_detect(wflow_id, "svm_linear") ~ "svm_linear",
str_detect(wflow_id, "svm_poly") ~ "svm_poly",
str_detect(wflow_id, "svm_rbf") ~ "svm_rbf"
),.config = paste0("Preprocessor1_Model", sample(1:5, length(model_names) * 3, replace = TRUE))
)
# Simulate a workflow set results object
# This is a simplified version for demonstration
<- list(
tune_workflows_models metrics = sim_metrics
)
# Function to extract metrics from our simulated object
<- function(object, rank_metric = "rmse", select_best = FALSE) {
rank_results if (select_best) {
$metrics %>%
objectgroup_by(wflow_id, .metric) %>%
slice_min(order_by = ifelse(.metric == "rsq", -mean, mean), n = 1) %>%
ungroup()
else {
} $metrics
object
}
}
<- function(object) {
collect_metrics $metrics
object }
5.10 Evaluating Model Performance
Letβs evaluate the performance of our models.
# Rank models by MAE
rank_results(tune_workflows_models, rank_metric = "mae") %>%
filter(.metric == "mae") %>%
arrange(rank)
# A tibble: 9 Γ 8
wflow_id .metric mean n std_err rank model .config
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 recipe_svm_linear mae 0.668 3 0.143 3 svm_linear Preprocβ¦
2 recipe_glmnet mae 0.907 3 0.0618 11 linear_reg Preprocβ¦
3 recipe_lightGBM mae 0.931 3 0.174 12 boost_tree Preprocβ¦
4 recipe_nnet mae 1.31 3 0.0643 14 mlp Preprocβ¦
5 recipe_knn mae 1.44 3 0.127 17 nearest_neighbor Preprocβ¦
6 recipe_svm_poly mae 1.52 3 0.128 18 svm_poly Preprocβ¦
7 recipe_xgboost mae 1.62 3 0.0529 20 boost_tree Preprocβ¦
8 recipe_svm_rbf mae 1.82 3 0.180 21 svm_rbf Preprocβ¦
9 recipe_rand_forest mae 1.98 3 0.0591 23 rand_forest Preprocβ¦
# Show the best sub-model per workflow
<- rank_results(tune_workflows_models, rank_metric = "mae", select_best = TRUE) %>%
best_models filter(.metric == "mae") %>%
arrange(rank)
best_models
# A tibble: 9 Γ 8
wflow_id .metric mean n std_err rank model .config
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 recipe_svm_linear mae 0.668 3 0.143 3 svm_linear Preprocβ¦
2 recipe_glmnet mae 0.907 3 0.0618 11 linear_reg Preprocβ¦
3 recipe_lightGBM mae 0.931 3 0.174 12 boost_tree Preprocβ¦
4 recipe_nnet mae 1.31 3 0.0643 14 mlp Preprocβ¦
5 recipe_knn mae 1.44 3 0.127 17 nearest_neighbor Preprocβ¦
6 recipe_svm_poly mae 1.52 3 0.128 18 svm_poly Preprocβ¦
7 recipe_xgboost mae 1.62 3 0.0529 20 boost_tree Preprocβ¦
8 recipe_svm_rbf mae 1.82 3 0.180 21 svm_rbf Preprocβ¦
9 recipe_rand_forest mae 1.98 3 0.0591 23 rand_forest Preprocβ¦
Visualizing Model Performance
Letβs visualize the performance of our models.
# Create a plot of model performance
%>%
best_models ggplot(aes(x = reorder(wflow_id, -rank), y = mean, fill = model)) +
geom_col() +
coord_flip() +
labs(
title = "Model Performance Comparison",
subtitle = "Mean Absolute Error (MAE) - Lower is Better",
x = NULL,
y = "Mean Absolute Error (MAE)"
+
) theme_minimal() +
theme(legend.position = "bottom")
5.11 Selecting the Best Model
Based on our evaluation, letβs select the best performing model and extract its hyperparameters.
# Identify the best performing model
<-
best_tuned_workflow rank_results(tune_workflows_models, rank_metric = "mae", select_best = TRUE) %>%
filter(.metric == "mae") %>%
arrange(rank) %>%
::slice(1)
dplyr
print(best_tuned_workflow)
# A tibble: 1 Γ 8
wflow_id .metric mean n std_err rank model .config
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 recipe_svm_linear mae 0.668 3 0.143 3 svm_linear Preprocessor1_β¦
# For demonstration, let's assume XGBoost is the best model
<- "recipe_xgboost"
best_wflow_id
# Simulate best hyperparameters for XGBoost
<- tibble(
best_hyperparams mtry = 6,
min_n = 8,
tree_depth = 10,
learn_rate = 0.01,
loss_reduction = 0.1,
sample_size = 0.8
)
print(best_hyperparams)
# A tibble: 1 Γ 6
mtry min_n tree_depth learn_rate loss_reduction sample_size
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 8 10 0.01 0.1 0.8
5.12 Advanced Hyperparameter Tuning
Now that weβve identified the best model, letβs perform more advanced hyperparameter tuning using different search strategies.
Grid Search Strategies
There are several strategies for grid search:
- Regular Grid: Evaluates all combinations of a fixed set of values for each hyperparameter.
- Random Grid: Randomly samples from the hyperparameter space.
- Latin Hypercube: A space-filling design that ensures good coverage of the hyperparameter space.
- Maximum Entropy: Another space-filling design that maximizes the entropy of the sampled points.
Letβs implement these strategies:
# For demonstration, we'll create a workflow for the best model (assumed to be XGBoost)
<- workflow() %>%
best_workflow add_recipe(ml_recipe) %>%
add_model(boost_tree_tune_spec)
# Finalize parameter ranges
<- parameters(
param_info mtry(range = c(2, 10)),
min_n(range = c(2, 20)),
tree_depth(range = c(3, 15)),
learn_rate(range = c(-3, -1), trans = log10_trans()),
loss_reduction(range = c(-10, 1), trans = log10_trans()),
sample_prop(range = c(0.5, 1))
)
# Regular Grid
<- grid_regular(param_info, levels = 3)
grid_reg <- tune_grid(
grid_results_reg
best_workflow,resamples = train_resamples,
grid = grid_reg,
metrics = metric_set(rmse, rsq, mae),
control = control_grid(save_pred = TRUE)
)
# Random Grid
<- grid_random(param_info, size = 25)
grid_rand <- tune_grid(
grid_results_rand
best_workflow,resamples = train_resamples,
grid = grid_rand,
metrics = metric_set(rmse, rsq, mae),
control = control_grid(save_pred = TRUE)
)
# Latin Hypercube Grid
<- grid_latin_hypercube(param_info, size = 25)
grid_latin <- tune_grid(
grid_results_latin
best_workflow,resamples = train_resamples,
grid = grid_latin,
metrics = metric_set(rmse, rsq, mae),
control = control_grid(save_pred = TRUE)
)
# Maximum Entropy Grid
<- grid_max_entropy(param_info, size = 25)
grid_entropy <- tune_grid(
grid_results_entropy
best_workflow,resamples = train_resamples,
grid = grid_entropy,
metrics = metric_set(rmse, rsq, mae),
control = control_grid(save_pred = TRUE)
)
Bayesian Optimization
Bayesian optimization is an efficient approach for hyperparameter tuning that uses a probabilistic model to guide the search.
# Bayesian Optimization
set.seed(12345)
<- tune_bayes(
bayes_results
best_workflow,resamples = train_resamples,
initial = 10, # Number of initial points
iter = 15, # Number of iterations
param_info = param_info,
metrics = metric_set(rmse, rsq, mae),
control = control_bayes(no_improve = 10, verbose = TRUE)
)
# Select the best hyperparameters
<- select_best(bayes_results, metric = "mae") best_bayes_params
5.13 Finalizing and Evaluating the Model
Now, letβs finalize our model with the best hyperparameters and evaluate it on the test set.
# Finalize the workflow with the best hyperparameters
<- workflow() %>%
final_workflow add_recipe(ml_recipe) %>%
add_model(
boost_tree(
mode = "regression",
engine = "xgboost",
mtry = best_hyperparams$mtry,
trees = 1000,
min_n = best_hyperparams$min_n,
tree_depth = best_hyperparams$tree_depth,
learn_rate = best_hyperparams$learn_rate,
loss_reduction = best_hyperparams$loss_reduction,
sample_size = best_hyperparams$sample_size
)
)
# Fit the final model on the training data
<- final_workflow %>%
final_fitted_model fit(data = train_data)
# Evaluate on the test set
<- final_fitted_model %>%
final_results predict(new_data = test_data) %>%
bind_cols(test_data %>% select(log_scope_1_ghg_emissions))
# Calculate metrics
<- final_results %>%
final_metrics metrics(truth = log_scope_1_ghg_emissions, estimate = .pred)
final_metrics
For demonstration, letβs create simulated test results:
# Simulated test predictions
set.seed(12345)
<- 200
n_test <- tibble(
final_results .pred = rnorm(n_test, mean = 11.1, sd = 2.5),
log_scope_1_ghg_emissions = rnorm(n_test, mean = 11.1, sd = 3.17)
)
# Calculate metrics
<- final_results %>%
final_metrics metrics(truth = log_scope_1_ghg_emissions, estimate = .pred)
final_metrics
# A tibble: 3 Γ 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.96
2 rsq standard 0.00140
3 mae standard 3.01
# Calculate residuals
<- final_results %>%
final_results mutate(residuals = log_scope_1_ghg_emissions - .pred)
# Plot actual vs predicted values
ggplot(final_results, aes(x = .pred, y = log_scope_1_ghg_emissions)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, color = "blue") +
labs(
title = "Actual vs. Predicted GHG Emissions",
x = "Predicted log(Scope 1 GHG Emissions)",
y = "Actual log(Scope 1 GHG Emissions)"
+
) theme_minimal()
# Plot residuals vs predicted values
ggplot(final_results, aes(x = .pred, y = residuals)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "blue", linetype = "dashed") +
labs(
title = "Residuals vs. Predicted Values",
x = "Predicted log(Scope 1 GHG Emissions)",
y = "Residuals"
+
) theme_minimal()
# Histogram of residuals
ggplot(final_results, aes(x = residuals)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
labs(
title = "Distribution of Residuals",
x = "Residuals",
y = "Count"
+
) theme_minimal()
5.14 Didactic Exercise: Different Train/Test Split Approaches
In this exercise, weβll explore the impact of different train/test split approaches on model performance. Specifically, weβll compare:
- Group-Based Split: Splitting by company (gvkey), as we did above.
- Random Split: Randomly splitting the data without considering company grouping.
Random Split
# Random train-test split
set.seed(12345)
<- initial_split(ml_data, prop = 0.8)
random_split <- training(random_split)
random_train <- testing(random_split)
random_test
# Check if companies appear in both training and testing sets
<- unique(random_train$gvkey)
companies_in_train <- unique(random_test$gvkey)
companies_in_test <- intersect(companies_in_train, companies_in_test)
companies_in_both
cat("Number of companies in both training and testing sets:", length(companies_in_both), "\n")
Number of companies in both training and testing sets: 95
cat("Percentage of companies in both sets:", round(length(companies_in_both) / length(unique(ml_data$gvkey)) * 100, 2), "%\n")
Percentage of companies in both sets: 14.73 %
# Train a model using the random split
<- recipe(log_scope_1_ghg_emissions ~ ., data = random_train) %>%
random_recipe update_role(gvkey, new_role = "id") %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal(), new_level = "unknown") %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_impute_median(all_numeric_predictors()) %>%
step_zv(all_predictors())
# Create a simple XGBoost model
<- boost_tree(
random_xgb_model mode = "regression",
engine = "xgboost",
mtry = 6,
trees = 1000,
min_n = 8,
tree_depth = 10,
learn_rate = 0.01,
loss_reduction = 0.1,
sample_size = 0.8
)
# Create a workflow
<- workflow() %>%
random_workflow add_recipe(random_recipe) %>%
add_model(random_xgb_model)
# Fit the model
<- random_workflow %>%
random_fitted_model fit(data = random_train)
# Evaluate on the test set
<- random_fitted_model %>%
random_results predict(new_data = random_test) %>%
bind_cols(random_test %>% select(log_scope_1_ghg_emissions))
# Calculate metrics
<- random_results %>%
random_metrics metrics(truth = log_scope_1_ghg_emissions, estimate = .pred)
random_metrics
# A tibble: 3 Γ 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.12
2 rsq standard 0.000969
3 mae standard 2.41
Comparing Group-Based and Random Splits
Now, letβs compare the performance of models trained with group-based and random splits:
# Create a comparison table
<- bind_rows(
comparison tibble(
split_type = "Group-Based Split",
rmse = final_metrics %>% filter(.metric == "rmse") %>% pull(.estimate),
rsq = final_metrics %>% filter(.metric == "rsq") %>% pull(.estimate),
mae = final_metrics %>% filter(.metric == "mae") %>% pull(.estimate)
),tibble(
split_type = "Random Split",
rmse = random_metrics %>% filter(.metric == "rmse") %>% pull(.estimate),
rsq = random_metrics %>% filter(.metric == "rsq") %>% pull(.estimate),
mae = random_metrics %>% filter(.metric == "mae") %>% pull(.estimate)
)
)
# Display the comparison
comparison
# A tibble: 2 Γ 4
split_type rmse rsq mae
<chr> <dbl> <dbl> <dbl>
1 Group-Based Split 3.96 0.00140 3.01
2 Random Split 3.12 0.000969 2.41
# Visualize the comparison
%>%
comparison pivot_longer(cols = c(rmse, rsq, mae), names_to = "metric", values_to = "value") %>%
ggplot(aes(x = split_type, y = value, fill = split_type)) +
geom_col() +
facet_wrap(~ metric, scales = "free_y") +
labs(
title = "Performance Comparison: Group-Based vs. Random Split",
x = NULL,
y = "Metric Value"
+
) theme_minimal() +
theme(legend.position = "bottom")
Discussion: The Importance of Proper Data Splitting
The comparison above illustrates an important point: random splitting often leads to overly optimistic performance estimates when dealing with grouped data. This is because:
Data Leakage: With random splitting, information about the same company appears in both training and testing sets, allowing the model to βcheatβ by recognizing company-specific patterns.
Unrealistic Scenario: In practice, we often want to predict emissions for companies that werenβt in our training data. Random splitting doesnβt simulate this scenario.
Violation of Independence: Many statistical models assume independence between observations. Observations from the same company are likely correlated, violating this assumption.
For these reasons, group-based splitting is the appropriate approach for this dataset. While it may result in seemingly worse performance metrics, these metrics are more realistic and better reflect how the model would perform in practice.
5.15 Saving the Final Model
Finally, letβs save our best model for future use:
# Save the final model
saveRDS(final_fitted_model, "final_ghg_emissions_model.rds")
# To load the model later:
# loaded_model <- readRDS("final_ghg_emissions_model.rds")
5.16 Summary
In this chapter, weβve covered the fundamentals of model selection and hyperparameter tuning for predicting greenhouse gas emissions. Weβve learned how to:
- Prepare data for modeling, including handling missing values and encoding categorical variables
- Implement cross-validation for robust model evaluation, with special attention to grouped data
- Define and tune multiple types of models, including tree-based models, regularized regression, and neural networks
- Compare model performance using appropriate metrics
- Implement advanced hyperparameter tuning techniques, including grid search and Bayesian optimization
- Evaluate model performance on test data
- Understand the importance of proper data splitting for panel data
These skills are essential for developing high-performing machine learning models that generalize well to new data. By systematically selecting and tuning models, you can make more accurate predictions and better business decisions in the context of sustainability analytics.
In the next chapter, weβll build on these skills to explore model interpretability and explainability, which are crucial for understanding the factors driving greenhouse gas emissions and SBTI commitments.
5.17 Exercises
Exercise 1: Feature Engineering
Using the dataset from this chapter:
- Create a new recipe that includes additional feature engineering steps, such as:
- Creating interaction terms between financial variables
- Applying transformations to skewed variables
- Implementing feature selection
- Compare the performance of models trained with this new recipe to the models trained with the original recipe.
- Discuss which feature engineering steps had the most impact on model performance.
Exercise 2: Model Comparison
Using the dataset from this chapter:
- Implement and tune at least three different types of models not covered in this chapter (e.g., BART, Cubist, or a custom ensemble).
- Compare the performance of these models to the models covered in this chapter.
- Discuss the trade-offs between model complexity, interpretability, and performance.
Exercise 3: Hyperparameter Tuning
Using the best-performing model from this chapter:
- Implement a more extensive hyperparameter tuning process, exploring a wider range of hyperparameter values.
- Compare the performance of the model tuned with this more extensive process to the model tuned with the original process.
- Discuss the computational trade-offs of more extensive hyperparameter tuning.
Exercise 4: Business Application
You are a sustainability analyst at a large corporation. Youβve been asked to develop a model to predict which companies are likely to commit to SBTI targets:
- Using the dataset from this chapter, create a classification model to predict
sb_ti_committed
. - Implement appropriate preprocessing, model selection, and hyperparameter tuning.
- Evaluate the model using appropriate classification metrics (e.g., accuracy, precision, recall, F1 score).
- Discuss how this model could be used to inform sustainability strategy.
5.18 References
- Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.
- Kuhn, M., & Silge, J. (2022). Tidy Modeling with R. OβReilly Media. https://www.tmwr.org/
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R. Springer.
- Boehmke, B., & Greenwell, B. (2019). Hands-On Machine Learning with R. CRC Press. https://bradleyboehmke.github.io/HOML/
- Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281-305.