knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
library(tidyverse)
library(caret)
library(recipes)
library(earth)
library(ipred)
library(e1071)
library(ranger)
library(rpart)
library(rpart.plot)
library(skimr)
library(corrplot)
library(caret)
library(stringr)
library(stopwords)
library(snakecase)
library(gridExtra)
library(dplyr)
library(kableExtra)

Introduction



Music has been a long-standing form of entertainment throughout humanity’s history and there are songs that are better embraced by the general population more than others. This project sought out to understand the underlying factors that could explain preferences from a scientific standpoint. Various machine learning algorithms were utilized on different audio features as an attempt to predict how likely a song would make it to top chart.

Being able to identify the elements that contribute to making a song popular could be beneficial to the music industry. The implications of this project to the music industry could be crucial moving forward.

The dataset used for this project is sourced from Kaggle. The dataset contains audio statistics of the top 2000 (includes songs released from 1956 - 2019) tracks on Spotify. Songs released from 1956 to 2019 are included from some notable and famous artists like Queen, The Beatles, Guns N’ Roses, etc.

Data Investigation

There are 1958 unique titles, 731 artists, and 149 genres in this dataset. The dataset contains 14 columns each describing the track and its qualities. This data contains audio features like Danceability, BPM, Liveness, Valence (Positivity) and many more.

For details on how the audio features are graded, refer to the Evaluation Guideline provided by, Echo Nest, now aquired by Spotify.

Audio features are defined as follow:

report <- data.frame(Variable = c("Title","Artist", "Genre", "Year", "BPM", "Energy", 
                                "Danceability", "Loudness", "Liveness", "Valence", "Length", 
                                "Acoustic", "Speechiness", "Popularity"),

           Definition = c("Name of the Track",
                          "Name of the Artist",
                          "Genre of the track",
                          "Release year of the track",
                          "The tempo of the song, beats per minute",
                          "The energy of a song- the higher the value, the more energetic the song",
                          "The higher the value, the easier it is to dance to the song",
                          "The higher the value, the louder the song",
                          "The higher the value, the more likely the song is a live recording",
                          "The higher the value, the more positive mood for the song",
                          "The duration of the song (in seconds)",
                          "The higher the value the more acoustic the song is",
                          "The higher the value the more spoken words the song contains",
                          "The higher the value the more popular the song")
                    ) 

# knitr::kable(report)
# 
# report$RMSE = cell_spec(report$RMSE, bold = TRUE, color = "white", 
#                         background = ifelse(report$RMSE == min(report$RMSE), "green", "red"))
# report$models = cell_spec(report$models, bold = TRUE)

kbl(report, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) 
Variable Definition
Title Name of the Track
Artist Name of the Artist
Genre Genre of the track
Year Release year of the track
BPM The tempo of the song, beats per minute
Energy The energy of a song- the higher the value, the more energetic the song
Danceability The higher the value, the easier it is to dance to the song
Loudness The higher the value, the louder the song
Liveness The higher the value, the more likely the song is a live recording
Valence The higher the value, the more positive mood for the song
Length The duration of the song (in seconds)
Acoustic The higher the value the more acoustic the song is
Speechiness The higher the value the more spoken words the song contains
Popularity The higher the value the more popular the song

With a quick glance at the summary statistics below, there are hit songs with identical titles. For the Artist variable, Queen had the most top hit songs, followed by The Beatles and Coldplay. As for the Genre variable, the top hit songs fell into rock, adult standards, and dutch pop. None of the variables have missing values.

# Importing data
Spotify <- read_csv("Datasets/Spotify-2000.csv")

# Quick cleaning
Spotify <- Spotify %>%
  # Remove the index column
  select(-c(Index)) %>%
  # Convert character in to factor
  mutate_if(is.character,as.factor) %>%
  # Rename with cleaner names
  rename(Genre = `Top Genre`,
         Length = `Length (Duration)`,
         BPM = `Beats Per Minute (BPM)`,
         Loudness = `Loudness (dB)`)

summary(Spotify)
##                Title                     Artist                  Genre     
##  Feeling Good     :   3   Queen             :  37   album rock      : 413  
##  Hallelujah       :   3   The Beatles       :  36   adult standards : 123  
##  One              :   3   Coldplay          :  27   dutch pop       :  88  
##  Always On My Mind:   2   U2                :  26   alternative rock:  86  
##  Amsterdam        :   2   The Rolling Stones:  24   dance pop       :  83  
##  Behind Blue Eyes :   2   Bruce Springsteen :  23   dutch indie     :  75  
##  (Other)          :1979   (Other)           :1821   (Other)         :1126  
##       Year           BPM            Energy        Danceability  
##  Min.   :1956   Min.   : 37.0   Min.   :  3.00   Min.   :10.00  
##  1st Qu.:1979   1st Qu.: 99.0   1st Qu.: 42.00   1st Qu.:43.00  
##  Median :1993   Median :119.0   Median : 61.00   Median :53.00  
##  Mean   :1993   Mean   :120.2   Mean   : 59.68   Mean   :53.24  
##  3rd Qu.:2007   3rd Qu.:136.0   3rd Qu.: 78.00   3rd Qu.:64.00  
##  Max.   :2019   Max.   :206.0   Max.   :100.00   Max.   :96.00  
##                                                                 
##     Loudness          Liveness        Valence          Length      
##  Min.   :-27.000   Min.   : 2.00   Min.   : 3.00   Min.   :  93.0  
##  1st Qu.:-11.000   1st Qu.: 9.00   1st Qu.:29.00   1st Qu.: 212.0  
##  Median : -8.000   Median :12.00   Median :47.00   Median : 245.0  
##  Mean   : -9.009   Mean   :19.01   Mean   :49.41   Mean   : 262.4  
##  3rd Qu.: -6.000   3rd Qu.:23.00   3rd Qu.:69.75   3rd Qu.: 289.0  
##  Max.   : -2.000   Max.   :99.00   Max.   :99.00   Max.   :1412.0  
##                                                                    
##   Acousticness    Speechiness       Popularity    
##  Min.   : 0.00   Min.   : 2.000   Min.   : 11.00  
##  1st Qu.: 3.00   1st Qu.: 3.000   1st Qu.: 49.25  
##  Median :18.00   Median : 4.000   Median : 62.00  
##  Mean   :28.86   Mean   : 4.995   Mean   : 59.53  
##  3rd Qu.:50.00   3rd Qu.: 5.000   3rd Qu.: 71.00  
##  Max.   :99.00   Max.   :55.000   Max.   :100.00  
## 

Data Preprocessing

The three categorical variables Genre, Title, and Artist required unique solutions to extract features:

Genre

Having all 149 genres as dummy variables would be problematic. As such, some genres got combined together. The resulting combination of different genres can be found below (for reproducibility purpose).

# Combine genre
combine_sheet <-read_csv("Datasets/CMSC (Genre).csv") # Pre-made combine sheet

# Mutate dataset to use the combined genres
Spotify <- Spotify %>%
  mutate(Combined_Genre = case_when(
    Genre %in% combine_sheet$`Rock/Metal` ~ "Rock_Metal",
    Genre %in% combine_sheet$Alternative ~ "Alternative",
    Genre %in% combine_sheet$Punk ~ "Punk",
    Genre %in% combine_sheet$`Folk/Indie` ~ "Folk_Indie",
    Genre %in% combine_sheet$Classic ~ "Classic",
    Genre %in% combine_sheet$Soul ~ "Soul",
    Genre %in% combine_sheet$Dance ~ "Dance",
    Genre %in% combine_sheet$Jazz ~ "Jazz",
    Genre %in% combine_sheet$`Singer-Songwriter` ~ "Singer_Songwriter",
    Genre %in% combine_sheet$`Country` ~ "Country",
    Genre %in% combine_sheet$`(Hip)Pop` ~ "Hip_Pop",
    TRUE ~ "Other"
  )) %>%
  mutate(Combined_Genre = as.factor(Combined_Genre))

detail <- data.frame(Combined = colnames(combine_sheet),
           Genres = apply(combine_sheet,2, function(x) paste(x[!is.na(x)], collapse = ", ")))

detail$Combined = cell_spec(detail$Combined, bold = TRUE)
# detail$Genres = cell_spec(detail$Genres, bold = TRUE)

kbl(detail, escape = FALSE, row.names = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) 
Combined Genres
(Hip)Pop dutch pop, pop, europop, art pop, britpop, german pop, australian pop, chamber pop, baroque pop, brill building pop, canadian pop, acoustic pop, belgian pop, barbadian pop, bubblegum pop, electropop, italian pop, austro pop, bow pop, candy pop, nederpop, new wave pop, afro pop, danish pop, irish pop, la pop, operatic pop, uk pop, dutch pop, europop, dutch hip pop, detroit hip pop, east coast hip pop, alternative hip pop, atl hip pop, hip pop
Rock/Metal album rock, alternative rock, glam rock, modern rock, art rock, irish rock, dutch rock, belgian rock, celtic rock, candian rock, hard rock, rock-and-roll, soft rock, yacht rock, blues rock, garage rock, australian rock, canadian rock, alternative metal, glam metal, dutch metal
Alternative alternative rock, alternative pop rock, alternative hip hop, british alternative rock, alternative pop, australian alternative rock, german alternative rock, latin alternative
Punk g punk, punk, celtic punk, cyberpunk, pop punk
Folk/Indie folk, candian folk, modern folk rock, brithish folk, folk-pop, australian indie folk, indie anthem-folk, dutch indie, indie pop, alaska indeie, icelandic indie
Classic classic rock, classic uk pop, classic soul, classic schalger, classic canadian rock, classic italian pop, classic soundtrack
Soul british soul, classic soul, neo soul, chicago soul
Dance dance pop, dance rock, alternative dance, eurodance, australian dance
Country arkansas country, classic country pop, contemporary country
Jazz acid jazz, contemporary vocal jazz, latin jazz
Singer-Songwriter scottish singer-songwriter, british singer-songwriter, irish singer-songwriter

Title

With most titles being unique, converting titles into dummy variables as is would also be problematic,. Hence, Natural Language Processing (NLP) techniques were applied to create new features:

1. The first technique would create a count feature that keep track of amount of words in a title.

# Create a new variable, length of title
Spotify <- Spotify %>%
  mutate(Title_Word_Count = str_count(Spotify$Title, pattern = "\\w+"))

2. The second technique, Bag of Words (BoW), would create a dummy variable if a particular word appear in the title.

# Split data
set.seed(888)
# Split data
Spotify_index <- createDataPartition(Spotify$Popularity, p = 0.8, list = FALSE)
Spotify_train <- Spotify[Spotify_index,]
Spotify_test <- Spotify[-Spotify_index,] 

# Create 25 most common words

# Firstly, we will have to lower case titles
Spotify_train <- Spotify_train %>% mutate(Title = str_to_lower(Title))
Spotify_test  <- Spotify_test  %>% mutate(Title = str_to_lower(Title))

# Secondly, we find which words do we want to create a dummy for, from TRAINING
word_count <- data.frame(matrix(ncol = 2, nrow = 0))
colnames(word_count) <- c('word', 'count')
black_list <- stopwords()
for (title in strsplit(Spotify_train$Title, "[- ,\\(\\)\"]")) { # Split on common separators 
  for (word in title) {
    if (!(word %in% black_list)) { # Filter out numbers and common words, just common library for now
      if (word %in% word_count$word) {
        word_count$count[which(word_count$word == word)] <- word_count$count[which(word_count$word == word)] + 1
      } else {
        word_count <- rbind(word_count, data.frame(word = word, count = 1))
      }
    }
  }
}

word_count <- arrange(word_count, desc(count)) # Arrange by count
word_count <- word_count[-c(1),] # Remove most common: white space
significant_words <- head(word_count,25)$word # Just take most common 25 for now, can tune later

# Thirdly,for each of the word in the list, we create a dummy variable

# Helper function to check if word is in a title (have to be custom made since the split above is also custom)
check_word <- function(title, word) {
  word_list <- strsplit(title, "[- ,\\(\\)\"]")[[1]]
  return(word %in% word_list)
}
check_word_expanded <- function(word, titles) {
     sapply(titles, check_word, word = word)
}

word_mutate <- function(data, word_list) {
  new_data <- data
  for (word in word_list) {
    col_name <- to_any_case(paste("contains", word),case = "snake")
    new_data <- new_data %>% 
      mutate("{col_name}" := ifelse(check_word_expanded(word, Title), 1, 0))
  }
  return(new_data)
}

# Apply funciton onto training and testing datasets
Spotify_train <- word_mutate(Spotify_train, significant_words)
Spotify_test <- word_mutate(Spotify_test, significant_words)

There are three drawbacks with the BoW technique, two of which will be directly addressed.

  1. BoW tends to create too many features, as such we reduce the BoW to 25 most common words.

  2. BoW include words that doesn’t contribute any meaning, called stopwords. A R library package, stopwords, was used to filter out all the common stopwords. However, the list of stopwords should be context specific and that could be a point of future research.

  3. BoW doesn’t account for sentence structure and context. However, since the text are short titles, it is less of a concern. Of course, NLP is an advanced subject and could warrant further exploration. Due to the constrain of time, this is as far this research will go.

Artist

With 1958 unique artists in the dataset, it would be redundant to turn all artists into dummy variables. Instead, dummy variables were created for the top 25 artists with the most appearance in the dataset.

# Create 25 most repeated artists
# First create a list of 25 most repeated artists from TRAINING data set
top_25_artists <- Spotify_train %>% group_by(Artist) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(25) # Top 25 for now, tune later

# Create function to create dummy variable to each artist given a list
artist_mutate <- function(data, artist_list) {
  new_data <- data
  for (artist in artist_list) {
    col_name <- to_any_case(paste("By", artist),case = "snake")
    new_data <- new_data %>% 
      dplyr::mutate("{col_name}" := ifelse(artist == Artist, 1, 0))
  }
  return(new_data)
}

# Apply the function on both train and test dataset
Spotify_train <- artist_mutate(Spotify_train, top_25_artists$Artist)
Spotify_test <- artist_mutate(Spotify_test, top_25_artists$Artist)

Note: Some of the preprocessing steps performed for Title and Artist are too complex for the R package recipe. Precaution have been taken to prevent data leakage when splitting the data into training and testing sets. However, by creating these features before making the recipe means that data leakage is present at the CV level.

To remark, potential expansion of this topic includes:

  1. Using better NLP techniques

  2. Choosing the optimal number of words in BoW

  3. Choosing the optimal number of artist dummies

  4. Prevent data leakage at the CV level

Numerical variables

Since all numerical variables are not zero-value or near zero-value, not many preprocessing steps were required. The only task was to center and scale the variables.

# We deselect Title, Artist, and Genre
Spotify_train <- select(Spotify_train, !c(Title, Artist, Genre))
Spotify_test <- select(Spotify_test, !c(Title, Artist, Genre))

# preprocessing
spotify_recipe <- recipe(Popularity ~. , data = Spotify_train)   # create the recipe for blueprint

# spotify_recipe$var_info   # check the types and roles of variables
# nearZeroVar(Spotify_train, saveMetrics = TRUE)   # The majority of the custom dummies are nzv, but that is to be expected for premade dummies

blueprint <- spotify_recipe %>%
  step_center(all_numeric_predictors(), -starts_with("by"), -starts_with("contains")) %>%   # center all numeric features except response and premade dummies
  step_scale(all_numeric_predictors(), -starts_with("by"), -starts_with("contains")) %>%    # scale all numeric features except response and premade dummies
  step_dummy(all_nominal_predictors(), one_hot = FALSE) # create dummy variables for nominal categorical features

# blueprint
prepare <- prep(blueprint, training = Spotify_train)   # estimate feature engineering parameters from training set
baked_train <- bake(prepare, new_data = Spotify_train)  # apply blueprint to training set
baked_test <- bake(prepare, new_data = Spotify_test)    # apply blueprint to test set

Prediction Models

Eight supervised learning models are utilized on the training data to compare performance using Cross-Validations (CV).

set.seed(888)
# We choose the CV setting of 5 fold and 5 repeats
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)

The optimal tuning parameter, Root-Mean-Square Error (RMSE), for each model is reported in the RMSE Table. The best-performing model is then applied to the testing data.

RMSE Table

# Restore the object
mlr_cv <- readRDS(file = "Datasets/mlr_cv.rds")
knn_cv <- readRDS(file = "Datasets/knn_cv.rds")
ridge_cv <- readRDS(file = "Datasets/ridge_cv.rds")
lasso_cv <- readRDS(file = "Datasets/lasso_cv.rds")
mars_cv <- readRDS(file = "Datasets/mars_cv.rds")
tree_cv <- readRDS(file = "Datasets/tree_cv.rds")
bag_cv <- readRDS(file = "Datasets/bag_cv.rds")
rf_cv <- readRDS(file = "Datasets/rf_cv.rds")

report <- data.frame(models = c("MLR","KNN", "Ridge", "Lasso", "MARS", "Regression tree", "Bagged tree", "Random forests"),
           bestTunes = c("N/A", # MLR tune
                         paste("k =", knn_cv$bestTune$k), # KNN tune
                         paste("lambda =", ridge_cv$bestTune$lambda), # Ridge tune
                         paste("lambda =", lasso_cv$bestTune$lambda), # Lasso tune
                         paste("degree =", mars_cv$bestTune$degree, "; nprune =", mars_cv$bestTune$nprune), # MARS tune
                         paste("cp =", tree_cv$bestTune$cp), # Regression tree tune
                         "N/A", # Bagged tune
                         paste("mtry =", rf_cv$bestTune$mtry)), # Random forest tune
           RMSE = c(min(mlr_cv$results$RMSE), # MLR CV RMSE
                    min(knn_cv$results$RMSE), # KNN CV RMSE
                    min(ridge_cv$results$RMSE), # Ridge CV RMSE
                    min(lasso_cv$results$RMSE), # Lasso CV RMSE
                    min(mars_cv$results$RMSE), # MARS CV RMSE
                    min(tree_cv$results$RMSE), # Regression tree CV RMSE
                    min(bag_cv$results$RMSE), # Bagged CV RMSE
                    min(rf_cv$results$RMSE))) # Random forest CV RMSE

report <- report %>% mutate_if(is.numeric, format, digits=7)
# knitr::kable(report)

report$RMSE = cell_spec(report$RMSE, bold = TRUE, color = "white", 
                        background = ifelse(report$RMSE == min(report$RMSE), "green", "red"))
report$models = cell_spec(report$models, bold = TRUE)

kbl(report, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) 
models bestTunes RMSE
MLR N/A 12.38717
KNN k = 15 13.35766
Ridge lambda = 1.07226722201032 12.36088
Lasso lambda = 0.114975699539774 12.34086
MARS degree = 1 ; nprune = 23 12.73216
Regression tree cp = 0.00980155093512361 13.54065
Bagged tree N/A 12.37812
Random forests mtry = 29 12.32998
# fit final model
rffit <- ranger(Popularity ~ .,
                data = baked_train,
                num.trees = 500,
                mtry = rf_cv$bestTune$mtry,
                splitrule = "variance",   # use "gini" for classificaton
                min.node.size = 2,
                importance = "impurity")

# variable importance
data.frame(Overall = rffit$variable.importance) %>% arrange(desc(Overall)) %>% head(10)
##                            Overall
## Year                      41825.47
## Length                    26250.90
## Loudness                  25239.72
## Danceability              24789.98
## BPM                       23461.68
## Energy                    23410.70
## Valence                   23339.80
## Liveness                  22651.44
## Acousticness              20196.02
## Combined_Genre_Folk_Indie 18272.76
# R-Squared of the final model
rffit$r.squared
## [1] 0.2793658
preds_rf <- predict(rffit, data = baked_test, type = "response")  # predictions on test set
pred_error_est_rffit <- sqrt(mean((preds_rf$predictions - baked_test$Popularity)^2))  # test set RMSE
pred_error_est_rffit
## [1] 12.81495

MLR

set.seed(888)
################### MLR ###################
mlr_cv <- train(blueprint,
                 data = Spotify_train, 
                 method = "lm",
                 trControl = cv_specs,
                 metric = "RMSE")
# Save an object to a file
saveRDS(mlr_cv, file = "Datasets/mlr_cv.rds")

min(mlr_cv$results$RMSE) # MLR CV RMSE

Multiple linear regression (MLR) is a parametric approach, which assumes a linear relationship between response and features.

Note: Attempting to predict the response for a value of the predictor that lies outside the range of the data is NOT recommended. This is called extrapolation, and the predictions tend to be unreliable.

KNN

set.seed(888)
################### KNN ###################
k_grid <- expand.grid(k = seq(1, 15, 1))   # for KNN regression

knn_cv <- train(blueprint,
                 data = Spotify_train, 
                 method = "knn",
                 trControl = cv_specs,
                 tuneGrid = k_grid,
                 metric = "RMSE")

# Save an object to a file
saveRDS(knn_cv, file = "Datasets/knn_cv.rds")

min(knn_cv$results$RMSE) # KNN CV RMSE

K Nearest Neighbors (KNN) is a non-parametric method that approximates the association between independent variables and the continuous outcome by averaging the observations in the same neighborhood. This method becomes impractical when the dimension increases (when there are many independent variables).

Smaller values for K can be noisy and will have higher influence on the result, while a larger value will have smoother decision boundaries which mean lower variance but increased bias.

Note: KNN does not have any specialized training phase as it uses all the training samples for classification and simply stores the results in memory. KNN can be very sensitive to the scale of data as it relies on computing the distances. For features with a higher scale, the calculated distances can be very high and might produce poor results. It is thus advised to scale the data before running the KNN.

Ridge

set.seed(888)
################### Ridge Regression ###################
lambda_grid <- 10^seq(-3, 3, length = 100)   # grid of lambda values to search over

ridge_cv <- train(blueprint,
                  data = Spotify_train,
                  method = "glmnet",   # for ridge regression
                  trControl = cv_specs,
                  tuneGrid = expand.grid(alpha = 0, lambda = lambda_grid),  # alpha = 0 implements ridge regression
                  metric = "RMSE")
# Save an object to a file
saveRDS(ridge_cv, file = "Datasets/ridge_cv.rds")

min(ridge_cv$results$RMSE) # Ridge CV RMSE

Ridge Regression is a model tuning method that’s used to analyze any data that suffers from multicollinearity. When the issue of multicollinearity occurs, least-squares are unbiased, and variance are large, which results in predicted values being far away from the actual values.

Ridge is useful for this dataset because many independent variables in the model are correlated:

  • Danceability is affected by several other variables such as loudness, energy, and BPM.

  • Loudness and liveness is also affected by valence; high valence sound more positive (happy, cheerful, euphoric) while low valence sound more negative (sad, depressed, angry)

  • Valence can also affect danceability, additionally songs with low valence tends to be higher on speechiness and acousticness.

  • Certain artists tend to make similar music (the composition of the song).

As lambda increases, the bias is unchanged but the variance drops. The drawback is that Ridge doesn’t select variables, it includes all of the variables in the final model

g1 <- ggplot(ridge_cv)   # lambda vs. RMSE plot
g2 <- ggplot(ridge_cv) + xlim(c(0,2))    # a closer look at lambda vs. RMSE plot
grid.arrange(g1, g2, ncol = 2)

Lasso

set.seed(888)
################### LASSO Regression ###################
lasso_cv <- train(blueprint,
                  data = Spotify_train,
                  method = "glmnet",   # for lasso
                  trControl = cv_specs,
                  tuneGrid = expand.grid(alpha = 1, lambda = lambda_grid),  # alpha = 1 implements lasso
                  metric = "RMSE")
# Save an object to a file
saveRDS(lasso_cv, file = "Datasets/lasso_cv.rds")

min(lasso_cv$results$RMSE) # Lasso CV RMSE

Lasso Regression shrinks the coefficient estimates towards zero and it has the effect of setting variables exactly equal to zero when lambda is large enough while ridge does not. When lambda is small, the result is essentially the least square estimate, as it increases, shrinkage occurs so that variables that are zero can be thrown away.

Major advantage is that it’s a combination of both shrinkage and selection of variables:

  • Holding all other variables constant, the popularity of any song is 59.75 on a 100 scale.

  • Danceability, loudness, and speechiness is positively correlated with popularity.

  • Year, energy, liveness, valence, length, and acousticness is negatively correlated with popularity.

  • Coefficient for BPM and title_word_count are 0, so they are thrown away.

  • If the song is by Coldplay or Ed Sheeran it is positively correlated. Popularity increases by 3.93 for Coldplay and increases by 6.71 if it is by Ed Sheeran.

By changing the value of lambda we are controlling the penalty term; the higher the value the bigger the penalty and therefore the magnitude of coefficients is reduced.

g3 <- ggplot(lasso_cv)   # lambda vs. RMSE plot
g4 <- ggplot(lasso_cv) + xlim(c(0,2))    # a closer look at lambda vs. RMSE plot
grid.arrange(g3, g4, ncol = 2)

MARS

set.seed(888)
################### MARS ################### 
param_grid_mars <- expand.grid(degree = 1:3, nprune = seq(1, 100, length.out = 10))   # for MARS

mars_cv <- train(blueprint,
                 data = Spotify_train,
                 method = "earth",
                 trControl = cv_specs,
                 tuneGrid = param_grid_mars,
                 metric = "RMSE")
# Save an object to a file
saveRDS(mars_cv, file = "Datasets/mars_cv.rds")

min(mars_cv$results$RMSE) # MARS CV RMSE

Multivariate Adaptive Regression Splines (MARS) seek to capture a non-linear relationship between features and outcome by adding knots to break the regression fit line into piecewise functions.

The two tuning parameters in the degree, which indicate optimal degree of interaction, and nprune which is related to the process of pruning knots that doesn’t contribute to fitting.

Optimal degree of this model is 1, which indicate that there is not an interaction effect in general. Our optimal pruning parameter is 23.

Regression Tree

set.seed(888)
################### Regression Tree ################### 
tree_cv <- train(blueprint,
                 data = Spotify_train,
                 method = "rpart",
                 trControl = cv_specs,
                 tuneLength = 20,
                 metric = "RMSE")
# Save an object to a file
saveRDS(tree_cv, file = "Datasets/tree_cv.rds")

min(tree_cv$results$RMSE) # Regression tree CV RMSE

Regression tree splits the observations based on the features. Factors most significantly divide the outcome can be observed through this method.

The tuning parameter for this model is cp, which control for the complexity of the model.

Looking at the graph below, the model predicted folk indie songs score in popularity. In addition, some artists are flagged by the model to score lower popularity score than average. Lastly, louder, older, and shorter songs tend score better.

rpart.plot(tree_cv$finalModel)

Bagged

set.seed(888)
################### Bagged ################### 
# Tutorial https://bradleyboehmke.github.io/HOML/bagging.html section 10.4
bag_cv <- train(blueprint,
                 data = Spotify_train, 
                 method = "treebag",
                 trControl = cv_specs,
                 nbagg = 500,
                 control = rpart.control(minsplit = 2, cp = 0, xval = 0),
                 metric = "RMSE")
# Save an object to a file
saveRDS(bag_cv, file = "Datasets/bag_cv.rds")
library(dplyr) # Prevent plyr overwrite dplyr

min(bag_cv$results$RMSE) # Bagged CV RMSE

Bagging uses bootstrap to create multiple datasets that is a subset of the original dataset. And for each of the bootstrap sample, the model will build a regression tree. The bagging model will average the results of the regression trees. In this implementation, 500 trees were generated.

The 10 most important variables using the bagging model are listed below:

# Top 10 most important variables
varImp(bag_cv$finalModel) %>% arrange(desc(Overall)) %>% head(10)
##                   Overall
## BPM              415.6968
## Energy           404.1048
## Danceability     388.6923
## Year             382.2472
## Loudness         292.1364
## Liveness         259.4494
## Valence          214.4778
## Length           197.3110
## Acousticness     165.8394
## Title_Word_Count 104.5483

Random Forests

set.seed(888)
################### Random Forests ################### 
param_grid_rf <- expand.grid(mtry = seq(1, ncol(baked_train) - 1, 1), # for random forests # sequence of 1 to number of predictors
                          splitrule = "variance",  # "gini" for classification
                          min.node.size = 2)       # for each tree

rf_cv <- train(blueprint,
               data = Spotify_train,
               method = "ranger",
               trControl = cv_specs,
               tuneGrid = param_grid_rf,
               importance = "permutation", # needed to use varImp, check https://stackoverflow.com/questions/37279964/variable-importance-with-ranger
               metric = "RMSE")
# Save an object to a file
saveRDS(rf_cv, file = "Datasets/rf_cv.rds")

min(rf_cv$results$RMSE) # Random forest CV RMSE

Random Forests seek to improve the regression tree by running multiple regression trees through bootstrapping. However, different from bagging, random forest split the data based on the number of predictors that each tree have access to.

The mtry parameter controls for the number of predictors. The 10 most important variables using the random forest model are listed below:

# Top 10 most important variables
varImp(rf_cv)$importance %>% arrange(desc(Overall)) %>% head(10)
##                              Overall
## Year                      100.000000
## Loudness                   76.817406
## Combined_Genre_Folk_Indie  44.861046
## Energy                     28.940713
## Danceability               21.703908
## Acousticness               20.009137
## Valence                    13.477981
## Length                      9.633971
## Combined_Genre_Hip_Pop      9.291288
## by_boudewijn_de_groot       8.990787

Conclusion

This project utilized various machine learning algorithms to predict the popularity of a song. Based on CV RMSE results, the Random Forest framework performed the best for this task. The model has a prediction error rate of 12.81495 when fitted onto the testing data.

While it is hard to interpret the exact effect of variables in random forests, the model provides variable importance. Some of the most important features of a song, regarding popularity and that the artist can control, are length, loudness, danceability, and BPM.

Other research on the same topic provides mixed results. Reiman and Ornell (2018) was unseccessful in confirming if predicting popularity of a song is possible. On the other spectrum, Herremans (2014) could predict with 81% accuracy if a song could make to the top 10.

It will be for future research to see if these variables are the most important. The results from the other models suggest that older, louder, more danceable, and shorter songs tend to score better. In addition, our final model has an \(R^2\) of 0.27, which indicates that there is plenty of room for improvements.