::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
knitrlibrary(tidyverse)
library(caret)
library(recipes)
library(earth)
library(ipred)
library(e1071)
library(ranger)
library(rpart)
library(rpart.plot)
library(skimr)
library(corrplot)
library(caret)
library(stringr)
library(stopwords)
library(snakecase)
library(gridExtra)
library(dplyr)
library(kableExtra)
Music has been a long-standing form of entertainment throughout
humanity’s history and there are songs that are better embraced by the
general population more than others. This project sought out to
understand the underlying factors that could explain preferences from a
scientific standpoint. Various machine learning algorithms were utilized
on different audio features as an attempt to predict how likely a song
would make it to top chart.
Being able to identify the elements that contribute to making a song popular could be beneficial to the music industry. The implications of this project to the music industry could be crucial moving forward.
The dataset used for this project is sourced from Kaggle. The dataset contains audio statistics of the top 2000 (includes songs released from 1956 - 2019) tracks on Spotify. Songs released from 1956 to 2019 are included from some notable and famous artists like Queen, The Beatles, Guns N’ Roses, etc.
There are 1958 unique titles, 731 artists, and 149 genres in this
dataset. The dataset contains 14 columns each describing the track and
its qualities. This data contains audio features like
Danceability
, BPM
, Liveness
,
Valence
(Positivity) and many more.
For details on how the audio features are graded, refer to the Evaluation Guideline provided by, Echo Nest, now aquired by Spotify.
Audio features are defined as follow:
<- data.frame(Variable = c("Title","Artist", "Genre", "Year", "BPM", "Energy",
report "Danceability", "Loudness", "Liveness", "Valence", "Length",
"Acoustic", "Speechiness", "Popularity"),
Definition = c("Name of the Track",
"Name of the Artist",
"Genre of the track",
"Release year of the track",
"The tempo of the song, beats per minute",
"The energy of a song- the higher the value, the more energetic the song",
"The higher the value, the easier it is to dance to the song",
"The higher the value, the louder the song",
"The higher the value, the more likely the song is a live recording",
"The higher the value, the more positive mood for the song",
"The duration of the song (in seconds)",
"The higher the value the more acoustic the song is",
"The higher the value the more spoken words the song contains",
"The higher the value the more popular the song")
)
# knitr::kable(report)
#
# report$RMSE = cell_spec(report$RMSE, bold = TRUE, color = "white",
# background = ifelse(report$RMSE == min(report$RMSE), "green", "red"))
# report$models = cell_spec(report$models, bold = TRUE)
kbl(report, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
Variable | Definition |
---|---|
Title | Name of the Track |
Artist | Name of the Artist |
Genre | Genre of the track |
Year | Release year of the track |
BPM | The tempo of the song, beats per minute |
Energy | The energy of a song- the higher the value, the more energetic the song |
Danceability | The higher the value, the easier it is to dance to the song |
Loudness | The higher the value, the louder the song |
Liveness | The higher the value, the more likely the song is a live recording |
Valence | The higher the value, the more positive mood for the song |
Length | The duration of the song (in seconds) |
Acoustic | The higher the value the more acoustic the song is |
Speechiness | The higher the value the more spoken words the song contains |
Popularity | The higher the value the more popular the song |
With a quick glance at the summary statistics below, there are hit
songs with identical titles. For the Artist
variable,
Queen had the most top hit songs, followed by
The Beatles and Coldplay. As for the
Genre
variable, the top hit songs fell into
rock, adult standards, and
dutch pop. None of the variables have missing
values.
# Importing data
<- read_csv("Datasets/Spotify-2000.csv")
Spotify
# Quick cleaning
<- Spotify %>%
Spotify # Remove the index column
select(-c(Index)) %>%
# Convert character in to factor
mutate_if(is.character,as.factor) %>%
# Rename with cleaner names
rename(Genre = `Top Genre`,
Length = `Length (Duration)`,
BPM = `Beats Per Minute (BPM)`,
Loudness = `Loudness (dB)`)
summary(Spotify)
## Title Artist Genre
## Feeling Good : 3 Queen : 37 album rock : 413
## Hallelujah : 3 The Beatles : 36 adult standards : 123
## One : 3 Coldplay : 27 dutch pop : 88
## Always On My Mind: 2 U2 : 26 alternative rock: 86
## Amsterdam : 2 The Rolling Stones: 24 dance pop : 83
## Behind Blue Eyes : 2 Bruce Springsteen : 23 dutch indie : 75
## (Other) :1979 (Other) :1821 (Other) :1126
## Year BPM Energy Danceability
## Min. :1956 Min. : 37.0 Min. : 3.00 Min. :10.00
## 1st Qu.:1979 1st Qu.: 99.0 1st Qu.: 42.00 1st Qu.:43.00
## Median :1993 Median :119.0 Median : 61.00 Median :53.00
## Mean :1993 Mean :120.2 Mean : 59.68 Mean :53.24
## 3rd Qu.:2007 3rd Qu.:136.0 3rd Qu.: 78.00 3rd Qu.:64.00
## Max. :2019 Max. :206.0 Max. :100.00 Max. :96.00
##
## Loudness Liveness Valence Length
## Min. :-27.000 Min. : 2.00 Min. : 3.00 Min. : 93.0
## 1st Qu.:-11.000 1st Qu.: 9.00 1st Qu.:29.00 1st Qu.: 212.0
## Median : -8.000 Median :12.00 Median :47.00 Median : 245.0
## Mean : -9.009 Mean :19.01 Mean :49.41 Mean : 262.4
## 3rd Qu.: -6.000 3rd Qu.:23.00 3rd Qu.:69.75 3rd Qu.: 289.0
## Max. : -2.000 Max. :99.00 Max. :99.00 Max. :1412.0
##
## Acousticness Speechiness Popularity
## Min. : 0.00 Min. : 2.000 Min. : 11.00
## 1st Qu.: 3.00 1st Qu.: 3.000 1st Qu.: 49.25
## Median :18.00 Median : 4.000 Median : 62.00
## Mean :28.86 Mean : 4.995 Mean : 59.53
## 3rd Qu.:50.00 3rd Qu.: 5.000 3rd Qu.: 71.00
## Max. :99.00 Max. :55.000 Max. :100.00
##
The three categorical variables Genre
,
Title
, and Artist
required unique solutions to
extract features:
Having all 149 genres as dummy variables would be problematic. As such, some genres got combined together. The resulting combination of different genres can be found below (for reproducibility purpose).
# Combine genre
<-read_csv("Datasets/CMSC (Genre).csv") # Pre-made combine sheet
combine_sheet
# Mutate dataset to use the combined genres
<- Spotify %>%
Spotify mutate(Combined_Genre = case_when(
%in% combine_sheet$`Rock/Metal` ~ "Rock_Metal",
Genre %in% combine_sheet$Alternative ~ "Alternative",
Genre %in% combine_sheet$Punk ~ "Punk",
Genre %in% combine_sheet$`Folk/Indie` ~ "Folk_Indie",
Genre %in% combine_sheet$Classic ~ "Classic",
Genre %in% combine_sheet$Soul ~ "Soul",
Genre %in% combine_sheet$Dance ~ "Dance",
Genre %in% combine_sheet$Jazz ~ "Jazz",
Genre %in% combine_sheet$`Singer-Songwriter` ~ "Singer_Songwriter",
Genre %in% combine_sheet$`Country` ~ "Country",
Genre %in% combine_sheet$`(Hip)Pop` ~ "Hip_Pop",
Genre TRUE ~ "Other"
%>%
)) mutate(Combined_Genre = as.factor(Combined_Genre))
<- data.frame(Combined = colnames(combine_sheet),
detail Genres = apply(combine_sheet,2, function(x) paste(x[!is.na(x)], collapse = ", ")))
$Combined = cell_spec(detail$Combined, bold = TRUE)
detail# detail$Genres = cell_spec(detail$Genres, bold = TRUE)
kbl(detail, escape = FALSE, row.names = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
Combined | Genres |
---|---|
(Hip)Pop | dutch pop, pop, europop, art pop, britpop, german pop, australian pop, chamber pop, baroque pop, brill building pop, canadian pop, acoustic pop, belgian pop, barbadian pop, bubblegum pop, electropop, italian pop, austro pop, bow pop, candy pop, nederpop, new wave pop, afro pop, danish pop, irish pop, la pop, operatic pop, uk pop, dutch pop, europop, dutch hip pop, detroit hip pop, east coast hip pop, alternative hip pop, atl hip pop, hip pop |
Rock/Metal | album rock, alternative rock, glam rock, modern rock, art rock, irish rock, dutch rock, belgian rock, celtic rock, candian rock, hard rock, rock-and-roll, soft rock, yacht rock, blues rock, garage rock, australian rock, canadian rock, alternative metal, glam metal, dutch metal |
Alternative | alternative rock, alternative pop rock, alternative hip hop, british alternative rock, alternative pop, australian alternative rock, german alternative rock, latin alternative |
Punk | g punk, punk, celtic punk, cyberpunk, pop punk |
Folk/Indie | folk, candian folk, modern folk rock, brithish folk, folk-pop, australian indie folk, indie anthem-folk, dutch indie, indie pop, alaska indeie, icelandic indie |
Classic | classic rock, classic uk pop, classic soul, classic schalger, classic canadian rock, classic italian pop, classic soundtrack |
Soul | british soul, classic soul, neo soul, chicago soul |
Dance | dance pop, dance rock, alternative dance, eurodance, australian dance |
Country | arkansas country, classic country pop, contemporary country |
Jazz | acid jazz, contemporary vocal jazz, latin jazz |
Singer-Songwriter | scottish singer-songwriter, british singer-songwriter, irish singer-songwriter |
With most titles being unique, converting titles into dummy variables
as is would also be problematic,. Hence, Natural Language
Processing (NLP) techniques were applied to create new
features:
1. The first technique would create a count feature
that keep track of amount of words in a title.
# Create a new variable, length of title
<- Spotify %>%
Spotify mutate(Title_Word_Count = str_count(Spotify$Title, pattern = "\\w+"))
2. The second technique, Bag of Words (BoW), would create a dummy variable if a particular word appear in the title.
# Split data
set.seed(888)
# Split data
<- createDataPartition(Spotify$Popularity, p = 0.8, list = FALSE)
Spotify_index <- Spotify[Spotify_index,]
Spotify_train <- Spotify[-Spotify_index,]
Spotify_test
# Create 25 most common words
# Firstly, we will have to lower case titles
<- Spotify_train %>% mutate(Title = str_to_lower(Title))
Spotify_train <- Spotify_test %>% mutate(Title = str_to_lower(Title))
Spotify_test
# Secondly, we find which words do we want to create a dummy for, from TRAINING
<- data.frame(matrix(ncol = 2, nrow = 0))
word_count colnames(word_count) <- c('word', 'count')
<- stopwords()
black_list for (title in strsplit(Spotify_train$Title, "[- ,\\(\\)\"]")) { # Split on common separators
for (word in title) {
if (!(word %in% black_list)) { # Filter out numbers and common words, just common library for now
if (word %in% word_count$word) {
$count[which(word_count$word == word)] <- word_count$count[which(word_count$word == word)] + 1
word_countelse {
} <- rbind(word_count, data.frame(word = word, count = 1))
word_count
}
}
}
}
<- arrange(word_count, desc(count)) # Arrange by count
word_count <- word_count[-c(1),] # Remove most common: white space
word_count <- head(word_count,25)$word # Just take most common 25 for now, can tune later
significant_words
# Thirdly,for each of the word in the list, we create a dummy variable
# Helper function to check if word is in a title (have to be custom made since the split above is also custom)
<- function(title, word) {
check_word <- strsplit(title, "[- ,\\(\\)\"]")[[1]]
word_list return(word %in% word_list)
}<- function(word, titles) {
check_word_expanded sapply(titles, check_word, word = word)
}
<- function(data, word_list) {
word_mutate <- data
new_data for (word in word_list) {
<- to_any_case(paste("contains", word),case = "snake")
col_name <- new_data %>%
new_data mutate("{col_name}" := ifelse(check_word_expanded(word, Title), 1, 0))
}return(new_data)
}
# Apply funciton onto training and testing datasets
<- word_mutate(Spotify_train, significant_words)
Spotify_train <- word_mutate(Spotify_test, significant_words) Spotify_test
There are three drawbacks with the BoW technique, two of which will be directly addressed.
BoW tends to create too many features, as such we reduce the BoW to 25 most common words.
BoW include words that doesn’t contribute any meaning, called
stopwords. A R library package, stopwords
,
was used to filter out all the common stopwords. However, the list of
stopwords should be context specific and that could be a point of future
research.
BoW doesn’t account for sentence structure and context. However, since the text are short titles, it is less of a concern. Of course, NLP is an advanced subject and could warrant further exploration. Due to the constrain of time, this is as far this research will go.
With 1958 unique artists in the dataset, it would be redundant to turn all artists into dummy variables. Instead, dummy variables were created for the top 25 artists with the most appearance in the dataset.
# Create 25 most repeated artists
# First create a list of 25 most repeated artists from TRAINING data set
<- Spotify_train %>% group_by(Artist) %>%
top_25_artists summarize(n = n()) %>%
arrange(desc(n)) %>%
head(25) # Top 25 for now, tune later
# Create function to create dummy variable to each artist given a list
<- function(data, artist_list) {
artist_mutate <- data
new_data for (artist in artist_list) {
<- to_any_case(paste("By", artist),case = "snake")
col_name <- new_data %>%
new_data ::mutate("{col_name}" := ifelse(artist == Artist, 1, 0))
dplyr
}return(new_data)
}
# Apply the function on both train and test dataset
<- artist_mutate(Spotify_train, top_25_artists$Artist)
Spotify_train <- artist_mutate(Spotify_test, top_25_artists$Artist) Spotify_test
Note: Some of the preprocessing steps performed for
Title
and Artist
are too complex for the R
package recipe
. Precaution have been taken to prevent data
leakage when splitting the data into training and testing sets. However,
by creating these features before making the recipe means that data
leakage is present at the CV level.
To remark, potential expansion of this topic includes:
Using better NLP techniques
Choosing the optimal number of words in BoW
Choosing the optimal number of artist dummies
Prevent data leakage at the CV level
Since all numerical variables are not zero-value or near zero-value, not many preprocessing steps were required. The only task was to center and scale the variables.
# We deselect Title, Artist, and Genre
<- select(Spotify_train, !c(Title, Artist, Genre))
Spotify_train <- select(Spotify_test, !c(Title, Artist, Genre))
Spotify_test
# preprocessing
<- recipe(Popularity ~. , data = Spotify_train) # create the recipe for blueprint
spotify_recipe
# spotify_recipe$var_info # check the types and roles of variables
# nearZeroVar(Spotify_train, saveMetrics = TRUE) # The majority of the custom dummies are nzv, but that is to be expected for premade dummies
<- spotify_recipe %>%
blueprint step_center(all_numeric_predictors(), -starts_with("by"), -starts_with("contains")) %>% # center all numeric features except response and premade dummies
step_scale(all_numeric_predictors(), -starts_with("by"), -starts_with("contains")) %>% # scale all numeric features except response and premade dummies
step_dummy(all_nominal_predictors(), one_hot = FALSE) # create dummy variables for nominal categorical features
# blueprint
<- prep(blueprint, training = Spotify_train) # estimate feature engineering parameters from training set
prepare <- bake(prepare, new_data = Spotify_train) # apply blueprint to training set
baked_train <- bake(prepare, new_data = Spotify_test) # apply blueprint to test set baked_test
Eight supervised learning models are utilized on the training data to compare performance using Cross-Validations (CV).
set.seed(888)
# We choose the CV setting of 5 fold and 5 repeats
<- trainControl(method = "repeatedcv", number = 5, repeats = 5) cv_specs
The optimal tuning parameter, Root-Mean-Square Error (RMSE), for each model is reported in the RMSE Table. The best-performing model is then applied to the testing data.
# Restore the object
<- readRDS(file = "Datasets/mlr_cv.rds")
mlr_cv <- readRDS(file = "Datasets/knn_cv.rds")
knn_cv <- readRDS(file = "Datasets/ridge_cv.rds")
ridge_cv <- readRDS(file = "Datasets/lasso_cv.rds")
lasso_cv <- readRDS(file = "Datasets/mars_cv.rds")
mars_cv <- readRDS(file = "Datasets/tree_cv.rds")
tree_cv <- readRDS(file = "Datasets/bag_cv.rds")
bag_cv <- readRDS(file = "Datasets/rf_cv.rds")
rf_cv
<- data.frame(models = c("MLR","KNN", "Ridge", "Lasso", "MARS", "Regression tree", "Bagged tree", "Random forests"),
report bestTunes = c("N/A", # MLR tune
paste("k =", knn_cv$bestTune$k), # KNN tune
paste("lambda =", ridge_cv$bestTune$lambda), # Ridge tune
paste("lambda =", lasso_cv$bestTune$lambda), # Lasso tune
paste("degree =", mars_cv$bestTune$degree, "; nprune =", mars_cv$bestTune$nprune), # MARS tune
paste("cp =", tree_cv$bestTune$cp), # Regression tree tune
"N/A", # Bagged tune
paste("mtry =", rf_cv$bestTune$mtry)), # Random forest tune
RMSE = c(min(mlr_cv$results$RMSE), # MLR CV RMSE
min(knn_cv$results$RMSE), # KNN CV RMSE
min(ridge_cv$results$RMSE), # Ridge CV RMSE
min(lasso_cv$results$RMSE), # Lasso CV RMSE
min(mars_cv$results$RMSE), # MARS CV RMSE
min(tree_cv$results$RMSE), # Regression tree CV RMSE
min(bag_cv$results$RMSE), # Bagged CV RMSE
min(rf_cv$results$RMSE))) # Random forest CV RMSE
<- report %>% mutate_if(is.numeric, format, digits=7)
report # knitr::kable(report)
$RMSE = cell_spec(report$RMSE, bold = TRUE, color = "white",
reportbackground = ifelse(report$RMSE == min(report$RMSE), "green", "red"))
$models = cell_spec(report$models, bold = TRUE)
report
kbl(report, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
models | bestTunes | RMSE |
---|---|---|
MLR | N/A | 12.38717 |
KNN | k = 15 | 13.35766 |
Ridge | lambda = 1.07226722201032 | 12.36088 |
Lasso | lambda = 0.114975699539774 | 12.34086 |
MARS | degree = 1 ; nprune = 23 | 12.73216 |
Regression tree | cp = 0.00980155093512361 | 13.54065 |
Bagged tree | N/A | 12.37812 |
Random forests | mtry = 29 | 12.32998 |
# fit final model
<- ranger(Popularity ~ .,
rffit data = baked_train,
num.trees = 500,
mtry = rf_cv$bestTune$mtry,
splitrule = "variance", # use "gini" for classificaton
min.node.size = 2,
importance = "impurity")
# variable importance
data.frame(Overall = rffit$variable.importance) %>% arrange(desc(Overall)) %>% head(10)
## Overall
## Year 41825.47
## Length 26250.90
## Loudness 25239.72
## Danceability 24789.98
## BPM 23461.68
## Energy 23410.70
## Valence 23339.80
## Liveness 22651.44
## Acousticness 20196.02
## Combined_Genre_Folk_Indie 18272.76
# R-Squared of the final model
$r.squared rffit
## [1] 0.2793658
<- predict(rffit, data = baked_test, type = "response") # predictions on test set
preds_rf <- sqrt(mean((preds_rf$predictions - baked_test$Popularity)^2)) # test set RMSE
pred_error_est_rffit pred_error_est_rffit
## [1] 12.81495
set.seed(888)
################### MLR ###################
<- train(blueprint,
mlr_cv data = Spotify_train,
method = "lm",
trControl = cv_specs,
metric = "RMSE")
# Save an object to a file
saveRDS(mlr_cv, file = "Datasets/mlr_cv.rds")
min(mlr_cv$results$RMSE) # MLR CV RMSE
Multiple linear regression (MLR) is a parametric approach, which assumes a linear relationship between response and features.
Note: Attempting to predict the response for a value of the predictor that lies outside the range of the data is NOT recommended. This is called extrapolation, and the predictions tend to be unreliable.
set.seed(888)
################### KNN ###################
<- expand.grid(k = seq(1, 15, 1)) # for KNN regression
k_grid
<- train(blueprint,
knn_cv data = Spotify_train,
method = "knn",
trControl = cv_specs,
tuneGrid = k_grid,
metric = "RMSE")
# Save an object to a file
saveRDS(knn_cv, file = "Datasets/knn_cv.rds")
min(knn_cv$results$RMSE) # KNN CV RMSE
K Nearest Neighbors (KNN) is a non-parametric method that approximates the association between independent variables and the continuous outcome by averaging the observations in the same neighborhood. This method becomes impractical when the dimension increases (when there are many independent variables).
Smaller values for K can be noisy and will have higher influence on the result, while a larger value will have smoother decision boundaries which mean lower variance but increased bias.
Note: KNN does not have any specialized training phase as it uses all the training samples for classification and simply stores the results in memory. KNN can be very sensitive to the scale of data as it relies on computing the distances. For features with a higher scale, the calculated distances can be very high and might produce poor results. It is thus advised to scale the data before running the KNN.
set.seed(888)
################### Ridge Regression ###################
<- 10^seq(-3, 3, length = 100) # grid of lambda values to search over
lambda_grid
<- train(blueprint,
ridge_cv data = Spotify_train,
method = "glmnet", # for ridge regression
trControl = cv_specs,
tuneGrid = expand.grid(alpha = 0, lambda = lambda_grid), # alpha = 0 implements ridge regression
metric = "RMSE")
# Save an object to a file
saveRDS(ridge_cv, file = "Datasets/ridge_cv.rds")
min(ridge_cv$results$RMSE) # Ridge CV RMSE
Ridge Regression is a model tuning method that’s used to analyze any data that suffers from multicollinearity. When the issue of multicollinearity occurs, least-squares are unbiased, and variance are large, which results in predicted values being far away from the actual values.
Ridge is useful for this dataset because many independent variables in the model are correlated:
Danceability
is affected by several other variables
such as loudness
, energy
, and
BPM
.
Loudness
and liveness
is also affected
by valence
; high valence sound more positive (happy,
cheerful, euphoric) while low valence sound more negative (sad,
depressed, angry)
Valence
can also affect danceability
,
additionally songs with low valence tends to be higher on
speechiness
and acousticness
.
Certain artists tend to make similar music (the composition of the song).
As lambda increases, the bias is unchanged but the variance drops. The drawback is that Ridge doesn’t select variables, it includes all of the variables in the final model
<- ggplot(ridge_cv) # lambda vs. RMSE plot
g1 <- ggplot(ridge_cv) + xlim(c(0,2)) # a closer look at lambda vs. RMSE plot
g2 grid.arrange(g1, g2, ncol = 2)
set.seed(888)
################### LASSO Regression ###################
<- train(blueprint,
lasso_cv data = Spotify_train,
method = "glmnet", # for lasso
trControl = cv_specs,
tuneGrid = expand.grid(alpha = 1, lambda = lambda_grid), # alpha = 1 implements lasso
metric = "RMSE")
# Save an object to a file
saveRDS(lasso_cv, file = "Datasets/lasso_cv.rds")
min(lasso_cv$results$RMSE) # Lasso CV RMSE
Lasso Regression shrinks the coefficient estimates towards zero and it has the effect of setting variables exactly equal to zero when lambda is large enough while ridge does not. When lambda is small, the result is essentially the least square estimate, as it increases, shrinkage occurs so that variables that are zero can be thrown away.
Major advantage is that it’s a combination of both shrinkage and selection of variables:
Holding all other variables constant, the popularity
of any song is 59.75 on a 100 scale.
Danceability
, loudness
, and
speechiness
is positively correlated with
popularity
.
Year
, energy
, liveness
,
valence
, length
, and acousticness
is negatively correlated with popularity
.
Coefficient for BPM
and
title_word_count
are 0, so they are thrown away.
If the song is by Coldplay or Ed
Sheeran it is positively correlated. Popularity
increases by 3.93 for Coldplay and increases by
6.71 if it is by Ed Sheeran.
By changing the value of lambda we are controlling the penalty term; the higher the value the bigger the penalty and therefore the magnitude of coefficients is reduced.
<- ggplot(lasso_cv) # lambda vs. RMSE plot
g3 <- ggplot(lasso_cv) + xlim(c(0,2)) # a closer look at lambda vs. RMSE plot
g4 grid.arrange(g3, g4, ncol = 2)
set.seed(888)
################### MARS ###################
<- expand.grid(degree = 1:3, nprune = seq(1, 100, length.out = 10)) # for MARS
param_grid_mars
<- train(blueprint,
mars_cv data = Spotify_train,
method = "earth",
trControl = cv_specs,
tuneGrid = param_grid_mars,
metric = "RMSE")
# Save an object to a file
saveRDS(mars_cv, file = "Datasets/mars_cv.rds")
min(mars_cv$results$RMSE) # MARS CV RMSE
Multivariate Adaptive Regression Splines (MARS) seek to capture a non-linear relationship between features and outcome by adding knots to break the regression fit line into piecewise functions.
The two tuning parameters in the degree, which indicate optimal degree of interaction, and nprune which is related to the process of pruning knots that doesn’t contribute to fitting.
Optimal degree of this model is 1, which indicate that there is not an interaction effect in general. Our optimal pruning parameter is 23.
set.seed(888)
################### Regression Tree ###################
<- train(blueprint,
tree_cv data = Spotify_train,
method = "rpart",
trControl = cv_specs,
tuneLength = 20,
metric = "RMSE")
# Save an object to a file
saveRDS(tree_cv, file = "Datasets/tree_cv.rds")
min(tree_cv$results$RMSE) # Regression tree CV RMSE
Regression tree splits the observations based on the features. Factors most significantly divide the outcome can be observed through this method.
The tuning parameter for this model is cp, which control for the complexity of the model.
Looking at the graph below, the model predicted folk indie songs score in popularity. In addition, some artists are flagged by the model to score lower popularity score than average. Lastly, louder, older, and shorter songs tend score better.
rpart.plot(tree_cv$finalModel)
set.seed(888)
################### Bagged ###################
# Tutorial https://bradleyboehmke.github.io/HOML/bagging.html section 10.4
<- train(blueprint,
bag_cv data = Spotify_train,
method = "treebag",
trControl = cv_specs,
nbagg = 500,
control = rpart.control(minsplit = 2, cp = 0, xval = 0),
metric = "RMSE")
# Save an object to a file
saveRDS(bag_cv, file = "Datasets/bag_cv.rds")
library(dplyr) # Prevent plyr overwrite dplyr
min(bag_cv$results$RMSE) # Bagged CV RMSE
Bagging uses bootstrap to create multiple datasets that is a subset of the original dataset. And for each of the bootstrap sample, the model will build a regression tree. The bagging model will average the results of the regression trees. In this implementation, 500 trees were generated.
The 10 most important variables using the bagging model are listed below:
# Top 10 most important variables
varImp(bag_cv$finalModel) %>% arrange(desc(Overall)) %>% head(10)
## Overall
## BPM 415.6968
## Energy 404.1048
## Danceability 388.6923
## Year 382.2472
## Loudness 292.1364
## Liveness 259.4494
## Valence 214.4778
## Length 197.3110
## Acousticness 165.8394
## Title_Word_Count 104.5483
set.seed(888)
################### Random Forests ###################
<- expand.grid(mtry = seq(1, ncol(baked_train) - 1, 1), # for random forests # sequence of 1 to number of predictors
param_grid_rf splitrule = "variance", # "gini" for classification
min.node.size = 2) # for each tree
<- train(blueprint,
rf_cv data = Spotify_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
importance = "permutation", # needed to use varImp, check https://stackoverflow.com/questions/37279964/variable-importance-with-ranger
metric = "RMSE")
# Save an object to a file
saveRDS(rf_cv, file = "Datasets/rf_cv.rds")
min(rf_cv$results$RMSE) # Random forest CV RMSE
Random Forests seek to improve the regression tree by running multiple regression trees through bootstrapping. However, different from bagging, random forest split the data based on the number of predictors that each tree have access to.
The mtry parameter controls for the number of predictors. The 10 most important variables using the random forest model are listed below:
# Top 10 most important variables
varImp(rf_cv)$importance %>% arrange(desc(Overall)) %>% head(10)
## Overall
## Year 100.000000
## Loudness 76.817406
## Combined_Genre_Folk_Indie 44.861046
## Energy 28.940713
## Danceability 21.703908
## Acousticness 20.009137
## Valence 13.477981
## Length 9.633971
## Combined_Genre_Hip_Pop 9.291288
## by_boudewijn_de_groot 8.990787
This project utilized various machine learning algorithms to predict
the popularity
of a song. Based on CV RMSE
results, the Random Forest framework performed the best
for this task. The model has a prediction error rate of
12.81495 when fitted onto the testing data.
While it is hard to interpret the exact effect of variables in random forests, the model provides variable importance. Some of the most important features of a song, regarding popularity and that the artist can control, are length, loudness, danceability, and BPM.
Other research on the same topic provides mixed results. Reiman and Ornell (2018) was unseccessful in confirming if predicting popularity of a song is possible. On the other spectrum, Herremans (2014) could predict with 81% accuracy if a song could make to the top 10.
It will be for future research to see if these variables are the most important. The results from the other models suggest that older, louder, more danceable, and shorter songs tend to score better. In addition, our final model has an \(R^2\) of 0.27, which indicates that there is plenty of room for improvements.