Ways to recommend songs based on song features

To calculate similar songs based on features of a given song, we used the data available here. Note: This dataset appears to only have data for songs up to 2021 or 2022. Some options (that I found) that calculates distances between songs are:

  • Cosine similarity
  • Manhattan distance
  • Euclidean distance
  • Minkowski distance

I decided to use cosine similarity because cosine similarity calculates the angle, or how vectors are pointed relatively to each other. Therefore, even though songs can be very different in magnitudes of features, as long as the features together points to a similar direction, these songs can still be very similar and recommended. Cosine similarity also ignores 0 to 0 matches and seems to be very popular in recommendation systems. Though this might only be true for text analysis since magnitudes are usually very different in that. Regardless, I used cosine similarity even though a distance measurement that uses magnitude might actually be better here since the values are already on a scale of 0-1

I decided to use the features acousticness, danceability, duration_ms, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, and valence to calculate the distances between songs

feature_cols=['acousticness', 'danceability', 'duration_ms', 'energy',
              'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
              'speechiness', 'tempo', 'valence']

Cosine similarity here calculates distances between each song's vector of features and creates a matrix of song to song distances

To begin, we should standardize all the values within each feature so the data are represented better relatively to other data and thus the distances can be calculated more accurately

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
normalized_df =scaler.fit_transform(df[feature_cols])

We should clear all duplicates for obvious reasons

indices = pd.Series(df.index, index=df['name']).drop_duplicates()

Now we create a function that returns the top n songs given song name

cosine = cosine_similarity(normalized_df)

"""
Purpose: Function for song recommendations 
Inputs: song title, number of recommendations to give, and type of similarity model
Output: list of recommended songs
"""
def generate_recommendation(name, n, model_type=cosine):
    # Get song indices
    index=indices[name]
    # Get list of songs for given songs
    score=list(enumerate(model_type[index]))
    # Sort the most similar songs
    similarity_score = sorted(score,key = lambda x:x[1],reverse = True)
    # Select the top-n recommend songs
    similarity_score = similarity_score[1:n+1]
    top_songs_index = [i[0] for i in similarity_score]
    # Top 10 recommended songs
    top_songs=df['name'].iloc[top_songs_index]
    return top_songs.tolist()

cosine = cosine_similarity(normalized_df) creates the similarity matrix and the function gets the list of cosine similarity distances in relation to other songs. We then sort the list in relation to the similarity score (x[1]) then recommend 1 to n+1 which is the top n songs since 0 is just the same song (best similarity score). The n songs are then returned

Note: Due to a memory problem when using cosine similarity on sparse matrixes, I limited the number of songs that we will calculate the similarity matrix with to the top 30,000 most popular songs

Next we just implement this function in our api and the song feature part of the content based recommendation system is good to go

Recommending songs based on user features

A difficulty in doing this is find how user features affect song preferences as we don't have any/enough data on our website. Therefore, we had to, again, get the necessary data from somewhere else. I decided to try and predict what genres users would like based on their demographics. I chose to use a simple ANN to try and predict how much they would enjoy each genre using their age, gender, location, education, height, weight, and number of siblings.

However, the accuracy of this model is not very good so there are definitely better ways to model this or we could've used the data in a different way. For examples, it might've been better to use a neural network to classify and give probability of user liking the genre with a rating of 4 or 5 classified as like and 1-3 classified as not like. Height, weight, education, and number of siblings also likely does not affect genre preference significantly so we could've eliminated that as an predictor.

After saving the model, we just had to access the model in the api and use it to make decisions about song recommendations in regards to the prediction of what genres the user would like.