Data Analysis and Insights from the Spotify Dataset

In this article, I will lead you through the journey of building a content-based recommender system using the rich Spotify dataset. We will explore the step-by-step process, complemented by practical examples and insightful visualizations.

While the initial steps of collecting, preprocessing, and analyzing data lay a solid foundation, the ultimate goal is to create a recommendation system. This system will harmonize with your musical preferences, suggesting tracks based on genres and descriptions.

In the world of music and recommendation systems, model deployment holds significant importance. Our primary objective is to enhance the music listening experience. Much like the choices made in the realm of AI, where we can categorize them into two key methods: batch recommendation and real-time recommendation.

Batch Recommendation:

In the context of recommendation systems, batch recommendation involves generating suggestions or recommendations in predefined batches or offline. This means that recommendations are computed and prepared ahead of time based on historical data and user preferences. Users receive a set of recommendations at specific intervals, and these recommendations remain consistent until the next batch is generated. This method is often employed when real-time processing is not a requirement, such as in email campaigns or periodic content updates.

Real-time Recommendation:

Real-time recommendation, on the other hand, is focused on providing immediate, on-the-fly suggestions to users as they interact with a platform or application. This method requires the system to process and analyze data in real-time, responding to user actions or preferences instantaneously. It is commonly used in applications where the user experience relies on immediate, dynamic recommendations, like e-commerce websites, streaming platforms, or social media feeds.

Problem Statement:

In Spotify’s vast music library, we encounter a common dilemma: the search for the ideal song becomes a daunting task amidst a diverse range of genres, artists, and tracks. Our mission is to improve the music discovery and listening experience for Spotify users by delivering personalized recommendations that align with their individual musical tastes.

To meet this challenge, our objective is to construct a robust content-based recommender system using the wealth of data available in the Spotify dataset. This system aims to decipher the intricacies of genres and descriptions to curate recommendations that resonate with each listener. However, the journey toward this goal is replete with hurdles, ranging from data exploration and preprocessing to data analysis and visualization.

As we navigate this musical data landscape, our path is illuminated by questions: How can we effectively handle missing values, eliminate duplicates, and ensure data quality? How can we identify the most popular artists and songs within this vast dataset? How can we visualize the insights gained from this musical treasure trove? These questions beckon us to unravel the complexities of building a content-based recommender system that will redefine how Spotify users uncover their favorite tunes.


To overcome the challenge of enhancing the music discovery and listening experience for Spotify users, we propose the creation of a content-based recommender system. This system will leverage the Spotify dataset, focusing on genres and descriptions to offer personalized music recommendations.

1.Data Exploration and Preprocessing:

  • Import the dataset into a Pandas dataframe.
  • Cleanse the data by handling missing values, removing duplicates, and ensuring data quality.
  • Introduce a new column to align with business needs.

2.Data Analysis:

  • Conduct artist popularity analysis, identifying the most popular artists.
  • Visualize artist popularity using a pie chart.
  • Identify and display the most popular song, visualizing it with a scatter plot.
  • Visualize the distribution of song popularity using a histogram.

3.Building a Song Recommendation System:

  • Define a popularity threshold for highly popular songs.
  • Filter and sort songs based on popularity.
  • Select the top song as a recommendation and provide details such as track name, artist, and popularity score.


In this example, I will make use of the Spotify Tracks dataset obtained from Kaggle. This dataset provides a rich repository of Spotify tracks across 125 diverse genres, offering extensive audio feature data. The analysis aims to uncover meaningful connections and insights related to music preferences, genres, and descriptive attributes.

Reading the Spotify Dataset into a Dataframe.

# Reading the spotify dataset into a dataframe
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
spotify_dataset =pd.read_csv(r”C:\\Users\suthesh.a\Downloads\spotify_dataset.csv”)
print (spotify_dataset.head(10))


  • Initially we need to load the required python libraries required for the analysis.
  • In this data analysis the required libraries are pandas for data analysis and data manipulation, matplotlib and seaborn for Data visualization.
  • Next, we have to load the Spotify csv dataset into a dataframe by using pandas library.

Replacing the Missing values:

# Replace the missing values.

filling missing values = {“artists” : “unknown artist”,
“album_name” : “unknown album name”,
“track_name” : “unknown track_name”}

# Replacing the missing values with alternate values.

replace_missing values = spotify dataset.fillna(value=filling missing values)
missing value_check = replace missing values.isnull().sum()


  • In the dataset there will be lot of missing values in the columns, we have to replace the empty columns with some other values for data analysis.
  • In this analysis I replaced the missing columns with default values and filled the empty columns by using fillna method.
  • To check whether missing values is present in the dataset we can use isnull() function.

Checking for Duplicates:

# checking for duplicates.
duplicate _check = spotify dataset.duplicated()


  • duplicate_check = spotify_dataset.duplicated(): This line of code creates a Boolean Series called duplicate_check that indicates whether each row in the spotify_dataset DataFrame is a duplicate of a previous row. Each value in the Series is True if the corresponding row is a duplicate and False if it’s not.
  • duplicate_check.drop_duplicates(): This line of code is used to drop duplicates from the duplicate_check.

Step-2 Performing Data analysis:

Artist popularity analysis:

#Artist popularity analysis:
artist_popularity = spotify_dataset.groupby(‘artists’)

artist_popularity_sorted = artist_popularity.sort_values(by=’popularity’,ascending=False)


  • The dataset is grouped by the ‘artists’ column using the groupby() function. This helps in aggregating the data for each unique artist in the subset.
  • The total popularity of each artist is calculated by summing up the ‘popularity’ values within each group using the sum() function.
  • The results are stored in the ‘artist_popularity’ DataFrame, which now contains the aggregated popularity values for each artist within the subset.
  • To identify the most popular artists, the ‘artist_popularity’ DataFrame is sorted in descending order based on the ‘popularity’ column using the .sort values() function. The result, named ‘artist_popularity_sorted’, provides a ranked list of artists based on their popularity within the limited dataset.

The above output is the list of Top artists sorted by popularity.

Visualizing Artist Popularity Using a Pie Chart:

#Visualizing the Dataset.
plt.title(‘popularity pie chart’)


  • The ‘plt.pie()’ function from the Matplotlib library is used to create a pie chart visualization.
  • The popularity data of artists, previously sorted by their total popularity, is used to plot the slices of the pie chart.
  • The ‘labels’ parameter of ‘plt.pie()’ is set to the artist names from the sorted popularity data, which provides the labels for each slice.
  • The ‘plt.title()’ function assigns the title “Popularity Pie Chart” to the chart.
  • Finally, ‘’ displays the generated pie chart with the popularity distribution of artists in the dataset.

Identifying and Displaying the Most Popular Song in the Spotify Dataset:

#Determine the Song popularity.
song_popularity = spotify_dataset.groupby([‘track_name’,’artists’])[‘popularity’].sum().reset_index()
song_count= spotify_dataset.groupby([‘track_name’,’artists’]).size().reset_index(name=’count’)
# print(song_count)
# print(song_popularity)#merge the Dataframe
song_stats=pd.merge(song_popularity,song_count,on=[‘track_name’,’artists’])most_popular_song = song_stats.sort_values(by=[‘popularity’, ‘count’], ascending=False).iloc[0]print(“Most Popular Song:”)
print(“Track Name:”, most_popular_song[‘track_name’])
print(“Artists:”, most_popular_song[‘artists’])
print(“Total Popularity:”, most_popular_song[‘popularity’])
print(“Total Listeners:”, most_popular_song[‘count’])


  • The above code segment starts by grouping the Spotify dataset based on both the ‘track_name’ and ‘artists’ columns to calculate song popularity and count.
  • Two DataFrames, ‘song_popularity’ and ‘song_count’, are generated by aggregating the data.
  • The ‘song_stats’ DataFrame is then created by merging ‘song_popularity’ and ‘song_count’ using ‘track_name’ and ‘artists’ as the common columns.
  • The code identifies the most popular song by sorting ‘song_stats’ based on both popularity and count in descending order, and then select the top row.
  • Finally, the details of the most popular song, including its track name, artists, total popularity, and total listeners, are printed to the console.

The above output is the track name of the most popular song, and it displays the artist’s name, popularity count and Total listeners.

Visualize the most popular song using scatter plot.

#Visualize the most popular song using scatter plot .
plt.scatter(most_popular_song[‘count’],most_popular_song[‘popularity’], color=’blue’, label=’Most Popular Song’)
plt.xlabel(‘Number of Times Heard’)
plt.title(‘Scatter Plot of Most Popular Song’)


  • plt.scatter(most_popular_song[‘count’], most_popular_song[‘popularity’], color=’blue’, label=’Most Popular Song’): This line creates a scatter plot with the x-axis representing the number of times the most popular song has been heard (count) and the y-axis representing its popularity.
  • Each data point is represented by a blue dot with a label indicating it’s the most popular song.
  • plt.xlabel(‘Number of Times Heard’): Sets the label for the x-axis as “Number of Times Heard”.
  • plt.ylabel(‘Popularity’): Sets the label for the y-axis as “Popularity”.
  • plt.title(‘Scatter Plot of Most Popular Song’): Sets the title of the plot as “Scatter Plot of Most Popular Song”.
  • plt.legend(): Displays the legend in the plot, indicating that the blue dots represent the most popular song.
  • Finally, this command displays the scatter plot.

Visualizing Song Popularity Distribution with a Histogram:

# visualize the dataset using Histogram.
column_to_plot = spotify_dataset[‘popularity’]
plt.title(‘Popularity Histogram’)


  • column_to_plot = spotify_dataset[‘popularity’]: Selects the ‘popularity’ column from the dataset as the data to be plotted.
  • num_bins = 5: Defines the number of bins (intervals) to categorize the data. In this case, it’s set to 5.
  • plt.hist(column_to_plot, bins=num_bins, edgecolor=’black’): Creates the histogram plot using the ‘popularity’ data with the specified number of bins and black edges around the bars.
  • plt.xlabel(‘Popularity’): Sets the label for the x-axis as “Popularity”.
  • plt.ylabel(‘Frequency’): Sets the label for the y-axis as “Frequency”, representing how many songs fall into each bin.
  • plt.title(‘Popularity Histogram’): Sets the title of the plot as “Popularity Histogram”.
  • Displays the histogram plot.

Building a Song Recommendation System based on Popularity:

#Recommend a song name based on Popularity.
high_popularity_threshold = 80
high_popularity_songs = spotify_dataset[spotify_dataset[‘popularity’] >= high_popularity_threshold]
sorted_high_popularity_songs = high_popularity_songs.sort_values(by=’popularity’, ascending=False)
recommended_song = sorted_high_popularity_songs.iloc[0]
print(“Recommended Song:”)
print(“Track Name:”, recommended_song[‘track_name’])
print(“Artist:”, recommended_song[‘artists’])
print(“Popularity:”, recommended_song[‘popularity’])


  • High Popularity Threshold: A threshold value (here, 80) is set to determine what is considered a “highly popular” song.
  • Filtering High Popularity Songs: The dataset is filtered to retain only songs with popularity scores equal to or above the defined threshold.
  • Sorting by Popularity: The filtered songs are then sorted in descending order based on their popularity scores, ensuring that the most popular songs come first.
  • Extracting Recommendation: The song at the top of the sorted list, representing the most popular song among those meeting the threshold, is selected as the recommended song.
  • Printing Recommendation: The details of the recommended song are printed, including its track name, artist, and popularity score.
#Recommend a song name based on Popularity.
high_popularity_threshold = 80
high_popularity_songs = spotify_dataset[spotify_dataset[‘popularity’] >= high_popularity_threshold]
sorted_high_popularity_songs = high_popularity_songs.sort_values(by=’popularity’, ascending=False)
recommended_song = sorted_high_popularity_songs.iloc[0]
print(“Recommended Song:”)
print(“Track Name:”, recommended_song[‘track_name’])
print(“Artist:”, recommended_song[‘artists’])
print(“Popularity:”, recommended_song[‘popularity’])
Recommended song :
Track Name: Hold on
Artist : Chord street
Popularity : 82


While this approach may not suit every scenario, it effectively addresses our requirements. It’s essential to acknowledge that this solution may have its limitations and unexplored opportunities. Your valuable input and suggestions on this methodology and the article are warmly welcomed, as they can contribute to further enhancements and refinements.

Connect With Us!