groovydata

Analyzing the Data of the Spotify Regional Top 200 Song Charts to Predict Success

By Amanuel Awoke, Ferzam Mohammad, and Josue Velasquez

d

Image from The Guardian

Introduction

Motivation

The music industry has changed a lot in the last decade with the introduction of streaming services like Apple Music or Spotify. Services like Spotify allow users to livestream music for personal consumption, often for free or for a subscription fee. These services have made it easier to consume music and have increased opportunities for people to start producing music, but they have also changed how musicians make money. Whenever a user listens to a song on a streaming service, the service typically keeps track of the number of “streams” that song has. Music artists are then paid a small amount based on the number of streams they have accumulated for their music. Given how little these artists are paid from streaming services, maximizing the amount of revenue made from a song is valuable for those looking to push out music to these services. Stream count also indicates where a song stands in the streaming services’ popularity lists, and making it onto their top 100 or 200 songs is a factor considered in whether these songs are added to global, official top songs charts i.e. Billboard 200.

Our group thought it would be interesting to see if we could try to make predictions for how popular a song might be given different features for a song (e.g. how fast or slow a song is, the mood of the song, how many listeners an artist already gets on average, etc.). If we can indicate how many plays a song will get, we can give a prediction for how much money a song will make on a streaming service. Much like the Moneyball scenario, it’s possible that artists are focusing on producing music that meets criteria which they think makes a song popular when, in reality, they should be focusing on other aspects of their music. Understanding what components of a song make it popular would help artists figure out the best way to produce music in order to make money off of these streaming services.

The Moneyball story demonstrated the importance of data science in producing a strong baseball team, and while music is different from sports, our project should hopefully reflect similar data science practices in order to reach a valuable conclusion. It may be relatively straightforward to conlcude whether a song by Taylor Swift will end up on the top 200 chart given her “incredibly loyal fanbase” of over 40 million people, but maybe there are other characteristics between popular songs that could indicate factors which help make a song more popular. Data science practices help us here by giving us tools to help identify characteristics in a song, clarify how those characteristics might relate to stream count, and determine whether any elements should be focused on when producing music.

From this point forward when we use the word “track” it is synonymous with “song.”

Goal Hypothesis

Are there traits of a song that can be used to determine future success? If so, what are they?

Defining Success

We are defining the success of a track by its appearance on the Top 200, as well as its ranking on the Top 200 (the higher the better).

Collect Data

This is the first step in the data science lifecycle where we must identify and gather information. We gather data from the Spotify Charts Regional Top 200 to first identify which songs had the highest stream counts in the United States, dating back to January 1st, 2017 through December 1st, 2017. Spotify Charts provides tracks with the highest stream count, their top 200 rank, and the artist(s) who created that song. Spotify Charts already compiles the data into Excel tables, so it isn’t necessary to directly scrape from the website. If you wanted to download one yourself, at the top right of the website, select a date you’d like to download in the dropdown, then select further up “Download to CSV.” The pandas method read_csv() was used to process the Excel files into dataframes.

import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
import spotipy
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Since there were consistent download URLs of Excel sheets in relation to the date they recorded, we used a loop to retreive the links then later download all sheets.

# Collect links from spotify charts top 200 streams per day
ref_str = "https://spotifycharts.com/regional/global/daily/"
ref_arr = []


for year in range(2017, 2018):
    date = ""
    
    endingMonth = 12
    if year == 2020:
        endingMonth = 10
        
    for month in range (1, endingMonth + 1):

        if int(month) < 10:
            month = "0" + str(month)

        date = str(year) + "-" + str(month) + "-" + "01" + "/download"
        date = ref_str + date
        ref_arr.append(date)

ref_arr
['https://spotifycharts.com/regional/global/daily/2017-01-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-02-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-03-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-04-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-05-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-06-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-07-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-08-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-09-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-10-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-11-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-12-01/download']
#Loop downloading and appending of dataframes 

df = pd.DataFrame(columns =['position', 'track_name', 'artist', 'streams', 'url', 'date'] )
#make dir to save to
path = "sheets"
folderExists = False
try:
    os.mkdir(path)
except FileExistsError:
    print ("Folder already exists")
    folderExists = True

for i in ref_arr:

    r = requests.get(i, allow_redirects = True)
    #String manipulation to read from the correct csv files
    date = i[48:58]
    fileName = "regional-global-daily-" + date + ".csv"
    if not folderExists:
        print("Downloading... " + fileName)
        open(fileName, "wb").write(r.content)

        os.rename(fileName, "sheets/" + fileName)

    df_new = pd.read_csv(path + "/" + fileName)
    df_new.columns= ['position', 'track_name', 'artist', 'streams', 'url']
    df_new['date'] = date
    
    df_new = df_new.iloc[1:] #deletes junk row from csv conversion
    df = df.append(df_new)

print("Done")
df = df.reset_index() # Sets index back to being the regular 0-based index. This is really helpful when trying to add more to the dataframe later, because otherwise there are lots of duplicate indices
df['streams'] = df['streams'].astype(int) #streams are a string of a num, must wrap as type int always
Folder already exists
Done

Wrangling data into dataframe

df
index position track_name artist streams url date
0 1 1 Starboy The Weeknd 3135625 https://open.spotify.com/track/5aAx2yezTd8zXrk... 2017-01-01
1 2 2 Closer The Chainsmokers 3015525 https://open.spotify.com/track/7BKLCZ1jbUBVqRi... 2017-01-01
2 3 3 Let Me Love You DJ Snake 2545384 https://open.spotify.com/track/4pdPtRcBmOSQDlJ... 2017-01-01
3 4 4 Rockabye (feat. Sean Paul & Anne-Marie) Clean Bandit 2356604 https://open.spotify.com/track/5knuzwU65gJK7IF... 2017-01-01
4 5 5 One Dance Drake 2259887 https://open.spotify.com/track/1xznGGDReH1oQq0... 2017-01-01
... ... ... ... ... ... ... ...
2395 196 196 Rockabye (feat. Sean Paul & Anne-Marie) Clean Bandit 552118 https://open.spotify.com/track/5knuzwU65gJK7IF... 2017-12-01
2396 197 197 Rake It Up (feat. Nicki Minaj) Yo Gotti 551576 https://open.spotify.com/track/4knL4iPxPOZjQzT... 2017-12-01
2397 198 198 New Freezer (feat. Kendrick Lamar) Rich The Kid 550167 https://open.spotify.com/track/4pYZLpX23Vx8rwD... 2017-12-01
2398 199 199 All Night Steve Aoki 548039 https://open.spotify.com/track/5mAxA6Q1SIym6dP... 2017-12-01
2399 200 200 113 Booba 546878 https://open.spotify.com/track/6xqAP7kpdgCy8lE... 2017-12-01

2400 rows × 7 columns

Data Processing

Spotipy is a lightweight Python library for the Spotify Web API used to retrieve more detailed data for tracks now that their names have been retrieved from the Spotify Top 200. We must first authenticate our usage of the API using a Spotify Account.

import spotipy
from spotipy.oauth2 import SpotifyOAuth
from spotipy.oauth2 import SpotifyClientCredentials


SPOTIPY_CLIENT_ID="ea1a162fbc6f413990542b76ab82a168"
SPOTIPY_CLIENT_SECRET="a09882042ce54f158fdd2b6baaf2b26d"
SPOTIPY_CLIENT_REDIRECT="https://amanuelawoke.com/groovydata/"

scope = "user-library-read"

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(scope=scope, client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET, redirect_uri=SPOTIPY_CLIENT_REDIRECT))


We’re going to start by using the Spotify API to get more information about all the tracks we found in the top 200’s chart for the timeframe we described above. The Spotify API gives us the ability to get “audio features” from a song given a track id that Spotify creates for every song. These “audio features” include characteristics like loudness, positivity, danceability, how energetic the song is, the speed of the song, and a couple other similar characteristics that have been determined by Spotify using their own machine learning algorithms.

First, we do need to get an id for every song and artist in our dataframe to be able to make queries through the Spotify API for a specific track or artist. Here, we get track and artist ids, and we also make a query for the audio features of each track id. We’re doing these all together for code efficiency, just because a large number of queries through the Spotify API can take time. For testing we cached the dataframe rather than compiling the data every time.

import xlsxwriter
import openpyxl

artist_id_list = []
track_id_list = []
popularity_index_list = []
follower_count_list = []
audio_features_df = pd.DataFrame()

#if cached df exists dont search again, else search again
if not os.path.exists("cached_df.xlsx"):
    #Take each song and lookup its audio features, then create a dataframe for them
    print("Searching...")
    for index, row in df.iterrows():
        trackName = row['track_name']
        track_id = ""
        artist_id = ""
        # We need to check if our track_name received was a nan value. Idk how these got in here, but there are nans
        if(type(trackName) == str):
            #delimit with +'s for spotipy search query
            trackNameWithoutSpaces = '+'.join(trackName.split())
            searchQuery = sp.search(trackNameWithoutSpaces, 1, 0)
            if (len(searchQuery['tracks']['items']) != 0):
                
                track_object = searchQuery['tracks']['items'][0]
                track_id = track_object['id']
                track_id_list.append(track_id)

                #if there are several artists, return the first artist
                artist_object = track_object['artists'][0] if type(track_object['artists']) is list else track_object['artists']
                artist_id = artist_object['id']
                artist_id_list.append(artist_id)

    
                artist_object_real = sp.artist(artist_id)
                followers_object = artist_object_real['followers']
                followers_value = followers_object['total']
                follower_count_list.append(followers_value)
                popularity_value = artist_object_real['popularity']
                popularity_index_list.append(popularity_value)

            # If our query returned nothing then append a nan in the place of artist and track for this entry
            else:
                artist_id_list.append(np.nan)
                track_id_list.append(np.nan)
                
                popularity_index_list.append(np.nan)
                follower_count_list.append(np.nan)

        # If we had stored a nan, then just plan to append a nan in this position
        else:
            artist_id_list.append(np.nan)
            track_id_list.append(np.nan)
            
            popularity_index_list.append(np.nan)
            follower_count_list.append(np.nan)
       
        #Defining audio features as nan to begin    
        audiofeatures = {'duration_ms' : np.nan, 'key' : np.nan, 'mode' : np.nan, 'time_signature' : np.nan, 'acousticness' : np.nan, 'danceability' : np.nan, 'energy' : np.nan, 'instrumentalness' : np.nan, 'liveness' : np.nan, 'loudness' : np.nan, 'speechiness' : np.nan, 'valence' : np.nan, 'tempo' : np.nan, 'id' : np.nan, 'uri' : np.nan, 'track_href' : np.nan, 'analysis_url' : np.nan, 'type' : np.nan, }

        # If we successfully found a track when we did our search, then get the audio features for that
        if (track_id != ""):
            audiofeatures = sp.audio_features(track_id)[0]
        #Append the audio features
        audio_features_df = audio_features_df.append(audiofeatures, ignore_index=True)

    #adds artist id list 
    audio_features_df['artist_id'] = artist_id_list 
    audio_features_df['popularity_index'] = popularity_index_list
    audio_features_df['follower_count'] = follower_count_list

    # Store the created data frame into the cache
    writer = pd.ExcelWriter('cached_df.xlsx', engine='openpyxl')
    audio_features_df.to_excel(writer, sheet_name='Sheet1')
    writer.save()
    
else: #access the cached df if it exist
 
    print("Cached dataframe found.")
    audio_features_df = pd.read_excel("cached_df.xlsx", engine = "openpyxl")
    audio_features_df.drop(["Unnamed: 0"], axis=1, inplace=True) #delete position row since rank alraedy has this information

audio_features_df
Cached dataframe found.
acousticness analysis_url danceability duration_ms energy id instrumentalness key liveness loudness ... speechiness tempo time_signature track_href type uri valence artist_id popularity_index follower_count
0 0.14100 https://api.spotify.com/v1/audio-analysis/7MXV... 0.679 230453.0 0.587 7MXVkk9YMctZqd1Srtv4MB 0.000006 7.0 0.137 -7.015 ... 0.2760 186.003 4.0 https://api.spotify.com/v1/tracks/7MXVkk9YMctZ... audio_features spotify:track:7MXVkk9YMctZqd1Srtv4MB 0.486 1Xyo4u8uXC1ZmMpatF05PJ 94.0 26720759.0
1 0.41400 https://api.spotify.com/v1/audio-analysis/7BKL... 0.748 244960.0 0.524 7BKLCZ1jbUBVqRi2FVlTVw 0.000000 8.0 0.111 -5.599 ... 0.0338 95.010 4.0 https://api.spotify.com/v1/tracks/7BKLCZ1jbUBV... audio_features spotify:track:7BKLCZ1jbUBVqRi2FVlTVw 0.661 69GGBxA162lTqCwzJG5jLp 84.0 17093912.0
2 0.23500 https://api.spotify.com/v1/audio-analysis/3ibK... 0.656 256733.0 0.578 3ibKnFDaa3GhpPGlOUj7ff 0.000000 7.0 0.118 -8.970 ... 0.0922 94.514 4.0 https://api.spotify.com/v1/tracks/3ibKnFDaa3Gh... audio_features spotify:track:3ibKnFDaa3GhpPGlOUj7ff 0.556 20s0P9QLxGqKuCsGwFsp7w 69.0 2055274.0
3 0.40600 https://api.spotify.com/v1/audio-analysis/5knu... 0.720 251088.0 0.763 5knuzwU65gJK7IF5yJsuaW 0.000000 9.0 0.180 -4.068 ... 0.0523 101.965 4.0 https://api.spotify.com/v1/tracks/5knuzwU65gJK... audio_features spotify:track:5knuzwU65gJK7IF5yJsuaW 0.742 6MDME20pz9RveH9rEXvrOM 80.0 4092589.0
4 0.00776 https://api.spotify.com/v1/audio-analysis/1zi7... 0.792 173987.0 0.625 1zi7xx7UVEFkmKfv06H8x0 0.001800 1.0 0.329 -5.609 ... 0.0536 103.967 4.0 https://api.spotify.com/v1/tracks/1zi7xx7UVEFk... audio_features spotify:track:1zi7xx7UVEFkmKfv06H8x0 0.370 3TVXtAsR1Inumwj472S9r4 96.0 51374698.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2395 0.40600 https://api.spotify.com/v1/audio-analysis/5knu... 0.720 251088.0 0.763 5knuzwU65gJK7IF5yJsuaW 0.000000 9.0 0.180 -4.068 ... 0.0523 101.965 4.0 https://api.spotify.com/v1/tracks/5knuzwU65gJK... audio_features spotify:track:5knuzwU65gJK7IF5yJsuaW 0.742 6MDME20pz9RveH9rEXvrOM 80.0 4092589.0
2396 0.02200 https://api.spotify.com/v1/audio-analysis/4knL... 0.910 276333.0 0.444 4knL4iPxPOZjQzTUlELGSY 0.000000 1.0 0.137 -8.126 ... 0.3440 149.953 4.0 https://api.spotify.com/v1/tracks/4knL4iPxPOZj... audio_features spotify:track:4knL4iPxPOZjQzTUlELGSY 0.530 6Ha4aES39QiVjR0L2lwuwq 75.0 3109571.0
2397 0.04050 https://api.spotify.com/v1/audio-analysis/2EgB... 0.884 191938.0 0.698 2EgB4n6XyBsuNUbuarr4eG 0.000000 0.0 0.195 -9.101 ... 0.3640 140.068 4.0 https://api.spotify.com/v1/tracks/2EgB4n6XyBsu... audio_features spotify:track:2EgB4n6XyBsuNUbuarr4eG 0.575 1pPmIToKXyGdsCF6LmqLmI 78.0 2419234.0
2398 0.00410 https://api.spotify.com/v1/audio-analysis/0dXN... 0.538 197640.0 0.804 0dXNQ8dckG4eYfEtq9zcva 0.000000 8.0 0.330 -5.194 ... 0.0358 144.992 4.0 https://api.spotify.com/v1/tracks/0dXNQ8dckG4e... audio_features spotify:track:0dXNQ8dckG4eYfEtq9zcva 0.507 7gAppWoH7pcYmphCVTXkzs 76.0 4082406.0
2399 0.00805 https://api.spotify.com/v1/audio-analysis/0leV... 0.740 266672.0 0.510 0leVyLipY7A8ruhkIBqc0E 0.000375 9.0 0.128 -8.042 ... 0.0780 141.534 5.0 https://api.spotify.com/v1/tracks/0leVyLipY7A8... audio_features spotify:track:0leVyLipY7A8ruhkIBqc0E 0.089 0JOxt5QOwq0czoJxvSc5hS 70.0 168927.0

2400 rows × 21 columns


#Append audio features to master dataframe
df['track_id'] = audio_features_df['id']
df['duration_ms'] = audio_features_df['duration_ms']
df['acousticness'] = audio_features_df['acousticness']
df['danceability'] = audio_features_df['danceability']
df['energy'] = audio_features_df['energy']
df['instrumentalness'] = audio_features_df['instrumentalness']
df['liveness'] = audio_features_df['liveness']
df['loudness'] = audio_features_df['loudness']
df['speechiness'] = audio_features_df['speechiness']
df['valence'] = audio_features_df['valence']
df['tempo'] = audio_features_df['tempo']
df['artist_id'] = audio_features_df['artist_id']
df['popularity_index'] = audio_features_df['popularity_index']
df['follower_count'] = audio_features_df['follower_count']

df = df.drop(columns='index')
df
position track_name artist streams url date track_id duration_ms acousticness danceability energy instrumentalness liveness loudness speechiness valence tempo artist_id popularity_index follower_count
0 1 Starboy The Weeknd 3135625 https://open.spotify.com/track/5aAx2yezTd8zXrk... 2017-01-01 7MXVkk9YMctZqd1Srtv4MB 230453.0 0.14100 0.679 0.587 0.000006 0.137 -7.015 0.2760 0.486 186.003 1Xyo4u8uXC1ZmMpatF05PJ 94.0 26720759.0
1 2 Closer The Chainsmokers 3015525 https://open.spotify.com/track/7BKLCZ1jbUBVqRi... 2017-01-01 7BKLCZ1jbUBVqRi2FVlTVw 244960.0 0.41400 0.748 0.524 0.000000 0.111 -5.599 0.0338 0.661 95.010 69GGBxA162lTqCwzJG5jLp 84.0 17093912.0
2 3 Let Me Love You DJ Snake 2545384 https://open.spotify.com/track/4pdPtRcBmOSQDlJ... 2017-01-01 3ibKnFDaa3GhpPGlOUj7ff 256733.0 0.23500 0.656 0.578 0.000000 0.118 -8.970 0.0922 0.556 94.514 20s0P9QLxGqKuCsGwFsp7w 69.0 2055274.0
3 4 Rockabye (feat. Sean Paul & Anne-Marie) Clean Bandit 2356604 https://open.spotify.com/track/5knuzwU65gJK7IF... 2017-01-01 5knuzwU65gJK7IF5yJsuaW 251088.0 0.40600 0.720 0.763 0.000000 0.180 -4.068 0.0523 0.742 101.965 6MDME20pz9RveH9rEXvrOM 80.0 4092589.0
4 5 One Dance Drake 2259887 https://open.spotify.com/track/1xznGGDReH1oQq0... 2017-01-01 1zi7xx7UVEFkmKfv06H8x0 173987.0 0.00776 0.792 0.625 0.001800 0.329 -5.609 0.0536 0.370 103.967 3TVXtAsR1Inumwj472S9r4 96.0 51374698.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2395 196 Rockabye (feat. Sean Paul & Anne-Marie) Clean Bandit 552118 https://open.spotify.com/track/5knuzwU65gJK7IF... 2017-12-01 5knuzwU65gJK7IF5yJsuaW 251088.0 0.40600 0.720 0.763 0.000000 0.180 -4.068 0.0523 0.742 101.965 6MDME20pz9RveH9rEXvrOM 80.0 4092589.0
2396 197 Rake It Up (feat. Nicki Minaj) Yo Gotti 551576 https://open.spotify.com/track/4knL4iPxPOZjQzT... 2017-12-01 4knL4iPxPOZjQzTUlELGSY 276333.0 0.02200 0.910 0.444 0.000000 0.137 -8.126 0.3440 0.530 149.953 6Ha4aES39QiVjR0L2lwuwq 75.0 3109571.0
2397 198 New Freezer (feat. Kendrick Lamar) Rich The Kid 550167 https://open.spotify.com/track/4pYZLpX23Vx8rwD... 2017-12-01 2EgB4n6XyBsuNUbuarr4eG 191938.0 0.04050 0.884 0.698 0.000000 0.195 -9.101 0.3640 0.575 140.068 1pPmIToKXyGdsCF6LmqLmI 78.0 2419234.0
2398 199 All Night Steve Aoki 548039 https://open.spotify.com/track/5mAxA6Q1SIym6dP... 2017-12-01 0dXNQ8dckG4eYfEtq9zcva 197640.0 0.00410 0.538 0.804 0.000000 0.330 -5.194 0.0358 0.507 144.992 7gAppWoH7pcYmphCVTXkzs 76.0 4082406.0
2399 200 113 Booba 546878 https://open.spotify.com/track/6xqAP7kpdgCy8lE... 2017-12-01 0leVyLipY7A8ruhkIBqc0E 266672.0 0.00805 0.740 0.510 0.000375 0.128 -8.042 0.0780 0.089 141.534 0JOxt5QOwq0czoJxvSc5hS 70.0 168927.0

2400 rows × 20 columns

# Fixing types because some values that are strings should be used as values
df['streams'] = df['streams'].astype(float)
df['position'] = df['position'].astype(int)

Data Visualization and Analysis

Song Properties

We’ve now gathered and manipulated valuable data for each track for each day recorded. The key elements are the following:

The following details define the patterns and properties of music, the way they sound, and what mood they instill:

The following are more extraneous details for identifying tracks in the data wrangling:

Using this data, we begin trying to observe what traits of a song bring success. First we observe that there is a standard distrubtion of stream counts, meaning the mean stream count will most likely fall from 1-1.05 million.

#Histogram takes 100 random tracks, takes the average of all their streams, then does this 100 times

from scipy.stats import normaltest
from numpy.random import seed
from numpy.random import randn

alpha = 0.05
data = []
for i in range(0,100):
    data.append(np.mean(df['streams'].sample(n=500)))
plt.hist(data)
plt.title("Frequency Distribution of Streams")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
print("Population mean: ", df['streams'].mean())
print("Population median: ", df['streams'].median())
print("Population STDDEV: ", df['streams'].std())
Population mean:  1023582.30625
Population median:  719465.0
Population STDDEV:  804478.6072499221

svg

Our goal is to determine if there are certain values of song properties that result in extremely high or low success. We create a dataframe that only saves the entry of a song at its peak stream count in the Top 200, meaning we are comparing all the peaks.

# Creating version of table with no duplicates, keeping the last seen version of each song. It is a fair representation of success.

no_dupes_df = df.copy()
no_dupes_df = no_dupes_df.sort_values('streams', ascending=False).drop_duplicates(['artist', 'duration_ms', 'acousticness', 'danceability', 'energy'], keep='first') 
no_dupes_df
position track_name artist streams url date track_id duration_ms acousticness danceability energy instrumentalness liveness loudness speechiness valence tempo artist_id popularity_index follower_count
200 1 Shape of You Ed Sheeran 7549041.0 https://open.spotify.com/track/7qiZfU4dY1lWllz... 2017-02-01 7qiZfU4dY1lWllzX7mPBI3 233713.0 0.5810 0.825 0.652 0.000000 0.0931 -3.183 0.0802 0.931 95.977 6eUKZXaKkcviH0Ku9w2n3V 91.0 73345259.0
1000 1 Despacito - Remix Luis Fonsi 7332260.0 https://open.spotify.com/track/5CtI0qwDJkDQGwX... 2017-06-01 6rPO02ozF3bM7NnOV4h6s2 228827.0 0.2280 0.653 0.816 0.000000 0.0967 -4.353 0.1670 0.816 178.085 4V8Sr092TqfHkfAA5fXXqG 78.0 9035487.0
2000 1 rockstar Post Malone 5755610.0 https://open.spotify.com/track/7wGoVu4Dady5GV0... 2017-11-01 7ytR5pFWmSjzHJIeQkgog4 181733.0 0.2470 0.746 0.690 0.000000 0.1010 -7.956 0.1640 0.497 89.977 4r63FhuTkUYltbVAg5TQnk 93.0 5174251.0
1600 1 Look What You Made Me Do Taylor Swift 5547962.0 https://open.spotify.com/track/6uFsE1JgZ20EXyU... 2017-09-01 1P17dC1amhFzptugyAO7Il 211853.0 0.2040 0.766 0.709 0.000014 0.1260 -6.471 0.1230 0.506 128.070 06HL4z0CvFAxyc27GXpf02 97.0 34579892.0
1001 2 I'm the One DJ Khaled 5208996.0 https://open.spotify.com/track/72Q0FQQo32KJloi... 2017-06-01 1jYiIOC5d6soxkJP81fxq2 288877.0 0.0533 0.599 0.667 0.000000 0.1340 -4.267 0.0367 0.817 80.984 0QHgL1lAIqAw0HtD7YldmP 82.0 5405048.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
193 194 Famous Kanye West 336134.0 https://open.spotify.com/track/19a3JfW8BQwqHWU... 2017-01-01 19a3JfW8BQwqHWUMbcqSx8 196040.0 0.0711 0.465 0.735 0.000000 0.0975 -3.715 0.1170 0.409 173.935 5K4W6rqBFWDnAN6FQUkS6x 90.0 12912141.0
196 197 Oh Lord MiC LOWRY 331792.0 https://open.spotify.com/track/1sTUEdVO85YU8Ym... 2017-01-01 1ISsiC4Fw6f96kZQegLGiJ 198253.0 0.4070 0.493 0.738 0.000000 0.1300 -6.921 0.2620 0.219 176.071 6fOMl44jA4Sp5b9PpYCkzz 84.0 4600363.0
197 198 Superstition - Single Version Stevie Wonder 331376.0 https://open.spotify.com/track/5lXcSvHRVjQJ3LB... 2017-01-01 1h2xVEoJORqrg71HocgqXd 245493.0 0.0380 0.633 0.634 0.006400 0.0385 -12.115 0.0725 0.872 100.499 7guDJrEfX3qb6FEbdPA5qi 80.0 4654921.0
198 199 Secrets The Weeknd 331233.0 https://open.spotify.com/track/3DX4Y0egvc0slLc... 2017-01-01 1NhPKVLsHhFUHIOZ32QnS2 224693.0 0.0717 0.516 0.764 0.000000 0.1150 -6.223 0.0366 0.376 148.021 5Pwc4xIPtQLFEnJriah9YJ 83.0 11061770.0
199 200 Ni**as In Paris JAY-Z 325951.0 https://open.spotify.com/track/2KpCpk6HjXXLb7n... 2017-01-01 4Li2WHPkuyCdtmokzW2007 219333.0 0.1270 0.789 0.858 0.000000 0.3490 -5.542 0.3110 0.775 140.022 3nFkdlSjzX9mRTtwJOzDYB 85.0 5812536.0

657 rows × 20 columns

Now we will visualize the stream count vs all song properties relationship

plt.scatter(no_dupes_df['popularity_index'], no_dupes_df['streams'])
plt.title('Streams in Relation to Popularity')
plt.xlabel('popularity value')
plt.ylabel('streams in millions')
print("Mean of popularity index: " + str(no_dupes_df['popularity_index'].mean()))
print("Median of popularity index: " + str(no_dupes_df['popularity_index'].median()))
print("STDDEV of popularity index: " + str(no_dupes_df['popularity_index'].std()))
Mean of popularity index: 81.84579439252336
Median of popularity index: 83.0
STDDEV of popularity index: 10.20413589979866

svg

The data appears to cluster around the mean, so we decided to check whether the popularity index was normally distributed

data = []
for i in range(0,100):
    data.append(np.mean(df['popularity_index'].sample(n=1000)))
plt.hist(data)
plt.title("Frequency Distribution of Popularity Indices")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
Text(0, 0.5, 'Frequency')

svg

Data related to popularity index of an artist does appear to be normally distributed from our data depending on the sample, and it appears that the majority of songs on the top 200 come from artists with a popularity of around 80.

plt.scatter(no_dupes_df['follower_count'], no_dupes_df['streams'])
plt.title('Streams in Relation to Follower Count')
plt.xlabel('number of artist followers in tens of million')
plt.ylabel('streams in millions')
print("Mean of follower count: " + str(no_dupes_df['follower_count'].mean()))
print("Median of follower count: " + str(no_dupes_df['follower_count'].median()))
print("STDDEV of follower count: " + str(no_dupes_df['follower_count'].std()))
Mean of follower count: 13744629.682242991
Median of follower count: 7748023.0
STDDEV of follower count: 16091903.457271803

svg

plt.scatter(no_dupes_df['duration_ms'], no_dupes_df['streams'])
plt.title('Streams in Relation to Song Duration')
plt.xlabel('duration in milliseconds')
plt.ylabel('streams in millions')
print("Mean of song length in seconds: " + str(no_dupes_df['duration_ms'].mean() / 1000))
print("Median of song length in seconds: " + str(no_dupes_df['duration_ms'].median() / 1000))
print("STDDEV of song length in seconds: " + str(no_dupes_df['duration_ms'].std() / 1000))
Mean of song length in seconds: 215.89356386292835
Median of song length in seconds: 213.981
STDDEV of song length in seconds: 41.844669610266095

svg

plt.scatter(no_dupes_df['acousticness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Acousticness')
plt.xlabel('acousticness scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of acousticness index: " + str(no_dupes_df['acousticness'].mean()))
print("Median of acousticness index: " + str(no_dupes_df['acousticness'].median()))
print("STDDEV of acousticness index: " + str(no_dupes_df['acousticness'].std()))
Mean of acousticness index: 0.21230590903426794
Median of acousticness index: 0.11
STDDEV of acousticness index: 0.23844285456931844

svg

plt.scatter(no_dupes_df['danceability'],no_dupes_df['streams'])
plt.title('Streams in Relation to Danceability')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of danceability index: " + str(no_dupes_df['danceability'].mean()))
print("Median of danceability index: " + str(no_dupes_df['danceability'].median()))
print("STDDEV of danceability index: " + str(no_dupes_df['danceability'].std()))
Mean of danceability index: 0.6818423676012461
Median of danceability index: 0.695
STDDEV of danceability index: 0.13576219623892008

svg

plt.scatter(no_dupes_df['energy'],no_dupes_df['streams'])
plt.title('Streams in Relation to Energy')
plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of energy index: " + str(no_dupes_df['energy'].mean()))
print("Median of energy index: " + str(no_dupes_df['energy'].median()))
print("STDDEV of energy index: " + str(no_dupes_df['energy'].std()))
Mean of energy index: 0.6355397507788161
Median of energy index: 0.6515
STDDEV of energy index: 0.17854747086125813

svg

plt.scatter(no_dupes_df['instrumentalness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Instrumentalness')
plt.xlabel('instrumentalness scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of instrumentalness index: " + str(no_dupes_df['instrumentalness'].mean()))
print("Median of instrumentalness index: " + str(no_dupes_df['instrumentalness'].median()))
print("STDDEV of instrumentalness index: " + str(no_dupes_df['instrumentalness'].std()))
Mean of instrumentalness index: 0.013712031915887851
Median of instrumentalness index: 0.0
STDDEV of instrumentalness index: 0.08112995046942596

svg

plt.scatter(no_dupes_df['liveness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Liveness')
plt.xlabel('liveness scale of 0-1') 
plt.ylabel('streams in millions')
print("Mean of liveness index: " + str(no_dupes_df['liveness'].mean()))
print("Median of liveness index: " + str(no_dupes_df['liveness'].median()))
print("STDDEV of liveness index: " + str(no_dupes_df['liveness'].std()))
Mean of liveness index: 0.1735563862928349
Median of liveness index: 0.123
STDDEV of liveness index: 0.12771847354589183

svg

plt.scatter(no_dupes_df['loudness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Loudness')
plt.xlabel('loudness in decibels')
plt.ylabel('streams in millions')
print("Mean of volume: " + str(no_dupes_df['loudness'].mean()))
print("Median of volume: " + str(no_dupes_df['loudness'].median()))
print("STDDEV of volume: " + str(no_dupes_df['loudness'].std()))
Mean of volume: -6.436602803738317
Median of volume: -5.992
STDDEV of volume: 2.930078470544615

svg

Lots of points appear to be around the mean volume, so let’s check and see if this data is normally distributed.

data = []
for i in range(0,100):
    data.append(np.mean(df['loudness'].sample(n=100)))
plt.hist(data)
plt.title("Frequency Distribution of Average Song Volumes")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
Text(0, 0.5, 'Frequency')

svg

The spread of different averages volumes for each track appears to be normally distributed, meaning the mean of any sample should be the same as the population mean. We determined that the population mean is about ~-6 decibels, with a standard deviation of around 3 decibels.

plt.scatter(no_dupes_df['speechiness'], no_dupes_df['streams'])
plt.title('Streams in Relation to Speechiness')
plt.xlabel('speechiness scale of 0-.5')
plt.ylabel('streams in millions')
print("Mean of speechiness index: " + str(no_dupes_df['speechiness'].mean()))
print("Median of speechiness index: " + str(no_dupes_df['speechiness'].median()))
print("STDDEV of speechiness index: " + str(no_dupes_df['speechiness'].std()))
Mean of speechiness index: 0.1167436137071651
Median of speechiness index: 0.0678
STDDEV of speechiness index: 0.11056578687587934

svg

plt.scatter(no_dupes_df['valence'],no_dupes_df['streams'])
plt.title('Streams in Relation to Valence')
plt.xlabel('valence scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of valence index: " + str(no_dupes_df['valence'].mean()))
print("Median of valence index: " + str(no_dupes_df['valence'].median()))
print("STDDEV of valence index: " + str(no_dupes_df['valence'].std()))
Mean of valence index: 0.48936448598130844
Median of valence index: 0.48
STDDEV of valence index: 0.23614792618729277

svg

plt.scatter(no_dupes_df['tempo'],no_dupes_df['streams'])
plt.title('Streams in Relation to Tempo')
plt.xlabel('tempo scale of 0-200 beats per minute')
plt.ylabel('streams in millions')
print("Mean of tempo: " + str(no_dupes_df['tempo'].mean()))
print("Median of tempo: " + str(no_dupes_df['tempo'].median()))
print("STDDEV of tempo: " + str(no_dupes_df['tempo'].std()))
Mean of tempo: 121.39503894080995
Median of tempo: 119.9425
STDDEV of tempo: 29.609868539398786

svg

We now will create a Correlation Matrix to see the relationship between all values. Observe this correlation matrix compiling the scatter plots above.

corr = no_dupes_df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
<AxesSubplot:>

svg

After seeing this, we had a couple of ideas. It is possible different features we are currently tracking work together to make a song popular, but it is also possible we are missing other important features. After looking back at the most popular songs over the course of our entire dataframe, we noticed the majority of artists were well known or already accomplished. While it is obvious that an artists’ “followers” (or typical listeners) will increase the number of streams a song will get, it would be interesting to know if the number of typical listeners was more important than all these other aspects of the song.

Top 10

Here we observe the traits of specifically the song at the ranks 1-10. The song in these positions is likely to change, so there will be different values for the same x-axis position at times.

top10s = df.loc[df['position'] <= 10]
#lists for legend to remove redundant code
color_list = ['r', 'orange', 'yellow', 'lime',  'cyan', 'b', 'brown' , 'violet', 'purple', 'black']
top10_legend = ['Rank 1', 'Rank 2', 'Rank 3', 'Rank 4', 'Rank 5', 'Rank 6','Rank 7','Rank 8','Rank 9','Rank 10']

#method to remove redundant code in plotting
def plotTop10(name):
    i = 0
    for index, row in top10s.iterrows():
        plt.scatter(row[name],row['streams'], color=color_list[i])
        i = (i + 1) % 10


top10s.head()
position track_name artist streams url date track_id duration_ms acousticness danceability energy instrumentalness liveness loudness speechiness valence tempo artist_id popularity_index follower_count
0 1 Starboy The Weeknd 3135625.0 https://open.spotify.com/track/5aAx2yezTd8zXrk... 2017-01-01 7MXVkk9YMctZqd1Srtv4MB 230453.0 0.14100 0.679 0.587 0.000006 0.137 -7.015 0.2760 0.486 186.003 1Xyo4u8uXC1ZmMpatF05PJ 94.0 26720759.0
1 2 Closer The Chainsmokers 3015525.0 https://open.spotify.com/track/7BKLCZ1jbUBVqRi... 2017-01-01 7BKLCZ1jbUBVqRi2FVlTVw 244960.0 0.41400 0.748 0.524 0.000000 0.111 -5.599 0.0338 0.661 95.010 69GGBxA162lTqCwzJG5jLp 84.0 17093912.0
2 3 Let Me Love You DJ Snake 2545384.0 https://open.spotify.com/track/4pdPtRcBmOSQDlJ... 2017-01-01 3ibKnFDaa3GhpPGlOUj7ff 256733.0 0.23500 0.656 0.578 0.000000 0.118 -8.970 0.0922 0.556 94.514 20s0P9QLxGqKuCsGwFsp7w 69.0 2055274.0
3 4 Rockabye (feat. Sean Paul & Anne-Marie) Clean Bandit 2356604.0 https://open.spotify.com/track/5knuzwU65gJK7IF... 2017-01-01 5knuzwU65gJK7IF5yJsuaW 251088.0 0.40600 0.720 0.763 0.000000 0.180 -4.068 0.0523 0.742 101.965 6MDME20pz9RveH9rEXvrOM 80.0 4092589.0
4 5 One Dance Drake 2259887.0 https://open.spotify.com/track/1xznGGDReH1oQq0... 2017-01-01 1zi7xx7UVEFkmKfv06H8x0 173987.0 0.00776 0.792 0.625 0.001800 0.329 -5.609 0.0536 0.370 103.967 3TVXtAsR1Inumwj472S9r4 96.0 51374698.0

plotTop10('popularity_index')
plt.title('Top 10 Streams in Relation to Popularity of Artist')
plt.xlabel('Artist Popularity Value')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bdc07df10>

svg

plotTop10('follower_count')
plt.title('Top 10 Streams in Relation to Follower Count')
plt.xlabel('Follower Count in 10 millions')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bd3a15760>

svg

This graph seems to indicate that the majority of songs within the top 10 positions come from artists with a follower count of less than 30 million, but the sample size here is small.

plotTop10('duration_ms')
plt.title('Top 10 Streams in Relation to Duration')
plt.xlabel('duration in milliseconds')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bd3a91af0>

svg

This graph shows how the duration of a song in milliseconds compares to the number of streams that song received, and we’re only using the first 10 pieces of data from our dataframe. This shows us that the songs with the most streams from this set of data are songs which are > 240000 ms, or 4 minutes. This is surprising, because the average song is usually around 3 minutes and 30 seconds or less.

plotTop10('acousticness')
plt.title('Top 10 Streams in Relation to Acousticness')
plt.xlabel('acousticness scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bd369fac0>

svg

This graph displays a confidence score for how likely it is that a song is acoustic (with a value of 1 being very likely that the song is acoustic) compared to the number of streams the song has. All of the confidence scores are less than .5, which indicates most of these songs are probably not acoustic.


plotTop10('energy')
plt.title('Top 10 Streams in Relation to Energy')
plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bd3e10970>

svg

This graph shows how the “energy” of a song, or generally how noisy and fast the song is, compares to the number of streams for the top 10 songs on the 1st of January. Here, we see that the songs with the most streams are around or above .6 on the energy scale (a higher score means the song is higher energy)

plotTop10('danceability')
plt.title('Top 10 Streams in Relation to Danceability')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bdfe06d00>

svg

This graph shows how “danceable” a song is using a value provided to us by the Spotify API comapred to the number of streams that song got. Danceability is measured as a value from 0 to 1, where 1 is most danceable. This graph appears to be similar to the graph describing, so they may have been determined using similar characteristisc (i.e. both are measuring how upbeat or fast a song is)


plotTop10('loudness')
plt.title('Top 10 Streams in Relation to Loudness')
plt.xlabel('loudness in decibels')
plt.ylabel('streams in millions')
plt.legend(top10_legend)


<matplotlib.legend.Legend at 0x23bd41c5b20>

svg

This graph describes the average volume of each track in our top 10s data set compared to the number of streams each song had. It appears to trend similarly to the last two graphs, indicating that the volume of a track may be correlated with how danceable or energetic a song is.

plotTop10('valence')
plt.title('Top 10 Streams in Relation to Loudness')
plt.xlabel('valence scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bdd6c9580>

svg

This graph describes the “valence” of a song compared to the # of streams it got. Valence is described as the “positivity” of a song where “Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry),” according to the Spotify API reference. The reference does not describe how this value is determined, but our data seems to show there may be a correlation between valence and the number of streams a song is getting in the set of number 1 songs. However, this graph does not take into account the other features for the songs. It may be worth trying to consider songs where features except for this one are held to a constant, so that we can consider if there is a correlation between this value and the number of streams.


plotTop10('tempo')
plt.title('Top 10 Streams in Relation to Beats Per Minute (BPM)')
plt.xlabel('tempo scale of 0-200 Beats Per Minute (BPM)')
plt.ylabel('streams in millions')
plt.legend(top10_legend)


<matplotlib.legend.Legend at 0x23bd35843d0>

svg

This graph describes the tempo of a song comapred to the number of streams that song has. Given our dataset, it is unclear whether there is a correlation between the tempo of a song and the number of streams it gets.

Correlation within the Top 10

corr = top10s.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
<AxesSubplot:>

svg

This Corellation Matrix of the Top 10 is much more polarizing than the Top 200 Corellation Matrix. Values that had a weak positive or weak negative correlation are now at least more strongly correlated, if not strongly correlated. This indicates the high correlation traits have a stronger influence in the Top 10.

Unique Relationships

There appeared to be a potential relationship between valence and the number of streams a song was getting both in our correlation chart and how rank 1 songs performed in our graph of the top 10 tracks each month, so it might be interesting to look at what the different features are like for songs with a high valence (.4 or higher).

highValenceTracks = df.loc[df['valence'] > .4]
highValenceTracks.sort_values('streams', ascending=False).head(10)
position track_name artist streams url date track_id duration_ms acousticness danceability energy instrumentalness liveness loudness speechiness valence tempo artist_id popularity_index follower_count
200 1 Shape of You Ed Sheeran 7549041.0 https://open.spotify.com/track/7qiZfU4dY1lWllz... 2017-02-01 7qiZfU4dY1lWllzX7mPBI3 233713.0 0.581 0.825 0.652 0.000000 0.0931 -3.183 0.0802 0.931 95.977 6eUKZXaKkcviH0Ku9w2n3V 91.0 73345259.0
1000 1 Despacito - Remix Luis Fonsi 7332260.0 https://open.spotify.com/track/5CtI0qwDJkDQGwX... 2017-06-01 6rPO02ozF3bM7NnOV4h6s2 228827.0 0.228 0.653 0.816 0.000000 0.0967 -4.353 0.1670 0.816 178.085 4V8Sr092TqfHkfAA5fXXqG 78.0 9035487.0
400 1 Shape of You Ed Sheeran 7201132.0 https://open.spotify.com/track/7qiZfU4dY1lWllz... 2017-03-01 7qiZfU4dY1lWllzX7mPBI3 233713.0 0.581 0.825 0.652 0.000000 0.0931 -3.183 0.0802 0.931 95.977 6eUKZXaKkcviH0Ku9w2n3V 91.0 73345259.0
600 1 Shape of You Ed Sheeran 6815498.0 https://open.spotify.com/track/7qiZfU4dY1lWllz... 2017-04-01 7qiZfU4dY1lWllzX7mPBI3 233713.0 0.581 0.825 0.652 0.000000 0.0931 -3.183 0.0802 0.931 95.977 6eUKZXaKkcviH0Ku9w2n3V 91.0 73345259.0
1200 1 Despacito - Remix Luis Fonsi 6398530.0 https://open.spotify.com/track/5CtI0qwDJkDQGwX... 2017-07-01 6rPO02ozF3bM7NnOV4h6s2 228827.0 0.228 0.653 0.816 0.000000 0.0967 -4.353 0.1670 0.816 178.085 4V8Sr092TqfHkfAA5fXXqG 78.0 9035487.0
800 1 Despacito - Remix Luis Fonsi 6360737.0 https://open.spotify.com/track/5CtI0qwDJkDQGwX... 2017-05-01 6rPO02ozF3bM7NnOV4h6s2 228827.0 0.228 0.653 0.816 0.000000 0.0967 -4.353 0.1670 0.816 178.085 4V8Sr092TqfHkfAA5fXXqG 78.0 9035487.0
2000 1 rockstar Post Malone 5755610.0 https://open.spotify.com/track/7wGoVu4Dady5GV0... 2017-11-01 7ytR5pFWmSjzHJIeQkgog4 181733.0 0.247 0.746 0.690 0.000000 0.1010 -7.956 0.1640 0.497 89.977 4r63FhuTkUYltbVAg5TQnk 93.0 5174251.0
1800 1 rockstar Post Malone 5649503.0 https://open.spotify.com/track/1OmcAT5Y8eg5bUP... 2017-10-01 7ytR5pFWmSjzHJIeQkgog4 181733.0 0.247 0.746 0.690 0.000000 0.1010 -7.956 0.1640 0.497 89.977 4r63FhuTkUYltbVAg5TQnk 93.0 5174251.0
1600 1 Look What You Made Me Do Taylor Swift 5547962.0 https://open.spotify.com/track/6uFsE1JgZ20EXyU... 2017-09-01 1P17dC1amhFzptugyAO7Il 211853.0 0.204 0.766 0.709 0.000014 0.1260 -6.471 0.1230 0.506 128.070 06HL4z0CvFAxyc27GXpf02 97.0 34579892.0
2200 1 rockstar Post Malone 5528701.0 https://open.spotify.com/track/7wGoVu4Dady5GV0... 2017-12-01 7ytR5pFWmSjzHJIeQkgog4 181733.0 0.247 0.746 0.690 0.000000 0.1010 -7.956 0.1640 0.497 89.977 4r63FhuTkUYltbVAg5TQnk 93.0 5174251.0

We have duplicate pieces of data, so lets remove the duplicates for this test. We’re going to try to keep the versions of the song that have the most streams

highValenceTracks = highValenceTracks.sort_values('streams', ascending=False).drop_duplicates(['artist', 'duration_ms', 'acousticness', 'danceability', 'energy'], keep='first') # Keeping the last seen version of each song, as that will probably hold it's total streams more accurately
highValenceTracks.sort_values('streams', ascending=False).head(10)
position track_name artist streams url date track_id duration_ms acousticness danceability energy instrumentalness liveness loudness speechiness valence tempo artist_id popularity_index follower_count
200 1 Shape of You Ed Sheeran 7549041.0 https://open.spotify.com/track/7qiZfU4dY1lWllz... 2017-02-01 7qiZfU4dY1lWllzX7mPBI3 233713.0 0.581000 0.825 0.652 0.000000 0.0931 -3.183 0.0802 0.931 95.977 6eUKZXaKkcviH0Ku9w2n3V 91.0 73345259.0
1000 1 Despacito - Remix Luis Fonsi 7332260.0 https://open.spotify.com/track/5CtI0qwDJkDQGwX... 2017-06-01 6rPO02ozF3bM7NnOV4h6s2 228827.0 0.228000 0.653 0.816 0.000000 0.0967 -4.353 0.1670 0.816 178.085 4V8Sr092TqfHkfAA5fXXqG 78.0 9035487.0
2000 1 rockstar Post Malone 5755610.0 https://open.spotify.com/track/7wGoVu4Dady5GV0... 2017-11-01 7ytR5pFWmSjzHJIeQkgog4 181733.0 0.247000 0.746 0.690 0.000000 0.1010 -7.956 0.1640 0.497 89.977 4r63FhuTkUYltbVAg5TQnk 93.0 5174251.0
1600 1 Look What You Made Me Do Taylor Swift 5547962.0 https://open.spotify.com/track/6uFsE1JgZ20EXyU... 2017-09-01 1P17dC1amhFzptugyAO7Il 211853.0 0.204000 0.766 0.709 0.000014 0.1260 -6.471 0.1230 0.506 128.070 06HL4z0CvFAxyc27GXpf02 97.0 34579892.0
1001 2 I'm the One DJ Khaled 5208996.0 https://open.spotify.com/track/72Q0FQQo32KJloi... 2017-06-01 1jYiIOC5d6soxkJP81fxq2 288877.0 0.053300 0.599 0.667 0.000000 0.1340 -4.267 0.0367 0.817 80.984 0QHgL1lAIqAw0HtD7YldmP 82.0 5405048.0
401 2 Something Just Like This The Chainsmokers 4581789.0 https://open.spotify.com/track/6RUKPb4LETWmmr3... 2017-03-01 6RUKPb4LETWmmr3iAEQktW 247160.0 0.049800 0.617 0.635 0.000014 0.1640 -6.769 0.0317 0.446 103.019 69GGBxA162lTqCwzJG5jLp 84.0 17093912.0
1201 2 Wild Thoughts (feat. Rihanna & Bryson Tiller) DJ Khaled 4558126.0 https://open.spotify.com/track/1OAh8uOEOvTDqkK... 2017-07-01 45XhKYRRkyeqoW3teSOkCM 204664.0 0.028700 0.613 0.681 0.000000 0.1260 -3.089 0.0778 0.619 97.621 0QHgL1lAIqAw0HtD7YldmP 82.0 5405048.0
402 3 It Ain't Me (with Selena Gomez) Kygo 4529714.0 https://open.spotify.com/track/3eR23VReFzcdmS7... 2017-03-01 2jRGYG8U5bJzWOH6FLuzvO 192000.0 0.016100 0.713 0.658 0.000138 0.0607 -5.362 0.0748 0.539 115.024 23fqKkggKUBHNkbKtXEls4 86.0 6975385.0
803 4 HUMBLE. Kendrick Lamar 4371886.0 https://open.spotify.com/track/7KXjTSCq5nL1LoY... 2017-05-01 7KXjTSCq5nL1LoYtL7XAwS 177000.0 0.000282 0.908 0.621 0.000054 0.0958 -6.638 0.1020 0.421 150.011 2YZyLoL8N0Wb9xBt1NhZWg 87.0 16028806.0
2002 3 New Rules Dua Lipa 3758506.0 https://open.spotify.com/track/2ekn2ttSfGqwhha... 2017-11-01 2ekn2ttSfGqwhhate0LSR0 209320.0 0.002610 0.762 0.700 0.000016 0.1530 -6.021 0.0694 0.608 116.073 6M2wZ9GZgrQXHCFfjv46we 93.0 21442792.0

Here are the first few tracks from our list of songs with high valences.

highValenceTracks = highValenceTracks.sort_values('valence', ascending=False)
highValenceTracks.head()
position track_name artist streams url date track_id duration_ms acousticness danceability energy instrumentalness liveness loudness speechiness valence tempo artist_id popularity_index follower_count
1006 7 There's Nothing Holdin' Me Back Shawn Mendes 3093935.0 https://open.spotify.com/track/79cuOz3SPQTuFrp... 2017-06-01 7JJmb5XwzOO8jgpou264Ml 199440.0 0.380 0.866 0.813 0.000000 0.0779 -4.063 0.0554 0.969 121.998 7n2wHs1TKAczGzO7Dd2rGr 92.0 30441601.0
124 125 Pumped Up Kicks Foster The People 467384.0 https://open.spotify.com/track/7w87IxuO7BDcJ3Y... 2017-01-01 7w87IxuO7BDcJ3YUqCyMTT 239600.0 0.145 0.733 0.710 0.115000 0.0956 -5.849 0.0292 0.965 127.975 7gP3bB2nilZXLfPHJhMdvc 76.0 3059575.0
2366 167 Feliz Navidad José Feliciano 631358.0 https://open.spotify.com/track/7taXf5odg9xCAZE... 2017-12-01 0oPdaY4dXtc3ZsaG17V972 182067.0 0.550 0.513 0.831 0.000000 0.3360 -9.004 0.0383 0.963 148.837 7K78lVZ8XzkjfRSI7570FF 76.0 211150.0
130 131 Happy - From "Despicable Me 2" Pharrell Williams 453426.0 https://open.spotify.com/track/5b88tNINg4Q4nrR... 2017-01-01 60nZcImufyMA1MKQY3dcCH 232720.0 0.219 0.647 0.822 0.000000 0.0908 -4.662 0.1830 0.962 160.019 2RdwBSPQiwcmiDo9kixcl8 80.0 3359324.0
1315 116 Skrt On Me (feat. Nicki Minaj) Calvin Harris 625504.0 https://open.spotify.com/track/7iDxZ5Cd0Yg08d4... 2017-07-01 7iDxZ5Cd0Yg08d4fI5WVtG 228267.0 0.169 0.713 0.889 0.000058 0.1690 -3.870 0.0376 0.960 101.977 7CajNmpbOovFoOoasH2HaY 86.0 20962353.0

Here’s a plot displaying the number of streams for each song with a high valence compared to their valence

plt.scatter(highValenceTracks['valence'], highValenceTracks['streams'])
plt.title("Streams Compared to Valence For Song With Valence > .4")
plt.xlabel("Valence Value From 0 to 1")
plt.ylabel("Streams in Millions")
Text(0, 0.5, 'Streams in Millions')

svg

No longer seeing the relationship we were seeing earlier between valence and number of streams. Maybe the relationship that leads to more streams is a combination of these features together. It might be worth trying to see if there is a relationship between streams and a combination of features like valence and loudness or valence and danceability

Let’s try categorizing our data into groups with different levels of valences. This allows us to bound valenece, which helps us treat it more like a constant. Then, we could see how other features compare to streams when valence is held within certain levels.

highValenceTracks = df.loc[(df['valence'] >= .5) & (df['valence'] < .8)]
veryHighValenceTracks = df.loc[df['valence'] >= .8]
lowValenceTracks = df.loc[(df['valence'] < .5) & (df['valence'] >= .3)]
veryLowValenceTracks = df.loc[df['valence'] <= .3]

First, we can try to looking at how danceability performs with different bounded groups of valence. We will color code the valence groups, so that we can easily see which tracks have a high, medium-high, medium-low, or low valence.

# Plotting songs with a high valence and varying levels of danceability against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['danceability'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['danceability'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['danceability'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['danceability'],veryHighValenceTracks['streams'], color="red")
plt.title('Danceability and streams for songs with bounded valences')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(['Song with valence < .3', 'Song with valence < .5 and > .3', 'Song with valence > .5 and < .8', 'Song with valence > .8'])
<matplotlib.legend.Legend at 0x23bd47c7340>

svg

This graph displays how the danceability of a song compares to the number of streams it has for songs that have a high valence (>.5). While there is little indication of a linear correlation, it appears that the songs with the most streams all also have a danceability of > .5.

Using this set of data where valence is color coded by groups of values, let’s try to plot other features against streams and see if valence and another feature have any effect on the number of streams. We can try considering follower count next, as that value appeared to have a slight positive correlation with streams in our correlation chart.

# Plotting songs with varying levels of followers within certain valence levels against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['follower_count'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['follower_count'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['follower_count'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['follower_count'],veryHighValenceTracks['streams'], color="red")
plt.title('Follower Count and Streams for Songs With Different Valence Bounds')
plt.xlabel('Follower count in tens of millions of followers')
plt.ylabel('Streams in millions')
plt.legend(['Song with valence < .3', 'Song with valence < .5 and > .3', 'Song with valence > .5 and < .8', 'Song with valence > .8'])
<matplotlib.legend.Legend at 0x23bd4839e50>

svg

This graph shows that the majority of songs in our entire set of data are within that bottom left corner of the graph, and this cluster includes varrying levels of valence. This means that different amounts of followers with bounded amounts of valence have little to do with the number of streams.

Another variable that appeared to have a slight correlation with streams was the loudness of a song. We can try to make a plot similar to our previous two plots where we instead measure loudness on the x-axis.

# Plotting songs with varying levels of loudness within certain valence levels against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['loudness'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['loudness'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['loudness'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['loudness'],veryHighValenceTracks['streams'], color="red")
plt.title('Loudness and Streams for Songs With Different Valence Bounds')
plt.xlabel('Loudness in Decibels')
plt.ylabel('Streams in Millions')
plt.legend(['Song with valence < .3', 'Song with valence < .5 and > .3', 'Song with valence > .5 and < .8', 'Song with valence > .8'])
<matplotlib.legend.Legend at 0x23bd48b6c10>

svg

While streams appears to have little correlation with our variables loudness and valence, it does appear that valence and loudness have a correlation. We start to see higher valences as songs gets louder, but this seems relatively intuitive as “negative” songs are probably quieter and more somber. It may also be noteworthy that songs that did very well all had a loudness of > -15 decibels.

These last three graphs seem to indicate that bounded levels of valence in conjunction with varying levels of the other features we measured have little to do with stream count. We did find that songs which performed exceptionally well had a loudness of above -15 decibels or had a danceability of greater than .6, but it was hard to see other relationships between our variables and streams otherwise.

Insight and Conclusion

In the final step of the data cycle, we draw conclusions based off our analysis to inform decisions made based on the data.

To answer our hypothesis: Are there traits of a song that can be used to determine future success? If so, what are they?

We were unable to conclude if there were specific traits of a song that directly determined future success or high stream counts. However, there were a few noteworthy trends in our data. Among our findings, we saw the volumes of songs across our dataset are normally distributed, with a mean of -6 decibels and a standard deviation of about 3 decibels. We can safely say that the majority of songs on the top 200 chart across our data maintain this volume because the data is normally distributed. The loudness of songs brings up an interesting reminder of the Loudness War in the early 1940s. During the loudness war, even though increasing the loudness of a song ultimately reduced its fidelity (fine details), critics preferred the increasing levels. This may be an echo of the impacts of this cultural trend, or perhaps people simply like their music loud, even if at a lower quality.

Our initial plotting of popularity index against streams indicated that most top 200 songs came from artists with an index of around 80 (mean of 81), and our correlation plot seemed to support that higher artist popularity indices were positively correlated with streams. It is unsurprising that artist popularity has a correlation with streams, but what is noteworthy is that our correlation plot seemed to indicate that even artist popularity alone had a weak positive correlation with streams. We noted that while the average number of followers for an artist was skewed by inconsistently high values in the top 1 or 2 positions, the median value of followers for an artist in the top 200 was 7,748,023.

After analyzing several features individually and seeing little direct correlation with stream values, we decided to consider how different factors working together might impact streams. We looked at individual features within the top 10 positions of our charts to see if we could find any trends here, as these songs were the most succesful of all the songs on our data. Within these, we initially thought we saw a positive correlation between valence and stream counts, but it was unclear if there was a distinctive relationship. We tried to plot a couple other features with bounded valence values to try to investigate whether any of these combinations could produce a more clear correlation. The other features we chose to measure with bounded valence values were ones which appeared to be positively correlated with streams according to our correlation chart, but were also measured independently from valence (i.e. they were not determined using the same song characteristics). Unfortunately, it appeared that there was no correlation between the combination of valence with the other factors we considered (danceability, popularity, and volume) and streams.

No single feature appeared to be the determining factor for a song’s success. Perhaps future research on how combinations of different features affect streams would be valuable, as our correlation plot seemed to indicate our conclusion. It would also be worth looking into what factors influence artist popularity. Access to data from a larger time period could also help in finding interesting trends in how people listen to music over time, too.