By Amanuel Awoke, Ferzam Mohammad, and Josue Velasquez
The music industry has changed a lot in the last decade with the introduction of streaming services like Apple Music or Spotify. Services like Spotify allow users to livestream music for personal consumption, often for free or for a subscription fee. These services have made it easier to consume music and have increased opportunities for people to start producing music, but they have also changed how musicians make money. Whenever a user listens to a song on a streaming service, the service typically keeps track of the number of “streams” that song has. Music artists are then paid a small amount based on the number of streams they have accumulated for their music. Given how little these artists are paid from streaming services, maximizing the amount of revenue made from a song is valuable for those looking to push out music to these services. Stream count also indicates where a song stands in the streaming services’ popularity lists, and making it onto their top 100 or 200 songs is a factor considered in whether these songs are added to global, official top songs charts i.e. Billboard 200.
Our group thought it would be interesting to see if we could try to make predictions for how popular a song might be given different features for a song (e.g. how fast or slow a song is, the mood of the song, how many listeners an artist already gets on average, etc.). If we can indicate how many plays a song will get, we can give a prediction for how much money a song will make on a streaming service. Much like the Moneyball scenario, it’s possible that artists are focusing on producing music that meets criteria which they think makes a song popular when, in reality, they should be focusing on other aspects of their music. Understanding what components of a song make it popular would help artists figure out the best way to produce music in order to make money off of these streaming services.
The Moneyball story demonstrated the importance of data science in producing a strong baseball team, and while music is different from sports, our project should hopefully reflect similar data science practices in order to reach a valuable conclusion. It may be relatively straightforward to conlcude whether a song by Taylor Swift will end up on the top 200 chart given her “incredibly loyal fanbase” of over 40 million people, but maybe there are other characteristics between popular songs that could indicate factors which help make a song more popular. Data science practices help us here by giving us tools to help identify characteristics in a song, clarify how those characteristics might relate to stream count, and determine whether any elements should be focused on when producing music.
From this point forward when we use the word “track” it is synonymous with “song.”
Are there traits of a song that can be used to determine future success? If so, what are they?
We are defining the success of a track by its appearance on the Top 200, as well as its ranking on the Top 200 (the higher the better).
This is the first step in the data science lifecycle where we must identify and gather information. We gather data from the Spotify Charts Regional Top 200 to first identify which songs had the highest stream counts in the United States, dating back to January 1st, 2017 through December 1st, 2017. Spotify Charts provides tracks with the highest stream count, their top 200 rank, and the artist(s) who created that song. Spotify Charts already compiles the data into Excel tables, so it isn’t necessary to directly scrape from the website. If you wanted to download one yourself, at the top right of the website, select a date you’d like to download in the dropdown, then select further up “Download to CSV.” The pandas method read_csv() was used to process the Excel files into dataframes.
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
import spotipy
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Since there were consistent download URLs of Excel sheets in relation to the date they recorded, we used a loop to retreive the links then later download all sheets.
# Collect links from spotify charts top 200 streams per day
ref_str = "https://spotifycharts.com/regional/global/daily/"
ref_arr = []
for year in range(2017, 2018):
date = ""
endingMonth = 12
if year == 2020:
endingMonth = 10
for month in range (1, endingMonth + 1):
if int(month) < 10:
month = "0" + str(month)
date = str(year) + "-" + str(month) + "-" + "01" + "/download"
date = ref_str + date
ref_arr.append(date)
ref_arr
['https://spotifycharts.com/regional/global/daily/2017-01-01/download',
'https://spotifycharts.com/regional/global/daily/2017-02-01/download',
'https://spotifycharts.com/regional/global/daily/2017-03-01/download',
'https://spotifycharts.com/regional/global/daily/2017-04-01/download',
'https://spotifycharts.com/regional/global/daily/2017-05-01/download',
'https://spotifycharts.com/regional/global/daily/2017-06-01/download',
'https://spotifycharts.com/regional/global/daily/2017-07-01/download',
'https://spotifycharts.com/regional/global/daily/2017-08-01/download',
'https://spotifycharts.com/regional/global/daily/2017-09-01/download',
'https://spotifycharts.com/regional/global/daily/2017-10-01/download',
'https://spotifycharts.com/regional/global/daily/2017-11-01/download',
'https://spotifycharts.com/regional/global/daily/2017-12-01/download']
#Loop downloading and appending of dataframes
df = pd.DataFrame(columns =['position', 'track_name', 'artist', 'streams', 'url', 'date'] )
#make dir to save to
path = "sheets"
folderExists = False
try:
os.mkdir(path)
except FileExistsError:
print ("Folder already exists")
folderExists = True
for i in ref_arr:
r = requests.get(i, allow_redirects = True)
#String manipulation to read from the correct csv files
date = i[48:58]
fileName = "regional-global-daily-" + date + ".csv"
if not folderExists:
print("Downloading... " + fileName)
open(fileName, "wb").write(r.content)
os.rename(fileName, "sheets/" + fileName)
df_new = pd.read_csv(path + "/" + fileName)
df_new.columns= ['position', 'track_name', 'artist', 'streams', 'url']
df_new['date'] = date
df_new = df_new.iloc[1:] #deletes junk row from csv conversion
df = df.append(df_new)
print("Done")
df = df.reset_index() # Sets index back to being the regular 0-based index. This is really helpful when trying to add more to the dataframe later, because otherwise there are lots of duplicate indices
df['streams'] = df['streams'].astype(int) #streams are a string of a num, must wrap as type int always
Folder already exists
Done
df
index | position | track_name | artist | streams | url | date | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | Starboy | The Weeknd | 3135625 | https://open.spotify.com/track/5aAx2yezTd8zXrk... | 2017-01-01 |
1 | 2 | 2 | Closer | The Chainsmokers | 3015525 | https://open.spotify.com/track/7BKLCZ1jbUBVqRi... | 2017-01-01 |
2 | 3 | 3 | Let Me Love You | DJ Snake | 2545384 | https://open.spotify.com/track/4pdPtRcBmOSQDlJ... | 2017-01-01 |
3 | 4 | 4 | Rockabye (feat. Sean Paul & Anne-Marie) | Clean Bandit | 2356604 | https://open.spotify.com/track/5knuzwU65gJK7IF... | 2017-01-01 |
4 | 5 | 5 | One Dance | Drake | 2259887 | https://open.spotify.com/track/1xznGGDReH1oQq0... | 2017-01-01 |
... | ... | ... | ... | ... | ... | ... | ... |
2395 | 196 | 196 | Rockabye (feat. Sean Paul & Anne-Marie) | Clean Bandit | 552118 | https://open.spotify.com/track/5knuzwU65gJK7IF... | 2017-12-01 |
2396 | 197 | 197 | Rake It Up (feat. Nicki Minaj) | Yo Gotti | 551576 | https://open.spotify.com/track/4knL4iPxPOZjQzT... | 2017-12-01 |
2397 | 198 | 198 | New Freezer (feat. Kendrick Lamar) | Rich The Kid | 550167 | https://open.spotify.com/track/4pYZLpX23Vx8rwD... | 2017-12-01 |
2398 | 199 | 199 | All Night | Steve Aoki | 548039 | https://open.spotify.com/track/5mAxA6Q1SIym6dP... | 2017-12-01 |
2399 | 200 | 200 | 113 | Booba | 546878 | https://open.spotify.com/track/6xqAP7kpdgCy8lE... | 2017-12-01 |
2400 rows × 7 columns
Spotipy is a lightweight Python library for the Spotify Web API used to retrieve more detailed data for tracks now that their names have been retrieved from the Spotify Top 200. We must first authenticate our usage of the API using a Spotify Account.
import spotipy
from spotipy.oauth2 import SpotifyOAuth
from spotipy.oauth2 import SpotifyClientCredentials
SPOTIPY_CLIENT_ID="ea1a162fbc6f413990542b76ab82a168"
SPOTIPY_CLIENT_SECRET="a09882042ce54f158fdd2b6baaf2b26d"
SPOTIPY_CLIENT_REDIRECT="https://amanuelawoke.com/groovydata/"
scope = "user-library-read"
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(scope=scope, client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET, redirect_uri=SPOTIPY_CLIENT_REDIRECT))
We’re going to start by using the Spotify API to get more information about all the tracks we found in the top 200’s chart for the timeframe we described above. The Spotify API gives us the ability to get “audio features” from a song given a track id that Spotify creates for every song. These “audio features” include characteristics like loudness, positivity, danceability, how energetic the song is, the speed of the song, and a couple other similar characteristics that have been determined by Spotify using their own machine learning algorithms.
First, we do need to get an id for every song and artist in our dataframe to be able to make queries through the Spotify API for a specific track or artist. Here, we get track and artist ids, and we also make a query for the audio features of each track id. We’re doing these all together for code efficiency, just because a large number of queries through the Spotify API can take time. For testing we cached the dataframe rather than compiling the data every time.
import xlsxwriter
import openpyxl
artist_id_list = []
track_id_list = []
popularity_index_list = []
follower_count_list = []
audio_features_df = pd.DataFrame()
#if cached df exists dont search again, else search again
if not os.path.exists("cached_df.xlsx"):
#Take each song and lookup its audio features, then create a dataframe for them
print("Searching...")
for index, row in df.iterrows():
trackName = row['track_name']
track_id = ""
artist_id = ""
# We need to check if our track_name received was a nan value. Idk how these got in here, but there are nans
if(type(trackName) == str):
#delimit with +'s for spotipy search query
trackNameWithoutSpaces = '+'.join(trackName.split())
searchQuery = sp.search(trackNameWithoutSpaces, 1, 0)
if (len(searchQuery['tracks']['items']) != 0):
track_object = searchQuery['tracks']['items'][0]
track_id = track_object['id']
track_id_list.append(track_id)
#if there are several artists, return the first artist
artist_object = track_object['artists'][0] if type(track_object['artists']) is list else track_object['artists']
artist_id = artist_object['id']
artist_id_list.append(artist_id)
artist_object_real = sp.artist(artist_id)
followers_object = artist_object_real['followers']
followers_value = followers_object['total']
follower_count_list.append(followers_value)
popularity_value = artist_object_real['popularity']
popularity_index_list.append(popularity_value)
# If our query returned nothing then append a nan in the place of artist and track for this entry
else:
artist_id_list.append(np.nan)
track_id_list.append(np.nan)
popularity_index_list.append(np.nan)
follower_count_list.append(np.nan)
# If we had stored a nan, then just plan to append a nan in this position
else:
artist_id_list.append(np.nan)
track_id_list.append(np.nan)
popularity_index_list.append(np.nan)
follower_count_list.append(np.nan)
#Defining audio features as nan to begin
audiofeatures = {'duration_ms' : np.nan, 'key' : np.nan, 'mode' : np.nan, 'time_signature' : np.nan, 'acousticness' : np.nan, 'danceability' : np.nan, 'energy' : np.nan, 'instrumentalness' : np.nan, 'liveness' : np.nan, 'loudness' : np.nan, 'speechiness' : np.nan, 'valence' : np.nan, 'tempo' : np.nan, 'id' : np.nan, 'uri' : np.nan, 'track_href' : np.nan, 'analysis_url' : np.nan, 'type' : np.nan, }
# If we successfully found a track when we did our search, then get the audio features for that
if (track_id != ""):
audiofeatures = sp.audio_features(track_id)[0]
#Append the audio features
audio_features_df = audio_features_df.append(audiofeatures, ignore_index=True)
#adds artist id list
audio_features_df['artist_id'] = artist_id_list
audio_features_df['popularity_index'] = popularity_index_list
audio_features_df['follower_count'] = follower_count_list
# Store the created data frame into the cache
writer = pd.ExcelWriter('cached_df.xlsx', engine='openpyxl')
audio_features_df.to_excel(writer, sheet_name='Sheet1')
writer.save()
else: #access the cached df if it exist
print("Cached dataframe found.")
audio_features_df = pd.read_excel("cached_df.xlsx", engine = "openpyxl")
audio_features_df.drop(["Unnamed: 0"], axis=1, inplace=True) #delete position row since rank alraedy has this information
audio_features_df
Cached dataframe found.
acousticness | analysis_url | danceability | duration_ms | energy | id | instrumentalness | key | liveness | loudness | ... | speechiness | tempo | time_signature | track_href | type | uri | valence | artist_id | popularity_index | follower_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.14100 | https://api.spotify.com/v1/audio-analysis/7MXV... | 0.679 | 230453.0 | 0.587 | 7MXVkk9YMctZqd1Srtv4MB | 0.000006 | 7.0 | 0.137 | -7.015 | ... | 0.2760 | 186.003 | 4.0 | https://api.spotify.com/v1/tracks/7MXVkk9YMctZ... | audio_features | spotify:track:7MXVkk9YMctZqd1Srtv4MB | 0.486 | 1Xyo4u8uXC1ZmMpatF05PJ | 94.0 | 26720759.0 |
1 | 0.41400 | https://api.spotify.com/v1/audio-analysis/7BKL... | 0.748 | 244960.0 | 0.524 | 7BKLCZ1jbUBVqRi2FVlTVw | 0.000000 | 8.0 | 0.111 | -5.599 | ... | 0.0338 | 95.010 | 4.0 | https://api.spotify.com/v1/tracks/7BKLCZ1jbUBV... | audio_features | spotify:track:7BKLCZ1jbUBVqRi2FVlTVw | 0.661 | 69GGBxA162lTqCwzJG5jLp | 84.0 | 17093912.0 |
2 | 0.23500 | https://api.spotify.com/v1/audio-analysis/3ibK... | 0.656 | 256733.0 | 0.578 | 3ibKnFDaa3GhpPGlOUj7ff | 0.000000 | 7.0 | 0.118 | -8.970 | ... | 0.0922 | 94.514 | 4.0 | https://api.spotify.com/v1/tracks/3ibKnFDaa3Gh... | audio_features | spotify:track:3ibKnFDaa3GhpPGlOUj7ff | 0.556 | 20s0P9QLxGqKuCsGwFsp7w | 69.0 | 2055274.0 |
3 | 0.40600 | https://api.spotify.com/v1/audio-analysis/5knu... | 0.720 | 251088.0 | 0.763 | 5knuzwU65gJK7IF5yJsuaW | 0.000000 | 9.0 | 0.180 | -4.068 | ... | 0.0523 | 101.965 | 4.0 | https://api.spotify.com/v1/tracks/5knuzwU65gJK... | audio_features | spotify:track:5knuzwU65gJK7IF5yJsuaW | 0.742 | 6MDME20pz9RveH9rEXvrOM | 80.0 | 4092589.0 |
4 | 0.00776 | https://api.spotify.com/v1/audio-analysis/1zi7... | 0.792 | 173987.0 | 0.625 | 1zi7xx7UVEFkmKfv06H8x0 | 0.001800 | 1.0 | 0.329 | -5.609 | ... | 0.0536 | 103.967 | 4.0 | https://api.spotify.com/v1/tracks/1zi7xx7UVEFk... | audio_features | spotify:track:1zi7xx7UVEFkmKfv06H8x0 | 0.370 | 3TVXtAsR1Inumwj472S9r4 | 96.0 | 51374698.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2395 | 0.40600 | https://api.spotify.com/v1/audio-analysis/5knu... | 0.720 | 251088.0 | 0.763 | 5knuzwU65gJK7IF5yJsuaW | 0.000000 | 9.0 | 0.180 | -4.068 | ... | 0.0523 | 101.965 | 4.0 | https://api.spotify.com/v1/tracks/5knuzwU65gJK... | audio_features | spotify:track:5knuzwU65gJK7IF5yJsuaW | 0.742 | 6MDME20pz9RveH9rEXvrOM | 80.0 | 4092589.0 |
2396 | 0.02200 | https://api.spotify.com/v1/audio-analysis/4knL... | 0.910 | 276333.0 | 0.444 | 4knL4iPxPOZjQzTUlELGSY | 0.000000 | 1.0 | 0.137 | -8.126 | ... | 0.3440 | 149.953 | 4.0 | https://api.spotify.com/v1/tracks/4knL4iPxPOZj... | audio_features | spotify:track:4knL4iPxPOZjQzTUlELGSY | 0.530 | 6Ha4aES39QiVjR0L2lwuwq | 75.0 | 3109571.0 |
2397 | 0.04050 | https://api.spotify.com/v1/audio-analysis/2EgB... | 0.884 | 191938.0 | 0.698 | 2EgB4n6XyBsuNUbuarr4eG | 0.000000 | 0.0 | 0.195 | -9.101 | ... | 0.3640 | 140.068 | 4.0 | https://api.spotify.com/v1/tracks/2EgB4n6XyBsu... | audio_features | spotify:track:2EgB4n6XyBsuNUbuarr4eG | 0.575 | 1pPmIToKXyGdsCF6LmqLmI | 78.0 | 2419234.0 |
2398 | 0.00410 | https://api.spotify.com/v1/audio-analysis/0dXN... | 0.538 | 197640.0 | 0.804 | 0dXNQ8dckG4eYfEtq9zcva | 0.000000 | 8.0 | 0.330 | -5.194 | ... | 0.0358 | 144.992 | 4.0 | https://api.spotify.com/v1/tracks/0dXNQ8dckG4e... | audio_features | spotify:track:0dXNQ8dckG4eYfEtq9zcva | 0.507 | 7gAppWoH7pcYmphCVTXkzs | 76.0 | 4082406.0 |
2399 | 0.00805 | https://api.spotify.com/v1/audio-analysis/0leV... | 0.740 | 266672.0 | 0.510 | 0leVyLipY7A8ruhkIBqc0E | 0.000375 | 9.0 | 0.128 | -8.042 | ... | 0.0780 | 141.534 | 5.0 | https://api.spotify.com/v1/tracks/0leVyLipY7A8... | audio_features | spotify:track:0leVyLipY7A8ruhkIBqc0E | 0.089 | 0JOxt5QOwq0czoJxvSc5hS | 70.0 | 168927.0 |
2400 rows × 21 columns
#Append audio features to master dataframe
df['track_id'] = audio_features_df['id']
df['duration_ms'] = audio_features_df['duration_ms']
df['acousticness'] = audio_features_df['acousticness']
df['danceability'] = audio_features_df['danceability']
df['energy'] = audio_features_df['energy']
df['instrumentalness'] = audio_features_df['instrumentalness']
df['liveness'] = audio_features_df['liveness']
df['loudness'] = audio_features_df['loudness']
df['speechiness'] = audio_features_df['speechiness']
df['valence'] = audio_features_df['valence']
df['tempo'] = audio_features_df['tempo']
df['artist_id'] = audio_features_df['artist_id']
df['popularity_index'] = audio_features_df['popularity_index']
df['follower_count'] = audio_features_df['follower_count']
df = df.drop(columns='index')
df
position | track_name | artist | streams | url | date | track_id | duration_ms | acousticness | danceability | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | artist_id | popularity_index | follower_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Starboy | The Weeknd | 3135625 | https://open.spotify.com/track/5aAx2yezTd8zXrk... | 2017-01-01 | 7MXVkk9YMctZqd1Srtv4MB | 230453.0 | 0.14100 | 0.679 | 0.587 | 0.000006 | 0.137 | -7.015 | 0.2760 | 0.486 | 186.003 | 1Xyo4u8uXC1ZmMpatF05PJ | 94.0 | 26720759.0 |
1 | 2 | Closer | The Chainsmokers | 3015525 | https://open.spotify.com/track/7BKLCZ1jbUBVqRi... | 2017-01-01 | 7BKLCZ1jbUBVqRi2FVlTVw | 244960.0 | 0.41400 | 0.748 | 0.524 | 0.000000 | 0.111 | -5.599 | 0.0338 | 0.661 | 95.010 | 69GGBxA162lTqCwzJG5jLp | 84.0 | 17093912.0 |
2 | 3 | Let Me Love You | DJ Snake | 2545384 | https://open.spotify.com/track/4pdPtRcBmOSQDlJ... | 2017-01-01 | 3ibKnFDaa3GhpPGlOUj7ff | 256733.0 | 0.23500 | 0.656 | 0.578 | 0.000000 | 0.118 | -8.970 | 0.0922 | 0.556 | 94.514 | 20s0P9QLxGqKuCsGwFsp7w | 69.0 | 2055274.0 |
3 | 4 | Rockabye (feat. Sean Paul & Anne-Marie) | Clean Bandit | 2356604 | https://open.spotify.com/track/5knuzwU65gJK7IF... | 2017-01-01 | 5knuzwU65gJK7IF5yJsuaW | 251088.0 | 0.40600 | 0.720 | 0.763 | 0.000000 | 0.180 | -4.068 | 0.0523 | 0.742 | 101.965 | 6MDME20pz9RveH9rEXvrOM | 80.0 | 4092589.0 |
4 | 5 | One Dance | Drake | 2259887 | https://open.spotify.com/track/1xznGGDReH1oQq0... | 2017-01-01 | 1zi7xx7UVEFkmKfv06H8x0 | 173987.0 | 0.00776 | 0.792 | 0.625 | 0.001800 | 0.329 | -5.609 | 0.0536 | 0.370 | 103.967 | 3TVXtAsR1Inumwj472S9r4 | 96.0 | 51374698.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2395 | 196 | Rockabye (feat. Sean Paul & Anne-Marie) | Clean Bandit | 552118 | https://open.spotify.com/track/5knuzwU65gJK7IF... | 2017-12-01 | 5knuzwU65gJK7IF5yJsuaW | 251088.0 | 0.40600 | 0.720 | 0.763 | 0.000000 | 0.180 | -4.068 | 0.0523 | 0.742 | 101.965 | 6MDME20pz9RveH9rEXvrOM | 80.0 | 4092589.0 |
2396 | 197 | Rake It Up (feat. Nicki Minaj) | Yo Gotti | 551576 | https://open.spotify.com/track/4knL4iPxPOZjQzT... | 2017-12-01 | 4knL4iPxPOZjQzTUlELGSY | 276333.0 | 0.02200 | 0.910 | 0.444 | 0.000000 | 0.137 | -8.126 | 0.3440 | 0.530 | 149.953 | 6Ha4aES39QiVjR0L2lwuwq | 75.0 | 3109571.0 |
2397 | 198 | New Freezer (feat. Kendrick Lamar) | Rich The Kid | 550167 | https://open.spotify.com/track/4pYZLpX23Vx8rwD... | 2017-12-01 | 2EgB4n6XyBsuNUbuarr4eG | 191938.0 | 0.04050 | 0.884 | 0.698 | 0.000000 | 0.195 | -9.101 | 0.3640 | 0.575 | 140.068 | 1pPmIToKXyGdsCF6LmqLmI | 78.0 | 2419234.0 |
2398 | 199 | All Night | Steve Aoki | 548039 | https://open.spotify.com/track/5mAxA6Q1SIym6dP... | 2017-12-01 | 0dXNQ8dckG4eYfEtq9zcva | 197640.0 | 0.00410 | 0.538 | 0.804 | 0.000000 | 0.330 | -5.194 | 0.0358 | 0.507 | 144.992 | 7gAppWoH7pcYmphCVTXkzs | 76.0 | 4082406.0 |
2399 | 200 | 113 | Booba | 546878 | https://open.spotify.com/track/6xqAP7kpdgCy8lE... | 2017-12-01 | 0leVyLipY7A8ruhkIBqc0E | 266672.0 | 0.00805 | 0.740 | 0.510 | 0.000375 | 0.128 | -8.042 | 0.0780 | 0.089 | 141.534 | 0JOxt5QOwq0czoJxvSc5hS | 70.0 | 168927.0 |
2400 rows × 20 columns
# Fixing types because some values that are strings should be used as values
df['streams'] = df['streams'].astype(float)
df['position'] = df['position'].astype(int)
We’ve now gathered and manipulated valuable data for each track for each day recorded. The key elements are the following:
The following details define the patterns and properties of music, the way they sound, and what mood they instill:
The following are more extraneous details for identifying tracks in the data wrangling:
Using this data, we begin trying to observe what traits of a song bring success. First we observe that there is a standard distrubtion of stream counts, meaning the mean stream count will most likely fall from 1-1.05 million.
#Histogram takes 100 random tracks, takes the average of all their streams, then does this 100 times
from scipy.stats import normaltest
from numpy.random import seed
from numpy.random import randn
alpha = 0.05
data = []
for i in range(0,100):
data.append(np.mean(df['streams'].sample(n=500)))
plt.hist(data)
plt.title("Frequency Distribution of Streams")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
print("Population mean: ", df['streams'].mean())
print("Population median: ", df['streams'].median())
print("Population STDDEV: ", df['streams'].std())
Population mean: 1023582.30625
Population median: 719465.0
Population STDDEV: 804478.6072499221
Our goal is to determine if there are certain values of song properties that result in extremely high or low success. We create a dataframe that only saves the entry of a song at its peak stream count in the Top 200, meaning we are comparing all the peaks.
# Creating version of table with no duplicates, keeping the last seen version of each song. It is a fair representation of success.
no_dupes_df = df.copy()
no_dupes_df = no_dupes_df.sort_values('streams', ascending=False).drop_duplicates(['artist', 'duration_ms', 'acousticness', 'danceability', 'energy'], keep='first')
no_dupes_df
position | track_name | artist | streams | url | date | track_id | duration_ms | acousticness | danceability | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | artist_id | popularity_index | follower_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
200 | 1 | Shape of You | Ed Sheeran | 7549041.0 | https://open.spotify.com/track/7qiZfU4dY1lWllz... | 2017-02-01 | 7qiZfU4dY1lWllzX7mPBI3 | 233713.0 | 0.5810 | 0.825 | 0.652 | 0.000000 | 0.0931 | -3.183 | 0.0802 | 0.931 | 95.977 | 6eUKZXaKkcviH0Ku9w2n3V | 91.0 | 73345259.0 |
1000 | 1 | Despacito - Remix | Luis Fonsi | 7332260.0 | https://open.spotify.com/track/5CtI0qwDJkDQGwX... | 2017-06-01 | 6rPO02ozF3bM7NnOV4h6s2 | 228827.0 | 0.2280 | 0.653 | 0.816 | 0.000000 | 0.0967 | -4.353 | 0.1670 | 0.816 | 178.085 | 4V8Sr092TqfHkfAA5fXXqG | 78.0 | 9035487.0 |
2000 | 1 | rockstar | Post Malone | 5755610.0 | https://open.spotify.com/track/7wGoVu4Dady5GV0... | 2017-11-01 | 7ytR5pFWmSjzHJIeQkgog4 | 181733.0 | 0.2470 | 0.746 | 0.690 | 0.000000 | 0.1010 | -7.956 | 0.1640 | 0.497 | 89.977 | 4r63FhuTkUYltbVAg5TQnk | 93.0 | 5174251.0 |
1600 | 1 | Look What You Made Me Do | Taylor Swift | 5547962.0 | https://open.spotify.com/track/6uFsE1JgZ20EXyU... | 2017-09-01 | 1P17dC1amhFzptugyAO7Il | 211853.0 | 0.2040 | 0.766 | 0.709 | 0.000014 | 0.1260 | -6.471 | 0.1230 | 0.506 | 128.070 | 06HL4z0CvFAxyc27GXpf02 | 97.0 | 34579892.0 |
1001 | 2 | I'm the One | DJ Khaled | 5208996.0 | https://open.spotify.com/track/72Q0FQQo32KJloi... | 2017-06-01 | 1jYiIOC5d6soxkJP81fxq2 | 288877.0 | 0.0533 | 0.599 | 0.667 | 0.000000 | 0.1340 | -4.267 | 0.0367 | 0.817 | 80.984 | 0QHgL1lAIqAw0HtD7YldmP | 82.0 | 5405048.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
193 | 194 | Famous | Kanye West | 336134.0 | https://open.spotify.com/track/19a3JfW8BQwqHWU... | 2017-01-01 | 19a3JfW8BQwqHWUMbcqSx8 | 196040.0 | 0.0711 | 0.465 | 0.735 | 0.000000 | 0.0975 | -3.715 | 0.1170 | 0.409 | 173.935 | 5K4W6rqBFWDnAN6FQUkS6x | 90.0 | 12912141.0 |
196 | 197 | Oh Lord | MiC LOWRY | 331792.0 | https://open.spotify.com/track/1sTUEdVO85YU8Ym... | 2017-01-01 | 1ISsiC4Fw6f96kZQegLGiJ | 198253.0 | 0.4070 | 0.493 | 0.738 | 0.000000 | 0.1300 | -6.921 | 0.2620 | 0.219 | 176.071 | 6fOMl44jA4Sp5b9PpYCkzz | 84.0 | 4600363.0 |
197 | 198 | Superstition - Single Version | Stevie Wonder | 331376.0 | https://open.spotify.com/track/5lXcSvHRVjQJ3LB... | 2017-01-01 | 1h2xVEoJORqrg71HocgqXd | 245493.0 | 0.0380 | 0.633 | 0.634 | 0.006400 | 0.0385 | -12.115 | 0.0725 | 0.872 | 100.499 | 7guDJrEfX3qb6FEbdPA5qi | 80.0 | 4654921.0 |
198 | 199 | Secrets | The Weeknd | 331233.0 | https://open.spotify.com/track/3DX4Y0egvc0slLc... | 2017-01-01 | 1NhPKVLsHhFUHIOZ32QnS2 | 224693.0 | 0.0717 | 0.516 | 0.764 | 0.000000 | 0.1150 | -6.223 | 0.0366 | 0.376 | 148.021 | 5Pwc4xIPtQLFEnJriah9YJ | 83.0 | 11061770.0 |
199 | 200 | Ni**as In Paris | JAY-Z | 325951.0 | https://open.spotify.com/track/2KpCpk6HjXXLb7n... | 2017-01-01 | 4Li2WHPkuyCdtmokzW2007 | 219333.0 | 0.1270 | 0.789 | 0.858 | 0.000000 | 0.3490 | -5.542 | 0.3110 | 0.775 | 140.022 | 3nFkdlSjzX9mRTtwJOzDYB | 85.0 | 5812536.0 |
657 rows × 20 columns
Now we will visualize the stream count vs all song properties relationship
plt.scatter(no_dupes_df['popularity_index'], no_dupes_df['streams'])
plt.title('Streams in Relation to Popularity')
plt.xlabel('popularity value')
plt.ylabel('streams in millions')
print("Mean of popularity index: " + str(no_dupes_df['popularity_index'].mean()))
print("Median of popularity index: " + str(no_dupes_df['popularity_index'].median()))
print("STDDEV of popularity index: " + str(no_dupes_df['popularity_index'].std()))
Mean of popularity index: 81.84579439252336
Median of popularity index: 83.0
STDDEV of popularity index: 10.20413589979866
The data appears to cluster around the mean, so we decided to check whether the popularity index was normally distributed
data = []
for i in range(0,100):
data.append(np.mean(df['popularity_index'].sample(n=1000)))
plt.hist(data)
plt.title("Frequency Distribution of Popularity Indices")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
Text(0, 0.5, 'Frequency')
Data related to popularity index of an artist does appear to be normally distributed from our data depending on the sample, and it appears that the majority of songs on the top 200 come from artists with a popularity of around 80.
plt.scatter(no_dupes_df['follower_count'], no_dupes_df['streams'])
plt.title('Streams in Relation to Follower Count')
plt.xlabel('number of artist followers in tens of million')
plt.ylabel('streams in millions')
print("Mean of follower count: " + str(no_dupes_df['follower_count'].mean()))
print("Median of follower count: " + str(no_dupes_df['follower_count'].median()))
print("STDDEV of follower count: " + str(no_dupes_df['follower_count'].std()))
Mean of follower count: 13744629.682242991
Median of follower count: 7748023.0
STDDEV of follower count: 16091903.457271803
plt.scatter(no_dupes_df['duration_ms'], no_dupes_df['streams'])
plt.title('Streams in Relation to Song Duration')
plt.xlabel('duration in milliseconds')
plt.ylabel('streams in millions')
print("Mean of song length in seconds: " + str(no_dupes_df['duration_ms'].mean() / 1000))
print("Median of song length in seconds: " + str(no_dupes_df['duration_ms'].median() / 1000))
print("STDDEV of song length in seconds: " + str(no_dupes_df['duration_ms'].std() / 1000))
Mean of song length in seconds: 215.89356386292835
Median of song length in seconds: 213.981
STDDEV of song length in seconds: 41.844669610266095
plt.scatter(no_dupes_df['acousticness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Acousticness')
plt.xlabel('acousticness scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of acousticness index: " + str(no_dupes_df['acousticness'].mean()))
print("Median of acousticness index: " + str(no_dupes_df['acousticness'].median()))
print("STDDEV of acousticness index: " + str(no_dupes_df['acousticness'].std()))
Mean of acousticness index: 0.21230590903426794
Median of acousticness index: 0.11
STDDEV of acousticness index: 0.23844285456931844
plt.scatter(no_dupes_df['danceability'],no_dupes_df['streams'])
plt.title('Streams in Relation to Danceability')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of danceability index: " + str(no_dupes_df['danceability'].mean()))
print("Median of danceability index: " + str(no_dupes_df['danceability'].median()))
print("STDDEV of danceability index: " + str(no_dupes_df['danceability'].std()))
Mean of danceability index: 0.6818423676012461
Median of danceability index: 0.695
STDDEV of danceability index: 0.13576219623892008
plt.scatter(no_dupes_df['energy'],no_dupes_df['streams'])
plt.title('Streams in Relation to Energy')
plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of energy index: " + str(no_dupes_df['energy'].mean()))
print("Median of energy index: " + str(no_dupes_df['energy'].median()))
print("STDDEV of energy index: " + str(no_dupes_df['energy'].std()))
Mean of energy index: 0.6355397507788161
Median of energy index: 0.6515
STDDEV of energy index: 0.17854747086125813
plt.scatter(no_dupes_df['instrumentalness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Instrumentalness')
plt.xlabel('instrumentalness scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of instrumentalness index: " + str(no_dupes_df['instrumentalness'].mean()))
print("Median of instrumentalness index: " + str(no_dupes_df['instrumentalness'].median()))
print("STDDEV of instrumentalness index: " + str(no_dupes_df['instrumentalness'].std()))
Mean of instrumentalness index: 0.013712031915887851
Median of instrumentalness index: 0.0
STDDEV of instrumentalness index: 0.08112995046942596
plt.scatter(no_dupes_df['liveness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Liveness')
plt.xlabel('liveness scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of liveness index: " + str(no_dupes_df['liveness'].mean()))
print("Median of liveness index: " + str(no_dupes_df['liveness'].median()))
print("STDDEV of liveness index: " + str(no_dupes_df['liveness'].std()))
Mean of liveness index: 0.1735563862928349
Median of liveness index: 0.123
STDDEV of liveness index: 0.12771847354589183
plt.scatter(no_dupes_df['loudness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Loudness')
plt.xlabel('loudness in decibels')
plt.ylabel('streams in millions')
print("Mean of volume: " + str(no_dupes_df['loudness'].mean()))
print("Median of volume: " + str(no_dupes_df['loudness'].median()))
print("STDDEV of volume: " + str(no_dupes_df['loudness'].std()))
Mean of volume: -6.436602803738317
Median of volume: -5.992
STDDEV of volume: 2.930078470544615
Lots of points appear to be around the mean volume, so let’s check and see if this data is normally distributed.
data = []
for i in range(0,100):
data.append(np.mean(df['loudness'].sample(n=100)))
plt.hist(data)
plt.title("Frequency Distribution of Average Song Volumes")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
Text(0, 0.5, 'Frequency')
The spread of different averages volumes for each track appears to be normally distributed, meaning the mean of any sample should be the same as the population mean. We determined that the population mean is about ~-6 decibels, with a standard deviation of around 3 decibels.
plt.scatter(no_dupes_df['speechiness'], no_dupes_df['streams'])
plt.title('Streams in Relation to Speechiness')
plt.xlabel('speechiness scale of 0-.5')
plt.ylabel('streams in millions')
print("Mean of speechiness index: " + str(no_dupes_df['speechiness'].mean()))
print("Median of speechiness index: " + str(no_dupes_df['speechiness'].median()))
print("STDDEV of speechiness index: " + str(no_dupes_df['speechiness'].std()))
Mean of speechiness index: 0.1167436137071651
Median of speechiness index: 0.0678
STDDEV of speechiness index: 0.11056578687587934
plt.scatter(no_dupes_df['valence'],no_dupes_df['streams'])
plt.title('Streams in Relation to Valence')
plt.xlabel('valence scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of valence index: " + str(no_dupes_df['valence'].mean()))
print("Median of valence index: " + str(no_dupes_df['valence'].median()))
print("STDDEV of valence index: " + str(no_dupes_df['valence'].std()))
Mean of valence index: 0.48936448598130844
Median of valence index: 0.48
STDDEV of valence index: 0.23614792618729277
plt.scatter(no_dupes_df['tempo'],no_dupes_df['streams'])
plt.title('Streams in Relation to Tempo')
plt.xlabel('tempo scale of 0-200 beats per minute')
plt.ylabel('streams in millions')
print("Mean of tempo: " + str(no_dupes_df['tempo'].mean()))
print("Median of tempo: " + str(no_dupes_df['tempo'].median()))
print("STDDEV of tempo: " + str(no_dupes_df['tempo'].std()))
Mean of tempo: 121.39503894080995
Median of tempo: 119.9425
STDDEV of tempo: 29.609868539398786
We now will create a Correlation Matrix to see the relationship between all values. Observe this correlation matrix compiling the scatter plots above.
corr = no_dupes_df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
<AxesSubplot:>
After seeing this, we had a couple of ideas. It is possible different features we are currently tracking work together to make a song popular, but it is also possible we are missing other important features. After looking back at the most popular songs over the course of our entire dataframe, we noticed the majority of artists were well known or already accomplished. While it is obvious that an artists’ “followers” (or typical listeners) will increase the number of streams a song will get, it would be interesting to know if the number of typical listeners was more important than all these other aspects of the song.
Here we observe the traits of specifically the song at the ranks 1-10. The song in these positions is likely to change, so there will be different values for the same x-axis position at times.
top10s = df.loc[df['position'] <= 10]
#lists for legend to remove redundant code
color_list = ['r', 'orange', 'yellow', 'lime', 'cyan', 'b', 'brown' , 'violet', 'purple', 'black']
top10_legend = ['Rank 1', 'Rank 2', 'Rank 3', 'Rank 4', 'Rank 5', 'Rank 6','Rank 7','Rank 8','Rank 9','Rank 10']
#method to remove redundant code in plotting
def plotTop10(name):
i = 0
for index, row in top10s.iterrows():
plt.scatter(row[name],row['streams'], color=color_list[i])
i = (i + 1) % 10
top10s.head()
position | track_name | artist | streams | url | date | track_id | duration_ms | acousticness | danceability | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | artist_id | popularity_index | follower_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Starboy | The Weeknd | 3135625.0 | https://open.spotify.com/track/5aAx2yezTd8zXrk... | 2017-01-01 | 7MXVkk9YMctZqd1Srtv4MB | 230453.0 | 0.14100 | 0.679 | 0.587 | 0.000006 | 0.137 | -7.015 | 0.2760 | 0.486 | 186.003 | 1Xyo4u8uXC1ZmMpatF05PJ | 94.0 | 26720759.0 |
1 | 2 | Closer | The Chainsmokers | 3015525.0 | https://open.spotify.com/track/7BKLCZ1jbUBVqRi... | 2017-01-01 | 7BKLCZ1jbUBVqRi2FVlTVw | 244960.0 | 0.41400 | 0.748 | 0.524 | 0.000000 | 0.111 | -5.599 | 0.0338 | 0.661 | 95.010 | 69GGBxA162lTqCwzJG5jLp | 84.0 | 17093912.0 |
2 | 3 | Let Me Love You | DJ Snake | 2545384.0 | https://open.spotify.com/track/4pdPtRcBmOSQDlJ... | 2017-01-01 | 3ibKnFDaa3GhpPGlOUj7ff | 256733.0 | 0.23500 | 0.656 | 0.578 | 0.000000 | 0.118 | -8.970 | 0.0922 | 0.556 | 94.514 | 20s0P9QLxGqKuCsGwFsp7w | 69.0 | 2055274.0 |
3 | 4 | Rockabye (feat. Sean Paul & Anne-Marie) | Clean Bandit | 2356604.0 | https://open.spotify.com/track/5knuzwU65gJK7IF... | 2017-01-01 | 5knuzwU65gJK7IF5yJsuaW | 251088.0 | 0.40600 | 0.720 | 0.763 | 0.000000 | 0.180 | -4.068 | 0.0523 | 0.742 | 101.965 | 6MDME20pz9RveH9rEXvrOM | 80.0 | 4092589.0 |
4 | 5 | One Dance | Drake | 2259887.0 | https://open.spotify.com/track/1xznGGDReH1oQq0... | 2017-01-01 | 1zi7xx7UVEFkmKfv06H8x0 | 173987.0 | 0.00776 | 0.792 | 0.625 | 0.001800 | 0.329 | -5.609 | 0.0536 | 0.370 | 103.967 | 3TVXtAsR1Inumwj472S9r4 | 96.0 | 51374698.0 |
plotTop10('popularity_index')
plt.title('Top 10 Streams in Relation to Popularity of Artist')
plt.xlabel('Artist Popularity Value')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bdc07df10>
plotTop10('follower_count')
plt.title('Top 10 Streams in Relation to Follower Count')
plt.xlabel('Follower Count in 10 millions')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bd3a15760>
This graph seems to indicate that the majority of songs within the top 10 positions come from artists with a follower count of less than 30 million, but the sample size here is small.
plotTop10('duration_ms')
plt.title('Top 10 Streams in Relation to Duration')
plt.xlabel('duration in milliseconds')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bd3a91af0>
This graph shows how the duration of a song in milliseconds compares to the number of streams that song received, and we’re only using the first 10 pieces of data from our dataframe. This shows us that the songs with the most streams from this set of data are songs which are > 240000 ms, or 4 minutes. This is surprising, because the average song is usually around 3 minutes and 30 seconds or less.
plotTop10('acousticness')
plt.title('Top 10 Streams in Relation to Acousticness')
plt.xlabel('acousticness scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bd369fac0>
This graph displays a confidence score for how likely it is that a song is acoustic (with a value of 1 being very likely that the song is acoustic) compared to the number of streams the song has. All of the confidence scores are less than .5, which indicates most of these songs are probably not acoustic.
plotTop10('energy')
plt.title('Top 10 Streams in Relation to Energy')
plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bd3e10970>
This graph shows how the “energy” of a song, or generally how noisy and fast the song is, compares to the number of streams for the top 10 songs on the 1st of January. Here, we see that the songs with the most streams are around or above .6 on the energy scale (a higher score means the song is higher energy)
plotTop10('danceability')
plt.title('Top 10 Streams in Relation to Danceability')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bdfe06d00>
This graph shows how “danceable” a song is using a value provided to us by the Spotify API comapred to the number of streams that song got. Danceability is measured as a value from 0 to 1, where 1 is most danceable. This graph appears to be similar to the graph describing, so they may have been determined using similar characteristisc (i.e. both are measuring how upbeat or fast a song is)
plotTop10('loudness')
plt.title('Top 10 Streams in Relation to Loudness')
plt.xlabel('loudness in decibels')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bd41c5b20>
This graph describes the average volume of each track in our top 10s data set compared to the number of streams each song had. It appears to trend similarly to the last two graphs, indicating that the volume of a track may be correlated with how danceable or energetic a song is.
plotTop10('valence')
plt.title('Top 10 Streams in Relation to Loudness')
plt.xlabel('valence scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bdd6c9580>
This graph describes the “valence” of a song compared to the # of streams it got. Valence is described as the “positivity” of a song where “Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry),” according to the Spotify API reference. The reference does not describe how this value is determined, but our data seems to show there may be a correlation between valence and the number of streams a song is getting in the set of number 1 songs. However, this graph does not take into account the other features for the songs. It may be worth trying to consider songs where features except for this one are held to a constant, so that we can consider if there is a correlation between this value and the number of streams.
plotTop10('tempo')
plt.title('Top 10 Streams in Relation to Beats Per Minute (BPM)')
plt.xlabel('tempo scale of 0-200 Beats Per Minute (BPM)')
plt.ylabel('streams in millions')
plt.legend(top10_legend)
<matplotlib.legend.Legend at 0x23bd35843d0>
This graph describes the tempo of a song comapred to the number of streams that song has. Given our dataset, it is unclear whether there is a correlation between the tempo of a song and the number of streams it gets.
corr = top10s.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
<AxesSubplot:>
This Corellation Matrix of the Top 10 is much more polarizing than the Top 200 Corellation Matrix. Values that had a weak positive or weak negative correlation are now at least more strongly correlated, if not strongly correlated. This indicates the high correlation traits have a stronger influence in the Top 10.
There appeared to be a potential relationship between valence and the number of streams a song was getting both in our correlation chart and how rank 1 songs performed in our graph of the top 10 tracks each month, so it might be interesting to look at what the different features are like for songs with a high valence (.4 or higher).
highValenceTracks = df.loc[df['valence'] > .4]
highValenceTracks.sort_values('streams', ascending=False).head(10)
position | track_name | artist | streams | url | date | track_id | duration_ms | acousticness | danceability | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | artist_id | popularity_index | follower_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
200 | 1 | Shape of You | Ed Sheeran | 7549041.0 | https://open.spotify.com/track/7qiZfU4dY1lWllz... | 2017-02-01 | 7qiZfU4dY1lWllzX7mPBI3 | 233713.0 | 0.581 | 0.825 | 0.652 | 0.000000 | 0.0931 | -3.183 | 0.0802 | 0.931 | 95.977 | 6eUKZXaKkcviH0Ku9w2n3V | 91.0 | 73345259.0 |
1000 | 1 | Despacito - Remix | Luis Fonsi | 7332260.0 | https://open.spotify.com/track/5CtI0qwDJkDQGwX... | 2017-06-01 | 6rPO02ozF3bM7NnOV4h6s2 | 228827.0 | 0.228 | 0.653 | 0.816 | 0.000000 | 0.0967 | -4.353 | 0.1670 | 0.816 | 178.085 | 4V8Sr092TqfHkfAA5fXXqG | 78.0 | 9035487.0 |
400 | 1 | Shape of You | Ed Sheeran | 7201132.0 | https://open.spotify.com/track/7qiZfU4dY1lWllz... | 2017-03-01 | 7qiZfU4dY1lWllzX7mPBI3 | 233713.0 | 0.581 | 0.825 | 0.652 | 0.000000 | 0.0931 | -3.183 | 0.0802 | 0.931 | 95.977 | 6eUKZXaKkcviH0Ku9w2n3V | 91.0 | 73345259.0 |
600 | 1 | Shape of You | Ed Sheeran | 6815498.0 | https://open.spotify.com/track/7qiZfU4dY1lWllz... | 2017-04-01 | 7qiZfU4dY1lWllzX7mPBI3 | 233713.0 | 0.581 | 0.825 | 0.652 | 0.000000 | 0.0931 | -3.183 | 0.0802 | 0.931 | 95.977 | 6eUKZXaKkcviH0Ku9w2n3V | 91.0 | 73345259.0 |
1200 | 1 | Despacito - Remix | Luis Fonsi | 6398530.0 | https://open.spotify.com/track/5CtI0qwDJkDQGwX... | 2017-07-01 | 6rPO02ozF3bM7NnOV4h6s2 | 228827.0 | 0.228 | 0.653 | 0.816 | 0.000000 | 0.0967 | -4.353 | 0.1670 | 0.816 | 178.085 | 4V8Sr092TqfHkfAA5fXXqG | 78.0 | 9035487.0 |
800 | 1 | Despacito - Remix | Luis Fonsi | 6360737.0 | https://open.spotify.com/track/5CtI0qwDJkDQGwX... | 2017-05-01 | 6rPO02ozF3bM7NnOV4h6s2 | 228827.0 | 0.228 | 0.653 | 0.816 | 0.000000 | 0.0967 | -4.353 | 0.1670 | 0.816 | 178.085 | 4V8Sr092TqfHkfAA5fXXqG | 78.0 | 9035487.0 |
2000 | 1 | rockstar | Post Malone | 5755610.0 | https://open.spotify.com/track/7wGoVu4Dady5GV0... | 2017-11-01 | 7ytR5pFWmSjzHJIeQkgog4 | 181733.0 | 0.247 | 0.746 | 0.690 | 0.000000 | 0.1010 | -7.956 | 0.1640 | 0.497 | 89.977 | 4r63FhuTkUYltbVAg5TQnk | 93.0 | 5174251.0 |
1800 | 1 | rockstar | Post Malone | 5649503.0 | https://open.spotify.com/track/1OmcAT5Y8eg5bUP... | 2017-10-01 | 7ytR5pFWmSjzHJIeQkgog4 | 181733.0 | 0.247 | 0.746 | 0.690 | 0.000000 | 0.1010 | -7.956 | 0.1640 | 0.497 | 89.977 | 4r63FhuTkUYltbVAg5TQnk | 93.0 | 5174251.0 |
1600 | 1 | Look What You Made Me Do | Taylor Swift | 5547962.0 | https://open.spotify.com/track/6uFsE1JgZ20EXyU... | 2017-09-01 | 1P17dC1amhFzptugyAO7Il | 211853.0 | 0.204 | 0.766 | 0.709 | 0.000014 | 0.1260 | -6.471 | 0.1230 | 0.506 | 128.070 | 06HL4z0CvFAxyc27GXpf02 | 97.0 | 34579892.0 |
2200 | 1 | rockstar | Post Malone | 5528701.0 | https://open.spotify.com/track/7wGoVu4Dady5GV0... | 2017-12-01 | 7ytR5pFWmSjzHJIeQkgog4 | 181733.0 | 0.247 | 0.746 | 0.690 | 0.000000 | 0.1010 | -7.956 | 0.1640 | 0.497 | 89.977 | 4r63FhuTkUYltbVAg5TQnk | 93.0 | 5174251.0 |
We have duplicate pieces of data, so lets remove the duplicates for this test. We’re going to try to keep the versions of the song that have the most streams
highValenceTracks = highValenceTracks.sort_values('streams', ascending=False).drop_duplicates(['artist', 'duration_ms', 'acousticness', 'danceability', 'energy'], keep='first') # Keeping the last seen version of each song, as that will probably hold it's total streams more accurately
highValenceTracks.sort_values('streams', ascending=False).head(10)
position | track_name | artist | streams | url | date | track_id | duration_ms | acousticness | danceability | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | artist_id | popularity_index | follower_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
200 | 1 | Shape of You | Ed Sheeran | 7549041.0 | https://open.spotify.com/track/7qiZfU4dY1lWllz... | 2017-02-01 | 7qiZfU4dY1lWllzX7mPBI3 | 233713.0 | 0.581000 | 0.825 | 0.652 | 0.000000 | 0.0931 | -3.183 | 0.0802 | 0.931 | 95.977 | 6eUKZXaKkcviH0Ku9w2n3V | 91.0 | 73345259.0 |
1000 | 1 | Despacito - Remix | Luis Fonsi | 7332260.0 | https://open.spotify.com/track/5CtI0qwDJkDQGwX... | 2017-06-01 | 6rPO02ozF3bM7NnOV4h6s2 | 228827.0 | 0.228000 | 0.653 | 0.816 | 0.000000 | 0.0967 | -4.353 | 0.1670 | 0.816 | 178.085 | 4V8Sr092TqfHkfAA5fXXqG | 78.0 | 9035487.0 |
2000 | 1 | rockstar | Post Malone | 5755610.0 | https://open.spotify.com/track/7wGoVu4Dady5GV0... | 2017-11-01 | 7ytR5pFWmSjzHJIeQkgog4 | 181733.0 | 0.247000 | 0.746 | 0.690 | 0.000000 | 0.1010 | -7.956 | 0.1640 | 0.497 | 89.977 | 4r63FhuTkUYltbVAg5TQnk | 93.0 | 5174251.0 |
1600 | 1 | Look What You Made Me Do | Taylor Swift | 5547962.0 | https://open.spotify.com/track/6uFsE1JgZ20EXyU... | 2017-09-01 | 1P17dC1amhFzptugyAO7Il | 211853.0 | 0.204000 | 0.766 | 0.709 | 0.000014 | 0.1260 | -6.471 | 0.1230 | 0.506 | 128.070 | 06HL4z0CvFAxyc27GXpf02 | 97.0 | 34579892.0 |
1001 | 2 | I'm the One | DJ Khaled | 5208996.0 | https://open.spotify.com/track/72Q0FQQo32KJloi... | 2017-06-01 | 1jYiIOC5d6soxkJP81fxq2 | 288877.0 | 0.053300 | 0.599 | 0.667 | 0.000000 | 0.1340 | -4.267 | 0.0367 | 0.817 | 80.984 | 0QHgL1lAIqAw0HtD7YldmP | 82.0 | 5405048.0 |
401 | 2 | Something Just Like This | The Chainsmokers | 4581789.0 | https://open.spotify.com/track/6RUKPb4LETWmmr3... | 2017-03-01 | 6RUKPb4LETWmmr3iAEQktW | 247160.0 | 0.049800 | 0.617 | 0.635 | 0.000014 | 0.1640 | -6.769 | 0.0317 | 0.446 | 103.019 | 69GGBxA162lTqCwzJG5jLp | 84.0 | 17093912.0 |
1201 | 2 | Wild Thoughts (feat. Rihanna & Bryson Tiller) | DJ Khaled | 4558126.0 | https://open.spotify.com/track/1OAh8uOEOvTDqkK... | 2017-07-01 | 45XhKYRRkyeqoW3teSOkCM | 204664.0 | 0.028700 | 0.613 | 0.681 | 0.000000 | 0.1260 | -3.089 | 0.0778 | 0.619 | 97.621 | 0QHgL1lAIqAw0HtD7YldmP | 82.0 | 5405048.0 |
402 | 3 | It Ain't Me (with Selena Gomez) | Kygo | 4529714.0 | https://open.spotify.com/track/3eR23VReFzcdmS7... | 2017-03-01 | 2jRGYG8U5bJzWOH6FLuzvO | 192000.0 | 0.016100 | 0.713 | 0.658 | 0.000138 | 0.0607 | -5.362 | 0.0748 | 0.539 | 115.024 | 23fqKkggKUBHNkbKtXEls4 | 86.0 | 6975385.0 |
803 | 4 | HUMBLE. | Kendrick Lamar | 4371886.0 | https://open.spotify.com/track/7KXjTSCq5nL1LoY... | 2017-05-01 | 7KXjTSCq5nL1LoYtL7XAwS | 177000.0 | 0.000282 | 0.908 | 0.621 | 0.000054 | 0.0958 | -6.638 | 0.1020 | 0.421 | 150.011 | 2YZyLoL8N0Wb9xBt1NhZWg | 87.0 | 16028806.0 |
2002 | 3 | New Rules | Dua Lipa | 3758506.0 | https://open.spotify.com/track/2ekn2ttSfGqwhha... | 2017-11-01 | 2ekn2ttSfGqwhhate0LSR0 | 209320.0 | 0.002610 | 0.762 | 0.700 | 0.000016 | 0.1530 | -6.021 | 0.0694 | 0.608 | 116.073 | 6M2wZ9GZgrQXHCFfjv46we | 93.0 | 21442792.0 |
Here are the first few tracks from our list of songs with high valences.
highValenceTracks = highValenceTracks.sort_values('valence', ascending=False)
highValenceTracks.head()
position | track_name | artist | streams | url | date | track_id | duration_ms | acousticness | danceability | energy | instrumentalness | liveness | loudness | speechiness | valence | tempo | artist_id | popularity_index | follower_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1006 | 7 | There's Nothing Holdin' Me Back | Shawn Mendes | 3093935.0 | https://open.spotify.com/track/79cuOz3SPQTuFrp... | 2017-06-01 | 7JJmb5XwzOO8jgpou264Ml | 199440.0 | 0.380 | 0.866 | 0.813 | 0.000000 | 0.0779 | -4.063 | 0.0554 | 0.969 | 121.998 | 7n2wHs1TKAczGzO7Dd2rGr | 92.0 | 30441601.0 |
124 | 125 | Pumped Up Kicks | Foster The People | 467384.0 | https://open.spotify.com/track/7w87IxuO7BDcJ3Y... | 2017-01-01 | 7w87IxuO7BDcJ3YUqCyMTT | 239600.0 | 0.145 | 0.733 | 0.710 | 0.115000 | 0.0956 | -5.849 | 0.0292 | 0.965 | 127.975 | 7gP3bB2nilZXLfPHJhMdvc | 76.0 | 3059575.0 |
2366 | 167 | Feliz Navidad | José Feliciano | 631358.0 | https://open.spotify.com/track/7taXf5odg9xCAZE... | 2017-12-01 | 0oPdaY4dXtc3ZsaG17V972 | 182067.0 | 0.550 | 0.513 | 0.831 | 0.000000 | 0.3360 | -9.004 | 0.0383 | 0.963 | 148.837 | 7K78lVZ8XzkjfRSI7570FF | 76.0 | 211150.0 |
130 | 131 | Happy - From "Despicable Me 2" | Pharrell Williams | 453426.0 | https://open.spotify.com/track/5b88tNINg4Q4nrR... | 2017-01-01 | 60nZcImufyMA1MKQY3dcCH | 232720.0 | 0.219 | 0.647 | 0.822 | 0.000000 | 0.0908 | -4.662 | 0.1830 | 0.962 | 160.019 | 2RdwBSPQiwcmiDo9kixcl8 | 80.0 | 3359324.0 |
1315 | 116 | Skrt On Me (feat. Nicki Minaj) | Calvin Harris | 625504.0 | https://open.spotify.com/track/7iDxZ5Cd0Yg08d4... | 2017-07-01 | 7iDxZ5Cd0Yg08d4fI5WVtG | 228267.0 | 0.169 | 0.713 | 0.889 | 0.000058 | 0.1690 | -3.870 | 0.0376 | 0.960 | 101.977 | 7CajNmpbOovFoOoasH2HaY | 86.0 | 20962353.0 |
Here’s a plot displaying the number of streams for each song with a high valence compared to their valence
plt.scatter(highValenceTracks['valence'], highValenceTracks['streams'])
plt.title("Streams Compared to Valence For Song With Valence > .4")
plt.xlabel("Valence Value From 0 to 1")
plt.ylabel("Streams in Millions")
Text(0, 0.5, 'Streams in Millions')
No longer seeing the relationship we were seeing earlier between valence and number of streams. Maybe the relationship that leads to more streams is a combination of these features together. It might be worth trying to see if there is a relationship between streams and a combination of features like valence and loudness or valence and danceability
Let’s try categorizing our data into groups with different levels of valences. This allows us to bound valenece, which helps us treat it more like a constant. Then, we could see how other features compare to streams when valence is held within certain levels.
highValenceTracks = df.loc[(df['valence'] >= .5) & (df['valence'] < .8)]
veryHighValenceTracks = df.loc[df['valence'] >= .8]
lowValenceTracks = df.loc[(df['valence'] < .5) & (df['valence'] >= .3)]
veryLowValenceTracks = df.loc[df['valence'] <= .3]
First, we can try to looking at how danceability performs with different bounded groups of valence. We will color code the valence groups, so that we can easily see which tracks have a high, medium-high, medium-low, or low valence.
# Plotting songs with a high valence and varying levels of danceability against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['danceability'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['danceability'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['danceability'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['danceability'],veryHighValenceTracks['streams'], color="red")
plt.title('Danceability and streams for songs with bounded valences')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(['Song with valence < .3', 'Song with valence < .5 and > .3', 'Song with valence > .5 and < .8', 'Song with valence > .8'])
<matplotlib.legend.Legend at 0x23bd47c7340>
This graph displays how the danceability of a song compares to the number of streams it has for songs that have a high valence (>.5). While there is little indication of a linear correlation, it appears that the songs with the most streams all also have a danceability of > .5.
Using this set of data where valence is color coded by groups of values, let’s try to plot other features against streams and see if valence and another feature have any effect on the number of streams. We can try considering follower count next, as that value appeared to have a slight positive correlation with streams in our correlation chart.
# Plotting songs with varying levels of followers within certain valence levels against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['follower_count'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['follower_count'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['follower_count'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['follower_count'],veryHighValenceTracks['streams'], color="red")
plt.title('Follower Count and Streams for Songs With Different Valence Bounds')
plt.xlabel('Follower count in tens of millions of followers')
plt.ylabel('Streams in millions')
plt.legend(['Song with valence < .3', 'Song with valence < .5 and > .3', 'Song with valence > .5 and < .8', 'Song with valence > .8'])
<matplotlib.legend.Legend at 0x23bd4839e50>
This graph shows that the majority of songs in our entire set of data are within that bottom left corner of the graph, and this cluster includes varrying levels of valence. This means that different amounts of followers with bounded amounts of valence have little to do with the number of streams.
Another variable that appeared to have a slight correlation with streams was the loudness of a song. We can try to make a plot similar to our previous two plots where we instead measure loudness on the x-axis.
# Plotting songs with varying levels of loudness within certain valence levels against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['loudness'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['loudness'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['loudness'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['loudness'],veryHighValenceTracks['streams'], color="red")
plt.title('Loudness and Streams for Songs With Different Valence Bounds')
plt.xlabel('Loudness in Decibels')
plt.ylabel('Streams in Millions')
plt.legend(['Song with valence < .3', 'Song with valence < .5 and > .3', 'Song with valence > .5 and < .8', 'Song with valence > .8'])
<matplotlib.legend.Legend at 0x23bd48b6c10>
While streams appears to have little correlation with our variables loudness and valence, it does appear that valence and loudness have a correlation. We start to see higher valences as songs gets louder, but this seems relatively intuitive as “negative” songs are probably quieter and more somber. It may also be noteworthy that songs that did very well all had a loudness of > -15 decibels.
These last three graphs seem to indicate that bounded levels of valence in conjunction with varying levels of the other features we measured have little to do with stream count. We did find that songs which performed exceptionally well had a loudness of above -15 decibels or had a danceability of greater than .6, but it was hard to see other relationships between our variables and streams otherwise.
In the final step of the data cycle, we draw conclusions based off our analysis to inform decisions made based on the data.
To answer our hypothesis: Are there traits of a song that can be used to determine future success? If so, what are they?
We were unable to conclude if there were specific traits of a song that directly determined future success or high stream counts. However, there were a few noteworthy trends in our data. Among our findings, we saw the volumes of songs across our dataset are normally distributed, with a mean of -6 decibels and a standard deviation of about 3 decibels. We can safely say that the majority of songs on the top 200 chart across our data maintain this volume because the data is normally distributed. The loudness of songs brings up an interesting reminder of the Loudness War in the early 1940s. During the loudness war, even though increasing the loudness of a song ultimately reduced its fidelity (fine details), critics preferred the increasing levels. This may be an echo of the impacts of this cultural trend, or perhaps people simply like their music loud, even if at a lower quality.
Our initial plotting of popularity index against streams indicated that most top 200 songs came from artists with an index of around 80 (mean of 81), and our correlation plot seemed to support that higher artist popularity indices were positively correlated with streams. It is unsurprising that artist popularity has a correlation with streams, but what is noteworthy is that our correlation plot seemed to indicate that even artist popularity alone had a weak positive correlation with streams. We noted that while the average number of followers for an artist was skewed by inconsistently high values in the top 1 or 2 positions, the median value of followers for an artist in the top 200 was 7,748,023.
After analyzing several features individually and seeing little direct correlation with stream values, we decided to consider how different factors working together might impact streams. We looked at individual features within the top 10 positions of our charts to see if we could find any trends here, as these songs were the most succesful of all the songs on our data. Within these, we initially thought we saw a positive correlation between valence and stream counts, but it was unclear if there was a distinctive relationship. We tried to plot a couple other features with bounded valence values to try to investigate whether any of these combinations could produce a more clear correlation. The other features we chose to measure with bounded valence values were ones which appeared to be positively correlated with streams according to our correlation chart, but were also measured independently from valence (i.e. they were not determined using the same song characteristics). Unfortunately, it appeared that there was no correlation between the combination of valence with the other factors we considered (danceability, popularity, and volume) and streams.
No single feature appeared to be the determining factor for a song’s success. Perhaps future research on how combinations of different features affect streams would be valuable, as our correlation plot seemed to indicate our conclusion. It would also be worth looking into what factors influence artist popularity. Access to data from a larger time period could also help in finding interesting trends in how people listen to music over time, too.