groovydata

Analyzing the Data of the Spotify Regional Top 200 Song Charts to Predict Success

By Amanuel Awoke, Ferzam Mohammad, and Josue Velasquez

Introduction

Motivation

The music industry has changed a lot in the last decade with the introduction of streaming services like Apple Music or Spotify. Services like Spotify allow users to livestream music for personal consumption, often for free or for a subscription fee. These services have made it easier to consume music and have increased opportunities for people to start producing music, but they have also changed how musicians make money. Whenever a user listens to a song on a streaming service, the service typically keeps track of the number of “streams” that song has. Music artists are then paid a small amount based on the number of streams they have accumulated for their music. Given how little these artists are paid from streaming services, maximizing the amount of revenue made from a song is valuable for those looking to push out music to these services. Stream count also indicates where a song stands in the streaming services’ popularity lists, and making it onto their top 100 or 200 songs is a factor considered in whether these songs are added to global, official top songs charts i.e. Billboard 200.

Our group thought it would be interesting to see if we could try to make predictions for how popular a song might be given different features for a song (e.g. how fast or slow a song is, the mood of the song, how many listeners an artist already gets on average, etc.). If we can indicate how many plays a song will get, we can give a prediction for how much money a song will make on a streaming service. Much like the Moneyball scenario, it’s possible that artists are focusing on producing music that meets criteria which they think makes a song popular when, in reality, they should be focusing on other aspects of their music. Understanding what components of a song make it popular would help artists figure out the best way to produce music in order to make money off of these streaming services.

The Moneyball story demonstrated the importance of data science in producing a strong baseball team, and while music is different from sports, our project should hopefully reflect similar data science practices in order to reach a valuable conclusion. It may be relatively straightforward to conlcude whether a song by Taylor Swift will end up on the top 200 chart given her “incredibly loyal fanbase” of over 40 million people, but maybe there are other characteristics between popular songs that could indicate factors which help make a song more popular. Data science practices help us here by giving us tools to help identify characteristics in a song, clarify how those characteristics might relate to stream count, and determine whether any elements should be focused on when producing music.

From this point forward when we use the word “track” it is synonymous with “song.”

Goal Hypothesis

Are there traits of a song that can be used to determine future success? If so, what are they?

Defining Success

We are defining the success of a track by its appearance on the Top 200, as well as its ranking on the Top 200 (the higher the better).

Collect Data

This is the first step in the data science lifecycle where we must identify and gather information. We gather data from the Spotify Charts Regional Top 200 to first identify which songs had the highest stream counts in the United States, dating back to January 1st, 2017 through December 1st, 2017. Spotify Charts provides tracks with the highest stream count, their top 200 rank, and the artist(s) who created that song. Spotify Charts already compiles the data into Excel tables, so it isn’t necessary to directly scrape from the website. If you wanted to download one yourself, at the top right of the website, select a date you’d like to download in the dropdown, then select further up “Download to CSV.” The pandas method read_csv() was used to process the Excel files into dataframes.

import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
import spotipy
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Since there were consistent download URLs of Excel sheets in relation to the date they recorded, we used a loop to retreive the links then later download all sheets.

# Collect links from spotify charts top 200 streams per day
ref_str = "https://spotifycharts.com/regional/global/daily/"
ref_arr = []


for year in range(2017, 2018):
    date = ""
    
    endingMonth = 12
    if year == 2020:
        endingMonth = 10
        
    for month in range (1, endingMonth + 1):

        if int(month) < 10:
            month = "0" + str(month)

        date = str(year) + "-" + str(month) + "-" + "01" + "/download"
        date = ref_str + date
        ref_arr.append(date)

ref_arr

['https://spotifycharts.com/regional/global/daily/2017-01-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-02-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-03-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-04-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-05-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-06-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-07-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-08-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-09-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-10-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-11-01/download',
 'https://spotifycharts.com/regional/global/daily/2017-12-01/download']

#Loop downloading and appending of dataframes 

df = pd.DataFrame(columns =['position', 'track_name', 'artist', 'streams', 'url', 'date'] )
#make dir to save to
path = "sheets"
folderExists = False
try:
    os.mkdir(path)
except FileExistsError:
    print ("Folder already exists")
    folderExists = True

for i in ref_arr:

    r = requests.get(i, allow_redirects = True)
    #String manipulation to read from the correct csv files
    date = i[48:58]
    fileName = "regional-global-daily-" + date + ".csv"
    if not folderExists:
        print("Downloading... " + fileName)
        open(fileName, "wb").write(r.content)

        os.rename(fileName, "sheets/" + fileName)

    df_new = pd.read_csv(path + "/" + fileName)
    df_new.columns= ['position', 'track_name', 'artist', 'streams', 'url']
    df_new['date'] = date
    
    df_new = df_new.iloc[1:] #deletes junk row from csv conversion
    df = df.append(df_new)

print("Done")
df = df.reset_index() # Sets index back to being the regular 0-based index. This is really helpful when trying to add more to the dataframe later, because otherwise there are lots of duplicate indices
df['streams'] = df['streams'].astype(int) #streams are a string of a num, must wrap as type int always

Folder already exists
Done

Wrangling data into dataframe

df

	index	position	track_name	artist	streams	url	date
0	1	1	Starboy	The Weeknd	3135625	https://open.spotify.com/track/5aAx2yezTd8zXrk...	2017-01-01
1	2	2	Closer	The Chainsmokers	3015525	https://open.spotify.com/track/7BKLCZ1jbUBVqRi...	2017-01-01
2	3	3	Let Me Love You	DJ Snake	2545384	https://open.spotify.com/track/4pdPtRcBmOSQDlJ...	2017-01-01
3	4	4	Rockabye (feat. Sean Paul & Anne-Marie)	Clean Bandit	2356604	https://open.spotify.com/track/5knuzwU65gJK7IF...	2017-01-01
4	5	5	One Dance	Drake	2259887	https://open.spotify.com/track/1xznGGDReH1oQq0...	2017-01-01
...	...	...	...	...	...	...	...
2395	196	196	Rockabye (feat. Sean Paul & Anne-Marie)	Clean Bandit	552118	https://open.spotify.com/track/5knuzwU65gJK7IF...	2017-12-01
2396	197	197	Rake It Up (feat. Nicki Minaj)	Yo Gotti	551576	https://open.spotify.com/track/4knL4iPxPOZjQzT...	2017-12-01
2397	198	198	New Freezer (feat. Kendrick Lamar)	Rich The Kid	550167	https://open.spotify.com/track/4pYZLpX23Vx8rwD...	2017-12-01
2398	199	199	All Night	Steve Aoki	548039	https://open.spotify.com/track/5mAxA6Q1SIym6dP...	2017-12-01
2399	200	200	113	Booba	546878	https://open.spotify.com/track/6xqAP7kpdgCy8lE...	2017-12-01

2400 rows × 7 columns

Data Processing

Spotipy is a lightweight Python library for the Spotify Web API used to retrieve more detailed data for tracks now that their names have been retrieved from the Spotify Top 200. We must first authenticate our usage of the API using a Spotify Account.

import spotipy
from spotipy.oauth2 import SpotifyOAuth
from spotipy.oauth2 import SpotifyClientCredentials


SPOTIPY_CLIENT_ID="ea1a162fbc6f413990542b76ab82a168"
SPOTIPY_CLIENT_SECRET="a09882042ce54f158fdd2b6baaf2b26d"
SPOTIPY_CLIENT_REDIRECT="https://amanuelawoke.com/groovydata/"

scope = "user-library-read"

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(scope=scope, client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET, redirect_uri=SPOTIPY_CLIENT_REDIRECT))

We’re going to start by using the Spotify API to get more information about all the tracks we found in the top 200’s chart for the timeframe we described above. The Spotify API gives us the ability to get “audio features” from a song given a track id that Spotify creates for every song. These “audio features” include characteristics like loudness, positivity, danceability, how energetic the song is, the speed of the song, and a couple other similar characteristics that have been determined by Spotify using their own machine learning algorithms.

First, we do need to get an id for every song and artist in our dataframe to be able to make queries through the Spotify API for a specific track or artist. Here, we get track and artist ids, and we also make a query for the audio features of each track id. We’re doing these all together for code efficiency, just because a large number of queries through the Spotify API can take time. For testing we cached the dataframe rather than compiling the data every time.

import xlsxwriter
import openpyxl

artist_id_list = []
track_id_list = []
popularity_index_list = []
follower_count_list = []
audio_features_df = pd.DataFrame()

#if cached df exists dont search again, else search again
if not os.path.exists("cached_df.xlsx"):
    #Take each song and lookup its audio features, then create a dataframe for them
    print("Searching...")
    for index, row in df.iterrows():
        trackName = row['track_name']
        track_id = ""
        artist_id = ""
        # We need to check if our track_name received was a nan value. Idk how these got in here, but there are nans
        if(type(trackName) == str):
            #delimit with +'s for spotipy search query
            trackNameWithoutSpaces = '+'.join(trackName.split())
            searchQuery = sp.search(trackNameWithoutSpaces, 1, 0)
            if (len(searchQuery['tracks']['items']) != 0):
                
                track_object = searchQuery['tracks']['items'][0]
                track_id = track_object['id']
                track_id_list.append(track_id)

                #if there are several artists, return the first artist
                artist_object = track_object['artists'][0] if type(track_object['artists']) is list else track_object['artists']
                artist_id = artist_object['id']
                artist_id_list.append(artist_id)

    
                artist_object_real = sp.artist(artist_id)
                followers_object = artist_object_real['followers']
                followers_value = followers_object['total']
                follower_count_list.append(followers_value)
                popularity_value = artist_object_real['popularity']
                popularity_index_list.append(popularity_value)

            # If our query returned nothing then append a nan in the place of artist and track for this entry
            else:
                artist_id_list.append(np.nan)
                track_id_list.append(np.nan)
                
                popularity_index_list.append(np.nan)
                follower_count_list.append(np.nan)

        # If we had stored a nan, then just plan to append a nan in this position
        else:
            artist_id_list.append(np.nan)
            track_id_list.append(np.nan)
            
            popularity_index_list.append(np.nan)
            follower_count_list.append(np.nan)
       
        #Defining audio features as nan to begin    
        audiofeatures = {'duration_ms' : np.nan, 'key' : np.nan, 'mode' : np.nan, 'time_signature' : np.nan, 'acousticness' : np.nan, 'danceability' : np.nan, 'energy' : np.nan, 'instrumentalness' : np.nan, 'liveness' : np.nan, 'loudness' : np.nan, 'speechiness' : np.nan, 'valence' : np.nan, 'tempo' : np.nan, 'id' : np.nan, 'uri' : np.nan, 'track_href' : np.nan, 'analysis_url' : np.nan, 'type' : np.nan, }

        # If we successfully found a track when we did our search, then get the audio features for that
        if (track_id != ""):
            audiofeatures = sp.audio_features(track_id)[0]
        #Append the audio features
        audio_features_df = audio_features_df.append(audiofeatures, ignore_index=True)

    #adds artist id list 
    audio_features_df['artist_id'] = artist_id_list 
    audio_features_df['popularity_index'] = popularity_index_list
    audio_features_df['follower_count'] = follower_count_list

    # Store the created data frame into the cache
    writer = pd.ExcelWriter('cached_df.xlsx', engine='openpyxl')
    audio_features_df.to_excel(writer, sheet_name='Sheet1')
    writer.save()
    
else: #access the cached df if it exist
 
    print("Cached dataframe found.")
    audio_features_df = pd.read_excel("cached_df.xlsx", engine = "openpyxl")
    audio_features_df.drop(["Unnamed: 0"], axis=1, inplace=True) #delete position row since rank alraedy has this information

audio_features_df

Cached dataframe found.

	acousticness	analysis_url	danceability	duration_ms	energy	id	instrumentalness	key	liveness	loudness	...	speechiness	tempo	time_signature	track_href	type	uri	valence	artist_id	popularity_index	follower_count
0	0.14100	https://api.spotify.com/v1/audio-analysis/7MXV...	0.679	230453.0	0.587	7MXVkk9YMctZqd1Srtv4MB	0.000006	7.0	0.137	-7.015	...	0.2760	186.003	4.0	https://api.spotify.com/v1/tracks/7MXVkk9YMctZ...	audio_features	spotify:track:7MXVkk9YMctZqd1Srtv4MB	0.486	1Xyo4u8uXC1ZmMpatF05PJ	94.0	26720759.0
1	0.41400	https://api.spotify.com/v1/audio-analysis/7BKL...	0.748	244960.0	0.524	7BKLCZ1jbUBVqRi2FVlTVw	0.000000	8.0	0.111	-5.599	...	0.0338	95.010	4.0	https://api.spotify.com/v1/tracks/7BKLCZ1jbUBV...	audio_features	spotify:track:7BKLCZ1jbUBVqRi2FVlTVw	0.661	69GGBxA162lTqCwzJG5jLp	84.0	17093912.0
2	0.23500	https://api.spotify.com/v1/audio-analysis/3ibK...	0.656	256733.0	0.578	3ibKnFDaa3GhpPGlOUj7ff	0.000000	7.0	0.118	-8.970	...	0.0922	94.514	4.0	https://api.spotify.com/v1/tracks/3ibKnFDaa3Gh...	audio_features	spotify:track:3ibKnFDaa3GhpPGlOUj7ff	0.556	20s0P9QLxGqKuCsGwFsp7w	69.0	2055274.0
3	0.40600	https://api.spotify.com/v1/audio-analysis/5knu...	0.720	251088.0	0.763	5knuzwU65gJK7IF5yJsuaW	0.000000	9.0	0.180	-4.068	...	0.0523	101.965	4.0	https://api.spotify.com/v1/tracks/5knuzwU65gJK...	audio_features	spotify:track:5knuzwU65gJK7IF5yJsuaW	0.742	6MDME20pz9RveH9rEXvrOM	80.0	4092589.0
4	0.00776	https://api.spotify.com/v1/audio-analysis/1zi7...	0.792	173987.0	0.625	1zi7xx7UVEFkmKfv06H8x0	0.001800	1.0	0.329	-5.609	...	0.0536	103.967	4.0	https://api.spotify.com/v1/tracks/1zi7xx7UVEFk...	audio_features	spotify:track:1zi7xx7UVEFkmKfv06H8x0	0.370	3TVXtAsR1Inumwj472S9r4	96.0	51374698.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2395	0.40600	https://api.spotify.com/v1/audio-analysis/5knu...	0.720	251088.0	0.763	5knuzwU65gJK7IF5yJsuaW	0.000000	9.0	0.180	-4.068	...	0.0523	101.965	4.0	https://api.spotify.com/v1/tracks/5knuzwU65gJK...	audio_features	spotify:track:5knuzwU65gJK7IF5yJsuaW	0.742	6MDME20pz9RveH9rEXvrOM	80.0	4092589.0
2396	0.02200	https://api.spotify.com/v1/audio-analysis/4knL...	0.910	276333.0	0.444	4knL4iPxPOZjQzTUlELGSY	0.000000	1.0	0.137	-8.126	...	0.3440	149.953	4.0	https://api.spotify.com/v1/tracks/4knL4iPxPOZj...	audio_features	spotify:track:4knL4iPxPOZjQzTUlELGSY	0.530	6Ha4aES39QiVjR0L2lwuwq	75.0	3109571.0
2397	0.04050	https://api.spotify.com/v1/audio-analysis/2EgB...	0.884	191938.0	0.698	2EgB4n6XyBsuNUbuarr4eG	0.000000	0.0	0.195	-9.101	...	0.3640	140.068	4.0	https://api.spotify.com/v1/tracks/2EgB4n6XyBsu...	audio_features	spotify:track:2EgB4n6XyBsuNUbuarr4eG	0.575	1pPmIToKXyGdsCF6LmqLmI	78.0	2419234.0
2398	0.00410	https://api.spotify.com/v1/audio-analysis/0dXN...	0.538	197640.0	0.804	0dXNQ8dckG4eYfEtq9zcva	0.000000	8.0	0.330	-5.194	...	0.0358	144.992	4.0	https://api.spotify.com/v1/tracks/0dXNQ8dckG4e...	audio_features	spotify:track:0dXNQ8dckG4eYfEtq9zcva	0.507	7gAppWoH7pcYmphCVTXkzs	76.0	4082406.0
2399	0.00805	https://api.spotify.com/v1/audio-analysis/0leV...	0.740	266672.0	0.510	0leVyLipY7A8ruhkIBqc0E	0.000375	9.0	0.128	-8.042	...	0.0780	141.534	5.0	https://api.spotify.com/v1/tracks/0leVyLipY7A8...	audio_features	spotify:track:0leVyLipY7A8ruhkIBqc0E	0.089	0JOxt5QOwq0czoJxvSc5hS	70.0	168927.0

2400 rows × 21 columns

#Append audio features to master dataframe
df['track_id'] = audio_features_df['id']
df['duration_ms'] = audio_features_df['duration_ms']
df['acousticness'] = audio_features_df['acousticness']
df['danceability'] = audio_features_df['danceability']
df['energy'] = audio_features_df['energy']
df['instrumentalness'] = audio_features_df['instrumentalness']
df['liveness'] = audio_features_df['liveness']
df['loudness'] = audio_features_df['loudness']
df['speechiness'] = audio_features_df['speechiness']
df['valence'] = audio_features_df['valence']
df['tempo'] = audio_features_df['tempo']
df['artist_id'] = audio_features_df['artist_id']
df['popularity_index'] = audio_features_df['popularity_index']
df['follower_count'] = audio_features_df['follower_count']

df = df.drop(columns='index')
df

	position	track_name	artist	streams	url	date	track_id	duration_ms	acousticness	danceability	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	artist_id	popularity_index	follower_count
0	1	Starboy	The Weeknd	3135625	https://open.spotify.com/track/5aAx2yezTd8zXrk...	2017-01-01	7MXVkk9YMctZqd1Srtv4MB	230453.0	0.14100	0.679	0.587	0.000006	0.137	-7.015	0.2760	0.486	186.003	1Xyo4u8uXC1ZmMpatF05PJ	94.0	26720759.0
1	2	Closer	The Chainsmokers	3015525	https://open.spotify.com/track/7BKLCZ1jbUBVqRi...	2017-01-01	7BKLCZ1jbUBVqRi2FVlTVw	244960.0	0.41400	0.748	0.524	0.000000	0.111	-5.599	0.0338	0.661	95.010	69GGBxA162lTqCwzJG5jLp	84.0	17093912.0
2	3	Let Me Love You	DJ Snake	2545384	https://open.spotify.com/track/4pdPtRcBmOSQDlJ...	2017-01-01	3ibKnFDaa3GhpPGlOUj7ff	256733.0	0.23500	0.656	0.578	0.000000	0.118	-8.970	0.0922	0.556	94.514	20s0P9QLxGqKuCsGwFsp7w	69.0	2055274.0
3	4	Rockabye (feat. Sean Paul & Anne-Marie)	Clean Bandit	2356604	https://open.spotify.com/track/5knuzwU65gJK7IF...	2017-01-01	5knuzwU65gJK7IF5yJsuaW	251088.0	0.40600	0.720	0.763	0.000000	0.180	-4.068	0.0523	0.742	101.965	6MDME20pz9RveH9rEXvrOM	80.0	4092589.0
4	5	One Dance	Drake	2259887	https://open.spotify.com/track/1xznGGDReH1oQq0...	2017-01-01	1zi7xx7UVEFkmKfv06H8x0	173987.0	0.00776	0.792	0.625	0.001800	0.329	-5.609	0.0536	0.370	103.967	3TVXtAsR1Inumwj472S9r4	96.0	51374698.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2395	196	Rockabye (feat. Sean Paul & Anne-Marie)	Clean Bandit	552118	https://open.spotify.com/track/5knuzwU65gJK7IF...	2017-12-01	5knuzwU65gJK7IF5yJsuaW	251088.0	0.40600	0.720	0.763	0.000000	0.180	-4.068	0.0523	0.742	101.965	6MDME20pz9RveH9rEXvrOM	80.0	4092589.0
2396	197	Rake It Up (feat. Nicki Minaj)	Yo Gotti	551576	https://open.spotify.com/track/4knL4iPxPOZjQzT...	2017-12-01	4knL4iPxPOZjQzTUlELGSY	276333.0	0.02200	0.910	0.444	0.000000	0.137	-8.126	0.3440	0.530	149.953	6Ha4aES39QiVjR0L2lwuwq	75.0	3109571.0
2397	198	New Freezer (feat. Kendrick Lamar)	Rich The Kid	550167	https://open.spotify.com/track/4pYZLpX23Vx8rwD...	2017-12-01	2EgB4n6XyBsuNUbuarr4eG	191938.0	0.04050	0.884	0.698	0.000000	0.195	-9.101	0.3640	0.575	140.068	1pPmIToKXyGdsCF6LmqLmI	78.0	2419234.0
2398	199	All Night	Steve Aoki	548039	https://open.spotify.com/track/5mAxA6Q1SIym6dP...	2017-12-01	0dXNQ8dckG4eYfEtq9zcva	197640.0	0.00410	0.538	0.804	0.000000	0.330	-5.194	0.0358	0.507	144.992	7gAppWoH7pcYmphCVTXkzs	76.0	4082406.0
2399	200	113	Booba	546878	https://open.spotify.com/track/6xqAP7kpdgCy8lE...	2017-12-01	0leVyLipY7A8ruhkIBqc0E	266672.0	0.00805	0.740	0.510	0.000375	0.128	-8.042	0.0780	0.089	141.534	0JOxt5QOwq0czoJxvSc5hS	70.0	168927.0

2400 rows × 20 columns

# Fixing types because some values that are strings should be used as values
df['streams'] = df['streams'].astype(float)
df['position'] = df['position'].astype(int)

Data Visualization and Analysis

Song Properties

We’ve now gathered and manipulated valuable data for each track for each day recorded. The key elements are the following:

Track Name
Artist Name
Stream Count
Popularity value (0-100, larger is better)
Number of Followers
Position/Rank on Top 200 (smaller is better)
Date on the Top 200 (most songs stay for many days)

The following details define the patterns and properties of music, the way they sound, and what mood they instill:

Duration
- The duration of the track in milliseconds.
Acousticness
- A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Energy
- Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
Instrumentalness
- Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Liveness
- Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
Loudness
- The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
Speechiness
- Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Valence
- A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Tempo
- The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

The following are more extraneous details for identifying tracks in the data wrangling:

Track ID
Artist ID
URL

Using this data, we begin trying to observe what traits of a song bring success. First we observe that there is a standard distrubtion of stream counts, meaning the mean stream count will most likely fall from 1-1.05 million.

#Histogram takes 100 random tracks, takes the average of all their streams, then does this 100 times

from scipy.stats import normaltest
from numpy.random import seed
from numpy.random import randn

alpha = 0.05
data = []
for i in range(0,100):
    data.append(np.mean(df['streams'].sample(n=500)))
plt.hist(data)
plt.title("Frequency Distribution of Streams")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
print("Population mean: ", df['streams'].mean())
print("Population median: ", df['streams'].median())
print("Population STDDEV: ", df['streams'].std())

Population mean:  1023582.30625
Population median:  719465.0
Population STDDEV:  804478.6072499221

svg

Our goal is to determine if there are certain values of song properties that result in extremely high or low success. We create a dataframe that only saves the entry of a song at its peak stream count in the Top 200, meaning we are comparing all the peaks.

# Creating version of table with no duplicates, keeping the last seen version of each song. It is a fair representation of success.

no_dupes_df = df.copy()
no_dupes_df = no_dupes_df.sort_values('streams', ascending=False).drop_duplicates(['artist', 'duration_ms', 'acousticness', 'danceability', 'energy'], keep='first') 
no_dupes_df

	position	track_name	artist	streams	url	date	track_id	duration_ms	acousticness	danceability	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	artist_id	popularity_index	follower_count
200	1	Shape of You	Ed Sheeran	7549041.0	https://open.spotify.com/track/7qiZfU4dY1lWllz...	2017-02-01	7qiZfU4dY1lWllzX7mPBI3	233713.0	0.5810	0.825	0.652	0.000000	0.0931	-3.183	0.0802	0.931	95.977	6eUKZXaKkcviH0Ku9w2n3V	91.0	73345259.0
1000	1	Despacito - Remix	Luis Fonsi	7332260.0	https://open.spotify.com/track/5CtI0qwDJkDQGwX...	2017-06-01	6rPO02ozF3bM7NnOV4h6s2	228827.0	0.2280	0.653	0.816	0.000000	0.0967	-4.353	0.1670	0.816	178.085	4V8Sr092TqfHkfAA5fXXqG	78.0	9035487.0
2000	1	rockstar	Post Malone	5755610.0	https://open.spotify.com/track/7wGoVu4Dady5GV0...	2017-11-01	7ytR5pFWmSjzHJIeQkgog4	181733.0	0.2470	0.746	0.690	0.000000	0.1010	-7.956	0.1640	0.497	89.977	4r63FhuTkUYltbVAg5TQnk	93.0	5174251.0
1600	1	Look What You Made Me Do	Taylor Swift	5547962.0	https://open.spotify.com/track/6uFsE1JgZ20EXyU...	2017-09-01	1P17dC1amhFzptugyAO7Il	211853.0	0.2040	0.766	0.709	0.000014	0.1260	-6.471	0.1230	0.506	128.070	06HL4z0CvFAxyc27GXpf02	97.0	34579892.0
1001	2	I'm the One	DJ Khaled	5208996.0	https://open.spotify.com/track/72Q0FQQo32KJloi...	2017-06-01	1jYiIOC5d6soxkJP81fxq2	288877.0	0.0533	0.599	0.667	0.000000	0.1340	-4.267	0.0367	0.817	80.984	0QHgL1lAIqAw0HtD7YldmP	82.0	5405048.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
193	194	Famous	Kanye West	336134.0	https://open.spotify.com/track/19a3JfW8BQwqHWU...	2017-01-01	19a3JfW8BQwqHWUMbcqSx8	196040.0	0.0711	0.465	0.735	0.000000	0.0975	-3.715	0.1170	0.409	173.935	5K4W6rqBFWDnAN6FQUkS6x	90.0	12912141.0
196	197	Oh Lord	MiC LOWRY	331792.0	https://open.spotify.com/track/1sTUEdVO85YU8Ym...	2017-01-01	1ISsiC4Fw6f96kZQegLGiJ	198253.0	0.4070	0.493	0.738	0.000000	0.1300	-6.921	0.2620	0.219	176.071	6fOMl44jA4Sp5b9PpYCkzz	84.0	4600363.0
197	198	Superstition - Single Version	Stevie Wonder	331376.0	https://open.spotify.com/track/5lXcSvHRVjQJ3LB...	2017-01-01	1h2xVEoJORqrg71HocgqXd	245493.0	0.0380	0.633	0.634	0.006400	0.0385	-12.115	0.0725	0.872	100.499	7guDJrEfX3qb6FEbdPA5qi	80.0	4654921.0
198	199	Secrets	The Weeknd	331233.0	https://open.spotify.com/track/3DX4Y0egvc0slLc...	2017-01-01	1NhPKVLsHhFUHIOZ32QnS2	224693.0	0.0717	0.516	0.764	0.000000	0.1150	-6.223	0.0366	0.376	148.021	5Pwc4xIPtQLFEnJriah9YJ	83.0	11061770.0
199	200	Ni**as In Paris	JAY-Z	325951.0	https://open.spotify.com/track/2KpCpk6HjXXLb7n...	2017-01-01	4Li2WHPkuyCdtmokzW2007	219333.0	0.1270	0.789	0.858	0.000000	0.3490	-5.542	0.3110	0.775	140.022	3nFkdlSjzX9mRTtwJOzDYB	85.0	5812536.0

657 rows × 20 columns

Now we will visualize the stream count vs all song properties relationship

plt.scatter(no_dupes_df['popularity_index'], no_dupes_df['streams'])
plt.title('Streams in Relation to Popularity')
plt.xlabel('popularity value')
plt.ylabel('streams in millions')
print("Mean of popularity index: " + str(no_dupes_df['popularity_index'].mean()))
print("Median of popularity index: " + str(no_dupes_df['popularity_index'].median()))
print("STDDEV of popularity index: " + str(no_dupes_df['popularity_index'].std()))

Mean of popularity index: 81.84579439252336
Median of popularity index: 83.0
STDDEV of popularity index: 10.20413589979866

svg

The data appears to cluster around the mean, so we decided to check whether the popularity index was normally distributed

data = []
for i in range(0,100):
    data.append(np.mean(df['popularity_index'].sample(n=1000)))
plt.hist(data)
plt.title("Frequency Distribution of Popularity Indices")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")

Text(0, 0.5, 'Frequency')

svg

Data related to popularity index of an artist does appear to be normally distributed from our data depending on the sample, and it appears that the majority of songs on the top 200 come from artists with a popularity of around 80.

plt.scatter(no_dupes_df['follower_count'], no_dupes_df['streams'])
plt.title('Streams in Relation to Follower Count')
plt.xlabel('number of artist followers in tens of million')
plt.ylabel('streams in millions')
print("Mean of follower count: " + str(no_dupes_df['follower_count'].mean()))
print("Median of follower count: " + str(no_dupes_df['follower_count'].median()))
print("STDDEV of follower count: " + str(no_dupes_df['follower_count'].std()))

Mean of follower count: 13744629.682242991
Median of follower count: 7748023.0
STDDEV of follower count: 16091903.457271803

svg

plt.scatter(no_dupes_df['duration_ms'], no_dupes_df['streams'])
plt.title('Streams in Relation to Song Duration')
plt.xlabel('duration in milliseconds')
plt.ylabel('streams in millions')
print("Mean of song length in seconds: " + str(no_dupes_df['duration_ms'].mean() / 1000))
print("Median of song length in seconds: " + str(no_dupes_df['duration_ms'].median() / 1000))
print("STDDEV of song length in seconds: " + str(no_dupes_df['duration_ms'].std() / 1000))

Mean of song length in seconds: 215.89356386292835
Median of song length in seconds: 213.981
STDDEV of song length in seconds: 41.844669610266095

svg

plt.scatter(no_dupes_df['acousticness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Acousticness')
plt.xlabel('acousticness scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of acousticness index: " + str(no_dupes_df['acousticness'].mean()))
print("Median of acousticness index: " + str(no_dupes_df['acousticness'].median()))
print("STDDEV of acousticness index: " + str(no_dupes_df['acousticness'].std()))

Mean of acousticness index: 0.21230590903426794
Median of acousticness index: 0.11
STDDEV of acousticness index: 0.23844285456931844

svg

plt.scatter(no_dupes_df['danceability'],no_dupes_df['streams'])
plt.title('Streams in Relation to Danceability')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of danceability index: " + str(no_dupes_df['danceability'].mean()))
print("Median of danceability index: " + str(no_dupes_df['danceability'].median()))
print("STDDEV of danceability index: " + str(no_dupes_df['danceability'].std()))

Mean of danceability index: 0.6818423676012461
Median of danceability index: 0.695
STDDEV of danceability index: 0.13576219623892008

svg

plt.scatter(no_dupes_df['energy'],no_dupes_df['streams'])
plt.title('Streams in Relation to Energy')
plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of energy index: " + str(no_dupes_df['energy'].mean()))
print("Median of energy index: " + str(no_dupes_df['energy'].median()))
print("STDDEV of energy index: " + str(no_dupes_df['energy'].std()))

Mean of energy index: 0.6355397507788161
Median of energy index: 0.6515
STDDEV of energy index: 0.17854747086125813

svg

plt.scatter(no_dupes_df['instrumentalness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Instrumentalness')
plt.xlabel('instrumentalness scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of instrumentalness index: " + str(no_dupes_df['instrumentalness'].mean()))
print("Median of instrumentalness index: " + str(no_dupes_df['instrumentalness'].median()))
print("STDDEV of instrumentalness index: " + str(no_dupes_df['instrumentalness'].std()))

Mean of instrumentalness index: 0.013712031915887851
Median of instrumentalness index: 0.0
STDDEV of instrumentalness index: 0.08112995046942596

svg

plt.scatter(no_dupes_df['liveness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Liveness')
plt.xlabel('liveness scale of 0-1') 
plt.ylabel('streams in millions')
print("Mean of liveness index: " + str(no_dupes_df['liveness'].mean()))
print("Median of liveness index: " + str(no_dupes_df['liveness'].median()))
print("STDDEV of liveness index: " + str(no_dupes_df['liveness'].std()))

Mean of liveness index: 0.1735563862928349
Median of liveness index: 0.123
STDDEV of liveness index: 0.12771847354589183

svg

plt.scatter(no_dupes_df['loudness'],no_dupes_df['streams'])
plt.title('Streams in Relation to Loudness')
plt.xlabel('loudness in decibels')
plt.ylabel('streams in millions')
print("Mean of volume: " + str(no_dupes_df['loudness'].mean()))
print("Median of volume: " + str(no_dupes_df['loudness'].median()))
print("STDDEV of volume: " + str(no_dupes_df['loudness'].std()))

Mean of volume: -6.436602803738317
Median of volume: -5.992
STDDEV of volume: 2.930078470544615

svg

Lots of points appear to be around the mean volume, so let’s check and see if this data is normally distributed.

data = []
for i in range(0,100):
    data.append(np.mean(df['loudness'].sample(n=100)))
plt.hist(data)
plt.title("Frequency Distribution of Average Song Volumes")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")

Text(0, 0.5, 'Frequency')

svg

The spread of different averages volumes for each track appears to be normally distributed, meaning the mean of any sample should be the same as the population mean. We determined that the population mean is about ~-6 decibels, with a standard deviation of around 3 decibels.

plt.scatter(no_dupes_df['speechiness'], no_dupes_df['streams'])
plt.title('Streams in Relation to Speechiness')
plt.xlabel('speechiness scale of 0-.5')
plt.ylabel('streams in millions')
print("Mean of speechiness index: " + str(no_dupes_df['speechiness'].mean()))
print("Median of speechiness index: " + str(no_dupes_df['speechiness'].median()))
print("STDDEV of speechiness index: " + str(no_dupes_df['speechiness'].std()))

Mean of speechiness index: 0.1167436137071651
Median of speechiness index: 0.0678
STDDEV of speechiness index: 0.11056578687587934

svg

plt.scatter(no_dupes_df['valence'],no_dupes_df['streams'])
plt.title('Streams in Relation to Valence')
plt.xlabel('valence scale of 0-1')
plt.ylabel('streams in millions')
print("Mean of valence index: " + str(no_dupes_df['valence'].mean()))
print("Median of valence index: " + str(no_dupes_df['valence'].median()))
print("STDDEV of valence index: " + str(no_dupes_df['valence'].std()))

Mean of valence index: 0.48936448598130844
Median of valence index: 0.48
STDDEV of valence index: 0.23614792618729277

svg

plt.scatter(no_dupes_df['tempo'],no_dupes_df['streams'])
plt.title('Streams in Relation to Tempo')
plt.xlabel('tempo scale of 0-200 beats per minute')
plt.ylabel('streams in millions')
print("Mean of tempo: " + str(no_dupes_df['tempo'].mean()))
print("Median of tempo: " + str(no_dupes_df['tempo'].median()))
print("STDDEV of tempo: " + str(no_dupes_df['tempo'].std()))

Mean of tempo: 121.39503894080995
Median of tempo: 119.9425
STDDEV of tempo: 29.609868539398786

svg

We now will create a Correlation Matrix to see the relationship between all values. Observe this correlation matrix compiling the scatter plots above.

corr = no_dupes_df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<AxesSubplot:>

svg

After seeing this, we had a couple of ideas. It is possible different features we are currently tracking work together to make a song popular, but it is also possible we are missing other important features. After looking back at the most popular songs over the course of our entire dataframe, we noticed the majority of artists were well known or already accomplished. While it is obvious that an artists’ “followers” (or typical listeners) will increase the number of streams a song will get, it would be interesting to know if the number of typical listeners was more important than all these other aspects of the song.

Top 10

Here we observe the traits of specifically the song at the ranks 1-10. The song in these positions is likely to change, so there will be different values for the same x-axis position at times.

top10s = df.loc[df['position'] <= 10]
#lists for legend to remove redundant code
color_list = ['r', 'orange', 'yellow', 'lime',  'cyan', 'b', 'brown' , 'violet', 'purple', 'black']
top10_legend = ['Rank 1', 'Rank 2', 'Rank 3', 'Rank 4', 'Rank 5', 'Rank 6','Rank 7','Rank 8','Rank 9','Rank 10']

#method to remove redundant code in plotting
def plotTop10(name):
    i = 0
    for index, row in top10s.iterrows():
        plt.scatter(row[name],row['streams'], color=color_list[i])
        i = (i + 1) % 10


top10s.head()

	position	track_name	artist	streams	url	date	track_id	duration_ms	acousticness	danceability	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	artist_id	popularity_index	follower_count
0	1	Starboy	The Weeknd	3135625.0	https://open.spotify.com/track/5aAx2yezTd8zXrk...	2017-01-01	7MXVkk9YMctZqd1Srtv4MB	230453.0	0.14100	0.679	0.587	0.000006	0.137	-7.015	0.2760	0.486	186.003	1Xyo4u8uXC1ZmMpatF05PJ	94.0	26720759.0
1	2	Closer	The Chainsmokers	3015525.0	https://open.spotify.com/track/7BKLCZ1jbUBVqRi...	2017-01-01	7BKLCZ1jbUBVqRi2FVlTVw	244960.0	0.41400	0.748	0.524	0.000000	0.111	-5.599	0.0338	0.661	95.010	69GGBxA162lTqCwzJG5jLp	84.0	17093912.0
2	3	Let Me Love You	DJ Snake	2545384.0	https://open.spotify.com/track/4pdPtRcBmOSQDlJ...	2017-01-01	3ibKnFDaa3GhpPGlOUj7ff	256733.0	0.23500	0.656	0.578	0.000000	0.118	-8.970	0.0922	0.556	94.514	20s0P9QLxGqKuCsGwFsp7w	69.0	2055274.0
3	4	Rockabye (feat. Sean Paul & Anne-Marie)	Clean Bandit	2356604.0	https://open.spotify.com/track/5knuzwU65gJK7IF...	2017-01-01	5knuzwU65gJK7IF5yJsuaW	251088.0	0.40600	0.720	0.763	0.000000	0.180	-4.068	0.0523	0.742	101.965	6MDME20pz9RveH9rEXvrOM	80.0	4092589.0
4	5	One Dance	Drake	2259887.0	https://open.spotify.com/track/1xznGGDReH1oQq0...	2017-01-01	1zi7xx7UVEFkmKfv06H8x0	173987.0	0.00776	0.792	0.625	0.001800	0.329	-5.609	0.0536	0.370	103.967	3TVXtAsR1Inumwj472S9r4	96.0	51374698.0

plotTop10('popularity_index')
plt.title('Top 10 Streams in Relation to Popularity of Artist')
plt.xlabel('Artist Popularity Value')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bdc07df10>

svg

plotTop10('follower_count')
plt.title('Top 10 Streams in Relation to Follower Count')
plt.xlabel('Follower Count in 10 millions')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bd3a15760>

svg

This graph seems to indicate that the majority of songs within the top 10 positions come from artists with a follower count of less than 30 million, but the sample size here is small.

plotTop10('duration_ms')
plt.title('Top 10 Streams in Relation to Duration')
plt.xlabel('duration in milliseconds')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bd3a91af0>

svg

This graph shows how the duration of a song in milliseconds compares to the number of streams that song received, and we’re only using the first 10 pieces of data from our dataframe. This shows us that the songs with the most streams from this set of data are songs which are > 240000 ms, or 4 minutes. This is surprising, because the average song is usually around 3 minutes and 30 seconds or less.

plotTop10('acousticness')
plt.title('Top 10 Streams in Relation to Acousticness')
plt.xlabel('acousticness scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bd369fac0>

svg

This graph displays a confidence score for how likely it is that a song is acoustic (with a value of 1 being very likely that the song is acoustic) compared to the number of streams the song has. All of the confidence scores are less than .5, which indicates most of these songs are probably not acoustic.

plotTop10('energy')
plt.title('Top 10 Streams in Relation to Energy')
plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bd3e10970>

svg

This graph shows how the “energy” of a song, or generally how noisy and fast the song is, compares to the number of streams for the top 10 songs on the 1st of January. Here, we see that the songs with the most streams are around or above .6 on the energy scale (a higher score means the song is higher energy)

plotTop10('danceability')
plt.title('Top 10 Streams in Relation to Danceability')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bdfe06d00>

svg

This graph shows how “danceable” a song is using a value provided to us by the Spotify API comapred to the number of streams that song got. Danceability is measured as a value from 0 to 1, where 1 is most danceable. This graph appears to be similar to the graph describing, so they may have been determined using similar characteristisc (i.e. both are measuring how upbeat or fast a song is)

plotTop10('loudness')
plt.title('Top 10 Streams in Relation to Loudness')
plt.xlabel('loudness in decibels')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bd41c5b20>

svg

This graph describes the average volume of each track in our top 10s data set compared to the number of streams each song had. It appears to trend similarly to the last two graphs, indicating that the volume of a track may be correlated with how danceable or energetic a song is.

plotTop10('valence')
plt.title('Top 10 Streams in Relation to Loudness')
plt.xlabel('valence scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bdd6c9580>

svg

This graph describes the “valence” of a song compared to the # of streams it got. Valence is described as the “positivity” of a song where “Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry),” according to the Spotify API reference. The reference does not describe how this value is determined, but our data seems to show there may be a correlation between valence and the number of streams a song is getting in the set of number 1 songs. However, this graph does not take into account the other features for the songs. It may be worth trying to consider songs where features except for this one are held to a constant, so that we can consider if there is a correlation between this value and the number of streams.

plotTop10('tempo')
plt.title('Top 10 Streams in Relation to Beats Per Minute (BPM)')
plt.xlabel('tempo scale of 0-200 Beats Per Minute (BPM)')
plt.ylabel('streams in millions')
plt.legend(top10_legend)

<matplotlib.legend.Legend at 0x23bd35843d0>

svg

This graph describes the tempo of a song comapred to the number of streams that song has. Given our dataset, it is unclear whether there is a correlation between the tempo of a song and the number of streams it gets.

Correlation within the Top 10

corr = top10s.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<AxesSubplot:>

svg

This Corellation Matrix of the Top 10 is much more polarizing than the Top 200 Corellation Matrix. Values that had a weak positive or weak negative correlation are now at least more strongly correlated, if not strongly correlated. This indicates the high correlation traits have a stronger influence in the Top 10.

Unique Relationships

There appeared to be a potential relationship between valence and the number of streams a song was getting both in our correlation chart and how rank 1 songs performed in our graph of the top 10 tracks each month, so it might be interesting to look at what the different features are like for songs with a high valence (.4 or higher).

highValenceTracks = df.loc[df['valence'] > .4]
highValenceTracks.sort_values('streams', ascending=False).head(10)

	position	track_name	artist	streams	url	date	track_id	duration_ms	acousticness	danceability	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	artist_id	popularity_index	follower_count
200	1	Shape of You	Ed Sheeran	7549041.0	https://open.spotify.com/track/7qiZfU4dY1lWllz...	2017-02-01	7qiZfU4dY1lWllzX7mPBI3	233713.0	0.581	0.825	0.652	0.000000	0.0931	-3.183	0.0802	0.931	95.977	6eUKZXaKkcviH0Ku9w2n3V	91.0	73345259.0
1000	1	Despacito - Remix	Luis Fonsi	7332260.0	https://open.spotify.com/track/5CtI0qwDJkDQGwX...	2017-06-01	6rPO02ozF3bM7NnOV4h6s2	228827.0	0.228	0.653	0.816	0.000000	0.0967	-4.353	0.1670	0.816	178.085	4V8Sr092TqfHkfAA5fXXqG	78.0	9035487.0
400	1	Shape of You	Ed Sheeran	7201132.0	https://open.spotify.com/track/7qiZfU4dY1lWllz...	2017-03-01	7qiZfU4dY1lWllzX7mPBI3	233713.0	0.581	0.825	0.652	0.000000	0.0931	-3.183	0.0802	0.931	95.977	6eUKZXaKkcviH0Ku9w2n3V	91.0	73345259.0
600	1	Shape of You	Ed Sheeran	6815498.0	https://open.spotify.com/track/7qiZfU4dY1lWllz...	2017-04-01	7qiZfU4dY1lWllzX7mPBI3	233713.0	0.581	0.825	0.652	0.000000	0.0931	-3.183	0.0802	0.931	95.977	6eUKZXaKkcviH0Ku9w2n3V	91.0	73345259.0
1200	1	Despacito - Remix	Luis Fonsi	6398530.0	https://open.spotify.com/track/5CtI0qwDJkDQGwX...	2017-07-01	6rPO02ozF3bM7NnOV4h6s2	228827.0	0.228	0.653	0.816	0.000000	0.0967	-4.353	0.1670	0.816	178.085	4V8Sr092TqfHkfAA5fXXqG	78.0	9035487.0
800	1	Despacito - Remix	Luis Fonsi	6360737.0	https://open.spotify.com/track/5CtI0qwDJkDQGwX...	2017-05-01	6rPO02ozF3bM7NnOV4h6s2	228827.0	0.228	0.653	0.816	0.000000	0.0967	-4.353	0.1670	0.816	178.085	4V8Sr092TqfHkfAA5fXXqG	78.0	9035487.0
2000	1	rockstar	Post Malone	5755610.0	https://open.spotify.com/track/7wGoVu4Dady5GV0...	2017-11-01	7ytR5pFWmSjzHJIeQkgog4	181733.0	0.247	0.746	0.690	0.000000	0.1010	-7.956	0.1640	0.497	89.977	4r63FhuTkUYltbVAg5TQnk	93.0	5174251.0
1800	1	rockstar	Post Malone	5649503.0	https://open.spotify.com/track/1OmcAT5Y8eg5bUP...	2017-10-01	7ytR5pFWmSjzHJIeQkgog4	181733.0	0.247	0.746	0.690	0.000000	0.1010	-7.956	0.1640	0.497	89.977	4r63FhuTkUYltbVAg5TQnk	93.0	5174251.0
1600	1	Look What You Made Me Do	Taylor Swift	5547962.0	https://open.spotify.com/track/6uFsE1JgZ20EXyU...	2017-09-01	1P17dC1amhFzptugyAO7Il	211853.0	0.204	0.766	0.709	0.000014	0.1260	-6.471	0.1230	0.506	128.070	06HL4z0CvFAxyc27GXpf02	97.0	34579892.0
2200	1	rockstar	Post Malone	5528701.0	https://open.spotify.com/track/7wGoVu4Dady5GV0...	2017-12-01	7ytR5pFWmSjzHJIeQkgog4	181733.0	0.247	0.746	0.690	0.000000	0.1010	-7.956	0.1640	0.497	89.977	4r63FhuTkUYltbVAg5TQnk	93.0	5174251.0

We have duplicate pieces of data, so lets remove the duplicates for this test. We’re going to try to keep the versions of the song that have the most streams

highValenceTracks = highValenceTracks.sort_values('streams', ascending=False).drop_duplicates(['artist', 'duration_ms', 'acousticness', 'danceability', 'energy'], keep='first') # Keeping the last seen version of each song, as that will probably hold it's total streams more accurately
highValenceTracks.sort_values('streams', ascending=False).head(10)

	position	track_name	artist	streams	url	date	track_id	duration_ms	acousticness	danceability	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	artist_id	popularity_index	follower_count
200	1	Shape of You	Ed Sheeran	7549041.0	https://open.spotify.com/track/7qiZfU4dY1lWllz...	2017-02-01	7qiZfU4dY1lWllzX7mPBI3	233713.0	0.581000	0.825	0.652	0.000000	0.0931	-3.183	0.0802	0.931	95.977	6eUKZXaKkcviH0Ku9w2n3V	91.0	73345259.0
1000	1	Despacito - Remix	Luis Fonsi	7332260.0	https://open.spotify.com/track/5CtI0qwDJkDQGwX...	2017-06-01	6rPO02ozF3bM7NnOV4h6s2	228827.0	0.228000	0.653	0.816	0.000000	0.0967	-4.353	0.1670	0.816	178.085	4V8Sr092TqfHkfAA5fXXqG	78.0	9035487.0
2000	1	rockstar	Post Malone	5755610.0	https://open.spotify.com/track/7wGoVu4Dady5GV0...	2017-11-01	7ytR5pFWmSjzHJIeQkgog4	181733.0	0.247000	0.746	0.690	0.000000	0.1010	-7.956	0.1640	0.497	89.977	4r63FhuTkUYltbVAg5TQnk	93.0	5174251.0
1600	1	Look What You Made Me Do	Taylor Swift	5547962.0	https://open.spotify.com/track/6uFsE1JgZ20EXyU...	2017-09-01	1P17dC1amhFzptugyAO7Il	211853.0	0.204000	0.766	0.709	0.000014	0.1260	-6.471	0.1230	0.506	128.070	06HL4z0CvFAxyc27GXpf02	97.0	34579892.0
1001	2	I'm the One	DJ Khaled	5208996.0	https://open.spotify.com/track/72Q0FQQo32KJloi...	2017-06-01	1jYiIOC5d6soxkJP81fxq2	288877.0	0.053300	0.599	0.667	0.000000	0.1340	-4.267	0.0367	0.817	80.984	0QHgL1lAIqAw0HtD7YldmP	82.0	5405048.0
401	2	Something Just Like This	The Chainsmokers	4581789.0	https://open.spotify.com/track/6RUKPb4LETWmmr3...	2017-03-01	6RUKPb4LETWmmr3iAEQktW	247160.0	0.049800	0.617	0.635	0.000014	0.1640	-6.769	0.0317	0.446	103.019	69GGBxA162lTqCwzJG5jLp	84.0	17093912.0
1201	2	Wild Thoughts (feat. Rihanna & Bryson Tiller)	DJ Khaled	4558126.0	https://open.spotify.com/track/1OAh8uOEOvTDqkK...	2017-07-01	45XhKYRRkyeqoW3teSOkCM	204664.0	0.028700	0.613	0.681	0.000000	0.1260	-3.089	0.0778	0.619	97.621	0QHgL1lAIqAw0HtD7YldmP	82.0	5405048.0
402	3	It Ain't Me (with Selena Gomez)	Kygo	4529714.0	https://open.spotify.com/track/3eR23VReFzcdmS7...	2017-03-01	2jRGYG8U5bJzWOH6FLuzvO	192000.0	0.016100	0.713	0.658	0.000138	0.0607	-5.362	0.0748	0.539	115.024	23fqKkggKUBHNkbKtXEls4	86.0	6975385.0
803	4	HUMBLE.	Kendrick Lamar	4371886.0	https://open.spotify.com/track/7KXjTSCq5nL1LoY...	2017-05-01	7KXjTSCq5nL1LoYtL7XAwS	177000.0	0.000282	0.908	0.621	0.000054	0.0958	-6.638	0.1020	0.421	150.011	2YZyLoL8N0Wb9xBt1NhZWg	87.0	16028806.0
2002	3	New Rules	Dua Lipa	3758506.0	https://open.spotify.com/track/2ekn2ttSfGqwhha...	2017-11-01	2ekn2ttSfGqwhhate0LSR0	209320.0	0.002610	0.762	0.700	0.000016	0.1530	-6.021	0.0694	0.608	116.073	6M2wZ9GZgrQXHCFfjv46we	93.0	21442792.0

Here are the first few tracks from our list of songs with high valences.

highValenceTracks = highValenceTracks.sort_values('valence', ascending=False)
highValenceTracks.head()

	position	track_name	artist	streams	url	date	track_id	duration_ms	acousticness	danceability	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	artist_id	popularity_index	follower_count
1006	7	There's Nothing Holdin' Me Back	Shawn Mendes	3093935.0	https://open.spotify.com/track/79cuOz3SPQTuFrp...	2017-06-01	7JJmb5XwzOO8jgpou264Ml	199440.0	0.380	0.866	0.813	0.000000	0.0779	-4.063	0.0554	0.969	121.998	7n2wHs1TKAczGzO7Dd2rGr	92.0	30441601.0
124	125	Pumped Up Kicks	Foster The People	467384.0	https://open.spotify.com/track/7w87IxuO7BDcJ3Y...	2017-01-01	7w87IxuO7BDcJ3YUqCyMTT	239600.0	0.145	0.733	0.710	0.115000	0.0956	-5.849	0.0292	0.965	127.975	7gP3bB2nilZXLfPHJhMdvc	76.0	3059575.0
2366	167	Feliz Navidad	José Feliciano	631358.0	https://open.spotify.com/track/7taXf5odg9xCAZE...	2017-12-01	0oPdaY4dXtc3ZsaG17V972	182067.0	0.550	0.513	0.831	0.000000	0.3360	-9.004	0.0383	0.963	148.837	7K78lVZ8XzkjfRSI7570FF	76.0	211150.0
130	131	Happy - From "Despicable Me 2"	Pharrell Williams	453426.0	https://open.spotify.com/track/5b88tNINg4Q4nrR...	2017-01-01	60nZcImufyMA1MKQY3dcCH	232720.0	0.219	0.647	0.822	0.000000	0.0908	-4.662	0.1830	0.962	160.019	2RdwBSPQiwcmiDo9kixcl8	80.0	3359324.0
1315	116	Skrt On Me (feat. Nicki Minaj)	Calvin Harris	625504.0	https://open.spotify.com/track/7iDxZ5Cd0Yg08d4...	2017-07-01	7iDxZ5Cd0Yg08d4fI5WVtG	228267.0	0.169	0.713	0.889	0.000058	0.1690	-3.870	0.0376	0.960	101.977	7CajNmpbOovFoOoasH2HaY	86.0	20962353.0

Here’s a plot displaying the number of streams for each song with a high valence compared to their valence

plt.scatter(highValenceTracks['valence'], highValenceTracks['streams'])
plt.title("Streams Compared to Valence For Song With Valence > .4")
plt.xlabel("Valence Value From 0 to 1")
plt.ylabel("Streams in Millions")

Text(0, 0.5, 'Streams in Millions')

svg

No longer seeing the relationship we were seeing earlier between valence and number of streams. Maybe the relationship that leads to more streams is a combination of these features together. It might be worth trying to see if there is a relationship between streams and a combination of features like valence and loudness or valence and danceability

Let’s try categorizing our data into groups with different levels of valences. This allows us to bound valenece, which helps us treat it more like a constant. Then, we could see how other features compare to streams when valence is held within certain levels.

highValenceTracks = df.loc[(df['valence'] >= .5) & (df['valence'] < .8)]
veryHighValenceTracks = df.loc[df['valence'] >= .8]
lowValenceTracks = df.loc[(df['valence'] < .5) & (df['valence'] >= .3)]
veryLowValenceTracks = df.loc[df['valence'] <= .3]

First, we can try to looking at how danceability performs with different bounded groups of valence. We will color code the valence groups, so that we can easily see which tracks have a high, medium-high, medium-low, or low valence.

# Plotting songs with a high valence and varying levels of danceability against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['danceability'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['danceability'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['danceability'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['danceability'],veryHighValenceTracks['streams'], color="red")
plt.title('Danceability and streams for songs with bounded valences')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(['Song with valence < .3', 'Song with valence < .5 and > .3', 'Song with valence > .5 and < .8', 'Song with valence > .8'])

<matplotlib.legend.Legend at 0x23bd47c7340>

svg

This graph displays how the danceability of a song compares to the number of streams it has for songs that have a high valence (>.5). While there is little indication of a linear correlation, it appears that the songs with the most streams all also have a danceability of > .5.

Using this set of data where valence is color coded by groups of values, let’s try to plot other features against streams and see if valence and another feature have any effect on the number of streams. We can try considering follower count next, as that value appeared to have a slight positive correlation with streams in our correlation chart.

# Plotting songs with varying levels of followers within certain valence levels against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['follower_count'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['follower_count'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['follower_count'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['follower_count'],veryHighValenceTracks['streams'], color="red")
plt.title('Follower Count and Streams for Songs With Different Valence Bounds')
plt.xlabel('Follower count in tens of millions of followers')
plt.ylabel('Streams in millions')
plt.legend(['Song with valence < .3', 'Song with valence < .5 and > .3', 'Song with valence > .5 and < .8', 'Song with valence > .8'])

<matplotlib.legend.Legend at 0x23bd4839e50>

svg

This graph shows that the majority of songs in our entire set of data are within that bottom left corner of the graph, and this cluster includes varrying levels of valence. This means that different amounts of followers with bounded amounts of valence have little to do with the number of streams.

Another variable that appeared to have a slight correlation with streams was the loudness of a song. We can try to make a plot similar to our previous two plots where we instead measure loudness on the x-axis.

# Plotting songs with varying levels of loudness within certain valence levels against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['loudness'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['loudness'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['loudness'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['loudness'],veryHighValenceTracks['streams'], color="red")
plt.title('Loudness and Streams for Songs With Different Valence Bounds')
plt.xlabel('Loudness in Decibels')
plt.ylabel('Streams in Millions')
plt.legend(['Song with valence < .3', 'Song with valence < .5 and > .3', 'Song with valence > .5 and < .8', 'Song with valence > .8'])

<matplotlib.legend.Legend at 0x23bd48b6c10>

svg

While streams appears to have little correlation with our variables loudness and valence, it does appear that valence and loudness have a correlation. We start to see higher valences as songs gets louder, but this seems relatively intuitive as “negative” songs are probably quieter and more somber. It may also be noteworthy that songs that did very well all had a loudness of > -15 decibels.

These last three graphs seem to indicate that bounded levels of valence in conjunction with varying levels of the other features we measured have little to do with stream count. We did find that songs which performed exceptionally well had a loudness of above -15 decibels or had a danceability of greater than .6, but it was hard to see other relationships between our variables and streams otherwise.

Insight and Conclusion

In the final step of the data cycle, we draw conclusions based off our analysis to inform decisions made based on the data.

To answer our hypothesis: Are there traits of a song that can be used to determine future success? If so, what are they?

We were unable to conclude if there were specific traits of a song that directly determined future success or high stream counts. However, there were a few noteworthy trends in our data. Among our findings, we saw the volumes of songs across our dataset are normally distributed, with a mean of -6 decibels and a standard deviation of about 3 decibels. We can safely say that the majority of songs on the top 200 chart across our data maintain this volume because the data is normally distributed. The loudness of songs brings up an interesting reminder of the Loudness War in the early 1940s. During the loudness war, even though increasing the loudness of a song ultimately reduced its fidelity (fine details), critics preferred the increasing levels. This may be an echo of the impacts of this cultural trend, or perhaps people simply like their music loud, even if at a lower quality.

Our initial plotting of popularity index against streams indicated that most top 200 songs came from artists with an index of around 80 (mean of 81), and our correlation plot seemed to support that higher artist popularity indices were positively correlated with streams. It is unsurprising that artist popularity has a correlation with streams, but what is noteworthy is that our correlation plot seemed to indicate that even artist popularity alone had a weak positive correlation with streams. We noted that while the average number of followers for an artist was skewed by inconsistently high values in the top 1 or 2 positions, the median value of followers for an artist in the top 200 was 7,748,023.

After analyzing several features individually and seeing little direct correlation with stream values, we decided to consider how different factors working together might impact streams. We looked at individual features within the top 10 positions of our charts to see if we could find any trends here, as these songs were the most succesful of all the songs on our data. Within these, we initially thought we saw a positive correlation between valence and stream counts, but it was unclear if there was a distinctive relationship. We tried to plot a couple other features with bounded valence values to try to investigate whether any of these combinations could produce a more clear correlation. The other features we chose to measure with bounded valence values were ones which appeared to be positively correlated with streams according to our correlation chart, but were also measured independently from valence (i.e. they were not determined using the same song characteristics). Unfortunately, it appeared that there was no correlation between the combination of valence with the other factors we considered (danceability, popularity, and volume) and streams.

No single feature appeared to be the determining factor for a song’s success. Perhaps future research on how combinations of different features affect streams would be valuable, as our correlation plot seemed to indicate our conclusion. It would also be worth looking into what factors influence artist popularity. Access to data from a larger time period could also help in finding interesting trends in how people listen to music over time, too.