STA107 Analysis of Song Durations

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due


Analysis of Song Durations

Context of the Data

A sample of STA107 students in our class participated in the discussion board “Music to Support Mental Health and Well-being” and mentioned a name of a song and its duration (converted to seconds) that they find uplifting and/or inspiring.

In your small groups, you will explore the distribution of song durations. The data set is stored in “Songs.csv” and it contains list of 22 songs. You will also obtain the frequency distribution of words in the lyrics of the selected songs and will practice making word clouds.  The data set is stored in “lyrics.txt”, and it contains the lyrics of 24 songs. In cases where song lyrics were not in English, translations in English were obtained.

List of variables in the Songs data set:

•  song: Here Comes the Sun

•  artist: The Beatles

•  duration:  185 seconds

You will consider the following data investigations:

The histogram shows that the distribution of song durations is expected to display as a symmetrical distri- bution, with a peak around 4 minutes. 

For the central tendency, the data is around the mean duration of 3.94 mins, which is near towards the peak of the histogram, the median is about 3.91 mins. For the spread, the minimum is about 1.69 minutes and 5.43 for maximum, this conclude that the data is slightly skew to the left, but mostly symmetric.

The boxplot shows that the distribution of song durations is fairly symmetric, with the central box and the median line placed near in the center of the box, and does not contains outliers.  From the code it shows that the median line is inside the box located in 3.91 minutes, which is really close to the mean that is about 3.94 minutes, which we suggest that the data is symmetric around teh center.  The Spread had shown the IQR is between the first and third quartile, it means 50% of the songs have duration between these valuse. From IQR, it shows that the minimum value is 1.68 minutes and maximum value is about 5.43, indicate the range of the song duration.

The QQ-plot shows that the distribution of song duration is followed in a normal distribution.  The data points are align closely with the red lines, with one point in -2 that is slightly different from the other points. The distribution does not show significant deviation, and the data points are generally aligned with the theoretical line, that tells that the song duration is approximately normally distributed.

It appears that the empirical rule does states that for a normal distribution, where the first standard deviation in the mean is 3.05 min and 4.83 min, second standard deviation in the mean lands in between 2.17 min and 5.71 min, that is about 95% of the data.  The third standard deviation of the mean is between 1.28 min and 6.60 min, which is 100% of the data, this conclude that the standard deviation is close to the expected value, and empirical rule holds well for this data, and all data points lying within the range.

The middle 95% of songs has between 2.201156 minutes and 5.676116 minutes durations.

We estimated the sampling distribution of X  (sample means).   The estimated population mean for song durations is 2.942116 minutes and the estimated standard deviation for song durations is 0.1826416 minutes.

A 95% confidence interval for population mean song durations (minutes) is bounded by the middle 95% of the bootstrap distribution 3.58 to 4.29. Based on the resampling from our data, we are 95% confident that the true population mean of song durations lies between 3.58 and 4.29 minutes.

The histogram plot shows that the estimated sampling distribution of sample means for song duration is approximately normal. The distribution is centered around the mean represented by the red line had contain the value between 3.9 and 4.0.  The majority of the sample means clustered around the population mean, that the confidence interval lines such as the green dashed lines are visible.

Enjoy making sense of song durations data :)

Load the Libraries

library (tidyverse)

library (mosaic)

library (knitr)

library(RColorBrewer)

library (wordcloud)

library (wordcloud2)

library (tokenizers)

library (tm)

Load Songs Data Set

songs_data  <- read.csv ("Songs.csv")

#  Convert  duration  from  seconds to  minutes

songs_data$duration_min  <- songs_data$duration 60

attach (songs_data)

Summary Statistics for Song Durations (in minutes)

kable (favstats (duration_min))

min

Q1

median

Q3

max

mean

sd

n

missing

1.683333

3.416667

3.908333

4.575

5.433333

3.938636

0.8864856

22

0




发表评论

电子邮件地址不会被公开。 必填项已用*标注