STA442H1: Methods of Applied Statistics I
Homework 3
Due March 8, 2024 4:59pm ET
Please submit your assignment as a pdf through Crowdmark.Ideally, make an R markdown document and output a pdf that shows your work. Please include your code (with comments!) with your assignment. Please list any references you used with a full citation. A friendly reminder that AI should not be used to help with this assignment. Feel free to discuss the homework questions with others, but your work and write-up should be your own. Ask questions about the homework on Piazza, and/or in the TA and instructor office hours.
Download the (astronomy!)data file M sigma.csv that is posted on Quercus in the homework section (this example data set was compiled by Hible et al 2017). Each row in the data set is an observation of a galaxy, and each galaxy is measured only once. The columns are as follows:
logsimga → log σ/σ0 , where σ is the central velocity dispersion of stars in the center region of the galaxy and where σ0 = 200km/s,
errlogsigma → measurement error for above
logMbh → log M./M⊙ , where M. is the mass of the galaxy’s central, supermassive black hole and M⊙ is one solar mass unit.
errlogMbh → measurement error for above
Type → categorical variable for the type of galaxy
1. For the following questions, define the covariate as x = log σ/σ0 and the response as y = log M./M⊙ . Ignore the measurement uncertainties in x and ignore the type of galaxy.
(a) Fit a linear regression of y on x using lm. Make a plot that shows the fit and the 90% confidence region, and show the output from the summary statistics. You may use the errors in y, but not required. (3 marks)
(b) Set up a linear regression using a Bayesian approach that will estimate the posterior dis-tribution for the intercept and slope parameters (β0 ,β1 ). Modify the Metropolis code we wrote in class on Feb. 14 to accept a vector of parameters. Write a target function for the linear regression. Define a Gaussian likelihood assuming that the standard deviations of the errors in y are known. Define the prior distribution and justify your choice. If you would like to define an informative prior, then see seminal papers on the M. − σ relation e.g., Ferrarese et al 2000, Gerbhardt et al 2000, among others. (5 marks)
(c) Run the Bayesian analysis using your code from the previous question. Ensure that you sample at least neff = 500. Report the summary statistics for (β0 ,β1 ), and compare to the fit using lm. Discuss whether you think the chains have converged to the target distribution using diagnostics and traceplots. (6 marks)
(d) Plot the inferred linear relationship from your Bayesian analysis using the mean of the posterior samples. Add to this figure 100 lines (use a semi-transparent colour) from the posterior samples, to give some indication of the uncertainty in the fit. (2 marks)
(e) Plot again the mean linear relationship along with the inferred 90% credible region. The credible region for a function (e.g., a line) is often defined pointwise, that is, for any x, what is the credible region for the predicted y. (3 marks)