Project 1
Use the material in the zip file project01.zip in Learn to write your report. Open the project01.Rproj project in RStudio and add all your function definitions to the code.R file. Write your report using report.Rmd. You must upload the following three files as part of this assignment: code.R, report.html, and report.Rmd. All functions should be in the code.R file. Document your functions using the same style as the predefined functions in the code.R file.
The main text in your report should be a coherent presentation of theory and a discussion of methods and results, showing code for code chunks that perform computations and analysis but not code for code chunks that generate figures or tables. Use the echo=TRUE and echo=FALSE to control what code is visible in your report.
1 Confidence interval approximation
Suppose we observe a dataset Y = {Y1,...,Yn} that follows an Exponential distribution:
Yi ≥ Exp(⁄), i = 1, . . . , n.
The Exponential probability density function (pdf) is given by:
p(y|⁄) = ⁄e≠⁄y, y> 0.
The maximum likelihood estimator (MLE) for ⁄ is
1. Create a function CI to compute confidence intervals for ⁄. The function should take y (observed data) and – (confidence level) as input and return the upper and lower confidence intervals.
2. Create a function called estimate_coverage to estimate the empirical coverage of the confidence interval using simulations. Your function should take arguments, CI, N (the number of simulation replications to use for the coverage estimate), alpha (1-alpha is the intended coverage probability), n (the sample size), and lambda (the true lambda values).
3. Use your function estimate_coverage to estimate the coverage of the construction of confidence intervals. To investigate how sensitive the results are to the precise values of ⁄ and n, run the coverage estimation for different combinations of model parameters ⁄ and n (fix N = 10000 and – = 0.1).
4. Present your results of estimated coverage in two plots side by side: as a function of ⁄ for fixed n = 20 and as a function of n for fixed ⁄ = 2. Discuss the plots in regard to whether the coverage of the intervals achieves the desired 90% confidence level, and if not, identify under which cases and provide a suggestion as to why. What alternative methods could be used to construct more accurate confidence intervals?
2 Bayesian inference for air quality prediction
The airquality dataset available in base R comprises records of daily air quality measurements in New York. For each month and day, the dataset contains:
• Ozone: Mean ozone concentration (ppb)
• Temp: Maximum daily temperature (°F)
• Wind: Wind speed (mph)
• Solar.R: Solar radiation (lang)
Your goal is to model ozone concentration (Ozone) as a function of temperature (Temp) while accounting for heteroscedasticity in the variance using the model
yi ≥ Normal(—1 + —2xi, —3 + —4x2
i )
We use the following parametrization
◊ = (—1, —2, log(—3), log(—4)).
1. Start by loading the airquality dataset. Then, create a new dataset, omitting the days with missing values. Use ‘ggplot’ to create scatter plots of Ozone vs. Temp and examine the variance pattern for different months. Comment on the choice of the model above and the use of the parametrization.
2. With the help of dnorm and the dlogexp function (see the code.R file for documentation), define a function log_prior_density with arguments theta and params, where theta is the ◊ parameter vector, and params is the vector of “ parameters (see below). Your function should evaluate the logarithm of the joint prior density p(◊) for the four ◊i parameters. Use independent prior distributions as follows:
◊1 ≥ Normal(0, “1),
◊2 ≥ Normal(1, “2),
◊3 ≥ LogExp(“3),
◊4 ≥ LogExp(“4),
3. Define a function log_like, taking arguments theta, x, and y, that evaluates the observation log likelihood p(y|◊) for the model defined above.
4. Define a function log_posterior_density with arguments theta, x, y, and params, which evaluates the logarithm of the posterior density p(◊|y), apart from some unevaluated normalisation constant.
5. Define a function posterior_mode with arguments theta_start, x, y, and params, that uses optim, log_posterior_density, and the airquality dataset to find the mode µ of the log-posterior-density and evaluates the Hessian at the mode as well as the inverse of the negated Hessian, S.
6. Let all “i = 1, i = 1, 2, 3, 4, and use posterior_mode to evaluate the inverse of the negated Hessian at the mode in order to obtain a multivariate Normal approximation Normal(µ,S) to the posterior distribution for ◊. Use start values ◊ = 0.
7. Define a function do_importance that performs importance sampling taking arguments N (the number of samples to generate), mu (the mean vector for the importance distribution), and S (the covariance matrix), and other additional parameters that are needed by the function code. The function should output a data.frame with five columns, beta1, beta2, beta3, beta4, and log_weights, containing the —i samples and normalised log-importance-weights, so that sum(exp(log_weights)) is 1. Use the log_sum_exp function (see the code.R file for documentation) to compute the needed normalisation information.
8. Use your defined functions to compute an importance sample of size N = 10000. Compute and plot the empirical weighted CDFs by first sorting the sampled parameter values, then computing cumulative sums of their normalized importance weights. Use ggplot2 to visualize the weighted and unweighted empirical distributions for each — value and comment on the results.
9. Comment on how the variance assumption impacts the inference. What assumptions in the current model might be unrealistic? How could they be relaxed or extended?