STAT3405/STAT4066 Department of Mathematics and Statistics

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Department of Mathematics and Statistics  STAT3405/STAT4066

Important: This assignment is assessed. Your work for this assignment must be submitted by 9:00pm on Sunday, 3 November 2024.

The expectation for a submission are:

  • The questions are answered in complete sentences. Marks will be awarded for the cor rectness of the answers and that they are given in complete sentences.
  • The answers to the questions should be submitted via LMS.
  • That numerical answers are rounded to an appropriate number of digits.
  • Code used to answer the questions should be submitted as an attached R notebook (with extension .Rmd) in your LMS submission.
Marks will be awarded for
  • the correctness of the answers and that they are given in complete sentences.
  • the correctness of the code, i.e. how easy it is to read it1 ; and
  • how easy it is to run your code, i.e. to turn your R notebook into a PDF file.
You may receive comments on the efficiency of your code, but there are no marks for efficiency. Unless special consideration is granted, any student failing to submit work by the deadline will receive a penalty for late submission (as described in the unit outline).

AI: Be reminded that the use of AI is not permitted for this assessment.

Plagiarism: You are encouraged to discuss assignments with other students and to solve problems together. However, the work that you submit must be your sole effort (i.e. not copied from anyone else). If you are found guilty of plagiarism you may be penalised. You are reminded of the University’s policy on ‘Academic Conduct’ and ‘Academic Misconduct’ (including plagiarism): 

http://www.student.uwa.edu.au/learning/resources/ace/conduct

Various material at the following URL might be helpful too:
https://www.uwa.edu.au/students/study-success/studysmarter

Task 1. Here we revisit Task 1 from the second assignment.

Recall, the file Golf.csv, available from LMS, contains the number of attempts (m) and successes (y) of golf putts, by distance from the hole in feet (distance), for a sample of professional golfers.

After downloading this file to the directory in which your R notebook is, you should be able to read the file using the following command2 :

FOO <- read.csv("Golf.csv")

For this exercise we will model the observed yi as realisations of independent binomial dis tributed random variables Yi , i = 1, . . . , 19, where the success probability depends on the distance from the hole. We will denote this distance by xi below, but it is the variable distance in the data file.

In this exercise we will consider the following model for these data:

Yi |mi , pi ∼ Bin(mi , pi), i = 1, . . . , n
logit(pi) = β0 + β1 log(xi)
β0 ∼ some suitable prior
β1 ∼ some suitable prior

Here β0 and β1 are two regression parameters and we will refer to them jointly as β.

(a) What is the interpretation of β1 in this model?

Hint: Consider how the odds change when the distance to the hole doubles.

(b) Implement the above model in your preferred probabilistic programming language. In the answer that you write into the submission window you should clearly state the priors that you put on β0 and β1. The code must be contained within the R notebook that you submit.

(c) What are your Bayesian estimates for β0 and β1?
(d) Produce a plot of the observed proportion of successful putts and superimpose the regression line that you have fitted. The plot should include distances from the hole between 0 and 22 feet. Submit this plot as part of the task. (That is, the plot should be included in the submission window and the code in your R notebook should produce the plot that you submit.)

Comment in a sentence or two whether you think the model is adequate.

(e) Use the test quantity


to perform posterior predictive checks to assess the fit of the model. That is, expand your code such that for each sample of β0 and β1
• a replicate data set y rep from the posterior predictive distribution is drawn,
• T(y, β) is evaluated; and
• T(y rep , β) is evaluated.
These calculations should be performed in Stan or BUGS and not by post-processing in R.

The relevant code must be contained within the R notebook that you submit.

In the submission window state your estimate for the Bayesian p-value P [T(y rep , β) ≥ T(y, β)|y].

Based on this posterior predictive check, do you think the model is suitable for these data? Discuss in a sentence or two.

Task 2. The complete data set on the survey that was done on bicycle and other vehicular traffic in the neighbourhood of the campus of the University of California, Berkeley, is available on LMS in the file bicycles.csv.

Remember, these data are counts of bicycles and other vehicles in one hour in each of 10 city blocks in each of six categories. That is, sixty city blocks were selected at random; each block was observed for one hour, and the numbers of bicycles and other vehicles travelling along that block were recorded. The sampling was stratified into six types of city blocks: busy, fairly busy and residential streets (streets were classified before the data were gathered), and with and without bike routes. The data for two of the residential blocks were lost.

After downloading this file to the directory in which your R notebook is, you should be able to read the file using the following command3 :

BAR <- read.csv("bicycles.csv")

The data frame dat should now contain the following variables:

Type             the type of the street, a factor with the levels “Busy”, “FairlyBusy” and “Residential”.
BikeRoute     does the street have a bike route? A factor with levels “yes” and “no”.
Bicycles        the number of bicycles observed.
Other           the number of other vehicles observed.

(a) Define the following indicator variables:

xi1 = ( 1 if observation 0 otherwise i was taken on a street with a bike route
xi2 = ( 0 otherwise 1 if observation i was taken on a fairly busy street
xi3 = ( 0 otherwise 1 if observation i was taken on a busy street

write down R commands4 that calculate the vectors of observed x1, x2 and x3. Also

write down a command that determines mi , the total number of observed vehicles in each street.

(b) Let Yi denote the number of bicycles observed and consider the following hierarchical generalised linear model:


(1) Implement the above model in your preferred probabilistic programming language. In the answer that you write into the submission window you should clearly state the prior5 that you put on σα. The code must be contained within the R notebook that you submit.

(2) What are your Bayesian estimates for β0, β1, β2, β3, β4, β5 and σα?

(3) Looking at the signs of the estimates β1, β2, β3, β4 and β5, are they what you would have expected? Do these estimates make sense? Comment briefly.

(4) Based on this model, what is your Bayesian estimate for the odds-ratio of a vehicle in a busy street with a bike route being a bicycle compared to a busy street without a bike route?
(5) Assume you select a new busy street without a bike route, which is similar to those in your sample, and observe 200 vehicles. Include a plot of the posterior predictive distribution of the number of bicycles in these 200 vehicles in the submission window. Based on this plot, what number of bicycles are you most likely to observe?

Task 3. Here we revisit Task 1 from the second set of computer lab problems and Task 2 from the fourth set of computer lab problems.

Recall, the file pregnancies.csv, available from LMS, contains the information on women who got pregnant under planned pregnancies. The women were classified as smokers andnonsmokers and the cycle in which each woman fell pregnant was recorded. The data file contains the tabulated data.

After downloading this file to the directory in which your R notebook is, you should be able to read the file using the following command6 :

BAZ <- read.csv("pregnancies.csv")

The aim of this task is to explore whether a beta-geometric model7 is appropriate for these data. To do so, and to handle the issue of censoring more easily, we will treat the data as following a multinomial distribution Mult13(π, N) where the vector with probabilities π is determined by a geometric model.

Then we will model y S and y NS as a realisation of a random vectors YS and YNS that follow multinomial distributions and are independent of each other. The full model specification is:


Where p(x; α, β) and F(x; α, β) denote respectively the probability mass function and the complementary cumulative distribution function of the beta-geometric distribution with parameters α and β. In other words, if X ∼ BetaGeom(α, β) then p(x; α, β) = P[X = x] and
F¯(x; α, β) = P[X > x] = 1 − P[X ≤ x].

As each component of a multinomial random vector has marginally a binomial distribution, we might consider using a Pearson’s χ 2 style statistic to test whether our model is adequate. Specifically, consider the test quantity:




发表评论

电子邮件地址不会被公开。 必填项已用*标注