CS275P: Graphical Models & Statistical Learning Homework 7

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Homework 7: Collaborative Filtering & Variational Autoencoders

CS275P: Graphical Models & Statistical Learning

Question 1: (40 points)

The MovieLens dataset (http://movielens.org) contains ratings for M movies, recorded as integers between 1 and 5, for a community of N users. Most users have only rated a few of the entire set of possible movies, so the data is stored as a sparse M ×N matrix X, where M is the number of movies and N is the number of users. For this assignment we have extracted a small subset of the overall database, containing M = 500 movie titles and N = 943 users.

The training and test data represent the same users, but each matrix has a different set of observed, non-zero user-movie rating pairs. You will learn factor analysis and PCA models from the training ratings, and use them to predict test ratings.

Let xij be the rating that user i gives to movie j. Not all user-movie pairs are observed in training data: let rij = 1 if xij is observed, and rij = 0 otherwise. The set of observed ratings for user i is then x o = {xij | rij = 1}. Factor analysis explains ratings xi for each user i via a K-dimensional latent vector zi ∈ R K: p(zi) = Norm(zi | 0, IK), p(xi | zi) = Norm(xi | W zi + m, V ).

Here, W is a M × K factor-loading matrix, m is an M × 1 mean vector, and V is an M ×M diagonal covariance matrix. Given a matrix of partially observed ratings x o , the EM algorithm will find maximum likelihood estimates of parameters W, m, V , and corresponding marginals for the hidden variables z. We have provided an implementation of the EM algorithm for learning factor analysis models from sparse observations like these.

In this question, we evaluate the empirical performance of several methods for predicting movie ratings. We compute the test root mean square error (RMSE) using only the ratings that were not observed in the training set, and the following formula:

Here, Hi is the set of test movies for user i, Nh is the total number of ratings in the test dataset, and ˆ ij is the rating predicted by the model under evaluation.

a) As a simple baseline, for each movie in the corpus, compute the average of the observed training ratings x o across all users. Then for each test item, simply predict the mean rating of the corresponding movie. Calculate the test RMSE for this method.

b) Next, we consider a simple heuristic method for dimensionality reduction with sparse data. First, fill in the missing entries of the training movie rating matrix using the mean predictions from part (a). Apply principal component analysis (PCA) to this matrix using the scikit-learn Python implementation (see demonstration code for details). Find low dimensional representations zi for each user by using the top K = {1, 2, . . . , 15} principal components, and use these to reconstruct the missing ratings x h . Plot RMSE versus K.

Is this reconstruction better than the baseline?

c) For the heuristic dimensionality reduction method of part (b), what should the performance approach as K → M, the number of movies?

d) Run the provided EM factor analysis code to estimate low-dimensional representations zi, the factor matrix W, mean vector m, and variances V . For each K ∈ {1, 2, . . . , 15}, run EM for 100 training iterations. After training, use these estimated quantities to reconstruct the missing ratings. Plot RMSE versus K on the same axes as the PCA plot from part (b). How do the two methods (Factor Analysis and PCA) compare? What choice of K leads to the best performance?

Question 2: (60 points)

In this question, we will learn generative models that represent data xi of dimension M by low-dimensional vectors zi ∈ R K, where K  M. In our experiments, the data are image of handwritten digits with M pixels, from the MNIST dataset. We will learn variational autoencoders (VAEs) that model images xi as a non-linear function of their embedding zi :

p(zi) = Norm(zi | 0, IK), p(xi | zi) =
M
Y
j=1

Bernoulli(xij | µj (zi ; θ)).

Here xij is the value of pixel j in image i, whose mean µj is determined by a deep neural network with parameters θ. As a baseline, we will also consider probabilistic PCA (PPCA) models that assume the data mean is a linear function of zi . PPCA is a special case of the factor analysis model from Question 1 where V = σ 2 IM.

a) As a simple baseline, train two PPCA models with latent space dimensions K = 2, 50. Use the maximum likelihood PPCA training algorithm implemented in ppca.py.

b) For each of the two PPCA models, plot the means of the reconstructions of 7 images from the MNIST dataset.

c) Generate and plot 25 samples from each of the two PPCA models. You can generate each sample by first drawing a “code” zi from the Gaussian prior in the latent space, and then using the decode(...) function to transform zi into an image xi = W zi + m.

d) Train two VAE models with latent space dimensions K = 2, 50. For each VAE, run the stochastic gradient training algorithm for 50 epochs using the Adam optimizer, with a learning rate of 0.001 and a batch size of 128. For each VAE model, plot the average training loss per epoch.

e) For each of the two VAE models, plot the means of the reconstructions of 7 images from the MNIST dataset. Briefly discuss how the VAE reconstructions compare to the PPCA reconstructions from part (b).

f ) Generate and plot 25 samples from each of the two VAE models. You can generate each sample by first drawing a “code” zi from the Gaussian prior in the latent space, and then using the decode(...) function to transform zi into an image xi = µ(zi ; θ). Briefly discuss how the VAE samples compare to the PPCA samples from part (c).

g) For the PPCA and VAE models with K = 2 latent dimensions, use plot_2d_clusterings to create a plot of the latent encodings for each model, color-coded based on the digit label.

Compare how well the PPCA and VAE models cluster digits in the latent space.

发表评论

电子邮件地址不会被公开。 必填项已用*标注