Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
Homework 1: Generative Models & Decision Theory
CS275P: Machine Learning with Generative Models
Question 1: (20 points)
A dentist wants to diagnose whether a patient has a cavity (C = 1) or not (C = 0), where P(C = 1) = 0.2. She asks the patient whether they have a toothache (A = 1), which is more likely if they have a cavity: P(A = 1 | C = 1) = 0.6, but P(A = 1 | C = 0) = 0.1. She also tests whether a dental tool catches the tooth (T = 1), and knows that P(T = 1 | C = 1) = 0.9 but P(T = 1 | C = 0) = 0.2. Assume that A and T are conditionally independent given C.
a) Draw a directed graphical model defining the joint distribution of C, A, and T.
b) What is P(T = 1), the probability of the dental tool catching?
c) What is P(C = 1 | T = 1), the probability of a cavity given that the dental tool catches?
d) What is P(C = 1 | T = 1, A = 0), the probability of a cavity given that the dental tool catches but there is no toothache?
e) Are the random variables A and T independent? Justify your answer mathematically.
Question 2: (50 points)
We consider data (produced by a sophisticated simulator) like that which would be collected by from a gamma telescope observing high energy particles. The raw data, “showers” of particles on a planar detector, have been converted into 10 continuous features as outlined here: http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. Our goal is the binary classification of the “primary” gamma signals of scientific interest from background, hadronic shower events. The D = 10 continuous features for each of the N = 15, 216 training examples are stored in a N ×D matrix train. The class labels are stored in an N ×1 vector trainLabels, where primary gamma signals have label 1 and background events have label 0. Similarly, test data is stored in test and testLabels.
Let xnd ∈ R be the value of feature d for training example n, and tn ∈ {0, 1} the class label for example n. We will build a “naive Bayes” classifier, which when all errors are equally costly predicts observation n to be a gamma signal when p(tn = 1 | xn) > p(tn = 0 | xn), and a background event otherwise. Using Bayes rule, this classifier is equivalent to one that predicts tn = 1 if and only if
p(tn = 1)p(xn | tn = 1)
p(xn)
>
p(tn = 0)p(xn | tn = 0)
p(xn)
,
ln p(tn = 1) + ln p(xn | tn = 1) > ln p(tn = 0) + ln p(xn | tn = 0). (1)
In this equation, p(tn) is the prior probability of gamma signals and background events.
The conditional probability density function p(xn | tn) describes the distribution of the D = 10 features, which we assume depends on the type of event. We make two simplifying assumptions about these densities: the features xnd are conditionally independent given tn, and their distributions are Gaussian. Thus:
Given tn = 1, xnd is Gaussian with mean µ1d and variance σ1
2
d
. Given tn = 0, xnd is Gaussian with mean µ0d and variance σ0
2
d
. There are a total of 2D mean parameters and 2D variance parameters, since every feature xnd has a distinct distribution for each of the two classes.
a) Derive equations for ln p(xn | tn = 1) and ln p(xn | tn = 0), the (natural) logarithms of the conditional probability density functions in Equations (2,3). For numerical robustness, simplify your answer so that it does not involve the exponential function.
b) Implement code that computes maximum likelihood (ML) estimates of the parameters of the Gaussian naive Bayes model: the class-conditional feature means µ1d, µ0d, the classconditional feature variances σ1
2
d
, σ0
2
d
, and the priors q = p(tn = 0), 1 − q = p(tn = 1). Hint: The demo code shows how to compute the ML estimates of the feature means.
c) Using your results from part (a), and the ML parameter estimates from part (b), write code that evaluates the log-likelihood ratio ln p(xn | tn = 1)−ln p(xn | tn = 0) for every test data point. Using these “confidence” scores, compute and plot an ROC curve to evaluate your classifier’s test performance.
d) Suppose that the frequencies of the classes are as in the training data, and that all errors are equally costly. Determine the optimal Bayesian classification rule. What are the true positive rate and false positive rate of this rule on the test data?
e) Suppose that the frequencies of the classes are as in the training data, and that it is 50 times more costly to classify signals as background (missed detections) as to classify background as signals (false alarms). Determine the optimal Bayesian classification rule. What are the true positive rate and false positive rate of this rule on the test data?
Question 3: (30 points)
You’ve been asked to test the performance of a batch of newly fabricated processors. If the processors were correctly manufactured (class T = 0), the time X to complete your test suite is exponentially distributed with mean 1. If the equipment at the factory malfunctions (class T = 1), the time X is exponentially distributed with mean 50. Recall that the exponential probability density function equals p(x) = θe−θx for x ≥ 0, where E[X] = 1/θ.
You must decide whether or not this batch of processors was correctly manufactured. For the scenarios in the three parts below, it is possible to show that the optimal Bayesian classifier predicts T = 0 if x ≤ c, and predicts T = 1 if x > c, for some constant c. The value of c depends on the test time distributions, the prior probabilities of the two classes, and the assumed loss function. You need to determine the optimal c in each case.
a) Suppose that a new fabrication process has just been deployed, and the probability that the factory manufactures correctly functioning processors is only P(T = 0) = 0.5. What
threshold c of the observed test suite time X = x maximizes the probability that your prediction is correct?
b) Suppose that after some improvements to the new fabrication process, the probability that the factory manufactures correctly functioning processors increases to P(T = 0) = 0.99. What threshold c of the observed test suite time X = x maximizes the probability that your prediction is correct?
c) Market research suggests that the loss (or cost) of a missed detection (predicting T = 0 when the processor is actually defective) is 500 times greater than the loss of a false alarm (predicting T = 1 when the processor was correctly manufactured). Assuming again that P(T = 0) = 0.99, what threshold c of the observed test suite time X = x minimizes the expected loss?