Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
Theoretical Foundations of Machine Learning - Homework #3
Homework Policy: Working in groups is fine, but every student must submit their own writeup. Please write the members of your group on your solutions. There is no strict limit to the size of the group but we may find it a bit suspicious if there are more than 4 to a team. Questions labelled with (Challenge) are not strictly required, but you’ll get some participation credit if you have something interesting to add, even if it’s only a partial answer.
1) Rademacher Complexity Identities. Problem 3.4 from the book, parts (a) and (b).
2) Model Selection. Assume our space X := {x ∈ R d : kxk2 ≤ 1}, Y := {−1, 1}, we have an unknown distribution D on X × Y our loss is the hinge loss `(y 0 , y) = max(0, 1−yy0 ). With a slight abuse of notation we will write `(w,(x, y)) to mean `((w>x), y). Our class of functions H are linear thresholds H := {hw(x) = sign(w>x) : w ∈ R d}. The risk of w (that is, of hw) is R(w) := E(x,y)∼D[`(w,(x, y))]. Let R∗ lin := infw∈Rd R(w) be the optimal risk for linear threshold functions. Given m samples (x1, y1), . . . ,(xm, ym) then the empirical risk is Rˆm(w) := 1 m Pm i=1 `(w,(xi , yi)).
In high-dimension, it is often the case that the optimal w∗ , that is the minimizer of R(·), will have to be very large, or possibly grow without bound. In other words, the margin may be tiny or even infinitesimally small. On the other hand it is typical that a slight amount of regularization can help. Let us assume that
inf w∈Rd R(w) + λkwk 2 → R ∗ lin as λ → 0. (1)
This suggests that it is a good idea to regularize our objective when we have limited data. Let wm = arg min w Rˆm(w) + λmkwk 2 and w∗ m = arg min w R(w) + λmkwk 2
for some choice of regularization parameter λm (decaying with m and to be determined), and note that the arg mins are unique since the objectives are strictly convex. Let’s analyze the performance of wm, and figure out how to select the sequence λm.
(a) Prove that both kwmk 2 ≤ 1 λm and kw∗ mk 2 ≤ 1 λm . (Hint: note that 0 is a possible choice for w)
(b) Let m be fixed. Prove that with probability at least 1 − δ, |R(wm) − Rˆm(wm)| ≤ (m, λm, δ) and |R(w∗ m) − Rˆm(w∗ m)| ≤ (m, λm, δ)
for a suitable choice of function (·, ·, ·) (and notice that (·, ·, ·) need not depend on d).
(c) Now prove the same statement but uniformly over m. That is, find a function (·, ·, ·) such that with probability at least 1 − δ we have for any m that
|R(wm) − Rˆm(wm)| ≤ (m, λm, δ) and |R(w∗ m) − Rˆm(w∗ m)| ≤ (m, λm, δ). (Hint: Union bound followed by choosing a reasonable sequence δm, so that δ = P∞ m=1 δm. It is likely you will get additional log m terms.) 1
(d) For the choice of (·, ·, ·) above, prove that with probability 1 − δ it holds that for all m R(wm) ≤ R(wm) + λmkwmk 2 ≤ R(w∗ m) + λmkw∗ mk 2 + 2(m, λm, δ).
(Hint: the actual form of (·, ·, ·) doesn’t matter, as long is it works in part (c))
(e) Find a suitable choice for the sequence λm that depends only on m, and prove that for this sequence it holds that R(wm) → R∗ lin as m → ∞ with high probability. (Hint: Your solution should combine equation 1 and part (d).)
What we are showing here is that our regularization algorithm is consistent, i.e. our chosen hypothesis wm will approach optimality as the size of our sample grows towards infinity. 2