Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CS 7545: Machine Learning Theory Problem Set 2
Problem 1
Problem.
1. Graded. Is it possible for an ERM hypothesis hˆ ∈ H on a set S ⊆ X to have LbS(hˆ) = 0 but L(hˆ) = 1? Why or why not? Would this be overfitting or underfitting? What is the role of the complexity of H in this situation?
2. Graded. Let H be a hypothesis class, hˆ ∈ H be an ERM hypothesis for a sample S, and h ? = arg infh∈H L(h). Show that ES∼Dm[LbS(hˆ)] ≤ L(h ? ) ≤ ES∼Dm[L(hˆ)]. (1)
3. Ungraded, optional. Here are some resources if you would like to study the proof of the nofree-lunch theorem. Most sources prove a simpler version; the one from class takes a bit more work. I highly recommend this illustrated proof. For a textbook version, see Section 5.1 “The No-Free-Lunch Theorem” in Understanding Machine Learning: From Theory to Algorithms. (Don’t worry about the PAC-learning stuff in Corollary 5.2).
Problem 2
Problem. In lecture we showed the following one-sided uniform convergence generalization bound: for H containing functions h : X → {−1, 1} such that |H| < ∞ and any δ > 0, with probability at least 1 − δ over S ∼ Dm, the following holds for all h ∈ H:
L(h) ≤ LbS(h) + r log |H| + log 1/δ 2m . (2)
This bound shows that the estimation error of the empirical risk minimizer goes to zero with 1/ √ m, which is often called the “slow rate”. Interestingly, we did not use any properties of H besides its size. One may wonder whether there is any advantage to choosing a hypothesis class H0 which is the same size as H but contains “better” functions. In fact, it turns out that if all the functions in H0 have sufficiently low generalization error, we can achieve a “fast rate” of 1/m.
In order to prove the fast rate bound, we need a more sophisticated concentration bound than Hoeffding’s inequality. Let us state Bernstein’s inequality: Let Z1, . . . , Zm be i.i.d. random variables with zero mean such that |Zi | ≤ C and Var(Zi) ≤ D for all i. Then for all > 0,
Pr " 1 m Xm i=1 Zi ≥ # ≤ exp −(m2 )/2 D + (C)/3 . (3) 1. Graded. Let H contain functions h : X → {−1, 1} with |H| < ∞. Suppose there exists a function q : R → R such that L(h) ≤ q(m) for any h ∈ H and S ⊆ X with |S| = m. Use Bernstein’s inequality to prove that for any δ > 0, with probability at least 1−δ over S ∼ Dm, the following holds for all h ∈ H: L(h) ≤ LbS(h) + r 2(log |H| + log 1/δ)q(m) m + 2(log |H| + log 1/δ) 3m . (4)
Hint. First, define the Zi ’s and compute C and D. Then, the rest will be similar to the proof of our original bound, but with Bernstein instead of Hoeffding. It’s OK if you get different constants than are listed here.
2. Graded. How should q(m) scale in order to obtain the fast rate? It is sufficient to give an answer like q(m) = O(?) and explain your reasoning.
3. Ungraded, optional. A more general form of Bernstein’s inequality holds for a very large class of distributions called subexponential distributions. These distributions are roughly characterized by having heavier tails than a Gaussian – they decay with e −x instead of e −x 2 – and they come up often in machine learning theory. If you would like to learn more, read sections 2.7 and 2.8 of High-Dimensional Probability.
Problem 3
Problem. In this problem, we will prove a classical bound on the Rademacher complexity of neural networks. Suppose the input space is X = R n and we have a training set S = {(xi , yi)} m i=1. Let φ : R → R be an L-Lipschitz activation function such that φ(0) = 0 (e.g., the ReLU function). Define the class of neural networks of depth 2 ≤ j ≤ D and width H with `1-bounded weights recursively as
Fj := ( x 7→ X H k=1 wkφ(fk(x)) : fk ∈ Fj−1, kwk1 ≤ Bj ) . (5)
Here, φ is applied elementwise, i.e., φ(x) = (φ(x1), . . . , φ(xn)).
1. Graded. Define F1 := {x 7→ hw, xi : kwk1 ≤ B1} and suppose kxik∞ ≤ C for all 1 ≤ i ≤ m. Prove that RS(F1) ≤ B1C r 2 log 2n m . (6) Hint. Use Hölder’s inequality and Massart’s lemma.
2. Graded. Prove that RS(Fj ) ≤ 2LBjRS(Fj−1) for 2 ≤ j ≤ D. Hint. Use Hölder’s inequality and Talagrand’s contraction lemma. You may use part (4) without proof.
3. Graded. Use parts (1) and (2) to show an upper bound on the Rademacher complexity of RS(FD). (You must use parts (1) and (2)).
4. Ungraded, optional. Prove that if a function class G contains the zero function, then Eσ " sup g∈G 1 m Xm i=1 σig(xi) # ≤ 2RS(G). (7)
Problem 4
Problem. Suppose A ⊆ R m.
1. Graded. Prove that R(A + b) = R(A) where A + b = {a + b : a ∈ A} for any b ∈ R m.
2. Graded. Prove that R(cA) = |c|R(A) where cA = {c · a : a ∈ A} for any c ∈ R.
3. Graded. In lecture we proved the following one-sided uniform convergence generalization bound: for F containing functions f : X → [0, 1] and any δ > 0, with probability at least 1−δ over S ∼ Dm, the following holds for all f ∈ F:
L(f) ≤ LbS(f) + 2R(F) + r log 1/δ 2m . (8)
However, to show a bound on the estimation error of ERM we actually needed a two-sided bound, on supf∈F L(f) − LbS(f) . Use parts (1) and (2) to prove one. (You must use parts (1) and (2)).
4. Challenge, optional, 1 point extra credit. Let S ∼ Dm and suppose F contains functions f : X → [0, 1]. Prove the symmetrization lower bound, also called the desymmetrization inequality:
1 2 R(F) − r log 2 2m ≤ ES " sup f∈F L(f) − LbS(f) # . (9)
Problem 5
Problem. In lecture we studied the growth function for classes of functions taking values in the set {−1, 1}, but the same definition applies to classes of functions taking values in the finite set Y. In this case, ΠH(m) ≤ |Y|m (analogous to 2 m in the original setup).
1. Graded. Let H1 ⊆ {h : X → Y1} and H2 ⊆ {h : X → Y2} be function classes and let H3 ⊆ {h : X × X → Y1 × Y2} such that H3 = {(h1, h2) : h1 ∈ H1, h2 ∈ H2}. Show that ΠH3 (m) = ΠH1 (m) · ΠH2 (m). (10)
2. Graded. Let H1 ⊆ {h : X → Y1} and H2 ⊆ {h : Y1 → Y2} be function classes and let H3 ⊆ {h : X → Y2} such that H3 = {h2 ◦ h1 : h1 ∈ H1, h2 ∈ H2}. Show that ΠH3 (m) ≤ ΠH1 (m) · ΠH2 (m). (11)
3. Ungraded, optional. Prove that (2) is tight, i.e., exhibit X , Y1, Y2, H1, H2, m such that ΠH3 (m) = ΠH1 (m) · ΠH2 (m). Hint. You can take |X | = m = 1.
Problem 6
Problem.
1. Graded. What is the VC-dimension of a union of k intervals on the real line?
2. Graded. What is the VC-dimension of axis-aligned hyperrectangles in R n?
3. Graded. A simplex in R n is the intersection of n + 1 halfspaces (not necessarily bounded). Prove that the VC-dimension of simplices in R n is O(n 2 log n). Hint. Use the VC-dimension of halfspaces in R n.
4. Challenge, optional, 1 extra credit point. Prove the best lower bound you can on the VC-dimension of simplices in R n. You will receive the extra credit point if you either (i) prove a lower bound of Ω(n) and show a reasonable attempt at improving it, or (ii) prove a lower bound better than Ω(n).