CS281B/Stat241B Statistical Learning Theory Homework Assignment 1

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

CS281B/Stat241B Homework Assignment 1

1. (Perceptron algorithm) Implement the perceptron algorithm. Consider the following two data sets, which are labelled according to a linear threshold function. (Define ei ∈ {0, 1} d as the unit vector with the ith component equal to 1.)
(a) Choose xi ∈ {0, 1} d+1 with xi = ei + ed+1, yi = 1 for i = 1, . . . d, and xd+1 = ed+1, yd+1 = −1.
(b) Construct a set Sn ⊂ {0, 1} 2n as follows. Set S1 = {01, 10, 11}, and for i ≥ 1, Si+1 = {x01 : x ∈ Si} ∪ {11 . . . 10, 00 . . . 011}.
For each element b = (b1 · · · b2n) of Sn, define an associated boolean value vb ∈ {0, 1} as vb = b2n ∧ (b2n−1 ∨ (b2n−2 ∧ (b2n−3 ∨ (· · ·(b2 ∧ b1)· · ·)))), where ∧ denotes conjunction (b1 ∧ b2 is 1 only for b1 = b2 = 1) and ∨ denotes disjunction (b1 ∨ b2 is 0 only for b1 = b2 = 0). Finally, set {(x1, y1), . . . ,(x2n+1, y2n+1)} = {((b, 1), 2vb − 1) : b ∈ Sn}.
Plot the number of updates made by the perceptron algorithm with these two data sets, as a function of n for d = 2n. Comment on the difference. Explain why it occurs.
2. (VC-dimension and lower bounds on risk in pattern classification) We say that a class F of {±1}-valued functions defined on X shatters a set {x1, . . . , xn} ⊆ X if
{(f(x1), . . . , f(xn)) : f ∈ F} = {±1} n , that is, if F can compute all 2n dichotomies of the set. The Vapnik-Chervonenkis dimension of F is dV C (F) = max {n : F shatters some {x1, . . . , xn} ⊆ X } .
We have seen the following minimax lower bound on expected risk for the class of linear threshold functions on R d .
Theorem 1 For any classification rule fn and any n > 1, there is a probability distribution P on X × {±1} for which some f ∈ F has R(f) = 0 but ER(fn) ≥  min(n, d) − 1
2n   1 − 1 n  n .
(a) Show that this result remains true for F an arbitrary set of functions with dV C (F) ≥ d.
(b) Hence, prove minimax lower bounds on expected risk for the following classes F of binary-valued functions.
i. Indicators of boxes in R d , F =  θB(a,b) : a, b ∈ R d , where θS(x) = ( 1 if x ∈ S, − 1 otherwise, and B(a, b) =  x ∈ R d : ∀i, ai ≤ xi ≤ bi
ii. Indicators of unions of intervals in R, . F = {θU : U is a union of up to k closed intervals} .
iii. Indicators of convex sets in R 2 ,
F =  θS : S ⊆ R 2 and S convex .
(c) Consider the empirical risk minimization strategy for indicators of boxes in R d that, given training data (x1, y1), . . . ,(xn, yn) predicts using fn = θBmin , where
Bmin = \
 B (a, b) : ∀i, θB(a,b)(xi) = yi .
Describe a distribution P on R d × {±1} for which the marginal distribution on R d is continuous, some box indicator f has R(f) = 0, and this distribution exhibits the bound of Theorem 1 for this ERM strategy: ER(θBmin ) = Ω(d/n).

3. (Kernel density estimation) Suppose X1, . . . , Xn are i.i.d. according to some continuous probability distribution on R, with density f. The kernel density estimate

f ˆ(x) = 1
nh n X i =1
K  x − h Xi 
is used to estimate the unknown f. Here, K is a probability density function with zero mean and h > 0 is a bandwidth parameter. Suppose that we use the L1 error to measure the quality of the density estimate f ˆ,
J(f ˆ) = Z
f(x) − f ˆ(x)
Show that, with probability at least 1 − δ, dx.
J(f ˆ) − EJ(f ˆ)≤ r 2 log 2/δn.


发表评论

电子邮件地址不会被公开。 必填项已用*标注