EECS 598 Theoretical Foundations of Machine Learning - Homework #4


Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due


Theoretical Foundations of Machine Learning - Homework #4 

Homework Policy: Working in groups is fine, but every student must submit their own writeup. Please write the members of your group on your solutions. There is no strict limit to the size of the group but we may find it a bit suspicious if there are more than 4 to a team. Questions labelled with (Challenge) are not strictly required, but you’ll get some participation credit if you have something interesting to add, even if it’s only a partial answer. 

1) Solving Feasibility Problems with Perceptron. We can define a special class of Linear Programming Feasibility (LFP) problems as follows. Given a set of m vectors a 1 , . . . , a m ∈ R d , we want to find a feasible solution, which is a vector y ∈ R d such that a i · y > 0 for all i = 1, . . . , m. For any vector x, we can define ν(x) := mini a i ·x kaik2kxk2 , which is positive if x is a feasible solution. Define the wiggle room of the feasibility problem as ν ∗ := supx∈Rd ν(x).

Assume we are given an LFP defined by a 1 , . . . , a m ∈ R d with a positive wiggle room ν ∗ > 0. Using Perceptron as a subroutine, give an algorithm that finds a feasible solution requiring no more than 1 ν∗ 2 updates to Perceptron. Prove that your solution works. 

2) Tuning Parameters. We are going to imagine we have some algorithm A with a performance bound that depends on some input values (which can not be adjusted) and some tuning parameters (which can be optimized). We will use greek letters (α, η, ζ, etc.) for the tuning parameters and capital letters (T, D, N, etc.) for inputs. We would like the bound to be the tightest possible, up to multiplicative constants. For each of the following, tune the parameters to obtain the optimal bound. Using big-Oh notation is fine to hide constants, but you must not ignore the dependence on the input parameters. For example, assuming M, T > 0, imagine we have a performance guarantee of the form: 

Performance(A; M, T, ) ≤ M  + T  (1) 

and we know  > 0. Then by optimizing the above expression with respect to the free parameter we can set  = q T M . With this value we obtain Performance(A; M, T, ) = O( √ MT) NOTE: I didn’t have to make up this problem, I actually pulled all the bounds below from different papers! 

(a) Performance(A; T, N, η) ≤ log N+ηT 1−exp(−η) 

(b) Performance(A; T, η) ≤ max(T η, η−2 ) 

(c) Performance(A; T, η) ≤ T η + exp(η). (Note: you needn’t obtain the optimal choice of η here or the tightest possible bound, but try to tune in order to get a bound that is o(T) – i.e. the bound should grow strictly slower than linear in T.) 

(d) Performance(A; N, T, η, ) ≤ T  η + N  + T η 

3) The Doubling Trick. Let’s imagine that an online learning algorithm that runs in T rounds has the bound in Eq. (1). By optimally tuning  we obtain a bound of the form O( √ TM). The problem with this approach is that it requires us to know T in advance. Is there a way around this issue? 

Imagine constructing a modified algorithm A0 that does the following iterative procedure. A0 starts with an initial parameter 1, implements algorithm A using this parameter for T1 rounds, then adjusts the parameter to 2, and runs A for T2 rounds, and so forth. Let’s say  gets updated k times, where T1 +T2 +. . .+Tk = T. You can also assume that Performance(A0 ; T, M) = Pk i=1 Performance(A; Ti , M, i). 

Can you construct a schedule for the sequence of (i , Ti) that achieves the same bound (up to a multiplicative constant factor) as the optimally tuned bound (namely, O( √ MT)), even though T is unknown in advance? In other words, you want to choose the sequence of sequence of T1, T2, . . ., with the associated 1, 2, . . . so that whenever the game truly ends, at a previously unknown T, the bound Performance(A0 ; T, M) = O( √ MT) will always hold. (Note: this is referred to as a “doubling trick”.) 

4) Exponential Weights Algorithm with a Prior. The Exponential Weights Algorithm had the initial weights w 1 1 , . . . , w1 n all set to 1. What if instead we imagine an algorithm A0 where we set these weights according to w 1 i = pi where ~p is some distribution (i.e. pi ≥ 0 for all i and P i pi = 1). We will do the same multiplicative update rule as before. 

Prove that with this modified algorithm we achieve the following bound: for any expert i we have that 

LossT (A) ≤ log p −1 i + ηLossT (expert i) 1 − e−η 

5) Subsets as Experts. We saw that when we wanted to do “predictive sorting” it’s not easy to apply the halving algorithm to the class of all permutations (rankings) as experts. But this isn’t the case for all classes of “complex experts”. 

Imagine a setting where we have N experts but our goal is not to choose one but k < N of them on each round! We can imagine having a “hyperexpert” for each subset S ⊂ [N], with |S| = k, of which there are clearly N k  . Let S N k be the set of all k-sized subsets of [N]. On round t, each “base” expert i suffers loss ` t i which implies that the hyperexpert S suffers loss losst(S) := X i∈S ` t i , 

that is, the hyperexpert loss is additive across the base experts in the subset. Our aim is to run the Exponential Weights Algorithm on these subset hyperexperts. With this well-defined loss on each hyperexpert and a given parameter η, we can define the weight update w t+1 S = w t S exp(−ηlosst(S)). 

In many scenarios in which we are dealing with hyperexperts, it’s suitable to compute the “projected” weights for each base expert i. That is, assume we can run our algorithm by simply knowing u t i := P S∈SN k :i∈S w t S . In other words, if our weights define a distribution over S N k , then the value u t i corresponds to the (unnormalized) marginal probability of i being in a randomly drawn subset. Can we obtain these values efficiently? If so, how? 

Hint: Given a vector of positive values v1:n := hv1, . . . , vni, we can define SumProd(v1:n, k) := P S∈Sn k Q i∈S vi

Naively this requires O(n k ) computation, but perhaps there is a faster way to compute SumProd? 

6) Reduction to the Halving Algorithm. (Challenge) In the 0-noise expert setting, where there is one expert that makes no mistakes, we showed that a simple strategy, the Halving algorithm, is guaranteed to make no more than log N mistakes. But what if we know that the best expert will some errors, but definitely no more than k mistakes? Try to find a master algorithm, via a reduction to the 0-noise case, such that the number of mistakes of the algorithm is never more than 

max  m : m ≤ log2 N + log2  m ≤ k  

where m ≤k  is the sum of binomial coefficients m 0  + . . . + m k  . (Hint: You need to find a large set of hyperexperts on which to perform the halving algorithm. You may need to argue that, without loss of generality, the majority of these hyperexperts are incorrect on every round.)

发表评论

电子邮件地址不会被公开。 必填项已用*标注