CS7545, Spring 2024: Machine Learning Theory - Homework #3


Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due


CS7545, Spring 2024: Machine Learning Theory - Homework #3 

1) The Doubling Trick. In class, we proved the regret of OGD. In the last step of the proof, we showed that the regret is upper bounded by (Last line of Step 2 in the proof): R POGD T ≤ ηG2T 2 + D2 2η . 

By optimally tuning η as η = D G √ T , we obtain a bound of the form DG√ T. The problem with this approach is that it requires us to know T in advance. Is there a way around this issue? 

Imagine constructing a modified algorithm A′ that does the following iterative procedure. A′ starts with an initial parameter η1, implements the P-OGD algorithm using this step size for T1 rounds, then adjusts the parameter to η2, and runs P-OGD for T2 rounds, and so forth. Let’s say η gets updated k times, where T1 + T2 + . . . + Tk = T. 

Can you construct a schedule for the sequence of (ηi , Ti) that achieves the same order of bound as the optimally tuned bound (namely, O(DG√ T)), even though T is unknown in advance? In other words, you want to choose the sequence of T1, T2, . . ., with the associated η1, η2, . . . so that whenever the online learning procedure truly ends, at a previously unknown T, the bound RA T = O( √ MT) will always hold. (Hint: this is referred to as a “doubling trick”.) 

2) Dynamic Regret. In Lecture 16, we learned how to get the regret guarantee of convex functions by P-OGD. Note that in regret, the compatetor w∗ is fixed. In this problem, we consider a generalized notation of regret: dynamic regret, which is defined as Regret T := X T t=1 ft (wt) − X T t=1 ft(w∗ t ) 

where w∗ t ∈ K and PT t=2 ∥w∗ t − w∗ t−1∥2 ≤ PT . For convenience we assume 0 ∈ K. Can you get the regret guarantee in terms of G, D, T, and PT ? What is the optimal η in this scenario? 

Hint: 1) you can start from RegretT ≤ X T t=1 η 2 G 2 + X T t=1 ∥wt − w∗ t ∥ 2 2 − ∥wt+1 − w∗ t ∥ 2 2 2η , 

and consider how to use the definitions of D and PT to finish the telescoping sum. 2) When tuning η in the final step, consider the upper bound as a function of η, and then try to find the optimal configuration. 

3) Tuning Parameters. We are going to imagine we have some algorithm A with a performance bound that depends on some input values (which can not be adjusted) and some tuning parameters (which can be optimized). We will use greek letters (α, η, ζ, etc.) for the tuning parameters and capital letters (T, D, N, etc.) for inputs. We would like the bound to be the tightest possible, up to multiplicative constants. For each of the following, tune the parameters to obtain the optimal bound. Using big-Oh notation is fine to hide constants, but you must not ignore the dependence on the input parameters. For example, assuming M, T > 0, imagine we have a performance guarantee of the form: 

Performance(A; M, T, ϵ) ≤ M ϵ + T ϵ (1) 

and we know ϵ > 0. Then by optimizing the above expression with respect to the free parameter we can set ϵ = q T M . With this value we obtain Performance(A; M, T, ϵ) = O( √ MT) 

NOTE: I didn’t have to make up this problem, I actually pulled all the bounds below from different papers! 

(a) Performance(A; T, η) ≤ max(T η, η−2 ) 

(b) Performance(A; T, η) ≤ T η + exp(η). (Note: you needn’t obtain the optimal choice of η here or the tightest possible bound, but try to tune in order to get a bound that is o(T) – i.e. the bound should grow strictly slower than linear in T.)

(c) Performance(A; N, T, η, ϵ) ≤ T ϵ η + N ϵ + T η 

(d) Performance(A; N, T, η, ϵ) ≤ log N η + ηT ϵ 2 + ϵT. 

(e) Performance(A; T, N, η) ≤ log N+ηT 1−exp(−η) 

4) Online Non-Convex Optimization. Sometimes our nice assumptions don’t always hold. But maybe things will still work out just fine. For the rest of this problem assume that X ⊂ R n is the learner’s decision set, and the learner observes a sequence of functions f1, f2, . . . , fT mapping X → R. The regret of an algorithm choosing a sequence of x1, x2, . . . is defined in the usual way: 

Regret T := X T t=1 ft (xt) − min x∈X X T t=1 ft(x) 

Wouldn’t it ruin your lovely day if the functions ft were not convex? Maybe the only two conditions you can guarantee is that the functions ft are bounded (say in [0, 1] ) and are 1-Lipschitz: they satisfy that |ft(x) − ft (x ′ )| ≤ ∥x − x ′∥2 . Prove that, assuming X is convex and bounded, there exists a randomized algorithm with a reasonable expected-regret bound. Something like E [Regret T ] ≤ √ nT log T would be admirable. (Hint: Always good to ask the experts for ideas. And you needn’t worry about efficiency.)

发表评论

电子邮件地址不会被公开。 必填项已用*标注