Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CS 521 — ML and Compilers
Spring Semester 2025
Tuesday, Apr 29, 2025
This is a take-home quiz. It is open-book, open-notes, and open-computer.
No LLM assistance allowed.
Print your name and NetID below. Also write your netid at the bottom of each subsequent page.
1. (20 points) AD for Activation Functions
Automatic differentiation gives a computational interpretation of the abstract ideas of Calculus. For the following activation functions (each activation follows the corresponding PyTorch semantics), we will study the existence of derivatives for all inputs:
(a) out = sigmoid(in).
(b) out = Softplus(in).
(c) out = LeakyReLU(in, negative slope=0.02).
(d) out = heaviside(in) (hint: this is the step function).
(e) out = abs(in).
For each of the functions above (4pts per function):
• Manually derive the expression for the derivative using simple numerical operators (+, -, *, /) and primitive functions (e.g., exp, log, sin, cos). Show your derivation.
• Investigate whether the original function is differentiable at every input point. If yes, explain why. If no, explain the reasons and specific points of non-differentiability.
2. (30 points) Reverse-mode Automatic Differentiation
We will work on the following code for reverse mode AD.
Consider the function: sigmoid(0.75 * ln(x1/2) + x1 + ln(3*x2)).
Break down the expression into the intermediate representation that computes each atomic expression and stores it in temporary variables (you can use jaxpr or a similar IR for this task, like in MP3; its meaning should be clear from the use).
(a) (20pt) Write down the forward and reverse mode derivative computations. Note: The basic approach from ”Baydin et al. Automatic Differentiation in Machine Learning: a Survey” (exemplified also in the Appendix of MP3) is sufficient. You are also free to use the JAX approach we discussed in the lectures as an alternative.
(b) (10pt) Assume that x1 = 1, x2 = 0.5. Use the calculated gradients to find the maxi mum disturbance δ to x1 that would keep the output within 0.05 of the one that the computation would produce, i.e. f(x1, x2) − f(x1 + δ, x2) < 0.05. Show your derivation.
Hints: (1) For correctness, compare your results to symbolic derivation that e.g. Wolfram Alpha produces. (2) For sensitivity, recall a general (approximate) relationship between f(v + δ) − f(v) and the derivative f
′
(v) for a small δ.
3. (25 points) Quantization and Pruning For each statement that follows, answer Yes/No followed by 1-2 sentences justifying why you selected that answer (5pt each):
(a) Unstructured pruning yields the highest level of accuracy, but savings on commodity hardware are typically small.
(b) Inference on networks that have reduced-precision weights (e.g., fp16, int5, int6) requires the hardware device to natively support operations in that quantized format.
(c) The following vector can be the result of a 2:4 structured pruning procedure: [ 2.5, 0, 2.5, 0, 0, 0, 2, 0.5, 1.5, 0, 0, 0 ].
(d) Quantization of activations typically requires a data type with a greater range than quantization of weights.
(e) Bfloat numerical representation has approximately the same range but reduced precision compared to the IEEE float32 numerical representation.
4. (25 points) Accuracy/Performance Tradeoffs
In this problem, we study the key properties of the Pareto frontiers. Each point in this problem will represent a program with a different configuration.
(a) In Figures 1-3 identify the points that represent the optimal tradeoffs between the metrics for performance and result quality. Circle each Pareto-optimal point and connect them with lines. Hint: Pay attention to the axes in each problem (higher-better or lower better) as you search for the Pareto-optimal cost-benefit points.
(b) In Figure 4, three Pareto frontiers were generated using different search techniques A, B, C. Based on these results, pairwise compare each of these techniques and either identify a better one or explain why the comparison is inconclusive.
Figure 1: Speedup vs Accuracy Loss.
Figure 2: Execution time vs Sensitivity.
Figure 3: Energy Consumed vs Accuracy.
Figure 4: Comparing Search Methods.
Figure 5: The description of the probabilistically reliable system. Each box is the component. The arrows represent the data flow of results produced in earlier components and used in the following component(s).
5. (20 points; Extra Credit) Probabilistic Programming
Figure 5 presents a system consisting of components that work in parallel and serially.
Each component C1-C3, and C6-C8 has the same probability of failure pf = 0.1. The com ponents C4 and C9 have a probability of failure 0.01.
The monitoring systems C4 and C9 receive the results of the three components and select the correct result if at least one of the parallel components produced the correct result.
Write a probabilistic program in WebPPL that will help you compute the answers to the following questions:
(a) What is the probability that the system produces the correct result?
(b) What is the probability that the component C3 was reliable if we know that the final result was computed reliably?
(c) If the system failed to produce the correct result, which component should engineers inspect first?
Present the tables that the probabilistic program generates for these queries.
Hint: As a reminder, WebPPL is a probabilistic programming language embedded in JavaScript; you can run it from your web browser from http://webppl.org/.