Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CSE590: Supercomputing Homework #1
Task 1. [ 250 Points ] Multithreading Floyd-Warshall’s All-Pairs Shortest Path (APSP) Algorithm.
Consider a directed graph G = (V, E), where V = {v1, v2, . . . , vn}, and each edge (vi , vj ) is labeled by a real-valued length l(vi , vj ). If (vi , vj ) ∈/ E, l(vi , vj ) is assumed to have a value +∞. The length of a path is defined as the sum of the lengths of the edges in the path, taken in order. For each pair vi , vj ∈ V , δ[i, j] is defined to be the smallest of the lengths of all paths going from vi to vj . We assume that G does not have any negative length cycle.
Floyd-Warshall’s algoithm for computing the length of the shortest path between every pair of vertices of G has a standard iterative implementation as shown in Figure 1. The input is an n × n matrix D[1 . . . n, 1 . . . n] in which for every pair i, j ∈ [1, n], D[i, j] is initialized with l(vi , vj ). After the algorithm terminates D[i, j] = δ[i, j] for all i, j ∈ [1, n].
In this task we will consider a number of recursive divide-and-conquer implementations of FloydWarshall’s algoithm which are shown in Figures 2, 4 and 5. For simplicity we will assume n to be a power of 2. Each of these implementations is launched by calling AFW (X, X, X), where X points to D[1 . . . n, 1 . . . n] initialized with edge lengths of G as described in the previous paragraph. In general, for each F ∈ {A, B, C}, function AFW / Aloop-FW accepts X ≡ D[i1 . . . i2, j1 . . . j2], U ≡ D[i1 . . . i2, k1 . . . k2] and V ≡ D[k1 . . . k2, j1 . . . j2] as inputs, where i1, i2, j1, j2, k1, k2 ∈ [1, n] with i2 −i1 = j2 −j1 = k2 −k1 ≥ 0. Each update D[i, j] ← min (D[i, j], D[i, k] + D[k, j]) applied by FFW (X, U, V ) / Floop-FW (X, U, V ) updates D[i, j] ∈ X using D[i, k] ∈ U and D[k, j] ∈ V , where i1 ≤ i ≤ i2, j1 ≤ j ≤ j2 and k1 ≤ k ≤ k2. In order to reduce the overhead of recursion each recursive function FFW (X, U, V ) switches to an iterative function Floop-FW (X, U, V ) as shown in Figure 3 when the problem size becomes small but still requires enough work to solve it so that the overhead of recursively reaching that problem size does not dominate the cost of solving the problem. We want to avoid all determinacy races in this task.
Loop-Floyd-Warshall-APSP(D, n)
1. for k ← 1 to n do
2. for i ← 1 to n do
3. for j ← 1 to n do
4. D[i, j] ← min (D[i, j], D[i, k] + D[k, j])
Figure 1: [Iterative Implementation] Looping code implementing Floyd-Warshall’s APSP algorithm.
(a) [ 15 Points ] Figure 1 shows the standard serial iterative implementation of Floyd-Warshall’s APSP algorithm. Explain how you would parallelize this implementation by only replacing the serial for loops with parallel for loops. Compute the work, span and parallelism of your parallel implementation.
AFW (X, U, V ) {possibly X ≡ U ≡ V }
1. if X is an m × m matrix then Aloop-FW (X, U, V ) else
2. AFW (X11, U11, V11)
3. parallel: AFW (X12, U11, V12), AFW (X21, U21, V11)
4. AFW (X22, U21, V12)
5. AFW (X22, U22, V22)
6. parallel: AFW (X21, U22, V21), AFW (X12, U12, V22)
7. AFW (X11, U12, V21)
Figure 2: [Recursive Implementation 1] The initial call is AFW (X, X, X), where X points to D[1 . . . n, 1 . . . n] and n is assumed to be a power of 2. By X11, X12, X21 and X22 we denote the topleft, top-right, bottom-left and bottom-right quadrant of X, respectively.
Floop-FW (X, U, V ) {F ∈ {A, B, C}}
1. let i1, i2, j1, j2, k1, k2 ∈ [1, n] with i2 − i1 = j2 − j1 = k2 − k1 ≥ 0 be indices such that X ≡ D[i1 . . . i2, j1 . . . j2], U ≡ D[i1 . . . i2, k1 . . . k2] and V ≡ D[k1 . . . k2, j1 . . . j2].
2. for k ← k1 to k2 do
3. for i ← i1 to i2 do
4. for j ← j1 to j2 do
5. D[i, j] ← min (D[i, j], D[i, k] + D[k, j])
Figure 3: Looping base case for the recursive implementation of Floyd-Warshall’s APSP algorithm.
(b) [ 60 Points ] Assuming m to be a (small) constant1 independent of n, compute the work, span and parallelism of each of the three recursive divide-and-conquer implementations of Floyd-Warshall’s APSP given in Figures 2, 4 and 5.
(c) [ 15 Points ] Observe that Aloop-FW , Bloop-FW , Cloop-FW , and Dloop-FW must implement the same triply nested for loop as shown in Figure 3. But explain why Bloop-FW and Cloop-FW are more optimizable than Aloop-FW , and Dloop-FW is the most optimizable among the four even when none of them are parallelized.
(d) [ 30 Points ] Parallelize Aloop-FW , Bloop-FW , Cloop-FW , and Dloop-FW using only parallel for loops. Compute the work, span and parallelism of each of those four functions.
(e) [ 40 Points ] Implement the standard iterative Floyd-Warshall’s APSP (from Figure 1) and the three recursive versions (from Figures 2, 4 and 5) using Cilk. Optimize the codes as much as possible. For each of three recursive implementations find the value of m (the switching
AFW (X, U, V ) {possibly X ≡ U ≡ V }
1. if X is an m × m matrix then Aloop-FW (X, U, V ) else
2. AFW (X11, U11, V11)
3. parallel: BFW (X12, U11, V12), CFW (X21, U21, V11)
4. AFW (X22, U21, V12)
5. AFW (X22, U22, V22)
6. parallel: BFW (X21, U22, V21), CFW (X12, U12, V22)
7. AFW (X11, U12, V21)
BFW (X, U, V ) {X and U are dsjoint, but possibly X ≡ V }
1. if X is an m × m matrix then Bloop-FW (X, U, V ) else
2. parallel: BFW (X11, U11, V11), BFW (X12, U11, V12)
3. parallel: BFW (X21, U21, V11), BFW (X22, U21, V12)
4. parallel: BFW (X21, U22, V21), BFW (X22, U22, V22)
5. parallel: BFW (X11, U12, V21), BFW (X12, U12, V22)
CFW (X, U, V ) {X and V are dsjoint, but possibly X ≡ U}
1. if X is an m × m matrix then Cloop-FW (X, U, V ) else
2. parallel: CFW (X11, U11, V11), CFW (X21, U21, V11)
3. parallel: CFW (X12, U11, V12), CFW (X22, U21, V12)
4. parallel: CFW (X12, U12, V22), CFW (X22, U22, V22)
5. parallel: CFW (X11, U12, V21), CFW (X21, U22, V12)
Figure 4: [Recursive Implementation 2] The initial call is AFW (X, X, X), where X points to D[1 . . . n, 1 . . . n] and n is assumed to be a power of 2. By X11, X12, X21 and X22 we denote the topleft, top-right, bottom-left and bottom-right quadrant of X, respectively.
point2 ) that gives you the best running time. Use n = 213, and try each power of 2 from 20 to 213 as value of m. Produce a table or a graph for each recursive implementation showing how the running time varies as you change m.
(f) [ 20 Points ] Run all four of your implementations from part (e) on all cores. Plot the running times of each implementation as you vary n from 24 to 213 (i.e., consider only powers of 2).
AFW (X, U, V ) {X ≡ U ≡ V }
1. if X is an m × m matrix then Aloop-FW (X, U, V ) else
2. AFW (X11, U11, V11)
3. parallel: BFW (X12, U11, V12), CFW (X21, U21, V11)
4. DFW (X22, U21, V12)
5. AFW (X22, U22, V22)
6. parallel: BFW (X21, U22, V21), CFW (X12, U12, V22)
7. DFW (X11, U12, V21)
BFW (X, U, V ) {X and U are dsjoint, but X ≡ V }
1. if X is an m × m matrix then Bloop-FW (X, U, V ) else
2. parallel: BFW (X11, U11, V11), BFW (X12, U11, V12)
3. parallel: DFW (X21, U21, V11), DFW (X22, U21, V12)
4. parallel: BFW (X21, U22, V21), BFW (X22, U22, V22)
5. parallel: DFW (X11, U12, V21), DFW (X12, U12, V22)
CFW (X, U, V ) {X and V are dsjoint, but X ≡ U}
1. if X is an m × m matrix then Cloop-FW (X, U, V ) else
2. parallel: CFW (X11, U11, V11), CFW (X21, U21, V11)
3. parallel: DFW (X12, U11, V12), DFW (X22, U21, V12)
4. parallel: CFW (X12, U12, V22), CFW (X22, U22, V22)
5. parallel: DFW (X11, U12, V21), DFW (X21, U22, V12)
DFW (X, U, V ) {X, U and V are dsjoint}
1. if X is an m × m matrix then Dloop-FW (X, U, V ) else
2. parallel: DFW (X11, U11, V11), DFW (X12, U11, V12), DFW (X21, U21, V11), DFW (X22, U21, V12)
3. parallel: DFW (X11, U12, V21), DFW (X12, U12, V22), DFW (X21, U22, V21), DFW (X22, U22, V22)
Figure 5: [Recursive Implementation 3] The initial call is AFW (X, X, X), where X points to D[1 . . . n, 1 . . . n] and n is assumed to be a power of 2. By X11, X12, X21 and X22 we denote the topleft, top-right, bottom-left and bottom-right quadrant of X, respectively.
(g) [ 20 Points ] For each of your four implementations from part (e) generate a Cilkview scalability plot by fixing n to 213 .
(h) [ 40 Points ] Repeat parts (e)–(g) using OpenMP, but instead of a Cilkview scalability plot generate a standard strong scalability plot for n = 213 .
(i) [ 10 Points ] Consider the recursive implementation from Figure 5. Can you improve its parallelism even further by using extra space for storing intermediate values the way we did for recursive matrix multiplication? Show your analysis.
APPENDIX 1: What to Turn in
One compressed archive file (e.g., zip, tar.gz) containing the following items.
– Source code, makefiles and job scripts.
– A PDF document containing all answers and plots.
APPENDIX 2: Things to Remember
– Please never run anything that takes more than a minute or uses multiple cores on TACC login nodes. TACC policy strictly prohibits such usage. They reserve the right to suspend your account if you do so. All runs must be submitted as jobs to compute nodes (even when you use Cilkview or PAPI).
– Please store all data in your work folder ($WORK), and not in your home folder ($HOME).
– When measuring running times please exclude the time needed for reading the input and writing the output. Measure only the time needed by the algorithm.