Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CSE638: Advanced Algorithms Homework #1
Serial-BFS( G, s, d )
(Inputs are an unweighted directed graph G with vertex set G[V ], and a source vertex s ∈ G[V ]. For any vertex u ∈ G[V ], Γ(u) denotes the set of vertices adjacent to u. The output will be returned in d, where for each u ∈ G[V ], d[u] will be set to the sortest distance (i.e., number of edges on the sortest path) from s to u.)
1. for each u ∈ G[V ] do d[u] ← +∞ {initialize all distances to +∞}
2. d[s] ← 0 {the source vertex is at distannce 0}
3. Q ← ∅ {start with an empty FIFO queue Q}
4. Q.enque( s ) {enqueue the source vertex}
5. while Q 6= ∅ do {iterate until the queue becomes empty}
6. u ← Q.deque( ) {dequeue the first vertex u from the queue}
7. for each v ∈ Γ(u) do {consider each vertex adjacent to u}
8. if d[v] = +∞ then {if that adjacent vertex v has not yet been visited}
9. d[v] ← d[u] + 1 {distance to v is 1 more than that to u}
10. Q.enque( v ) {enqueue v for future exploration}
Figure 1: Serial breadth-first search (BFS) on a graph.
Task 1. [ 200 Points ] Parallel BFS with Work Stealing.
(a) [ 10 Points ] Implement the serial BFS algorithm (Serial-BFS) given in Figure 1.
(b) [ 5 Points ] Consider the parallel BFS algorithm (Parallel-BFS) given in Figure 2. Explain how the algorithm may give rise to race conditions, and why the output will still be correct.
(c) [ 15 Points ] Show that in any given iteration of the while loop in Parallel-BFS, the same vertex may appear multiple times in Qin, but not more than once in any given queue of Qin. Modify the algorithm so that exactly one of those multiple instances of the same vertex is expanded in lines 4–7 of Parallel-BFS-Thread. You are allowed to use only O (1) additional time per instance for this modification. Explain how and why your modification works. Also prove that the same vertex cannot appear in Qin in two different iterations of the while loop.
(d) [ 20 Points ] Let ql be the total number of entries in all input queues (i.e., Qin.q[i]’s) at the start of the exploration of a particular BFS level l. Prove that w.h.p. in p, after the first Θ p log p + log log ql Min-Steal-Size log ql Min-Steal-Size steal attempts in the entire system, all steal attempts will fail.
(e) [ 10 Points ] Explain why cp log p is a good choice for Max-Steal-Attempts, where c > 1 is a constant
(f) [ 30 Points ] Observe that in any given BFS level each thread is either waiting to be launched, or exploring vertices, or trying to steal, or waiting at the sync point. Now for an input graph G of diameter D with n vertices, m edges and maximum degree ∆, use the observation above to (upper) bound the parallel running time Tp of the entire algorithm. Assume that Max-Steal-Attempts = Θ (p log p), and the algorithm has been modified to avoid the exploration of duplicate vertices (following part (c)). Give a lower bound on the size of the input graph which makes the algorithm work-optimal.
(g) [ 15 Points ] Modify the algorithm so that ∆ disappears from Tp while other terms remain unchanged. Such modifications are useful for handling scale-free graphs1 more efficiently which arise frequently in real-world scenarios (e.g., the web graph, social network graphs, biological interaction networks, etc.).
(h) [ 5 Points ] Consider lines 10–13 of Parallel-BFS (Figure 2), and suggest modifications to reduce the time a thread waits to be launched. How does that change the bounds you proved in parts (f) and (g)?
(i) [ 35 Points ] Implement the algorithms from parts (f) and (g) without the modification for avoiding duplicate exploration. Run on a machine with at least 8 cores (same for parts (j) and (k)), and empirically find and report the best value for Min-Steal-Size and also for Max-Steal-Attempts. Optimize your code as much as possible.
(j) [ 35 Points ] Create a table that compares the running times of your serial implementation from part (a) and the parallel implementations from part (i) using all cores. For each input file (in Appendix 1) create a separate row in the table showing the running times of all three algorithms as well as the speedup factors of the two parallel implementations w.r.t. (a).
(k) [ 20 Points ] Generate a strong scalability plot (see slides 15–18 of lecture 5) for each of the two parallel implementations from part (i) using the wikipedia graph (see Appendix 1) as input.
Task 2. [ Optional, No Point ] Lockfree Parallel BFS.
(a) [ No Point ] Observe that in lines 9–17 of Parallel-BFS-Thread (Figure 3) the thief uses a lock to make its own queue unavailable to other thieves while it is trying to steal, and uses another lock to restrict access to its victim’s queue segment while a steal is in progress. Can you eliminate all locks from this algorithm without making it incorrect? You are not allowed to use any atomic instructions either. Prove the correctness of your algorithm, and also that it terminates.
(b) [ No Point ] Implement and optimize your algorithm from part (a), and repeat parts (j) and (k) of Task 1 with this implementation.
Parallel-BFS( G, s, d )
(Inputs are an unweighted directed graph G with vertex set G[V ], and a source vertex s ∈ G[V ]. For any vertex u ∈ G[V ], Γ(u) denotes the set of vertices adjacent to u. The output will be returned in d, where for each u ∈ G[V ], d[u] will be set to the shortest distance (i.e., number of edges on the shortest path) from s to u.)
1. parallel for each u ∈ G[V ] do d[u] ← +∞ {initialize all distances to +∞}
2. d[s] ← 0 {the source vertex is at distance 0}
3. p ← #processing cores
4. Q in ← collection of p empty FIFO queues Q in.q[1], Qin.q[2], . . . , Qin.q[p] Q in will hold vertices in the current BFS level}
5. Q out ← collection of p empty FIFO queues Q out.q[1], Qout.q[2], . . . , Qout.q[p] {vertices in the next BFS level generated from Q in will be stored in Q out
6. S ← collection of p segment pointers (global) {for 1 ≤ i ≤ p, S[i] will point to the queue segment currently being explored by thread i}
7. Q in.q[1].enque( s ) {start with BFS level 0 by enqueueing the source vertex}
8. while Q in 6= ∅ do iterate until Q in (i.e., the current BFS level) is empty
9. parallel for i = 1 to p do S[i] ← entire Q in.q[i] as a single segment thread i will start with Q in.q[i]
10. for i = 1 to p − 1 do from vertices in Q in generate vertices
11. spawn Parallel-BFS-Thread( i, G, Qout.q[i], d ) in the next BFS level by launching p − 1
12. Parallel-BFS-Thread( p, G, Qout.q[p], d ) threads concurrent to the current thread}
13. sync wait until all vertices in Q in have been processed
14. Q in ⇔ Q out swap the roles of Q in and Q out
15. Q out ← ∅ empty the queues in Q out
Figure 2: Parallel breadth-first search (BFS) on a graph.
Parallel-BFS-Thread( i, G, Qo , d )
(Inputs are the id i ∈ [1, p] of the current thread, and an unweighted directed graph G with vertex set G[V ]. For any vertex u ∈ G[V ], Γ(u) denotes the set of vertices adjacent to u. For each such vertex v the correct BFS level will be stored in d[v]. All vertices in the next level of BFS discovered by the current thread will be put in the output queue Q o which is a single queue used exclusively by the current thread. We assume that S[1 : p] is a globally accessible queue segment identifiers, where S[i] keeps track of the input queue segment currently being explored by thread i.)
1. while ( True ) do
2. while ( S[i] 6= ∅ ) do {while the input segment is not empty}
3. u ← S[i].extract( ) {extract the next vertex from the segment}
4. for each v ∈ Γ(u) do {consider each vertex adjacent to u}
5. if d[v] ← +∞ then {if that adjacent vertex v has not yet been visited}
6. d[v] ← d[u] + 1 {distance to v is 1 more than that to u}
7. Q o .enque( v ) {enqueue v in the output queue for future exploration}
8. t ← 0 {count number of steal attempts}
9. Lock( i ) {lock self until done with stealing}
10. while ( S[i] = ∅ ) and ( t < Max-Steal-Attempts ) do {try to steal ≤ Max-Steal-Attempts times}
11. r ← Rand( 1, p ) {pick a random victim}
12. if Try-Lock( r ) then {if able to secure exclusive access to the victim}
13. if ( |S[r]| > Min-Steal-Size ) then {if victim’s current queue segment is not too small}
14. S[i] ← second half of S[r] {steal second half of work from victim’s segment}
15. Unlock( r ) {give up exclusive access}
16. t ← t + 1 {done with one more steal attempt}
17. Unlock( i ) {make self available to thieves}
18. if ( S[i] = ∅ ) then break {terminate thread if all steal attempts fail}
Figure 3: Parallel breadth-first search (BFS) on a graph.
APPENDIX 1: Input/Output Format
Your code must read from standard input and write to standard output.
– Input Format: The first line will contain three integers giving the number of vertices (n), number of edges (m), and the number of source vertices (r), respectively. Each of the next m lines will contain two integers u and v (1 ≤ u, v ≤ n) denoting a directed edge from vertex u to vertex v. The edges will be sorted in nondecreasing order of the first vertex. Each of the next r lines will contain one integer s (1 ≤ s ≤ n) giving the index of a source vertex.
– Output Format: The output will consist of r lines – one for each source vertex. Line i (1 ≤ i ≤ r) will contain two indegers di and ci , where di is the maximum BFS level of a vertex from the i-th source vertex given in the input file, and ci is the checksum value as computed by the following function.
// Computes a very simple checksum by adding all d values.
// When some d[ i ] = infinity, replaces that infinity with nVertices
// during checksum calculation, i.e., assumes that d[ i ] = nVertices.
unsigned long long computeChecksum( void ) {
cilk::reducer_opadd< unsigned long long > chksum;
cilk_for ( int i = 0; i < nVertices; i++ )
chksum += d[ i ];
return chksum.get_value( );
}
– Sample Input/Output: /work/01905/rezaul/CSE638/HW1/samples on Lonestar.
– Test Input/Output (Table 1): /work/01905/rezaul/CSE638/HW1/turn-in on Lonestar.
Graph |
Description |
n |
m |
cage15 |
DNA electrophoresis, 15 monomers in polymer |
5.2M |
99.2M |
cage14 |
DNA electrophoresis, 14 monomers in polymer |
15.1M |
27.1M |
freescale |
Large circuit, Freescale Semiconductor |
3.4M |
18.9M |
Wikipedia |
Gleich/wikipedia-20070206 |
3.6M |
45M |
kkt-power |
Optimal power flow, nonlinear optimization (KKT) |
2M |
8.1M |
RMAT100M |
RMAT Graph generated using Graph-500 RMAT generator |
10M |
100M |
RMAT1B |
RMAT Graph generated using Graph-500 RMAT generator |
10M |
1B |
Table 1: Input graphs with #vertices (n) and #edges (m). Note that 1M = 106 and 1B = 109 .
APPENDIX 2: What to Turn in
One compressed archive file (e.g., zip, tar.gz) containing the following items.
– Source code, makefiles and job scripts.
– A PDF document containing all answers and plots.
– Output generated for the input files in /work/01905/rezaul/CSE638/HW1/turn-in/ on Lonestar. If the name of the input file is xxxxx-in.txt, please name the output files as xxxxx-1a-out.txt, xxxxx-1f-out.txt, xxxxx-1g-out.txt and xxxxx-2b-out.txt for tasks 1(a), 1(f), 1(g) and 2(b), respectively.
APPENDIX 3: Things to Remember
– Please never run anything that takes more than a minute or uses multiple cores on TACC login nodes. TACC policy strictly prohibits such usage. They reserve the right to suspend your account if you do so. All runs must be submitted as jobs to compute nodes (even when you use Cilkview or PAPI).
– Please store all data in your work folder ($WORK), and not in your home folder ($HOME).
– When measuring running times please exclude the time needed for reading the input and writing the output. Measure only the time needed by the algorithm.