Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CSCI-UA.0480-051: Parallel Computing
Total: 100 points
Problem 1
a. [10 points] There is a problem with the following code. Please explain the problem in 1-2 lines. Can we fix this problem? If yes, show how (either by writing few lines of code or by explaining what you will do). If no, explain why not.
#pragma omp parallel for for
(int i = 1; i < N; i++)
{
A[i] = B[i] – A[i – 1]; //A and B are arrays of int
}
b. [5] Hyperthreading core can execute several instructions at the same time. Superscalar core can also execute several instructions at the same time. What is the main difference then between the two? The answer must not exceed 1-2 lines.
c. [5] Does a hyperthreading core need to be also superscalar? Explain.
d. [5] Does a superscalar core need to have branch prediction? Explain.
Problem 2
Assume the following three processes in MPI and the operations they do at each time step of the program. Also assume that the reduction operation is MPI_MIN. The ranks of the three processes (0, 1, and 2) are their ranks in MPI_COMM_WORLD, and all the APIs called in the table below use MPI_COMM_WORLD.
Time |
Process 0 |
Process 1 |
Process 2 |
0 |
x = 6; y = 7; |
x = 8; y = 9; |
x = 5; y = 6; |
1 |
MPI_Allreduce(&x, &y, …) |
MPI_Allreduce(&y, &x, …) |
MPI_Allreduce(&x, &y, …) |
2 |
non-MPI code not affecting x andy |
MPI_Send(&y, 1, MPI_INT, 2, 0, …) |
MPI_Recv(&y, 1, MPI_INT, 1, 0, …) |
3 |
MPI_Allreduce(&y, &x, …) |
MPI_Allreduce(&x, &y, …) |
MPI_Allreduce(&x, &y, …) |
|
Process 0 |
Process 1 |
Process 2 |
x |
|
|
|
y |
|
|
|
[6] After Step 2:
|
Process 0 |
Process 1 |
Process 2 |
x |
|
|
|
y |
|
|
|
[6] After Step 3:
|
Process 0 |
Process 1 |
Process 2 |
x |
|
|
|
y |
|
|
|
a. [5] Suppose we have three communicators, including MPI_COMM_WORLD. How many ranks does each of the above three processes have? Explain in no more than two lines.
b. [7] If we have a quad-core processor, can these three processes use all the cores at the same time? Justify in 1-2 lines. Assume each one of the four cores has superscalar capability but no hyperthreading capability. Of course, the code shown above for each process is just a small part of the full code of the process (i.e. the full code of each process is bigger than what is shown in this problem).
Problem 3
Suppose we have the following code snippet. Assume all variables and arrays have been declared and initialized earlier.
1. #pragma omp parallel |
2. { |
3. |
4. for (i = 1 ; i < M; i += 2 ) |
5. { |
6. D[i] = x * A[i] + x * B[i]; |
7. } |
8. |
9. #pragma omp for |
10. for (i = 0; i < N; i++) |
11. { |
12. C[i] = k * D[i]; |
13. } |
14. } // end omp parallel |
a. [10] Assume M = N. What do we have to write in line 3 above to get as much performance as possible from the above code? You must not have race condition and at the same time get the best performance. No need to write any explanations, just the line of code.
b. [10] Repeat problem a, but assume M = 2N.
c. [5] Based on your answer in b, do we need to write anything in line 8 to make the code even faster? If no, then just say no. If yes, then write the line you will include in line 8.
d. [5] If M = 16, N = 8, and the multicore on which we run the code has 16 cores. How many threads do you think OpenMP runtime will create when executing line 1? Justify in no more than two lines. Assume no other processes/threads are using the multicore except the above code.
Problem 4
[15] Choose one answer only from the multiple choices for each problem. If you pick more than one answer, you will lose the grade of the corresponding problem.
1. Suppose a block, when launching a GPU kernel, is composed of 33 threads. How many warps will be created for this block?
a. 1 b. 2 c. 4 d.5 e. Depends on the total number of blocks.
2. Which of the following is NOT an advantage of warps?
a. reduces complexity of the GPU hardware
b. reduces penalty of thread divergence
c. decouples the dependence of the number of threads in a block on the number of SPs in the SM.
3. If we have a local variable in the GPU kernel, it will surely go to registers.
a. True b. False c. Depends on the compute capability of the GPU.
4. More than one block from two different kernels can be assigned to the same SM at the same time.
a. True b. False c. Depends on the compute capability of the GPU.
5. If a critical section in OpenMP looks like { x += b;} what is the best mutual exclusion technique? Assume no other critical sections exist in the program.
a. lock b. critical section c. atomic
6. If a critical section in OpenMP looks like { x += c; y+=b; } what is the best mutual exclusion technique? Assume no other critical sections exist in the program.
a. lock b. critical section c. atomic
7. Sometimes, in OpenMP, we cannot know the number of critical sections in a program by just looking at the source code.
a. True b. False
8. Suppose we launch a kernel as follows: kernelname <<<4,4>>>(argument list …);
And suppose the kernel has a global variable x. How many instances of x will be created?
a. 16 b. 4 c. 8 d. 1 e. 2
9. An algorithm of O(n) can be faster than an algorithm of O(n2).
a. True b. False
10. An algorithm of O(n2) can be faster than an algorithm of O(n).
a. True b. False
11. Even though MPI is a distributed memory programming model, it can still run on a multicore processor, which is a shared memory hardware.
a. True b. False
12. Even though OpenMP is a shared memory programming model, it can still run on a distributed memory hardware system.
a. True b. False
13. Even though locks are faster than critical sections in OpenMP, sometimes we must use critical sections.
a. True b. False
14. Non-determinism exists in MPI, OpenMP, and CUDA.
a. True b. False
15. The less the synchronization points in your parallel program, the less sensitive your program will be to load-imbalance.
a. True b. False