Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CS131 Parallel and Distributed Systems
LAB 2: OpenMP (in C++)
Part A:
In this lab you’re parallelizing the same edge detection algorithm as in lab1, but using OpenMP instead of std::threads. The algorithm takes a grayscale image as input and outputs an image with edge outlines. You will be provided with a skeleton program to get a head start with the solution. Use this sequential code and parallelize it in different ways.
Part A1:
Use OpenMP parallel for with static loop scheduling in the edge detection code. Print the information about OpenMP threads and the starting point of chunks (Example: Thread 1 -> Processing Chunk starting at Row 50) that they processed. Store the thread Id and the starting points in a variable and print it after output is written to the PGM file.
Part A2:
Use dynamic loop scheduling process for edge detection. The OpenMP uses a chunk in a similar way to how we used it in Lab 1. Each chunk is similar in size (except for the last chunk, which may be smaller). Again, print the information about the threads and the starting point of chunks that they processed.
Part A3:
Add SIMD processing to your A1 code (using the omp simd pragma) to parallelize the reduction loops using vector instructions. Number of threads and chunk size parameters are not needed in this part.
Part B (Analysis):
Report the performance of your implementation on Openlab from Parts A1, A2, and A3 of this lab. Be careful when timing your code, you want to measure only the time for processing the image. Use chrono::high_resolution_clock for timing the code (see the discussion).
Report the performance for all 5 testcases with 2, 4 and 8 threads. Do it for chunk sizes of 25 and 70.
A work sharing construct:
#pragma omp parallel for schedule (type [,chunk]), private (list), shared (list)
Use schedule type STATIC for part A1 and DYNAMIC for part A2. Private variables in the list will be private to each thread.
#pragma omp simd reduction (op:variable)
Use + operand and the name of the variable where the result of the reduction is being saved for SIMD parallelization of the loop.
NOTE:
· Skeletons and the dataset for the lab are available on the Canvas.
Compiling and Running OpenMP program:
To compile you must use gcc/local available from the module system.
$ module load gcc/local
To compile use the –std=c++11 –fopenmp flags like so:
$ g++ -std=c++11 –fopenmp Implementation.cpp –o Implementation
It is highly recommended to use the compilation flag –Wall to detect errors early, like so:
$ g++ -Wall -std=c++11 – fopenmp Implementation.cpp –o Implementation
To check for data races in your code, use the following flags (more about them in the discussion):
$ g++ -Wall -std=c++11 – fopenmp Implementation.cpp –o Implementation -fsanitize=thread -fPIE -pie -g
If there are any possible data races detected in the code, they will be shown as warnings on execution.
Setting thread affinity to run each thread on different core/processor, use environment variables:
More about this in the discussion.
To set number of threads:
- OMP_NUM_THREADS
To bind threads to specific CPUs when compiling with GCC:
- GOMP_CPU_AFFINITY
The variable should contain a space-separated or comma-separated list of CPUs. This list may contain different kinds of entries: either single CPU numbers in any order, a range of CPUs (M-N) or a range with some stride (M-N:S). CPU numbers are zero based. For example, GOMP_CPU_AFFINITY="0 3 1-2 4-15:2" will bind the initial thread to CPU 0, the second to CPU 3, the third to CPU 1, the fourth to CPU 2, the fifth to CPU 4, the sixth through tenth to CPUs 6, 8, 10, 12, and 14 respectively and then start assigning back from the beginning of the list. GOMP_CPU_AFFINITY=0 binds all threads to CPU 0.
Examples:
## Bind threads 0,2 to cores 0,1 and threads 1,3 to cores 8,9.
$export OMP_NUM_THREADS=4
$ export GOMP_CPU_AFFINITY=0,8,1,9
## Bind threads 0,1,2,3 to cores 8,9,10,11 respectively.
$ export OMP_NUM_THREADS=4
$ export GOMP_CPU_AFFINITY=8-15
Useful OpenMP Functions:
- Use the OpenMP function omp_get_num_threads() to get the number of threads in use but call this function inside a parallel region.
- To get the identifier of a running thread use the OpenMP function omp_get_thread_num(). The function returns an integer. You can assign the returned integer to a variable var_thread_id and later use the variable to control the execution flow of your program, i.e. if ( var_thread_id == 0 ){} else {}.
- You may need to create variables private to an OpenMP thread (i.e. var_thread_id). First declare the var_thread_id variable and next to use the #pragma omp parallel for private(var_th_id) statement. In this way you do not need to worry about the overwriting of the variable var_thread_id.
- Do not use I/O statements inside the parallel region.
Submit via Canvas in 2 parts: YOU MUST USE the file names below:
1. Part A1, A2, A3 in the same ImplementationA.cpp file.
2. Part B as a PDF file name Analysis.pdf
Point Breakdown:
Part A Implementation: 70 pts,
Part B Analysis: 30 pts.
USEFUL LINKS:
http://en.wikipedia.org/wiki/OpenMP (very easy and useful)
https://computing.llnl.gov/tutorials/openMP/ (much more complex but complete)
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf (very good tutorial)
http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf (very detailed examples)
NOTE: The TA cannot solve the exercises for you, but he can answer your questions!