CS131 Parallel and Distributed Systems​ LAB 3

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

CS131 Parallel and Distributed Systems

LAB 3:  MPI (C++)

NOTE: Skeletons will be provided for Part B only. Data files will be provided for all problems.

NOTE: Use MPI_Scatter to distribute the data.  

Part A (An image histogram using Tarry’s algorithm):

You are given two matrices, D and A. D is a grayscale image with 8-bit pixels. A is a matrix describing the DS connectivity (an adjacency bit matrix). Row i in the matrix describes nodes that node i is connected to (a 1 in position A[i,j] means node j have a direct, bidirectional link). A node can send messages only to its neighbors (as per the adjacency matrix), assume that otherwise it does not know how many nodes are in the system. 

You goal is to compute a “intensity” histogram of D in a distributed way. The computation is initiated by node 2. Use Tarry’s algorithm to visit every node and collect partial histograms of its neighbors.

Your implementation will take three command line arguments (in this order): Grayscale PGM image, path to text file with the adjacency matrix and output file name.

Use the same grayscale PGM images from Lab 1 and 2. You can use the code from those labs to read images into matrices.

A sample text file with adjacency matrix with 4 nodes will look as follows:

0

1

0

1

1

0

1

0

0

1

0

1

1

0

1

0

In this case, Node 1 is connected to 2 and 4, Node 2 is connected to 1 and 3, Node 3 is connected to 2 and 4, and Node 4 is connected to 1 and 3.

Histogram will calculate number of pixels that have intensities for every value between 0 and 255.

Print histogram output into the output file where each line will represent the # of pixels for each intensity. Therefore, the output file MUST have 256 lines only.

Part B1 (Word Frequency: Reduction):

Implement a simple and parallel version of an algorithm to count the word frequency using MPI.  To aggregate your results in the Master process you must use MPI_Reduce.

int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

Part B2 (Word Frequency: Ring topology):

Use the same implementation of the algorithm as PartB1 but instead of using MPI_reduce, pass the partial result between the tasks as messages. You will use a point-to-point communication between “adjacent” tasks, e.g. i and i-1.

Look at the provided skeleton for CL args and for choosing between part B1 and B2.

Timing (Part B1 and B2):

Report the timings for Part B1 and B2 with 2, 4, 8, and 16 processes. Be careful when timing your code, you want to time only the processing time, not reading the input. Only the Master process should time the execution, not the worker threads. Create a graph from that timing data and explain what they show.

Also explain:

1.     What is the topology used by MPI to implements its Reduce?

2.     How does the ring topology affect the runtime?

3.     Is there a case where the ring topology may be faster? 

4.     How many messages are passed between machines when program is run with 2, 4, 8, and 16 processes?

See below for the submission instructions. How to time the implementation will be discussed in the discussion.

NOTE: See the MPI How-To guide on the assignment web page before running your implementations on openlab servers.

Submit via Canvas:  YOU MUST USE the file names below

1.       Implementation of your C++ program: ImplementationA.cpp and ImplementationB.cpp (includes both B1 and B2)

2.       Analysis of the timing for Part B1 and B2: Timing.pdf

Point Breakdown:

Part A: Implementation -> 40pts

Part B1 and B2: Implementation -> 50pts (25pts each), Timing -> 10pts

USEFUL LINKS:

MPI Tutorial   

NOTE: The TA cannot solve the exercises for you, but he can answer your questions!

发表评论

电子邮件地址不会被公开。 必填项已用*标注