CMPEN 431 flow dependences

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

CMPEN 431 (Fall 2024 Final Exam)

Question-1 (20 points):

While anti and output dependences can be handled through renaming, true dependences (also called flow dependences) cannot. Observing this, researchers at ABC (a hardware company), which implements MIPS ISA, proposed a strategy named WNM (Who Needs Memory?). In this strategy, the goal is to predict the value that would be returned by a load operation and continue with the predicted value, i.e., the pipeline proceeds with the value predicted by WNM but the load operation is still executed in the background. If, later in execution, the value read from memory (ground truth) is different from the one predicted, the pipeline is flushed and the execution re-starts using the value read from the memory.

(a) Since WNM reads the value from the memory in the background, wouldn’t its performance be bounded by memory latency anyway, that is, why do these researchers think WNM will bring any performance benefits in practice?

(b)Explain the changes in MIPS architecture, using a detailed diagram that shows all relevant components (e.g., pipeline registers, control signals, etc), to implement WNM.   In your diagram, show the “ prediction logic” of WNM as a “ black box”, and include the logic needed to handle mis- predictions as well.

(c) Discuss three different strategies to implement the “ black box” (prediction logic) in (b).

(d) Focusing on one of the strategies you came up in (c), suggest a metric to quantify the effectiveness of WNM, and explain (using your metric) how well your strategy would work for the load instructions in this high-level program code, assuming that both A and B are initialized to all 0s (note that loop body contains 3 assignment statements):

K=0;

DO i = 1 to 100

{ K = K + A(i) + B(i);  A(i) = A(i) +1;     B(i) = B(i) + A(i) }

(e) In general, what types of programs would benefit most from WNM? (f) Why do you think they named this approach as WNM?

Question-2 (20 points):

Consider the complete memory hierarchy. You have a paged memory system. The processor has 256 MB of memory and an 2GB virtual address space with 2 KB pages. The L1 cache is a 128 KB, 4- way set associative cache with 64-byte cache lines. The L2 cache is a 2 MB, 16-way set associative cache with 128-byte lines. Address translation is supported by a 4 entry TLB. Each page table entry is 32 bits. Both the caches in the system employ the LRU policy for page replacement.

(a) Show the breakdown of the fields of virtual address and physical address. Also show the physical address as interpreted by the L1 cache, and the physical address as interpreted by the L2 cache. Clearly mark the number of bits in each address and the number of bits in each field.

(b) What is the number of L2 cache lines per page? What is the number of entries in the page table? What is the page table size in terms of pages?

(c) If a program has a cache line access pattern (in decimal) :

0, 512, 2048, 2560, 3072, 3584, 4096, 1024, 3072, 4096, 3584, 2048,

indicate, for each access, whether it is an L1 hit/miss or L2 hit/miss (note that the addresses in the sequence are cache line addresses. They do not comprise the bits of offsets within a cache line).

(d) Is it possible for this system to address the L1 cache concurrently with the address translation provided by the TLB? Justify your answer using the address breakdowns shown in part (a). What would be the benefit of doing so?

Question-3 (20 points):

Consider a superscalar architecture with two pipelined units, one for load/store and one for ALU instructions (including branches and jumps). The following two tables indicate the order of execution of two threads, A and B, and the latencies mandated by dependences between instructions. For example, A2 and A3 should execute after A1 and there should be at least four cycles between the execution of A3 and A5 and at least two cycles between A4 and A5. In other words, the tables indicate the schedule if the instructions of each of the threads are executed with no multithreading.

time

Load/store pipeline

ALU   pipeline

t

A1

t+1

A3

A2

t+2

t+3

A4

t+4

t+5

t+6

A5

A6

t+7

A7


time

Load/store pipeline

ALU   pipeline

t

B1

t+1

B3

B2

t+2

B4

t+3

t+4

B6

B5

t+5

B7

t+6

B8

t+7

Note that an instruction that executes on one pipeline (for example A1 on the load/store pipeline) cannot execute on the other pipeline. Also, instructions cannot be issued in an out of order (OoO) fashion.

Show the execution schedule for the two threads on the two pipelines assuming the following multithreaded architectures:

(a) Fine grain multithreading (start from thread A, switch between threads A and B in every cycle)

Time

Load/store pipeline

ALU   pipeline

T

t+1

t+2

t+3

t+4

t+5

t+6

t+7

t+8

t+9

t+10

t+11

t+12

t+13

t+14

(b) Simultaneous multithreading (with priority always given to thread A)

time

Load/store pipeline

ALU   pipeline

t

t+1

t+2

t+3

t+4

t+5

t+6

t+7

t+8

t+9

t+10

t+11

t+12

t+13

t+14

Question-4 (20 points):

A pipelined architecture has the following characteristics:

* The pipeline has 18 stages, each stage normally taking 1 cycle.

* The address of the branch target is computed in stage 9.

* The branch decision is made in stage 13.

* There is only an L1 cache; consequently, all L1 misses go directly to off-chip DRAM.

An architect, at computer company DEF, is considering two different enhancements to improve the execution of the branch instructions in this architecture:

Enhancement-1: With the help of additional logic, move the branch decision to the 8th stage and the branch address calculation to the 6th stage.

Enhancement-2: Employ a branch predictor with an accuracy of 85%. The predictor is accessed in stage 3.

These enhancements are iso-cost, and given a budget constraint, our architect can implement only one of these enhancements.

(a) Which of these proposed enhancements is better? Justify your answer quantitatively (use a simplified pipeline diagram if needed).

(b) Another architect in the same team is focusing on memory accesses in this architecture and developing a new type of cache-like structure. This on-chip structure, called WNC (Who NeedCache?), is like L1 cache but is managed entirely in software (i.e., by program code or compiler), and our architect is considering substituting WNC for the L1 cache.  Explain what type of ISA support is needed to use WNC (recall that conventional L1 caches are entirely managed in hardware, i.e., data movements in and out of L1 are decided by hardware based on data access patterns) .

(c) What are the advantages and drawbacks of WNC architecture with respect to conventional L1 cache? List the advantages and drawbacks explicitly.

(d) When WNC got tested, they noticed mixed results with respect to conventional L1 caches (i.e., for some data access patterns, they found WNC outperforming L1 cache, whereas L1 resulted in better performance with some other data access patterns) . Motivated by this observation, another architect in the same company came up with the idea of accommodating both L1 and WNC in the same architecture. Explain, for this new architecture, how data lookup, placement and replacement can be implemented, i.e., how will data flow among WNC, L1 cache and main memory?

Question-5 (20 points):

Observing the staggering costs associated with  production of their computer chips, a startup company GHI is proposing a radical idea of building chips that are not 100% reliable. Specifically, in this new architecture, named WNR (Who Needs Reliability?), any bit in the system can change any time with no reason (i.e., it can go from 1 to 0, or vice versa).

(a) For each of the following scenarios, explain the worst possible outcome from the perspective of the program code, i.e., what can happen in the worst case:

(i) A bit in branch predictor is changed

(ii) A bit in L1 cache is changed (consider bit changes in tag array and data array separately)

(iii) A bit in one of the pipeline registers is changed

(iv) A bit in one of the registers is changed

(v) A bit in one of the entries of the reservation unit in an out-of-order (OoO) machine is changed

(b) Can you imagine any scenario in which the WNR architecture would be useful/desirable? Elaborate.

Question-6 (20 points):

Assume a 4-core system, each core 2-way SMT, where each core has a private L1 followed by a shared  bus to main memory.  Assume that all cache lines (blocks) are initially invalid. Using the MSI protocol, simulate the coherence state of cache lines FOO and BAR. Assume that threads 0 and 1 are mapped to core 0, threads 2 and 3 are mapped to core 1, thread 4 is mapped to Core 2, and thread 5 is mapped to core 3. Variables A and B are mapped to cache line FOO and variables C and D are mapped to cache line BAR.


发表评论

电子邮件地址不会被公开。 必填项已用*标注