Wednesday, December 4, 2013

Exercise 4.9

This Exercise is Half Done, Please Ignore

To answer the following questions we will be using the simulator ProcSim

r1-r4 is mapped to temp registers t1-t4 respectively
r16 is mapped to s7

This is my sample asm code for running my simulation:

#init registers
.register $t1 1
.register $t2 2
.register $t3 3
.register $t4 4
.register $s7 200

main: sw $t4, -100($s7)
  slt $t1, $t2, $t3
exit:

4.9.1 What is the value of the instruction word?






a)
10101110111011001111111111111100
op           rs        rt         immediate
101011-10111-01100-1111111111111100
b)
00000001010010110100100000101010
op           rs        rt         rd         shamt  funct
000000-01010-01011-01001-00000-101010

4.9.2 What is the register number supplied to the register file’s “Read register 1” input? Is this register actually read? How about “Read register 2”?

a)10111
Also mentioned above in the machine code breakdown.


b)01010

4.9.3 What is the register number supplied to the register file’s “Write register” input? Is this register actually written?

a)01100
Also mentioned above in the machine code breakdown.


b)01001

Exercise 4.7




4.7.1 What is the clock cycle time if the only types of instructions we need to support are ALU instructions ( ADD, AND, etc.)?
a) 200+90+20+90+20 = 420ps
b) 750+300+50+250+50 = 1400ps

4.7.2 What is the clock cycle time if we only have to support LW instructions?
a) 200+90+20+90+250+20 = 670ps
b) 750+300+50+250+500+50 = 1900ps

4.7.3 What is the clock cycle time if we must support ADD, BEQ, LW, and SW instructions?

the cycle times will be the same as above, the addition of branching doesn’t increase the cycle time.  Implementation a: 15+10+70+20 = 115ps which is less than data memory latencies.
Implementation b is the same: 100+5+200+20 = 350ps

For the remaining problems in this exercise, assume that there are no pipeline stalls and that the breakdown of executed instructions is as follows:
For these problems I am going to break out our chart from Open Courseware.
4.7.4 In what fraction of all cycles is the data memory used?
Data memory is used in SW and LW as we are writings and reading to memory.
a) 25+10 = 35%
b) 30+20 = 50%
4.7.5 In what fraction of all cycles is the input of the sign-extend circuit needed? What is this circuit doing in cycles in which its input is not needed?
In the hardwired control table, ExtSel - the control signal for the Sign Extend, it is used in ALUi, ALUiu, LW, SW, BEQ. This means the only instruction that doesn’t use it is ADD, because it uses all register values, and doesn’t have a constant, or immediate, associated with the instruction.
a) 20%
b) 30%
A control signal is sent to the resource to activate its use or not, however, in the figure associated with these problems, that control signal does not exist, so we must assume the function performs no matter what.  However, in the case where it is not needed, even in its operations are performed, it is simply ignored because it isn’t used.

4.7.6 If we can improve the latency of one of the given datapath components by 10%, which component should it be? What is the speedup from this improvement?
There are two prime contenders here.  The first is Instruction memory, since it is used every cycle.  The second is Data Memory, since it has the longest latency.

We will assume 100 instructions
a)
I-Mem - 200ps - 100%
D-Mem - 250ps - 35%

200*.9 = 180 * 100 = 18000
250*.9 = 225 * 35 = 7875

200*100 = 20000 - 18000 = 2000ps change
250*35 = 8750 - 7875 = 875ps change

For a, the component to improve would be the Instruction memory.

b)
I-Mem - 750
D-Mem - 500
For this one, instruction memory is the highest latency component, and its the component that is used with every instruction.  Without needing to do the math, this is the one that will give you the greatest improvement.  However, here is the math anyway:

750*.9 = 675 * 100 = 67500
750-100 = 75000 - 67500 = 7500ps

Monday, December 2, 2013

Exercise 4.2


4.2.1 Which existing blocks (if any) can be used for this instruction?
a)  SEQ is a Boolean operation returning 1/true or 0/false if the two registers are equal.
reg, mux, alu
b) LWI leads the contents of a memory allocation that is the sum of two registry values.
reg, mux, alu, memory
Figure Below


4.2.2 Which  new  functional  blocks  (if  any)  do  we  need  for  this instruction?
a) a mux after ALU zero for Boolean 0 or 1
b) nothing
Figure Below


4.2.3 What new signals do we need (if any) from the control unit to support this instruction?
a) need control signal to operate new mux
b) nothing
Figure Below


 In the  following three problems, assume that we are starting with a datapath from Figure 
4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem, and Control blocks have latencies of 400ps, 100ps, 30ps, 120ps, 200ps, 350ps, and 100ps, respectively, and costs of 1000, 30, 10, 100, 200, 2000, and 500, respectively.
Costs
Instruction memory:1000
Registers:200
ALU:100
Data Memory:2000
Add:30*2 = 60
Mux:10*3 = 30
Control:500


Total Cost: 3890


4.2.4 What is the clock cycle time with and without this improvement?
Critical Path is PC->Instruction Mem->Registers->Mux->ALU->D-Mem->Mux
400+200+30+120+350+30 = 1130ps


a. ALU latency +300: 1130+300 = 1430ps
b. Control Latency+100 - matches registers for latency, so no change in critical path - 1130ps


4.2.5 What is the speedup achieved by adding this improvement?
a. While no direct speed up occurs in the critical path, and in fact, the cycle time is lengthened, since it adds MUL to the instruction set, 5% fewer instructions can be performed.
So, if we assume 1000 instructions at 1130ps, we have a run time of 1,130,000ps
Now with the improvement, we have 950 instructions at 1430 ps we have a run time of 1,358,500.


We have a performance decrease of 225800ps, instead of an increase.  The increase in cycle time is not recovered in the decrease in instruction count.


b.  The change does not affect cycle time, therefore no speedup is achieved.


4.2.6 Compare the cost/performance ratio with and without this improvement.
I will increase the instruction count to 1million.
Cost/Performance
Cost/(1/Execution Time)
1000000*1130 = 1130000000ps
.00113sec
3890/(1/.00113)) = 4.4


a. cost 3890+600 = 4490
1000000*.95*1430 = 1358500000ps
.0013585 sec
4490/(1/.0013585) = 6.1


b. cost 3890-400 = 3490
1000000*1130 = 1130000000ps
.00113sec
3490/(1/.00113) = 3.94


Improvement a, is not an improvement at all.  The increase in cycle time is never made up in the decrease in instruction count. When combined with the cost of the improvement, you have a system a fair amount worse than our baseline. To make this implementation and actual improvement, the usage of mul will need to be used much more frequently.


Improvement b is a true improvement.  With no sacrifice to cycle time, we have a reduced cost. Seems like an obvious improvement.