Pipelined Parallelism | 6.16. Pipelining & Parallelism

Electron Tube

Subscribe Here





6.16. Pipelining & Parallelism


We would take a look at something really interesting. Which is the trade-off between supply voltage and power on the one hand and supply voltage and throughput on the other, so the aim of this video and the next video is to look at methods to reduce power consumption in a Cmos circuit. Power consumption is really important because most Cmos circuits are now used in mobile platforms and they are battery-powered and the amount of energy they consume is commensurate with the number of times that someone has to recharge their battery. But first we have to define what we are talking about. Exactly because saying we talk about power is a little bit misleading. Imagine that you have a circuit. Let’s say if there’s an error, for example, and this adder works at a power dissipation level of P and its producing output at a throughput of s. Now imagine that somebody else says that they produce a circuit, which is lower power than ours and so it’s parts patient√≠’s. P over 2 This is not enough information because we also have to know what kind of throughput they are using, because if their throughput is 1/2 then they are, they have actually done nothing to improve the design. Why is that because we could simply be using two of the circus that they are using in parallel to produce double the throughput and double the power, which means that we have not actually made a low-power design. They have not made a no power design. They have just cut the design in half and so when we talk about power, we should compare powers for the same throughput or otherwise compare the product of power and the inverse of throughput, which is power delay product. So the inverse of throughput is the time it takes to produce a single output sample when we multiply that by power. We need them. Get the amount of energy required to produce one output sample and this is the best measure of measure of power. We have now if we look at dynamic power or active power. P Dynamic was equal to alpha CAF Vdd Square. We have several different options to reduce power dissipation, but the best option is to reduce supply voltage. Vd’d, the reason is power dependency on supply voltage is quadratic, and so we get more bang for our buck from doing this. But we also have to notice that the operating frequency, which is inversely proportional to the propagation delay in the critical path, is directly proportional to V DD as well so reducing your supply voltage will also reduce your operating frequency, which would decrease your throughput, So you have to think about it carefully. Now consider a circuit that is producing outputs at the throughput of s when it’s operating at a frequency of F and a power supply level of VDD. In this case, it is consuming power at a level. P now assume we use two of these of the circuit in parallel with each other and combine their outputs together, we had getting doubled. The throughput to us had the same frequency using the same supply and consuming double the power because each of these two networks is going to consume. P now assume for the parent circuit here that we don’t actually want a throughput of two S. We just want to preserve the original throughput of s. Right, that’s all we need. We need an output of s samples per second. That means that we can operate these two parallel circuits at half the frequency. F over two. If F is going to give us two S Then F over two is going to give us s. But F is directly proportional to the power supply so we can actually reduce the power supply by half and still get the same throughput. So what happened to power here? Power was originally 2 P So what happened to it, right. We have to look at that So power for one of the blocks was equal to C F Vdd Square so power for the two blocks was 2 F 2 CF Vdd Square. Now we have reduced the operating frequency by half so 2 C Times F of -, and we have also reduced the operating supply voltage by half so V DD / – All Square and so then we get CF V DD Square over 4 and so power is actually now P over 4 so we have managed to reduce the power by a factor of 4 and the cost that we pay is area. We are using two parallel circuits to produce the same throughput so we are paying more area, but we are getting much less power, which is kind of counterintuitive because more area means more capacitance to switch, which should mean more power, But the fact that you managed to use frequency had a more profound effect on power dissipation because it had a profound effect through supply voltage. Now is the price that we are paying only in area, because if that’s the case, why not use 4 of these and reduce power by a factor of 16 why not use 6 of this and reduce power by a factor of 36 there has to be a fundamental limit upon how much parallelism we can use the degree to which we can paralyze our circuit, of course area. The cost of area is one aspect, but there’s another aspect notice that as the power supply drops. I mean, the main thing that helped us reduce power here is that the supply dropped. If we use three blocks in parallel, we reduce the supply by a third 4 by fourth and so on and notice is notice that as we reduce the supply, we also reduce the available noise margins, so that’s our fundamental limit, like if we keep reducing the supply noise is not being reduced with us and the control we will hit a limit when our noise margins are no longer effective, There’s a similar approach towards power minimization by using a pipelining assume that you have a pipeline of logic blocks, right, so you have three logic blocks. Eob 1 2 & 3 And they are originally or initially only IO pipeline, so there’s only pipeline registers on the input and the output. Now this pipeline operates at a frequency. F producing a throughput of S operating from a supply VDD and consuming a power P which is equal to C F VDD Square Times. Alpha, Of course, Now, what is C in this case? C is the total amount of switched frequent of switch capacitance at the outputs of each of the logic blocks. It is the total capacitance that is being switched on average in this pipeline. Now assume that we are performed pipelining on this network, so we perform internal pipelining and also assume that each of these blocks have equal delay. So this is the best case for pipelining right when we have equal delay because it reduces our critical path by the maximum amount. So now if we assume that in the original pipeline, we had a critical path delay of T, we now have a critical path delay of T over 3 which means that the operating frequency has more or less tripled. It should be a little bit less than triple because we have the overhead of Set-up time and TC Q. But ignoring certain time and T secured should have tripled by now, which means that the throughput has also tripled using the same supply and the amount of power Dissipation is now going to be a little bit different because we are operating at a different frequency so power is going to be C plus Delta and this Delta is the additional capacitance that we are switching because of the internal pipelining registers that we added time’s 3f times Vdd Square. Now, we’re going to ignore this. Delta, for now, it’s not a big deal and that’s just assume that we are not interested in operating this circuit at 3s we just want the original throughput of s out of the circuit that means that we can reduce the operating frequency down to F from 3f which means we can reduce the power supply from VDD to VDD three. If we check out the value of participation is going to be. P equals C Times F because we reduced our operating frequency down to F Times V DD over 3 L Square, which means the power Dissipation has reduced by a factor of 9 on one of us over three squared, so we are basically seeing the same kind of improvement the same quadratic improvement that we saw from parallelism, but we are seeing it with with actually less cost because we are not multiplying the area The area has not been doubled or tripled. In this case. We just have the area of the additional pipeline registers, which is much less significant than pipe then paralyzing the combinational logic blocks, but we have the same limitations that we have in paradism. We have to be careful about how much reduce the supply voltage because noise margins are going to be reduced, along with the supply voltage with pipelines. We’ll also have an additional cost, which is the cost and latency the number of cycles that go until we see the first output latency is often is not usually a huge issue, especially when you have a latency of a few cycles, but in a large system where the latencies of multiple blocks can add up. This could have an impact, especially on real time applications like video chatting, for example, so you also have to be careful with the cost in latency.

0.3.0 | Wor Build 0.3.0 Installation Guide

Transcript: [MUSIC] Okay, so in this video? I want to take a look at the new windows on Raspberry Pi build 0.3.0 and this is the latest version. It's just been released today and this version you have to build by yourself. You have to get your own whim, and then you...

read more