# POWER-AWARE DESIGN OF AN 8-BIT PIPELINING ASYNCHRONOUS ANT-BASED CLA USING DATA TRANSITION DETECTION§ Chua-Chin Wang<sup>†</sup>, Ching-Li Lee, and Pai-Li Liu Department of Electrical Engineering National Sun Yat-Sen University Kaohsiung, Taiwan 80424 email: ccwang@ee.nsysu.edu.tw #### ABSTRACT A high speed and low-power 8-bit carry-lookahead adder (CLA) using two-phase all-N-transistor (ANT) blocks which are arranged in a PLA design style with power-aware pipelining is present. The pull-up charging and pull-down dis-charging of the transistor arrays of the PLA are accelerated by inserting two feedback MOS transistors between the evaluation NMOS blocks and the outputs. The addition of two 8-bit binary numbers is executed in 2 cycles. The proposed power-aware pipelining design methodology using a simple data transition detection circuit takes advantage of shutting down the processing stages with identical inputs in two consecutive operations. Not only is it proved to be also suitable for long adders, the dynamic power consumption is drastically reduced by more than 50% at every process corner. #### 1. INTRODUCTION Fast adders are key elements in digital circuits, including multipliers, and DSP chips. Many efforts have been focused on the improvement of adder designs [2], [3], [5]. CMOS dynamic logic has been recognized as one of the promising options to challenge the GHz operations for the adder design, [1]. Other logics suffer from different difficulties which were addressed in [3]. However, the major trade-off of these prior GHz logic circuits is the high power consumption which is not a tolerable price to pay in recent mobile technologies. These circuits unavoidably consume power even if they are in a stand-by condition. We, hence, propose a power-aware PLA-like structure to improve our high-speed all-N-transistor (ANT) function block [3], [5], [6]. An 8-bit CLA using ANTs which are arranged in the power-aware PLA-like structure and asynchronously triggered is implemented to verify the power reduction as well as the preservation of high speed. A simple but effective data transition detection (DTD) circuit is proposed to resolve the power consumption problem. The major advantage of the power-aware design methodology is that it is robust regardless of long data words, e.g., 64-bit binary data. The power reduction is simulated to be more than 50% compared to the prior works. #### 2. POWER-AWARE HIGH-SPEED 8-BIT CLA #### 2.1. All-N-transistor (ANT) function unit Although the N-block dynamic logic intrinsically possesses high speed [1], it is not good enough for the operation in the giga hertz range. The reasons are: firstly, the slopes of the clock's edges must be gentle, and secondly, the number of stacks in the evaluation N-block severely affects the size of all of the transistors in the unit. Hence, a modified dynamic logic, ANT [6], has been proposed in Fig. 1. The feature of this modification is the feedback transistor pair, P3 and N3, between the evaluation block and the output. - When clk = 0, P1 is on and the gate of P2 is precharged to be Vdd. Then, P2 is off and N4 is off. This makes the output to stay at the previous state. - 2). When clk = 1 and the N-block is evaluated to be "pass", the charge at node a should be ground through the N-block and N1 theoretically. Note that N4 is on and N2 is also on at the beginning. If the previous state of cutput is high, then N3 will be turned via N4. This means that N3 provides another fast discharging path for the charge at node a. - 3). When clk=1 and the previous state of the output is low and the N-block is evaluated to be "pass," the voltage at node a starts to drop. When $V_a V_{dd} > V_{tp}$ , P3 will be on such that the gate of N3 will be charged to be $V_{dd}$ . Not only the charge at node a will be discharged faster, but also the output will be charged to high via P2 and N4. - 4). When clk = 1 and the N-block is evaluated to be "stop", the charge at node a should be kept if the previous state of output is low. There will be no discharging path for node a because N3 will be off via N4. If the previous state is high, the output will be ground via N4 and N2 before the voltage at node a starts to drop. <sup>§</sup> This research was partially supported by National Health Research Institute under grant NHRI-EX93-9319EI and National Science Council under grant NSC 92-2220-E-110-001 and 92-2220-E-110-004. Figure 1: ANT logic Summarized from 2). and 3). in the above, the output will be high when the N-block is evaluated "pass", i.e., "1", during clk = 1. By 4), the output will be low when the N-block is evaluated "stop", i.e., "0", during clk = 1. The function of ANT logic block, thus, is conclusively correct and non-inverting. Restated, P3 and N3, respectively, provide an extra charging path and an extra discharging path such that the speed of the evaluation can be accelerated. In addition to the previous discharging path problem, one of the reasons why other high-speed logic can not run correctly given clocks with short rise time or fall time is that the size of each transistor can not be tuned properly. The sizing problem of the transistors in the ANT besides those in the N-block drastically affect the speed. We have been proceeded several simulations to find out the best figure of merit for the sizing of each transistor in Fig. 1 using TSMC 0.25 $\mu$ m 1P5M CMOS technology. ## 2.2. PLA-styled 8-bit CLA design The formulation of a 8-b CLA is represented by the following equations: $$S_{i} = C_{i-1} \bigoplus P_{i}$$ $$C_{i} = G_{i-1} + P_{i-1}G_{i-2} + P_{i-1}P_{i-2}G_{i-3} + \dots + P_{i-1}P_{i-2}\dots P_{1}P_{0}C_{0}$$ (1) where $A_i, B_i, i = 0...7$ , are inputs, and $P_i, G_i$ are propagate and generate signals, respectively, $P_i = A_i \bigoplus B_i$ , $G_i = A_i \cdot B_i$ . If the $P_i$ 's and $G_i$ 's are produced by combinatorial logic function blocks before they are fed into the function blocks for $S_i$ 's and $C_i$ 's, then Eqn. (1) implies that a twolevel AND-OR logic function block is a possible solution to achieve high speed operations. Thus, the PLA-styled design is suitable for such a function block. A conceptual PLA-styled design for CLA is shown in Fig. 2. A typical PLA consists of an AND array and an OR array. It is well known that the series NMOS in the evaluation block of NAND or AND gates will produce long discharging delays which subsequently slow down the entire circuit. We can take advantage of the non-inverting feature of the ANT logic to utilize a NOT-OR-NOT-OR configuration instead of the typical AND-OR style, where the two OR planes are made of ANT logic blocks. Meanwhile, it can also minimize the series transistor count in the evaluation block. The OR array is made of the ANT logic with a Figure 2: PLA-styled CLA Figure 3: GP block predefined evaluation block. The inputs to the first OR array is the inverted $P_i$ 's (propagate) and $G_i$ 's (generate) signals which are also produced by other ANT logic units as shown in Fig. 3. Note that we define the propagate signals in a different way from the traditional $P_i = A_i + B_i$ because the $P_i = A_i \bigoplus B_i$ can be reused to generate the sum term, i.e., $S_i$ . #### 2.3. Speed and area analysis Speed: The critical path of an adder resides on the generation of carry signals, i.e., $C_8$ in the 8-bit adder. After the binary data are ready, the generation of $P_i$ 's and $G_i$ 's by using the ANT logic takes the high half of a full cycle. That is, the results of GP blocks will be ready when the clk is low. The inverted $P_i$ 's and $G_i$ 's will then be fed into the first OR plane of the ANT-based PLA. The inverted outputs of the first OR plane will be presented to the second OR at the high half of the second cycle. The final $C_i$ 's results then are ready in the low half of the second cycle. Right after the generation of every $C_i$ 's, they are inverted and fed into the $S_i$ 's function blocks. Another half cycle then is required to produce all of the $S_i$ 's. The final result will be latched after 2 cycles. Area: As for the transistor count of the PLA-styled implementation for CLA using ANT logic, an analytic form is obtained after careful derivations. In short, if an *n*-bit CLA is to be realized by our methodology, the transistor count can be computed as: $T_{total} = \frac{1}{6}(n+1)(n+2)(n+3) + 5n(n+1) + 50n + 3$ Figure 4: Power-aware circuitry Figure 5: Regulator in the power-aware circuitry ### 2.4. Data transition detection (DTD) A simple thought to improve the power efficiency is to "deny" the current fed into those function units of which the input data are identical between two consecutive operation cycles. The dynamic power, hence, of CMOS logic elements will be drastically reduced. Take the ANT block shown in Fig. 1 as an example. Assume the N-block is composed of two cascaded NMOS transistors to constitute an AND gate. The probability of the data inputs of two consecutive operation cycles is 25% which implies a significant portion of power consumption. Hence, a monitoring circuitry, called data transition detector (DTD), as shown in Fig. 4 is proposed to resolve the low power demand. The DTD design is based on an important observation, which is that the state switching of either $A_i$ or $B_i$ will cause a series of state switches with regard to $P_j$ , $G_j$ , and $C_j$ , $\forall j \geq i$ . Hence, an early state transition detection of lower bits can be used to determine whether the computation of higher bits is required or not. It carries out the monitoring mechanism and triggers the addition operations asynchronously depending on the comparison Figure 6: Data flow block diagram of the proposed CLA of the previous operands and the current operands. The DTD is composed of three blocks: 3 stages of delay chains to generate phase-shifted data pulses, a clock generator, and a voltage regulator. As shown in Fig. 4, the $D_0$ is propagated through 3 delay stages which individually comprises 4 cascaded inverters to generate $D_1$ , $D_2$ , and $D_3$ . $D_i$ , $i=0,\ldots,3$ , are inverted respectively to generate $\overline{D_i}$ , $i=0,\ldots,3$ . Since what the 8-bit PLA-styled CLA needs to complete an addition is 2 cycles, we conclude that $(D_3D_2D_1D_0)=0001,0111,1110,1000$ , are the states required to generate two consecutive clock cycles for the required addition. Hence, the clock generator for the corresponding ANT logic, e.g., C1 and SUM0, is carried out. The generated internal clock is set to 100 MHz. However, the most critical part of the proposed DTD is the sensitivity of the strobe duration with respect to the power variation. One of the most efficient approach to avoid the unstable power supply is to employ stepdown bandgap-referenced voltage regulators to supply a temperature independent reference voltage, $V_{ref}$ , to the rest of the circuitry. Referring to Fig. 5, the regulator is composed of $A_{OP3}$ , PM61, and a resistor string. The generated internal voltage for the DTD is a very stable $V_{int} = V_{dd} - V_{thp}$ . # 3. PERFORMANCE SIMULATIONS AND COMPARISON The data flow block diagram of the proposed 8-bit power-aware PLA-styled ANT-based CLA is shown in Fig. 6. The detailed schematic and layout of the CLA implemented by TSMC (Taiwan Semiconductor Manufacturing Company) 0.25 $\mu$ m 1P5M CMOS process shown in Fig. 7 and 8, respectively. An example of the output waveform of 8-bit power-aware PLA-styled CLA using ANT logic shown in Fig. 9 illustrates that the result of an addition appears after two cycles when the VDD is coupled with a 1 MHz sine wave noise possessing 10% VDD amplitude. The characteristics of the proposed power-aware CLA is tabulated in Table 1. To reveal the power-saving advantage of the proposed power-aware design, two 8-bit adders are, respectively, implemented by [6] and the proposed design using the same CMOS process. The power reduction of the power-aware design is shown in Table 2. The simulations are carried out by HSPICE Monte Carlo method with sweep = 30. Figure 7: Schematic of the proposed CLA Figure 8: Layout of the proposed CLA | | proposed CLA | | |-------------------|--------------------------------|--| | highest data rate | 100 MHz | | | avg. power | 9.7 mW | | | area | $1.16 \times 0.75 \text{mm}^2$ | | | transistor count | 2250 | | Table 1: Characteristics of the proposed power-aware 8-bit adder | data rate | [6] | power-aware | reduction | |------------------|-------|-------------------|-----------| | 50 MHz | 35 mW | 17.4 mW | 50% | | 20 MHz | 35 mW | $9.7~\mathrm{mW}$ | 70% | | 10 MHz | 35 mW | 4.9 mW | 85% | | $5~\mathrm{MHz}$ | 35 mW | 3.3 mW | 90% | Table 2: Power reduction by using the power-aware DTD circuitry #### 4. CONCLUSION We propose a power-aware high speed PLA-styled ANT logic design for the adders' implementation. A novel but simple DTD circuit is used to monitor the switching activity of input data such that the unnecessary power consumption is avoided. Not only the correctness of the Figure 9: Post-layout simulation result function in the giga hertz range is preserved, but also the power dissipation is reduced. The PLA-styled ANTbased structure using only one clock makes the result of an 8-bit adder appear in two cycles, or a hierarchical 64-bit adder in 4 cycles. #### 5. REFERENCES - [1] R. X. Gu, and M. I. Elmasry, "All-N-logic high-speed true-single-phase dynamic CMOS logic," IEEE J. on Solid-State Circuits, vol. 31, no. 2, pp. 221-229, Feb. 1996. - [2] R. Rogenmoser, and Q. Huang, "An 800-MHz 1mm CMOS pipelined 8-b adder using true phase clocked logic-flip-flops," *IEEE J. on Solid-State Circuits*, vol. 31, no. 3, pp. 401-409, Mar. 1996. - [3] C.-C. Wang, C.-J. Huang, and K.-C. Tsai, "A 1.0 GHz 0.6-\u03c4m 8-bit carry lookahead adder using PLA-styled all-N-transistor logic," IEEE Trans. of Circuits and Systems, Part II: Analog and Digital Signal Processing, vol. 47, no. 2, pp. 133-135, Feb. 2000. - [4] Z. Wang, G. A. Jullien, W. C. Miller, J. Wang, and S. S. Bizzan, "Fast adders using enhanced multipleoutput domino logic," *IEEE J. of Solid-State Cir*cuits, vol. 32, no. 2, pp. 206-214, Feb. 1997. - [5] C.-C. Wang, Y.-L. Tseng, P.-M. Lee, R.-C. Lee, and C.-J. Huang, "A 1.25 GHz 32-bit tree-structured carry lookahead adder using modified ANT logic," *IEEE Trans. on Circuits and Systems - I Funda*mental Theory and Applications, vol. 50, no. 9, pp. 1208-1216, Sep. 2003. - [6] C.-C. Wang, C.-F. Wu, and K.-C. Tsai, "A 1.0 GHz 64-bit High-Speed Comparator Using ANT Dynamic Logic with Two-Phase Clocking," IEE Proceedings - Computers and Digital Techniques, vol. 145, no. 6, pp. 433-436, Nov. 1998.