

# LOW VOLTAGE, LOW POWER (5:2) COMPRESSOR CELL FOR FAST ARITHMETIC CIRCUITS

Jiangmin Gu and Chip-Hong Chang

School of Electrical and Electronic Engineering, Nanyang Technological University  
Nanyang Avenue, Singapore 639798

## ABSTRACT

This paper presents a new (5:2) compressor circuit capable of operating at ultra-low voltages. Its power efficacy is derived from the novel design of composite XOR-XNOR gate at transistor level. The new circuit eliminates the weak logic and threshold voltage drop problems, which are the main factors limiting the performance of pass transistor based circuits at low supply voltages. The proposed (5:2) compressor has been designed with special consideration on output drivability to ensure that it can function reliably at low voltages when these cells are employed in the tree structured multiplier and multiply-accumulator. Simulation results show that the proposed (5:2) compressor is able to function at supply voltage as low as 0.7V, and outperforms other (5:2) compressors constructed with various combinations of recently reported superior low-power logic cells.

## 1. INTRODUCTION

Fast arithmetic computation cells including adders and multipliers are the most frequently and widely used circuits in VLSI systems. Microprocessors and digital signal processors rely on the efficient implementation of generic arithmetic logic units and floating point units to execute dedicated algorithms such as convolution and filtering [1-4]. In most of these applications, multipliers have been the critical and obligatory component dictating the overall circuit performance when constrained by power consumption and computation speed. With trends of VLSI technologies towards deep-submicron regime, the most eminent means of achieving power efficacy is by lowering the power supply voltage. Therefore, it is imperative to explore circuit design techniques to achieve high power efficacy of arithmetic circuits at ultra low supply voltages.

Fast multipliers are generally composed of three sub-functions: *partial product generation*, *partial product accumulation* and *carry-propagating addition* [2, 5]. In the partial product generation circuit, Booth encodings are often used to reduce the number of partial products. A summation tree, called the Carry Save Adder (CSA) tree, is used in the second function to further reduce the partial products to two. The last function is normally fulfilled by the fast carry propagate adder, such as carry look-ahead adder and carry-skip adder. Early designs of CSA tree use the (3:2) counters, or full adders for the partial product accumulation, in which 3 equally weighted bits were combined to produce two output bits. The (4:2) compressors, due to their ability to form regular interconnected cells structure, are more popularly used nowadays. Higher input compressors such as (5:2), (6:2), etc., have also been studied and

increasingly employed in high precision multipliers to achieve greater performance.

There has been an increasing interest to use fast (5:2) compressor for large word-size multiplier and multiply-accumulators [1, 2]. In this paper, we investigate several fast (5:2) compressor architectures and devise a new circuit that built around composite XOR-XNOR gates and multiplexers. Various CMOS logic styles for implementing these primitive cells at transistor level have been studied and a new design is proposed. The proposed fast (5:2) compressor architecture designed with the new composite gate and multiplexer cells is able to operate below 1 volt with fJ power efficacy. Meantime, the driving capability is assured by simulating the circuit in an environment similar to its use in a tree structured accumulator.

## 2. (5:2) COMPRESSOR ARCHITECTURES

The block diagram of a (5:2) compressor is shown in Fig. 1, which has seven inputs and four outputs. Five of the inputs are the primary inputs  $x_1, x_2, x_3, x_4$  and  $x_5$ , and the other two inputs,  $c_{in1}$  and  $c_{in2}$  receive their values from the neighboring compressor of one binary bit order lower in significance. All the seven inputs have the same weight. The (5:2) compressor generates an output, *sum* of the same weight as the inputs, and three outputs, *carry*,  $c_{out1}$  and  $c_{out2}$  weighted one binary order higher. The  $c_{out1}$  and  $c_{out2}$  are fed to the neighboring compressor of higher significance.



Figure 1. Block diagram

A simple implementation of the (5:2) compressor is to cascade three (3,2) parallel counters in a hierarchical structure, as shown in Fig. 2. Since a (3,2) parallel counter is equivalent to a full adder, this architecture has a critical path delay of  $6\Delta_{xor}$ , where  $\Delta_{xor}$  is the delay of an XOR gate.

A faster (5:2) compressor is shown in Fig. 3. This architecture is proposed in [1], which uses a different method to generate  $c_{out1}$  and  $c_{out2}$ . It is claimed to have a delay of  $4\Delta_{xor}$ .

Fig. 4 shows another architecture of a (5:2) compressor [2]. Careful analysis shows that this design has a critical path delay of  $4\Delta_{xor} + \Delta_{mux}$  as oppose to  $5\Delta_{xor}$  reported in [2]. The style and structure of the circuit share some common attributes as the

recently published structural design of full adders [4,5] and (4:2) compressors [2,6].



Figure 2. (5:2) compressor based on cascaded (3,2) counters



Figure 3. (5:2) compressor architecture of [1]



Figure 4. (5:2) compressor architecture of [2]

In spite of structural differences of various implementations, the formulae to generate the output signals are essentially derived from the basic architecture of Fig. 2. Each full adder can be logically expressed as:

$$s_{FA} = a \oplus b \oplus c \quad (1)$$

$$c_{FA} = (a \oplus b) \cdot c + \overline{(a \oplus b)} \cdot a \quad (2)$$

where  $a$ ,  $b$ , and  $c$  are the primary inputs and  $s_{FA}$  and  $c_{FA}$  are the primary outputs of the full adder.

It follows that the outputs and the internal nodes of Fig. 2 can be expressed by the following set of equations.

$$s_1 = x_1 \oplus x_2 \oplus x_3 \quad (3)$$

$$c_{out1} = (x_1 \oplus x_2) \cdot x_3 + \overline{(x_1 \oplus x_2)} \cdot x_1 \quad (4)$$

$$s_2 = s_1 \oplus x_4 \oplus c_{in1} \quad (5)$$

$$= x_1 \oplus x_2 \oplus x_3 \oplus x_4 \oplus c_{in1}$$

$$c_{out2} = (x_4 \oplus s_1) \cdot c_{in1} + \overline{(x_4 \oplus s_1)} \cdot x_4 \quad (6)$$

$$= (x_1 \oplus x_2 \oplus x_3 \oplus x_4) \cdot c_{in1} + \overline{(x_1 \oplus x_2 \oplus x_3 \oplus x_4)} \cdot x_4$$

$$sum = s_2 \oplus x_5 \oplus c_{in2} \quad (7)$$

$$= x_1 \oplus x_2 \oplus x_3 \oplus x_4 \oplus x_5 \oplus c_{in1} \oplus c_{in2}$$

$$carry = (x_5 \oplus s_2) \cdot c_{in2} + \overline{(x_5 \oplus s_2)} \cdot x_5 \quad (8)$$

$$= (x_1 \oplus x_2 \oplus x_3 \oplus x_4 \oplus x_5 \oplus c_{in1}) \cdot c_{in2}$$

$$+ (x_1 \oplus x_2 \oplus x_3 \oplus x_4 \oplus x_5 \oplus c_{in1}) \cdot x_5$$

Based on the above formulae, it is conjectured that lowering the critical path delay of the (5:2) compressor to  $3\Delta_{xor}$  or lower is almost impossible. However, it is very likely to explore different logic design styles at transistor level to achieve significantly improved low power and high-speed (5:2) compressor cells for instantiation at architectural level. For example, a dual rail (5:2) compressor [2] is proposed for the architecture of Fig. 4, where the XOR gates are implemented as dual rail multiplexers to improve the performance.

### 3. PROPOSED CIRCUITS LEVEL DESIGN

Having defined the notion of the (5:2) compressor, our objective is to develop novel circuits at transistor level that leverage on advanced CMOS technologies to realize a low voltage, low power (5:2) compressor cell with sustainable speed and acceptable area cost. It is obvious from Fig. 3 and 4 that the primitive logic elements are XOR-XNOR gate and multiplexer. Since the compressors are normally cascaded in tree structure to reduce the height of the partial product matrix, it is imperative that the outputs of the compressor have enough drivability.

Fig. 5 shows three circuit implementations of the composite XOR-XNOR cell. Circuit (a) consumes very low power [4]. However, it generates a weak logic '1' when the primary inputs are '11', which prevents it from functioning reliably at low supply voltages. Circuit (b) is able to operate at low voltages, but it is not power efficient [4]. Both Circuits (a) and (b) use inverter to generate the xor/xnor signals. Therefore, the outputs of these complimentary signals are heavily skewed in time. Fig. 5(c) shows the schematic and layout of our proposed XOR-XNOR cell. The output signals are generated simultaneously with balanced delay. It is able to operate under very low voltages because the weak logic problem caused by the pass transistors is circumvented by the pair of feed-back transistors.

Fig. 6 shows the circuit implementation of a simple XOR gate and its layout. It is used in some blocks of the architecture of Fig. 4 where only the exclusive-or output is needed.

Fig. 7 shows different circuit implementations of the multiplexer. Circuit (a) is a dual rail multiplexer, where complementary pairs of primary inputs and outputs need to be generated [2]. Although it generates full-swing outputs, due to the pass transistor structure, it will not provide adequate drivability if many such circuits are cascaded, particularly at low supply voltages. Therefore if this circuit is used in the output port of the compressor, inverters are needed to strengthen the

signals. Circuit (b) uses the dual rail multiplexer in (a) to construct the XOR-XNOR gate, which can be used for the fully multiplexer-based implementation of (5:2) compressor. Circuit (c) is a widely used multiplexer for the carry output generation in fast full adders and multi-input compressors [4,5]. A buffer is added to enhance its driving capability. Fig 7(d) shows CMOS style multiplexer circuit and its layout. Since it is designed based on complementary CMOS logic style, it has the inherent robustness against supply voltage scaling and transistor sizing. Comparing with Circuit (c), Circuit (d) maintains the required output drivability with one less inverter. The simpler and regular layout of the proposed circuit has outweighed the penalty of having two more transistors than Circuit (c).



Figure 5. XOR-XNOR cells



Figure 6. XOR cell for output

Fig. 8 is the complete layout of our new fast (5:2) compressor based on the architecture of Fig. 4, which covers a silicon area of  $33\mu\text{m} \times 17\mu\text{m}$ . The composite XOR-XNOR cells are implemented with the proposed circuit of Fig. 5(c), except the cell that produces the final *sum* output uses the circuit of Fig. 6. The multiplexers are implemented with the circuit of Fig. 7(d).

#### 4. SIMULATIONS AND ANALYSIS

Five different (5:2) compressors including ours are simulated and their performances in terms of average power consumption, worst case delay and power efficacy (average power  $\times$  worst case delay) are compared. Table 1 shows the configurations of these circuits, all of which are guaranteed to have sufficient drivability for fair comparison. Compressor 1 and 2 use the architecture of Fig. 3 [1]. Compressor 3 is a full-multiplexer based implementation with each primary output strengthened by

an inverter. Compressor 4 and 5 use the architecture of Fig. 4. The difference is that Compressor 4 uses logic gates and multiplexer circuits from existing literatures [4,5], whereas Compressor 5 is our proposed circuit presented in Section 3.



Figure 7. MUX cells



Figure 8. Layout of the proposed (5:2) compressor

Table 1. Configurations of six simulated (5:2) compressors

| No. | Structure | MUX       | XOR-NOR   | XOR       |
|-----|-----------|-----------|-----------|-----------|
| 1   | Fig. 3    | Fig. 7(c) | Fig. 5(a) | Fig. 5(a) |
| 2   | Fig. 3    | Fig. 7(c) | Fig. 5(b) | Fig. 6    |
| 3   | Fig. 4    | Fig. 7(a) | Fig. 7(a) | Fig. 7(a) |
| 4   | Fig. 4    | Fig. 7(c) | Fig. 5(b) | Fig. 6    |
| 5   | Fig. 4    | Fig. 7(d) | Fig. 5(c) | Fig. 6    |

All the simulations and layouts are targeted for the latest Chartered Semiconductor CSM  $0.18\mu\text{m}$  CMOS technology. Therefore, the circuits are designed and optimized based on this process model. The simulations are performed by Nassda HSim 2.0 with the option “HSIMSPED” set to “0”. This option gives the slowest simulation time with the highest accuracy giving results compatible to HSPICE simulation.

←

→

Each input and output pins are cascaded by buffers, which provide a realistic simulation environment reflecting the compressor operation in actual applications. The 1024 input data are randomly generated by MATLAB. The delay is measured from the earliest input signal reaching 50% of the supply voltage to the latest output signal reaching 50% of the supply voltage during each transition. The worst case delay is the largest delay among all input data for each applied voltage. When the supply voltage is larger than 1.0V, the computation rate of 100MHz is used. When the supply voltage is equal to or lower than 1.0V, 10MHz computation rate is used. The power consumed by the additional buffers is excluded from the average power consumption calculation.

The simulation results show that all circuits except Compressor 1 are able to work under a supply voltage as low as 0.7V. The minimal voltage at which Compressor 1 is still operable is 1.2V. The malfunction at lower voltages is due to the threshold voltage drop of the XOR-XNOR circuit of Fig. 5(a). Fig. 9 to 11 show the power consumption ( $\mu$ W), worst case delay (ns) and the power-delay product (fJ) over the voltage range of 0.7V to 3.3V. The proposed circuit consumes the least power at voltages below 2.5V. For example, it consumes 13% to 22.7% less power than other circuits at 1.8V and 10.7% to 23.7% lesser at 0.8V. The delay profile is better than Compressors 1, 2 and 4 in low voltage operation, and is comparable to Compressor 3. More importantly, its power efficacy measured by the power-delay product outperforms the other circuits at supply voltages below 1.8V, for example, from 1.5% to 10.5% less than other circuits at 1.8V and 14.1% to 24.6% lesser at 0.8V. Due to the superior power efficacy at low voltages, our proposed (5:2) compressor is particularly suitable for future low power applications where sub-1V power supply will become part of the mainstream technologies.

## 5. CONCLUSION

Various architectures of (5:2) compressor are analyzed. Different CMOS logic design styles for implementing the primitive modules of the compressor at transistor level are explored. Two new circuits are proposed, one for the complex logic gate to co-generate the XOR-XNOR outputs and the other for the multiplexer. A regular layout and cascadable novel (5:2) compressor is constructed from these composite gate and multiplexer circuits. Simulation results show that the new (5:2) compressor is capable of functioning down to 0.7V, and still maintain superior power efficacy. Therefore, it is an excellent cell library component to realize future high-speed, low power arithmetic logic unit at sub-1V supply voltages.

## 6. REFERENCES

- [1] O. Kwon, K. Nowka, and E.E. Swartzlander, "A 16-bit x 16-bit MAC design using fast 5:2 compressor," in *Proc. IEEE Int. Conf. on Application-Specific Systems, Architectures, and Processors*, 2000, pp. 235 -243, 2000.
- [2] K. Prasad, and K.K. Parhi, "Low-power 4-2 and 5-2 compressors," in *Proc. of the 35th Asilomar Conf. on Signals, Systems and Computers*, vol.1, pp. 129-133, 2001.
- [3] P.J. Song, and G. De Micheli, "Circuit and architecture trade-offs for high-speed multiplication," *IEEE J. of Solid-State Circuits*, vol. 26 no. 9, pp. 1184 -1198, Sept. 1991
- [4] A.M. Shams, T.K. Darwish, and M.A. Bayoumi, "Performance analysis of low-power 1-bit CMOS full adder cells," *IEEE Trans. on VLSI Syst.*, vol. 10, no. 1, pp. 20-29, 2002.
- [5] A.M. Shams, and M.A. Bayoumi, "A structured approach for designing low power adders," in *Proc. of the 31st Asilomar Conf. on Signals, Systems & Computers*, vol. 1, pp. 757-761, 1997.
- [6] D. Radhakrishnan, and A.P. Preethy, "Low power CMOS pass logic 4-2 compressor for high-speed multiplication," in *Proc. of the 43th IEEE Midwest Symp. on Circuits and Systems*, vol. 3, pp. 1296-1298, 2000.



Figure 9. Power consumption ( $\mu$ W)



Figure 10. Worst case delay (ns)



Figure 11. Power delay product (fJ)