# A 35 $\mu$ W 1.1V GATE ARRAY 8×8 IDCT PROCESSOR FOR VIDEO-TELEPHONY

Roberto Rambaldi, Alessandro Uguzzoni, and Roberto Guerrieri

DEIS - Università di Bologna - Italy Central R&D - ST Microelectronics - Italy

# ABSTRACT

We have designed and fabricated a low power IC to perform the Inverse 8×8 DCT transform according to the CCITT precision specifications, suitable for portable video communication devices. Several design techniques have been used to reduce the power, such as a fast algorithm, an architecture that can exploit input signal correlation, and large amount of parallelism. The chip is fabricated in a triple metal  $0.5\mu$ m Gate Array CMOS technology. The maximum throughput is 400 Kpix/s at 1.1 V, and 27 Mpix/s at 3.3 V. The measured power consumption is 35  $\mu$ W for typical image sequences in color QCIF format at 10 frames/sec with a 1.1 V power supply, making this device ideal for low power portable applications.

### 1. INTRODUCTION

The increasing demand for multimedia services, and in particular videoconferencing, both for the PC market and for portable devices, has boosted the interest for efficient video compression techniques, that allow real time communication over low cost and low bandwidth channels, such as telephone lines. Several standards [3] describe the functionalities of these systems, which typically require large amounts of computational power. Very fast microprocessors can now deliver the performance to address some of these tasks in software through dedicated instruction sets and high clock speeds, but this solution is not viable for high performance portable systems, which must be power efficient [4].

In this paper we address this problem and describe the design and implementation of a low power ASIC macrocell used in video decompression, that computes the  $8 \times 8$  IDCT transform as required by a single chip H.263 decoder under development. The paper presents the IDCT operation, the chip architecture, the design methodology and the results of the testing from the fabricated prototype.

# 2. THE IDCT

Most block based video compression techniques exploit spatial and temporal redundancy of the image sequences by differential frame coding, possibly augmented by motion compensation, followed by transform coding, quantization and entropy coding; the more the image transform can pack the energy content of the differential blocks in the smallest set of coefficients, the more the latter compression step is effective, for a given quality loss. For practical images, the Discrete Cosine Transform (DCT) has nearoptimal packing properties [], and it is very used in compression algorithms, while the dual transform (IDCT), represents the bulk of computation for decompression systems. The mathematical definition of the IDCT, generally applied to  $8 \times 8$  data blocks, is the following:

$$X_{i,j} = \sum_{u,v=0}^{7} Y_{u,v} k_u k_v \cos\left[(2i+1)u\frac{\pi}{16}\right] \cos\left[(2j+1)v\frac{\pi}{16}\right]$$
(1)

where

$$k_n = \begin{cases} \frac{1}{2\sqrt{2}} & n = 0\\ \frac{1}{2} & n \neq 0 \end{cases} \quad for each \ 0 \le i, j < 8 \end{cases}$$

and  $X_{i,j}$  and  $Y_{u,v}$  are the values in the space and frequency domains respectively.

For reasons of stability in closed loop predictive coding schemes [3], final precision IDCT operations are required to have very high accuracy; in addition, real time execution requires very large computational power: rough estimates using Equation 1 are 50-500 MOPs (1 multiply or add = 1 operation) for video-conference decoding, and over 3 GOPs for VGA MPEG coding. These figures can be cut down by a factor of up to 6 or 7 using several fast algorithms ([6, 8] and many others) that have been developed exploiting symmetry properties of the DCT and IDCT kernel matrices: one of the most efficient solutions consists in rewriting Equation 1 as the matrix product

$$X = CY^{t}C = C(CY^{t})^{t} = (\mathcal{ABCDE})((\mathcal{ABCDE})Y^{t})^{t}$$

where matrices ABCDE have a smaller set of non-zero elements.

This approach reduces the number of additions and multiplications isolating common subexpressions that can be used more than once to get the final results. Other methods include distributed arithmetic and recursive decomposition [9], with "computationally" efficient solutions that reduce the amount of brute force computation at the cost of irregular algorithms that are not always well suited for hardware implementation. Many architectures have been proposed ([10, 11, 12, 13, 14] and others), with emphasis on different resources and performance requirements. For suitability for low power - low voltage parallel design, we chose a separable method with one-dimensional transforms implemented according to the 14-multiplier algorithm illustrated in the signal flow graph of Figure 1, which has good properties of compactness, regularity and low latency.

#### 3. THE IDCT ARCHITECTURE

Very low power consumption can be obtained mainly by reduction of the power supply; for this reason, the design of the chip was driven by the need to maximize parallelism in order to have a very fast architecture that can sustain the desired throughput with the lowest supply voltage; moreover, the number of sequential operations in a single cycle should be kept as small as possible, to reduce both the critical path and the spurious switching activity, while minimizing the number of pipeline stages and the amount of control logic, that do not contribute directly to the computation but are a source of large power consumption.

From these considerations, we have implemented a completely parallel, single-cycle 8-point IDCT processor as the core for the separable 2-D transform: to design the data path, the signal flow graph of Figure 1 has been re-mapped onto a more regular structure by moving around the operators and using the associative and distributive properties of sums and products: this helped hardware compaction and reduced the length of the interconnections paths, while the high number of crosses was not a problem from the layout viewpoint when using a three metal layers technology.

By processing 8 pixels at a time, throughput increases and the clocking frequency can be reduced, while the fully parallel structure virtually eliminates control overhead and generates a very regular layout. Moreover, parallelism helps to reduce power consumption by partitioning the circuit in sections activated by correlated data flows. In the design of the stand-alone macrocell, we could not use the same amount of parallelism in the I/O interfaces because of external pin count limitation, so that serial to parallel and parallel to serial converters had to be implemented to support single pixel I/O; to minimize power consumption, both interfaces use a ping pong scheme, that reduces the switching activity in the memory elements by a factor of 9 with respect to a shift register solution.

To allow for continuous pipelined I/O and higher throughput, and to minimize the amount of memory on chip, we used two IDCT cores (Figure 2), instead of time multiplexing the same one and buffering the I/O; a special architecture for the memory (2.3 Kbit) was implemented to perform a fully pipelined addresstransparent matrix transposition operation, working on 144 bits in parallel.

Area reduction and further architecture speedup was achieved through careful Booth encoding of the hardwired multipliers and high optimization of the carry-save adder network in the core of the processor. The control logic is limited to very few gates, addressing the memory and the I/O registers: to reduce power consumption, this unit drives already decoded strobing signals to address the memory elements, with the advantage of reducing total wire capacitance and minimizing internal switching frequency.

#### 4. DESIGN METHODOLOGY

The framework of the architectural optimization was completed by a extensive set of bit-accurate software simulations that validated the CCITT precision requirements (Table 1) after analysis of several combinations of internal word-lengths and partial rounding techniques, resulting in a minimum datapath width of 18 bits.

Since architecture and system-level strategies are the key factors for the reduction of power consumption, its preliminary estimation early in the design cycle, has been a strong point of the our design methodology. At the software level, quick modeling and simulation help define the algorithm and the main features of the architecture, by estimating, in addition to the precision, the amount of hardware required and the switching activities of a large part of the nets of the processor. Being able to grossly estimate power consumption, it is possible to span a large design space relatively fast and optimize the architecture by taking advantage of the char-



Figure 1: Signal Flow Graph of the 8 point IDCT.

acteristics of the input data. At the hardware Front-End level, a novel design methodology to estimate the power dissipation before the Back-End phase of the design cycle [1], lets designers optimize the final architecture of the processor, verifying the figures estimated in the first step, which is then repeated if necessary before further advance in the design flow.

For our project, the software analysis has been very important for algorithmic study and to exploit the high correlation of input data, that was obtained from the video-conferencing software developed in the VLSI laboratory at DEIS. In particular, we found that the IDCT is statistically characterized at the input by a predominance of zero blocks, rows and columns, with some nonzero coefficients located in the same positions (typically close to the DC values), while the output is strongly autocorrelated, both spatially and temporally. For this reason we chose to focus on an architecture dominated by combinational logic, which exposes to some glitching, but has the great advantage of being virtually nonconsuming most of the times because of very limited signal transitions. The rest of the design, and in particular the network of adders and multipliers, could then be optimized for power at the floor plan level, where a more accurate estimation had been possible. Regarding circuit design, our objective has been to run the circuit at the desired speed at 1.1V, in order to take advantage of the elimination of short circuit current, that happens when  $V_{dd} < V_{Tn} + |V_{Tp}|$  (1.3 V in our technology). The main disadvantage is the long transition time of the devices, which is particularly critical in a block with deep combinational logic such as the 1D-IDCT core. This problem was overcome with appropriate buffering and careful selection of the standard cells used in the design, avoiding too many stacked transistors.

#### 5. RESULTS AND CONCLUSIONS

The IDCT processor was fabricated using a  $0.5\mu$ m single-poly, triple metal CMOS Gate Array technology from SGS-Thomson Microelectronics. The photograph of the chip is shown in Figure



Figure 2: Block diagram of the architecture.

2, and its characteristics are reported in Table 3.

Figure 4 shows the results of the power measurements. With typical stream of data, the chip dissipates 35  $\mu$ W running at 400 KHz, which is equivalent to 10 MOPs/ $\mu$ W at a supply voltage of 1.1 V. This speed is sufficient to perform the IDCT transform of QCIF images (176×144) at 10 frame/s in color (15 in B/W), which is the target application for video-conferencing over telephone lines. 30-Hz 640×480 color VGA throughput can be already achieved at 1.8 V (8.5 mW), while at 3.3 V the speed exceeds 27 Mpix/s with a consumption of 67 mW. This is more than sufficient for real-time SuperVGA MPEG decoding. Deeper pipeline of the design would boost the performance by a factor larger than 4, but at the cost of a largely increased power dissipation. In fact, Figure 4 shows that, thanks to the architectural design, power consumption is directly correlated with input activity, being much higher for uncorrelated (random) input, which is not a real working condition. On the other hand, standby consumption, measured with constant zero input, is limited to 10  $\mu$ W at 1.1 V but is always present; note that this represents 28% of the total dissipation and it is generated by sequential logic that is roughly equivalent, from the dissipation viewpoint, to one level of pipeline in the core processor. An additional increase in energy efficiency comes from the reduction of the effective input gate capacitance of the CMOS logic gates when power supply approaches twice the threshold voltage. Figure 3 plots the measured effective capacitance of the IDCT module depending on the supply voltage. It is shown that the benefits of the supply scaling beyond twice the threshold voltage is more than quadratic and accounts for an additional 15% power reduction at 1V power supply. These capacitance values have been computed as

$$C_{eff} = \frac{P}{f_{clk} V_{dd}^2} \tag{2}$$

where P is the total power consumtion and  $f_{clk}$  is the clock frequency and  $V_{dd}$  the power supply.

The typical case consumption of 67 mW at 27 Mpix/s compares very favorably with those reported in literature, and particularly by Toshiba in [14], presenting what was the lowest power consuming IDCT processor we knew of: 150 mW at 40 Mpix/s at 2 V in 0.5  $\mu$ m custom CMOS, that we can extrapolate to 100 mW at 27 Mpix/s.

Our IDCT processor is being currently integrated in a single chip decoder for H.263 video-conferencing, using standard cells in 0.35  $\mu$ m CMOS technology; in this latter implementation, power is further reduced thanks to the removal of the I/O interfaces, and additional system level optimizations; the resulting area of the macrocell is below 4  $mm^2$ .

#### 6. ACKNOWLEDGEMENTS

The authors wish to thank Prof. G. Baccarani for his encouragement and advice, M.Borgatti and L.Bolcioni for helpful discussion during this work.

### 7. REFERENCES

- R. Zafalon, C. Guardiani, M. C. Rossi, R. Rambaldi "Forward Power Annotation on Physical Layout Floorplan" IEEE 1996 Custom Integrated Circuit Conference, May 1996, pp. 389-392.
- [2] R.Guerrieri, M.Borgatti, L.Bolcioni "Sub 1-Volt Operation of CMOS Devices for Very-Low Power Circuits: Possibilies and Limits" Low Power/Low Voltage Workshop (ESSCIRC'97), September 1997
- [3] ITU-T Recommendation H.263 "Line Transmission of Non Telephone Signals: Video Coding for Low Bitrate Communication" The International Telecommunication Union, Geneva 1996.
- [4] A. P. Chandrakasan, S. Sheng, R. W. Brodersen "Low Power CMOS Digital Design" IEEE Journal of Solid State Circuits, Vol. 27, N. 4, April 1992, pp. 473-484.
- [5] A. P. Chandrakasan, A. Burnstein, R. W. Brodersen "A Low-Power Chipset for a Portable Multimedia I/O Terminal " IEEE Journal of Solid State Circuits, Vol. 29, N. 12, December 1994, pp. 1415-1428.
- [6] B. G. Lee: "A New Algorithm to Compute the Discrete Cosine Transform" IEEE Transactions on Acoustic, Speech and Signal Processing, Vol. ASSP-32, No.6, pp. 1243-1245, December 1984.
- [7] C. Loeffler, A. Ligtenberg, G. Moschytz: "Pratical, Fast 1-D DCT Algorithm with 11 Multiplications" Proceedings of ICASSP-89, Vol.2, pp. 988-991.
- [8] N. I. Cho, I. D. Yun, S. U. Lee: "On the Regular Structure of the Fast 2-D DCT Algorithm" IEEE Transactions on Circuits and Systems-II: Analog and Digital Signal Processing, Vol. 40, No.4, April 1993, pp. 259-266.
- [9] P. Lee, F. Y. Huang: "Restructured Recursive DCT and DST Algorithms" IEEE Transactions on Signal Processing, Vol. 42, No. 7, July 1994, pp. 1600-1609.

- [10] O. Duardo, et al.: "Architecture and Implementation of ICs for a DSC-HDTV Video Decoder System" IEEE Micro, October 1992, pp. 22-27
- [11] S. Uramoto, et al.: "A 100-MHz 2-D discrete cosine transform core processor" IEEE Journal of Solid State Circuits, Vol. 27, No. 4, pp.492-499, April 1992.
- [12] P. A. Ruetz, Po Tong: "A 160-Mpixel/s IDCT Processor for HDTV" IEEE Micro, October 1992, pp. 28-32.
- [13] Y. F. Jang, el al.: "A 0.8um 100-MHz 2-D DCT Core Processor" IEEE Transactions on Consumer Electronics Vol. 40, No. 3, August 1994, pp. 703-709.
- [14] M. Matsui, et al.: "A 200 MHz 13 mm2 2-D DCT Macrocell Using Sense-Amplifying Pipeline Flip-Flop Scheme" IEEE Journal of Solid State Circuits, Vol. 39, No. 12, December 1994, pp.1482-1489.

| Error type          | Limit  | Result |
|---------------------|--------|--------|
| Peak                | 1      | 1      |
| Pixel Mean          | 0.0200 | 0.0094 |
| Pixel Mean Square   | 0.0600 | 0.0164 |
| Overall Mean        | 0.0138 | 0.0026 |
| Overall Mean Square | 0.0150 | 0.0150 |

Table 1: CCITT precision specifications and results.



| Technology       | $0.5\mu m$ 3-metal    |
|------------------|-----------------------|
|                  | CMOS Sea of Gates     |
| $V_{tn}$         | 0.65 V                |
| $V_{tp}$         | 0.65 V                |
| Transistor Count | 200K (70K for Memory) |
| Dissipation      | 35µW @1.1V @400 KHz   |
| Data Format      | IN: 12 bit DCT        |
|                  | OUT: 9 bit pixel      |
| Throughput       | 1 pixel/clock         |
| Latency          | 80 cycles             |
| Maximum Speed    | 27 MHz @ 3.3 V        |

Table 3: Features of the Processor



Figure 3: Gate Capacitance Reduction. Average switched capacitance (nF) calculated with typical input data and a clock rate of 400KHz



Figure 4: **Maximum speed, and power consumption.** Power is shown with random, typical, and always zero input data, at room temperature and 400 KHz.