# IMAGE PROCESSING SYSTEM USING A PROGRAMMABLE TRANSFORM IMAGER

Jungwon Lee, A. Bandyopadhyay, I. Faik Baskaya, Ryan Robucci, and Paul Hasler

School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA 30332

## ABSTRACT

We propose an image processing system using an analog transform imager chip. The transform imager is a CMOS imager which is capable of block transforms using floating-gates. Because the transform imager is highly programmable, it can perform various block transforms such as Walsh-Hadamard transform, discrete sine transform (DST), and discrete cosine transform (DCT). We investigate the reduction of computations and power by a baseline JPEG compression system as an application. The 48x40 transform imager provides 144MOPS/s using an area of  $1.03mm^2$  in  $0.5\mu m$  CMOS technology, and it has a much higher fill factor, 46%, compared with other focal plane imagers. The measured transformed images and the reconstructed images are presented to demonstrate the operations.

## 1. LOW POWER SIGNAL PROCESSING

Prevalence of the mobile systems in daily life creates a higher demand for low power signal processing systems. Conventional digital-only signal processing has limitations towards meeting this demand. On the other hand, analog computational circuits have various advantages; namely, power consumption and real-time signal processing. There is great potential in integrating both types of circuits into a cooperative analog-digital signal processing (CADSP) system [1].

A transform imager blends the advantage of focal-plane imagers and the advantage of active pixel sensors (APS), so it can do significant amounts of signal processing with large imaging arrays [2]. An image sensor with a multiplier is implemented with only two transistors in the transform imager, which makes it possible not only to reduce the chip area but also to reduce power consumption.

In this work, a low power reconfigurable digital image processing system using the transform imager was designed. To demonstrate one potential application, a baseline JPEG compression system was implemented. Section 2 describes the transform imager, which is the analog core of the system. In section 3, a JPEG compression and a 3D DCT video compression are presented, and the reduction of computations and power is estimated. Section 4 reports the experimental results. Section 5 concludes this paper with a summary.



**Fig. 1**. Transform imager architecture: The chip performs arbitrary seperable block transforms. The basis functions are programmed on-chip using floating-gate circuits.

# 2. PROGRAMMABLE TRANSFORM IMAGER

Fig. 1 shows the block diagram of our single-chip CMOS imager. This imager architecture can compute arbitrary separable 2D linear operations. These operations are expressed as two matrix multiplications on the image,

$$Y = A^T P B \tag{1}$$

where P is the image matrix of pixels, Y is the computed output matrix, and A and B are the transform kernels [3]. Each pixel consists of a photodiode sensor element and an analog multiplier. This is implemented by a differential pair using a photodiode as a current source. Effectively, each pixel element gives a current output that is the product of the light intensity and a predefined input voltage. A vector of input voltages comes from the circuitry to the image array and is shared by each column of the image array. Along each column, outputs are tied together to obtain a summation of currents. This means that the output of each column



**Fig. 2.** Die micrograph: This 48x40 transform imager was designed in  $0.5\mu m$  N-well CMOS technology. The chip area is  $1.03mm^2$ .

| Technology          | $0.5 \mu m$ N-well CMOS |
|---------------------|-------------------------|
| Array size          | 48 x 40                 |
| Pixel size          | 13.5µm x 13.5µm         |
| Fill factor         | 46%                     |
| Transistor count    | 10.7K                   |
| Frame rate          | 20fps                   |
| Resolution of pixel | 6 bits                  |
| Frequency response  | DC~100Khz               |
| Power consumption   | $16.5 \mu W$ /frame     |

Table 1. Summary of the chip characteristics

is a sum of products; in particular, it is the inner product of the voltage vector with the light intensities along the column. Therefore, the first matrix multiplication  $Y = A^T P$ , takes place at the pixel array itself. The output currents from the pixel array are supplied to the multiplier array for the second multiplication (with B). For the purpose of illustration in this paper, a transform imager chip that has 48x40 pixels and no analog computitional array was fabricated. Imagers with higher resolution and the analog computing array have already been fabricated and are being tested.

The pixel array is subdivided into blocks, and different stored kernels can be applied to each block. The difference of the multiplier outputs (currents) are integrated in a rowparallel manner and the analog voltage output is then multiplexed out. Sample and hold circuits are used for reading one row at a time for each block. This process is repeated for all the rows after the necessary kernel has been placed onto the selected block for multiplication.

Fig. 2 shows a die micrograph of the fabricated 48x40 transform imager, and Table 1 briefly summarizes the performance of the chip. The tested DCT transformed images are shown in Fig. 3.

For an indication of multiplication accuracy, a gain er-



**Fig. 3**. DCT results: (a) Original images (b) 8x8 block 2D DCT and DCT coefficients for one block

ror measurement of the imager array was performed by illuminating the chip with a nearly uniform light. The mean gain was 2.12nA/Volt, which should be constant across the imager for a constant illumination. The average error was 1.58%. This measurement is a conservative measure of expected performance since currents were measured off chip in a noisy environment. On chip processing will have the benefit of even better signal-to-ratio.

A traditional system would have to read out each pixel value, perform an analog to digital conversion, and store the image. The data would then be processed by digital circuitry to perform the matrix multiplication. This implementation of the matrix multiplication is much more efficient in terms of area, speed, and power.

Our CMOS imager is based on our earlier work that demonstrated the viability of the pixel element, that computed simple single-block transforms on a fabricated IC, and that a single-block DCT could be computed [2].

## 3. JPEG COMPRESSION SYSTEM WITH A TRANSFORM IMAGER

Baseline JPEG compression requires computation of a 2D DCT, quantization, and run-length followed by Huffman coding [3]. Fig. 4(a) shows a conventional JPEG implementation, in which most of the computations are processed in the digital chip except the imager and the Analog to Digital Converter (ADC). Our proposed system is shown in Fig. 4(b). A DCT block is one of the most computationally intensive blocks in the JPEG system [4]. This block is moved to the programmable transform imager to reduce power consumption. Fig. 4(c) shows the system we tested to prove our approach and concept. The analog computing array is modelled in MATLAB. An FPGA is used to interface between the transform imager and MATLAB, and also provides control signals to the imager. Fig. 4(d) is an ideal low power system. The ADC is merged into the transform imager, so that all analog computations are implemented in



**Fig. 4.** Top level view of our JPEG compression system used as an application for signal processing. (a) Conventional approach (b) Our proposed system (c) Tested system (d) Ideal single chip system

a single chip. This solution is currently under development and will be presented in a future publication.

The implementation of 8x8 2D DCT based on the fast symmetry-based 1D DCT needs 208 multiplications and 464 additions, and even fast FFT-based 1D DCT based method requires 80 multiplications and 464 additions [5]. By using the transform imagers, we can significantly reduce the number of digital functional units. The key concern here is the amount of power that could saved by this reduction. It is not easy to estimate the total power of 2D DCT digital implementations directly from the number of reduced operations because there are several ways to implement it with various architectures. However, even DCT chips optimized for low power consume considerable amount of power, which ranges from about 10mW to 140mW [6],[7]. Usually, fixed implementations consume less power, and reconfigurable implementations consume more. A transform imager consumes less than 5mW. Considering the fact that the transform imager includes pixel elements, the power consumption reduction is significant.

Considering the reduced computation of 2D DCT by the proposed system, 3D DCT implementation can be a good candidate for utilizing the system. A video compression based on the 3D DCT is not widely used since the 3D DCT is highly computational intensive. Direct implementation of the 8x8x8 3D DCT requires 12,288 multiplications and 12,096 additions, and 960 multiplications and 5,568 addi-



**Fig. 5**. Haar transform results: (a) Original image (b) Haar transform (before reordering) (c) Reconstructed image.



**Fig. 6.** Image sequences (20fps): (a) A 16x16 Walsh-Hadamard transformed image sequence and a reconstructed image sequence (b) A 16x16 DCT transformed image sequence and a reconstructed image sequence.

tions are required if the fourier-based 1D DCT is used [5]. The transform imager can reduce 2/3 of the total operations because 2D DCT is already computed. Another problem of the 3D DCT based compression algorithm is the memory requirement. To compute one block of 8x8x8 3D DCT, 8 frames of data should be stored in a buffer, which requires a large amount of memory. After DCT, most energy is concentrated on low frequencies, which are few DCT coefficients depending on the compression ratio. Therefore, if only left top part, low frequency coefficients, of 8x8 block is read and transferred to the digital processing chip, the buffer memory and the bandwidth between the imager and the digital chip can be significantly reduced. Because reading image is simply sending address signals to the imager, this selective reading is very easy to implement.

The transform imager is reconfigurable not only because the chip can be used for various image processing applications but also because it makes easy for the analog chip to calibrate its parameters. The transform imager performs multiplication of its inputs. Therefore, calibrating the parameters, which is used for the multiplication coefficients, is important to avoid calculation errors. A printed circuit board is designed for programming the imager and capturing images. This board is controlled by an Altera Stratix FPGA with a 32.768Mhz clock. Because the computation in the analog transform imager is very fast compared to the digital system, the readout speed is constrained by the test setup such as the speed of interface and FPGA clock rate. The test images are focused onto the pixel array using a multiple lens system, a LCD (1024x768 resolution), and a DC regulated light source. We have built other imagers, which range from 14x14 resolution to 512x480 resolution, and use an imager with 48x40 resolution and an imager with 72x64 resolution to easily analyze the transform results.

### 4. EXPERIMENTAL RESULTS

#### 4.1. Measured images

A transform imager is capable of computing block transforms as mentioned in the previous section. Because of its programmability, we can implement various block transforms such as DCT, DST, Hadamard transform, Haar transform, and so on. We show the Walsh-Hadamard transform, the DCT, and the Haar transform as examples. The DCT is a widely used block transform for years, and the Haar transform can be used for wavelet-based compression, which is basically similar to JPEG2000.

Fig. 5 shows the Haar transform results. We programmed the transform imager with 8x8 Haar transform. Fig. 5(b) shows the Haar transform before reordering, and Fig. 5(c) is a reconstructed image to demonstrate the operation of the chip. Fig. 6(a) shows the 16x16 Walsh-Hadamard transform output sequence of the imager with 20fps and the corresponding reconstructed sequence. 16x16 DCT output sequence and the reconstructed sequence are shown in Fig. 6(b). Still images are used to make input sequences to the imager. Therefore, all movement is global translation.

#### 4.2. Performance of the transform imager

Based on the bandwidth of a pixel, the peak computing power can be calculated [8]. To calculate one 8x8 DCT coefficient, 16 multiplications and 14 additions are required. Therefore, total 48x40x(16+14) operations are needed for one 48x40 image. Considering the maximum bandwidth of a pixel to be 100Khz,  $10 \ \mu sec$  is needed to process one column, and 400  $\ \mu sec$  is needed for the whole image. This results in 144MOPS/s, which means 144 million analog operations per second. Using this calculation, we can easily compute the peak computing power of the 128x128 transform imager. Required total operations for the 128x128 transform imager is 128x128x(16+14), and it takes 1.28 *msec* to finish the DCT calculation for one image. Therefore, the peak computing power is 384MOPS/s. The fill factor of the imager is 46%, and it is much higher than that of other focal-plane imagers such as a neuromorphic imager [9] and a SIMD based digital focal-plane imager [10]. The power consumption is  $16.5\mu W/f$ rame, and it is measured only for the internal circuitry excluding pads.

#### 5. CONCLUSION

In this paper, we proposed a reconfigurable image processing system using an analog transform imager. The proposed system is highly reconfigurable with low power consumption, therefore, it can be adapted to many image processing applications. We presented a JPEG compression system as an application. The transform imager, which is programmed to perform a DCT, reduces a large number of digital functional units and improve power efficiency. A Haar transformed image sequence and a DCT transformed image sequence were shown with reconstructed images to demonstrate the operation and the performance of the system.

### 6. REFERENCES

- P. Hasler and D. V. Anderson, "Cooperative analog-digital signal processing," in Acoustics, Speech, and Signal Processing, 2002 IEEE International Conference on, 2002, vol. 4, pp. 3972–3975.
- [2] P. Hasler, A. Bandyopadhyay, and D. V. Anderson, "High-fill factor imagers for neuromorphic processing enabled by floating-gate circuits," *International Journal of Signal Processing*, vol. 103, no. 5, Pt 1, pp. 2273–2281, 2003.
- [3] Ra. C. Gonzalez and R. E. Woods, *Digital Image Processing*, Pearson Education, 2002.
- [4] N. Narasimhan, V. Srinivasan, M. Vootukuru, J. Walrath, S. Govindarajan, and R. Vemuri, "Rapid prototyping of reconfigurabe coprocessors," in Acoustics, Speech, and Signal Processing, 2002 IEEE International Conference on, 1996, pp. 303–312.
- [5] R. Westwater and B. Furht, *Real-time Video Compression: Techniques and Algorithms*, Kluwer Academic Publishers, 1997.
- [6] J. Park, S. Kwon, and K. Roy, "Low power reconfigurable dct design based on sharing multiplication," in *Acoustics, Speech, and Signal Processing, 2002 IEEE International Conference on*, 2002, vol. 3, pp. 3116–3119.
- [7] M. Martina, A. Molino, and F. Vacca, "Reconfigurable and low power 2d-dct ip for ubiquitous multimedia streaming," in *Multimedia* and Expo, 2002. ICME '02. Proceedings. 2002 IEEE International Conference on, 2002, vol. 2, pp. 26–29.
- [8] Ri. Carmona-Galan, F. Jimenez-Garrido, C. M. Dominguez-Mata, R. Domingguez-Castro, S. E. Meana, I. Petras, and A. Rodriguez-Vazquez, "Second-order neural core for bioinspired focal-plane dynamic image processing in cmos," *IEEE Transactions on Circuits* and Systems I, vol. 51, no. 5, pp. 913–925, May 2004.
- [9] A. A. Stocker, "Analog vlsi focal-plane array with dynamic connections for the estimation of piecewise-smooth optical flow," *IEEE Transactions on Circuits and Systems I*, vol. 51, no. 5, pp. 963–973, May 2004.
- [10] R. H. Robinson and D. S. Wills, "Design of an integrated focal plane architecture for efficient image processing," in *15th International Conference on Parallel and Distributed Computing Systems*, 2002, vol. 171, pp. 128–135.