Chair: Konstantinos Konstantinides, Hewlett-Packard Laboratories (USA)
An-Yeu Wu, University of Maryland (USA)
K.J. Ray Liu, University of Maryland (USA)
In most low-power VLSI designs, the supply voltage is usually reduced to lower the total power consumption. However, the device speed will be degraded as the supply voltage goes down. In this paper, we propose new algorithmic-level techniques for compensating the increased delays based on the multirate approach. We will show how to compute most of the discrete sinusoidal transforms through the decimated low-speed sequences with reasonable linear hardware overhead. For the case the decimation factor equal to two, the overall power consumption can be reduced to about one-third of the original design. The resulting multirate low- power architectures are regular, modular, and free of global communications. Such properties are very suitable for VLSI implementations. The proposed architectures can also be applied to very high-speed block transforms where only low-speed operators are required.
E. Scopa, DEIS Universita di Bologna (ITALY)
A. Leone, DEIS Universita di Bologna (ITALY)
R. Guerrieri, DEIS Universita di Bologna (ITALY)
G. Baccarani, DEIS Universita di Bologna (ITALY)
A low-power architecture for 2D-DCT is presented. It has been designed for portable H.261-compliant video-telephone applications, but most of the results and considerations apply to MPEG systems too. The presence of quantization in the coding process has been exploited, adapting the precision of DCT calculations to the quantization noise level. The proposed architecture has the capability of dynamically controlling power consumption by reducing the precision to the minimum required level and turning off sub-systems when they are not necessary for the computation. Compared with a standard implementation, power consumption is reduced by a factor between 7 and 10, without appreciable degradation of the transmission quality.
Javier Bruguera, University of Santiago de Compostela (SPAIN)
Tomas Lang, University of California-Irvine (USA)
We present a VLSI architecture for the evaluation of the (8x8)--point 2--D DCT with on--line arithmetic. The utilization of on--line arithmetic, in combination with an algorithm based on FCT and matrix multiplication, reduces the total hardware maintaining a data rate and a latency similar to approaches based on distributed or parallel arithmetic. The architecture has been integrated in a chip using a 1 (mu) CMOS technology, occupying an area of 56.7 mm^2.
Heonchul Park, Samsung Electronics Co. Ltd. (KOREA)
Jae-Chul Son, Samsung Electronics Co. Ltd. (KOREA)
Seong-Rae Cho, Samsung Electronics Co. Ltd. (KOREA)
An area-efficient VLSI architecture for fast Huffman decoder which can support HDTV rates was proposed. Huffman coding has been widely used to reduce storage and channel bandwidth, and several image compression standards such as JPEG, MPEG require to perform Huffman decoding in real-time with high throughput. The proposed VLSI architecture for Huffman decoder requires less numbers of comparators and smaller size of data rotator to simulate Barrel shifter. It can decode up to 17 bits per cycle and employ 40MHz clock which can support HDTV rates. Thus, it can decode Huffman coded sequence up to 680 Mbps at peak. Compared with parallel implementation in [3] which requires up to 1460 PEs and has 10Mbps throughput, the proposed architecture is a single PE design with the competitive processing power. It requires 25% of the area of the known single PE design in [3].
Alain Pirson, Thomson Consumer Electronics Components (FRANCE)
Fathy Yassa, Thomson Consumer Electronics Components (FRANCE)
Philippe Paul, Thomson Consumer Electronics Components (FRANCE)
Barth Canfield, TC Electronics (USA)
Friedrich Rominger, Thomson Consumer Electronics (GERMANY)
Andreas Graf, Thomson Consumer Electronics (GERMANY)
Detlef Teichner, Thomson Consumer Electronics (GERMANY)
This paper describes a programmable motion estimation processor applying a block matching technique on large search windows. Developed in the context of an MPEG-2 video encoder, its use can be extended to any application where fast motion estimation is required. Its high performance (17 Gops peak) and its ability to work in parallel make it ideal for real time applications like video compression. The Block Matching Processor (BMP) consists of a CPU associated with several specific units including a fast motion estimator, a DRAM interface, IO ports and some on-chip memory. This approach allies the flexibility of a CPU to the efficiency of dedicated hardware. A DRAM controller minimizes the impact of data transfer on the computing power.
D. Charlot, Thomson Consumer Electronics Components (FRANCE)
J.-M. Bard, Thomson Consumer Electronics Components (FRANCE)
B. Canfield, Thomson Consumer Electronics (USA)
C. Cuney, Thomson Consumer Electronics Components (FRANCE)
A. Graf, Thomson Consumer Electronics (GERMANY)
A. Pirson, Thomson Consumer Electronics (FRANCE)
D. Teichner, Thomson Consumer Electronics (USA)
F. Yassa, Thomson Consumer Electronics (FRANCE)
In this paper, we describe the architecture of a hierarchical motion estimation processor, with respect to the MPEG-2 encoding standard. This processor can also be used in HDTV applications. The motion estimation processing is in 2 steps: first full-pixel then half-pixel. Several modes are possible, depending on the image types (I, P or B - MPEG terminology, frame based or field based). A decision is taken in this processor to choose the best mode. The architecture is based on a RISC controller, external DRAMs to store anchor frames and specific hardware for processing the distortions. The architecture was chosen to achieve high performance, programmability and high memory bandwidth.
Avidan Akerib, A.S.P. Solutions Ltd. (ISRAEL)
Rutie Adar, A.S.P. Solutions Ltd. (ISRAEL)
This article presents a new methodology, based on the Associative Signal Processing (A.S.P) approach to real time parallel image processing. The architecture is fully programmable and can be programmed to implement a wide range of color image processing, computer vision and multi-media algorithms at much faster than video rate. The approach is based on an array of thousands of processors, each is nothing but an "intelligent" memory word that can identify itself to a value and change its content accordingly. Benchmark results show that when assigning an "intelligent" word (processor) to each pixel in the image, computational power of several hundred billion instructions per second is obtained. A chip based on this approach was developed by A.S.P. Solutions Ltd. The chip called XIUM (tm) includes 1024 processors, each with 72 "intelligent" bits, has computational power of 1 BIPS and cloud identifies at a rate of 20 billion patterns per second. A commercial chip with 2 BIPS performance - The XIUM (tm) -II is now on the final stage of development.
Chin-Liang Wang, National Tsing Hua University (REPUBLIC OF CHINA)
Ker-Min Chen, National Tsing Hua University (REPUBLIC OF CHINA)
Jin-Min Hsiung, National Tsing Hua University (REPUBLIC OF CHINA)
This paper presents a new systolic VLSI architecture to realize the full-search block matching algorithm for motion estimation. The architecture has an efficiency of 100 percent and a throughput of one motion vector per $n^2$ cycles, where nxn is the reference block size. As compared to existing VLSI motion estimators with the same efficiency and throughput, the proposed one not only gains advantages in the flexibility of changing the reference blocksize and the tracking range, but also employs no additional control circuitry to determine the motion vectors. These features make it useful for a wide rangeof applications.
Mei-Cheng Lu, National Chiao Tung University (REPUBLIC OF CHINA)
Chen-Yi Lee, National Chiao Tung University (REPUBLIC OF CHINA)
This paper presents a new VLSI architecture for full-search block matching algorithm. The proposed architecture has two specific features: (1) it has a processor element (PE) array which provides sufficient computational power, where PE's work in a semi-systolic style and (2) it contains stream memory banks which provide scheduled data flow to reduce idle operations within PE array. By exploiting broadcasting and local data communications, hardware efficiency of the proposed architecture can be up to 100%, which outperforms those systolic-array solutions found in the literature. This efficient VLSI architecture is then demonstrated by a 16X16 motion estimation processor design whose speed can be up to 100 MHz based on 0.8(mu) m CMOS double metal process.
Hangu Yeo, University of Wisconsin (USA)
Yu Hen Hu, University of Wisconsin (USA)
In this paper, we propose a modular systolic array architecture for the full-search block matching motion estimation algorithm(FBMA). With this novel architecture, we are able to generate a motion vector for every reference block in raster scan order while achieving 100% processor utilization and high throughput rate. Furthermore, we devised a scheme to save the pin count (I/O) by sharing memory units. This results in low memory bandwidth. This architecture is scalable in that it can easily be adapted to handle larger search ranges and different block sizes without increasing the effective latency.