Session: DISPS-P2
Time: 3:30 - 5:30, Thursday, May 10, 2001
Location: Exhibit Hall Area 4
Title: Video Processing Systems
Chair: Jose Fridman

3:30, DISPS-P2.1
MULTI-PORT INTERCONNECTION NETWORKS FOR RADIX-R ALGORITHMS
J. TAKALA, T. JÄRVINEN, P. SALMELA, D. AKOPIAN
In array processors, complex data reordering is often needed to realize the interconnection topologies between the computational nodes in algorithms. Several important algorithms, e.g., discrete trigonometric transforms and Viterbi decoding, can be represented in a radix-R form where the principal topology is stride by R permutation. In this paper, a general factorialization of stride permutations is derived, which can be mapped onto register-based structures for constructing area-efficient multi-port interconnection networks. The networks can be modified to support several stride permutations and sequence sizes.

3:30, DISPS-P2.2
INTEGER SINUSOIDAL TRANSFORMS BASED ON LIFTING FACTORIZATION
Y. ZENG, G. BI, Z. LIN
A general method is proposed to factor discrete W transform (DWT) into lifting steps and additions. Then, based on the relationships among various types of discrete sinusoidal transforms, other types of transforms such as discrete Fourier transform (DFT) and discrete cosine transform (DCT) are factored into lifting steps and additions. After approximating the lifting matrices, we get various types of new integer discrete transforms such as IntDWT, IntDFT and IntDCT which are floating-point multiplication free. Transforms which map integer to integer are also proposed. Fast algorithms are given for the new transforms and their computational complexities are analyzed. Based on polynomial transform and index mapping, multi-dimensional integer transforms are presented with especially low computational complexity.

3:30, DISPS-P2.3
A PARTIAL-RESULT-REUSE ARCHITECTURE AND ITS DESIGN TECHNIQUE FOR MORPHOLOGICAL OPERATIONS
S. CHIEN, S. MA, L. CHEN
This paper proposes a new cost-effective architecture for mathematical morphology named Partial-Result-Reuse (PRR) architecture. For a lot of real-time applications of mathematical morphology, the hardware implementation is necessary; however, the hardware cost of almost existing morphology architectures is too high when dealing with large structuring elements. With partial-result-reuse concept and self-affinity property of general structuring elements, the proposed architecture is more cost-effective and more general than existing morphology architectures. In addition, the PRR architecture is very easy to generate. It can deal with morphological operations with arbitrary structuring elements and can be used for other semi-group operations. Simulation shows this architecture can dramatically reduce hardware cost of morphological operations with all kinds of structuring elements.

3:30, DISPS-P2.4
AN INTEGRATED SYSTOLIC ARRAY DESIGN FOR VIDEO COMPRESSION
P. TAI, C. LIU, S. HUANG, J. WANG
This paper presents an integrated systolic array design for implementing full-search block matching, 2-D discrete wavelet transform, and full-search vector quantization on the same VLSI architecture. The above three functions are prime essential but with large amount of computation in video compression. To meet real-time requirement, many systolic arrays are designed for individually performing each of them. In fact, these functions contain the similar computational procedure. The matrix-vector product forms of them are quite similar. In this paper, with carefully extracting the common computation component, an integrated systolic array design that can perform above three functions is presented. A utilization of 100% to 97% is achieved for executing full-search block matching and full-search vector quantization. When performing 2-D discrete wavelet transform, the utilization is about 32%. The proposed integrated architecture spends lower hardware cost and has a regular hardware structure. It befits the VLSI implementation for video compression.

3:30, DISPS-P2.5
DESIGN OF RNS-BASED DISTRIBUTED ARITHMETIC DWT FILTERBANKS
J. RAMIREZ, A. GARCIA, U. MEYER BAESE, F. TAYLOR, P. FERNANDEZ, A. LLORIS
The need for both speed and increased precision in modern digital signal processing (DSP) applications represents a serious implementation obstacle. This paper explores the arithmetic benefits provided by the residue number system (RNS) for the design of such systems. Concretely, the fusion of the RNS with the popular distributed arithmetic (DA) is considered for the implementation of a discrete wavelet transform (DWT) filter bank. An exhaustive comparison of the advantages of RNS-DA over the traditional two¡¯s complement design, 2C-DA, is carried out for field-programmable logic (FPL), and cell-based ASIC technologies. The results show that the reported RNS-DA methodology, compared to a traditional 2C-DA design, enjoys a significant performance advantage that increases with precision.

3:30, DISPS-P2.6
QUANTIZATION EFFECT ON VLSI IMPLEMENTATIONS FOR THE 9/7 DWT FILTERS.
V. SPILIOTOPOULOS, N. ZERVAS, Y. ANDREOPOULOS, G. ANAGNOSTOPOULOS, C. GOUTIS
In this paper, two basic approaches for implementing the 9/7 Filtering Unit, used in the Discrete Wavelet Transform, are addressed. The first is the lifting scheme approach and the second is the conventional, convolutional filter approach. Two architectures are examined for each approach, a simple – straightforward one and an optimized one, substituting the multipliers used for scaling with shift – add operations. The quantizations of the constants used in the calculations and the selection of the data path bit-width are thoroughly explored. Experimental results for several quantizations and for the different hardware architectures of the 9/7 filtering Units are given.

3:30, DISPS-P2.7
A FAST MOTION ESTIMATION ALGORITHM EQUIVALENT TO EXHAUSTIVE SEARCH
M. GHARAVI-ALKHANSARI
In this paper, a fast algorithm is proposed for block motion estimation for video sequences. The proposed algorithm is proven to be equivalent to exhaustive search. In a multiresolution approach, it uses mathematically derived threshold to prune search candidates whose low-resolution versions are too far from the low resolution version of the block for which a best match is sought. Experimental results show that speed ups of around 36, compared to full search, may be achieved, for some typical test video sequences. This is the fastest full-search-equivalent motion estimation reported in the literature to date, and has speed ups comparable to inexact fast motion estimation methods.

3:30, DISPS-P2.8
CACHE CONSCIOUS WALSH-HADAMARD TRANSFORM
N. PARK, V. PRASANNA
The Walsh-Hadamard Transform (WHT) is an important tool in signal processing because of its simplicity. One major problem with the existing packages is their poor scalability to larger problems. In computing large size WHT, non-unit stride accesses result in poor cache performance leading to severe degradation in performance. In this paper, we develop an efficient cache friendly technique that drastically enhances the performance of large size WHT. In our approach, data reorganization is performed between computation stages to reduce cache pollution. We develop an efficient search algorithm to determine the optimal factorization tree depending on the problem size and stride access in the decomposition. We have developed a package and demonstrated improved performance. For example, on Alpha 21264 and MIPS R10000, our technique results in upto 180% performance improvement over the state-of-the-art package. The proposed optimization is applicable to other signal transforms and is portable across various platforms.

3:30, DISPS-P2.9
A DHT-BASED FFT/IFFT PROCESSOR FOR VDSL TRANSCEIVERS
C. WANG, C. CHANG
This paper presents a new VLSI architecture for computing the N-point discrete Fourier transform (DFT) of real data and the corresponding inverse (IDFT) based on the discrete Hartley transform, where N is a power of two. The architecture includes two real multipliers, three real adders, six memory-based buffers, two ROM¡¦s, and some simple logic circuits, making itself suitable for single-chip implementation. It is capable of evaluating one DFT sample or one IDFT sample every [log2(N)+1]/2 clock cycles on average. Under 0.35 um CMOS technology, the proposed design can operate at a clock rate of 100 MHz to reach a throughput of 20M transform samples per second for N=512. The processing speed will be higher if more advanced CMOS technology is adopted to implement the same circuit. Such low-complexity and high-throughput feature supports that the proposed design is well suited for use in discrete multitone based very high-speed digital subscriber line transceivers.

3:30, DISPS-P2.10
AN APPROACH FOR ENABLING DCT/IDCT ENERGY REDUCTION SCALABILITY IN MPEG-2 VIDEO CODECS
R. HENNING, C. CHAKRABARTI
It would be desirable, in terms of energy conservation, to use a low complexity approximate algorithm to do all DCT and IDCT computation in an MPEG-2 video codec. However, there is a significant quality penalty associated with this approach that may not always be acceptable. A practical algorithmic method is studied here for achieving scalable energy reduction during DCT and IDCT computation in MPEG-2 video codecs at the expense of reasonable amounts of quality. For example, by applying exact and approximate DCT/IDCT algorithms appropriately, the energy consumption of DCT and IDCT execution in two video codecs communicating with one another can be reduced by 8% for quality reduction of 0.4 dB average PSNR, 14% for 0.8 dB reduction, or 22% for 1.4 dB reduction.