DISPS-2.1

Hybrid Multiplier/CORDIC Unit for Online Handwriting Recognition
Stephen McInerney (DSP Group, Dept. of Electronic & Electrical Engineering, University College Dublin), Richard B Reilly (Dept. of Electronic & Electrical Engineering, University College Dublin)

Traditionally Online Handwriting Recognition (OHR) implementations use general-purpose processor architectures. The pre-processing step of OHR comprises regular array-based tasks such as normalisation, feature extraction and segmentation. Standard processor architectures cannot however efficiently support the varied arithmetic operations required by pre-processing. These tasks would seem ideally suited for custom hardware acceleration. CORDIC offers all the required elementary functions for pre-processing but is inefficient for linear mode operations (multiplication/division) due to its serial nature. A hybrid Multiplier/CORDIC architecture is proposed in which a fast iterative multiplier/MAC shares hardware with a serial CORDIC unit. This multiplier retires 6b/cycle with minor additional hardware requirements. This hybrid offers improved general performance for signal-processing applications and is targeted at the pre-processing task of OHR. Performance results are included. http://wwwdsp.ucd.ie

DISPS-2.2

Low-Power Bit-Serial Viterbi Decoder for Next Generation Wide-Band CDMA Systems
Hiroshi Suzuki (Kawasaki Steel Corporation, LSI Division), Yun-Nan Chang, Keshab K Parhi (University of Minnesota, Minneapolis)

This paper presents a low-power bit-serial Viterbi decoder chip with the coding rate r=1/3 and the constraint length K=9 (256 states). This chip has been implemented using 0.5um three-layer metal CMOS technology and is targeted for high speed convolutional decoding for next generation wireless applications such as wide-band CDMA mobile systems and wireless ATM LANs. The chip is expected to operate at 20Mbps under 3.3V and at 2Mbps under 1.8V. The Add-Compare-Select (ACS) units have been designed using bit-serial arithmetic, which has made it feasible to execute 256 ACS operations in parallel. For trace-back operations, we have developed a novel power-efficient trace-back scheme and an application-specific memory, which was designed considering that 256 bits should be written simultaneously for write operations but only one bit needs to be accessed for read operations. We have estimated that the chip dissipates only 10mW at 2Mbps operation under 1.8V.

DISPS-2.3

A Highly-scalable Symmetric/Asymmetric FIR Processor
Wei-Lung Liu, Oscal T.-C. Chen, Heng-Chou Chen (National Chung Cheng University), Hsun-Chang Hsieh (Industrial Technology Research Institute)

Based on the radix-4 Booth algorithm, we developed a highly-scaleable symmetric/asymmetric finite impulse response (FIR) architecture which comprises a pre-processing unit, data latches, configurable connection units, double Booth decoders, coefficient registers, a path control unit, and a post-processing unit. In order to achieve scaleability, the configurable connection units between data latches and the double Booth decoders have been effectively addressed. The precision of filter coefficients is adjustable by using a path control unit. The double Booth decoding is efficiently implemented. Especially, the proposed architecture only employs data-path controls to accomplish the scaleable operations without changing word lengths and components of data latches and filter taps. A practical FIR processor, which can accommodate dynamic ranges of 8 and 16 bits of input data and filter coefficients, was implemented by using the COMPASS 5V cell library in the TSMC 0.6� m CMOS technology. This processor supports ten different operation modes of asymmetric, symmetric, and anti-symmetric filter coefficients at 64, 63, 32, or 16 taps for various industrial applications.

DISPS-2.4

A NOVEL MEMORY-BASED FFT PROCESSOR FOR DMT/OFDM APPLICATIONS
Ching-Hsien Chang, Chin-Liang Wang, Yu-Tai Chang (Department of Electrical Engineering, National Tsing Hua University Hsinchu, Taiwan 300, Republic of China)

This paper presents a novel VLSI architecture for computing the N-point discrete Fourier transform (DFT) based on a radix-2 fast algorithm, where N is a power of two. The architecture consists of one complex multiplier, two complex adders, and some special memory units. It can compute one transform sample every log2(N)+1 clock cycles in average. For the case of N=512, the chip area required is about 5742um x5222 um and the throughput is up to 4M transform samples per second under 0.6mm CMOS technology. Such area-time performance makes the proposed design rather attractive for use in long-length DFT applications, such as ADSL and OFDM systems.

DISPS-2.5

Synthesis of Array Architectures for Block Matching Motion Estimation: Design Exploration using the tool DG2VHDL
John Bonk (Naval Systems, Electronic Design Laboratory , Raytheon), Andrew Stone, Elias S Manolakos (Communications and Digital Signal Processing (CDSP) Center for Research and Graduate Studies, Northeastern University)

In this paper we present a design case study using DG2VHDL, a tool which bridges the gap between an abstract graphical description of a DSP algorithm and its concrete hardware description language (HDL) representation. DG2VHDL automatically translates a Dependence Graph (DG) into a synthesizable, behavioral VHDL entity that can be input to industrial strength behavioral compilers for producing silicon implementations of the algorithm (FPGAs, ASICs). Full Search Block Matching Motion Estimation was selected for its current applications (MPEG, HDTV, Video Conferencing) as well as for the richness of literature and architectural exploration over the last decade. We will not only demonstrate here that the behavioral VHDL code produced automatically by the tool leads, after behavioral synthesis, to an efficient distributed memory and control modular array architecture, but will also provide comparative statistics for several new FS-BMA architectures derived for real-time motion estimation.

DISPS-2.6

A High-Throughput, Low Power Architecture and Its VLSI Implementation for DFT/IDFT Computation
Wei-Ren Shiue, Shen-Fu Hsiao (Inst. Compt. Eng., NSYSU, Taiwan)

A recursive algorithm for computation of both forward and backward DFT has been proposed where the common entries in the decomposed matrices are factored out in order to reduce the number of multipliers needed during implementation. The derived algorithm is essentially the band-matrix-vector multiplication with matrix bandwidth of 3. By exploiting the heterogeneous dependency graphs for the matrix-vector multiplication and using an efficient mapping technique, only logN adders and logN-1 multipliers are needed to compute the DFT of size N, a great saving from a recently proposed systolic architecture which calls for 3logN adders and 3logN multipliers. Furthermore, due to the simplicity and regularity of the architectures, it is possible to design low power processor by turning off the hardware components of no operation at proper time steps. VLSI implementation of the DFT/IDFT processor with distributed FSM for timing control is also presented.

DISPS-2.7

NOVEL MAPPING OF A LINEAR QR ARCHITECTURE
LIGHTBODY GAYE (QUEEN'S UNIVERSITY BELFAST, NORTHERN IRELAND), RICHARD L WALKE (DEFENSE EVALUATION AND RESEARCH AGENCY, MALVERN, ENGLAND), ROGER F WOODS, JOHN V McCANNY (QUEEN'S UNIVERSITY BELFAST, NORTHERN IRELAND)

This paper presents a novel architecture mapping technique which was essential in the design of a QR array which forms the core processor of a single chip adaptive beamforming system. The mapping technique assigns a QR triangular array of 2m^2+3m+1 cells down onto a linear architecture of m+1 processors. The mapping results in a linear systolic architecture with one hundred percent hardware utilisation, local interconnects and individual processors for boundary and internal cell operations. In addition, this paper highlights the effect latency has on the validity of the linear architecture.

DISPS-2.8

An Unrestrictedly Parallel Scheme for Ultra-High-Rate Reprogrammable Huffman Coding
Robert A Freking, Keshab K Parhi (Dept. of Electrical and Computer Engineering, University of Minnesota)

This paper proposes a comprehensive method for overcoming the inherently serial nature of variable-length near-entropy coding to obtain unrestrictedly parallel realizations of Huffman compression. A codestream rearrangement technique together with a symbol-stream order-recovery procedure form a concurrent approach capable of exceeding all previously attainable coderate figures. Furthermore, the method is noteworthy for achieving 100% hardware utilization with no coderate overhead while maintaining data output in a traditional streamed format. To further this endeavor, bit-serial encoder and decoder designs that possess compelling speed and area advantages are developed for service as parallel processing elements. However, both are suitable in more general contexts as well. The decoder, in particular, is optimally fast. The encoder and decoder designs are programmable, thus suggesting the appropriateness of the composite approach for a general-purpose ultra-high-speed codec. Benefits for low-power and variable-rate applications are briefly discussed.

DISPS-2.9

FLEXIBLE VIDEO COMPRESSION SYSTEMS USING AN ANALOG VECTOR QUANTIZATION CHIP
Stefano Rovetta, Rdolfo Zunino (DIBE - Genoa University)

Vector quantization systems are usually based on digital implementation of the core operations. In this paper, video compression systems exploiting an analog implementation of vector quantization are presented. The main advantages of analog design are exploited, obtaining notable performances when compared to other solutions found in the literature. The circuit features a very modular, completely parallel internal architecture. Many circuits can be easily connected to obtain a larger codebook size and a larger vector dimension. Synthesis of codebooks is also described.

< DISPS-1 DISPS-3 >

Last Update: February 4, 1999 Ingo Höntsch