Chair: Teresa Meng, Stanford University (USA)
F. Lorenzelli, University of California at Los Angeles (USA)
K. Yao, University of California at Los Angeles (USA)
In this paper we consider the algorithm for SVD updating based on Jacobi rotations. In order to overcome the trade-off between accuracy and updating rate intrinsic in the original algorithm, we propose two schemes which improve the overall performance when the rate of change of the data is high. In the "variable forgetting factor" approach, the effective width of the observation adjusts to the data nonstationarity. The former scheme ensures closeness to convergence at all times, while the latter adapts the response to data variation. We consider applications of the SVD updating algorithm to speech processing of segmentation, adaptive parameter estimation, and glottal closure detection.
Kalavai J. Raghunath, University of Minnesota (USA)
Keshab K. Parhi, University of Minnesota (USA)
Recently, a new pipelinable PSTAR-RLS algorithm was developed. It was shown to be an effective alternative to the QRD-RLS algorithm when high-speeds are required. Using folding technique, a 4-tap PSTAR-RLS algorithm was implemented on a single VLSI chip. All the operations in the chip are bit-level pipelined. With a 1.2(mu) CMOS technology this chip is expected to run at 100 MHz. Redundant number system based arithmetic operators were used for performance advantage. Apart from a wafer scale implementation, this is the first ever single chip ASIC implementation of a RLS adaptive filter.
S. Freeman, University of Michigan (USA)
M. O'Donnell, University of Michigan (USA)
A versatile signal processor has been designed that can perform multiple rotations, multiplications and additions within one clock cycle. The computational elements of this processor include four pipelined CORDIC rotators, two pipelined fast multipliers and two adders. A combination of register files, SRAM,and ROM provides on chip storage for coefficients, running sums and programs. The chip architecture and its applicability to complex valued signal processing tasks are discussed.
Amer El Helwani, France Telecom- CNET (FRANCE)
Patrice Le Scan, France Telecom- CNET (FRANCE)
Within the context of acoustic echo cancellation, the convergence rate of the NLMS adaptive filtering algorithm is not sufficient when the input signal is strongly correlated (speech signal). The Generalized Multi Delay Frequency-domain (GMDF) algorithm allows a faster convergence rate and faster tracking of the variation in the echo path to be identified. In these applications the filter may have several thousand tap length at a sampling rate of 16 KHz. This fact leads to difficulties in real time implementation of the algorithm. This paper describes an optimized VLSI architecture of a specific circuit which is designed to perform the computational intensive task of the GMDF algorithm in real time. It can be carried out for long impulse response filters.
Hiroyuki Kawai, Mitsubishi Electric Corporation (JAPAN)
Yoshitugu Inoue, Mitsubishi Electric Corporation (JAPAN)
Robert Streitenberger, Mitsubishi Electric Corporation (JAPAN)
Masahiko Yoshimoto, Mitsubishi Electric Corporation (JAPAN)
This paper presents the architecture of a newly developed highly parallel DSP suited for realtime image recognition. The programmable DSP was designed for a variety of image recognition systems, such as computer vision systems, character recognition systems and others. The DSP consists of optimized functional units for image recognition: SIMD processing core, a hierarchical bus, Address Generation Unit, Data Memories, DMAC, Link Unit, and Control Unit. The DSP can process a 5x5 spatial filtering for 512x512 images within 13.1msec. Adopting the DSP to a Japanease character recognition system, the speed of 924 characters/sec can be achieved for feature extractions and feature vectors matchings. The DSP can be integrated in a 14.5x14.5mm2 single-chip, using 0.5um CMOS technology. In this paper, the key features of the architecture and the new techniques enabling efficient operation of the eight parallel processing units are described. Estimation of the performance of the DSP is also presented.
Shen-Fu Hsiao, National Sun Yat-Sen University (TAIWAN)
Jacobi method has been used on special- purpose multi-processor VLSI systems for parallel singular value decomposition (SVD) of dense matrices, and CORDIC processors are often used as the basic processing elements to implement the two-sided rotations, the fundamental operations in the Jacobi method. Recently, generalizations of the original CORDIC algorithm to multi-dimensional spaces have been used in the SVD of complex matrices to achieve faster computation speed. A further speed-up of more than 2 can be gained by gradually refining the resolution of the CORDIC algorithms used in the Jacobi method.
Finn T. Moller, Aalborg University
Jack B. Andersen, Grundfos Electronics
Hans R. Jensen, Aalborg University (DENMARK)
Ole Olsen, Aalborg University (DENMARK)
Flemming K. Fink, Aalborg University (DENMARK)
This paper describes PSEUDEC, a dedicated co-processor and the rationale behind its design. The final goal of our work is to present an advanced digital hearing aid based on parameterized transformation of speech (PARTRAN), as a single chip solution with low power consumption. The subset of PARTRAN implemented by PSEUDEC performs PSEUdo DEComposition of a 12th order LPC polynomial. An adapted algorithm displays improved dynamic range compared to a conventional solution suited for DSP's, calculating the amplitude spectrum rather than the power spectrum. Highly pipelined CORDIC- units optimized for the application replaces complex multiplication, trigonometric operations (for $e^{jw}$) and square root (for 2-norm of complex vector), exploiting the power of CORDIC operations in advanced DSP algorithms. PSEUDEC uses ON-LINE arithmetic for efficient implementation of operators and for efficient inter-operator communication. PSEUDEC has been implemented using ordinary standard cells.
Daniel Rabideau, United States Air Force
Allan Steinhardt, MIT Lincoln Laboratory (USA)
Subspace tracking is an integral part of many high resolution adaptive array methods. Unfortunately, the high computational complexity and non-parallel nature of traditional subspace tracking algorithms have deterred their use in real-time systems. In this paper we discuss parallel mappings of the Fast Subspace Tracking Algorithm. The serial complexity of this algorithm is already among the lowest {O(Nr) for N channels and an r dimensional subspace}. In this paper, we show that even greater reductions in effective complexity can be achieved by mapping our algorithm onto multiple processors. Near linear speedup is obtained on machines spanning the range from fine grain systolic arrays to coarse grain commercially available MPPS.
Minoru Okamoto, Matsushita Electric Ind. Co. Ltd.
Toshihiro Ishikawa, Matsushita Communication Ind. Co. Ltd.
Shinichi Mauri, Matsushita Electric Ind. Co. Ltd.
Masayuki Yamaski, Matsushita Electric Ind. Co. Ltd.
Katsuhiko Ueda, Matsushita Electric Ind. Co. Ltd.
Nobuo Asano, Matsushita Communication Ind. Co. Ltd. (JAPAN)
Mitsuru Uesugi, Matsushita Communication Ind. Co. Ltd. (JAPAN)
Yoshiko Saitoh, Matsushita Communication Ind. Co. Ltd. (JAPAN)
Yukihiro Fujimoto, Matsushita Communication Ind. Co. Ltd. (JAPAN)
Susumu Furushima, Matsushita Communication Ind. Co. Ltd. (JAPAN)
A new DSP architecture for equalizing, channel coding/decoding and encryption/decryption required by GSM hand portable terminals is presented. In the DSP, which is called EQCHAN (Equalizer and Channel coding/decoding processor), these tasks are managed in common units, that is, the data processing unit (DPU) and the bit manipulation unit (BMU). The LSI that contains EQCHAN was designed using 0.8 um CMOS technology and its die size is 123$mm^2$. The power consumed in the LSI is 60mW at 3.6V under continuous communication mode and this value is sufficient for a portable terminal. In this paper, we describe the detail architecture of EQCHAN.
Thomas P. Kelliher, Westminster College
Eric S. Gayles, Pennsylvania State University(USA)
Robert M. Owens, Pennsylvania State University(USA)
Mary Jane Irwin, Pennsylvania State University(USA)
The Micro-Grain Array Processor (MGAP) is a family of two-dimensional, micro-grained array processors. The processor cell architecture is extremely compact and simple, ensuring fine grainess, a very high processor density, and programming flexibility. Flexibility is maintained through a programmable interconnect which clusters array cells into larger computational units. In this paper, we will discuss the design and optimization issues of MGAP-2, both at the processor array and system levels. Various design strategies and tradeoffs are being investigated at both levels. The reader will see how lessons learned from building and using MGAP-1 have been applied in this new design effort. We also describe our MGAP programming environment and an application example --- the two- dimensional discrete cosine transform, a powerful image compression tool.