Authors:
Stephen McInerney,
Richard B Reilly,
Page (NA) Paper number 1470
Abstract:
Traditionally Online Handwriting Recognition (OHR) implementations
use general-purpose processor architectures. The pre-processing step
of OHR comprises regular array-based tasks such as normalisation, feature
extraction and segmentation. Standard processor architectures cannot
however efficiently support the varied arithmetic operations required
by pre-processing. These tasks would seem ideally suited for custom
hardware acceleration. CORDIC offers all the required elementary functions
for pre-processing but is inefficient for linear mode operations (multiplication/division)
due to its serial nature. A hybrid Multiplier/CORDIC architecture is
proposed in which a fast iterative multiplier/MAC shares hardware with
a serial CORDIC unit. This multiplier retires 6b/cycle with minor additional
hardware requirements. This hybrid offers improved general performance
for signal-processing applications and is targeted at the pre-processing
task of OHR. Performance results are included. http://wwwdsp.ucd.ie
Authors:
Hiroshi Suzuki,
Yun-Nan Chang,
Keshab K Parhi,
Page (NA) Paper number 1788
Abstract:
This paper presents a low-power bit-serial Viterbi decoder chip with
the coding rate r=1/3 and the constraint length K=9 (256 states). This
chip has been implemented using 0.5um three-layer metal CMOS technology
and is targeted for high speed convolutional decoding for next generation
wireless applications such as wide-band CDMA mobile systems and wireless
ATM LANs. The chip is expected to operate at 20Mbps under 3.3V and
at 2Mbps under 1.8V. The Add-Compare-Select (ACS) units have been designed
using bit-serial arithmetic, which has made it feasible to execute
256 ACS operations in parallel. For trace-back operations, we have
developed a novel power-efficient trace-back scheme and an application-specific
memory, which was designed considering that 256 bits should be written
simultaneously for write operations but only one bit needs to be accessed
for read operations. We have estimated that the chip dissipates only
10mW at 2Mbps operation under 1.8V.
Authors:
Wei-Lung Liu,
Oscal T.-C. Chen,
Page (NA) Paper number 5092
Abstract:
Based on the radix-4 Booth algorithm, we developed a highly-scaleable
symmetric/asymmetric finite impulse response (FIR) architecture which
comprises a pre-processing unit, data latches, configurable connection
units, double Booth decoders, coefficient registers, a path control
unit, and a post-processing unit. In order to achieve scaleability,
the configurable connection units between data latches and the double
Booth decoders have been effectively addressed. The precision of filter
coefficients is adjustable by using a path control unit. The double
Booth decoding is efficiently implemented. Especially, the proposed
architecture only employs data-path controls to accomplish the scaleable
operations without changing word lengths and components of data latches
and filter taps. A practical FIR processor, which can accommodate dynamic
ranges of 8 and 16 bits of input data and filter coefficients, was
implemented by using the COMPASS 5V cell library in the TSMC 0.6µm
CMOS technology. This processor supports ten different operation modes
of asymmetric, symmetric, and anti-symmetric filter coefficients at
64, 63, 32, or 16 taps for various industrial applications.
Authors:
Ching-Hsien Chang, Department of Electrical Engineering, National Tsing Hua University Hsinchu, Taiwan 300, Republic of China (China)
Chin-Liang Wang, Department of Electrical Engineering, National Tsing Hua University Hsinchu, Taiwan 300, Republic of China (China)
Yu-Tai Chang, Department of Electrical Engineering, National Tsing Hua University Hsinchu, Taiwan 300, Republic of China (China)
Page (NA) Paper number 1505
Abstract:
This paper presents a novel VLSI architecture for computing the N-point
discrete Fourier transform (DFT) based on a radix-2 fast algorithm,
where N is a power of two. The architecture consists of one complex
multiplier, two complex adders, and some special memory units. It can
compute one transform sample every log2(N)+1 clock cycles in average.
For the case of N=512, the chip area required is about 5742um x5222
um and the throughput is up to 4M transform samples per second under
0.6mm CMOS technology. Such area-time performance makes the proposed
design rather attractive for use in long-length DFT applications, such
as ADSL and OFDM systems.
Authors:
John Bonk,
Andrew Stone,
Elias S Manolakos,
Page (NA) Paper number 2210
Abstract:
In this paper we present a design case study using DG2VHDL, a tool
which bridges the gap between an abstract graphical description of
a DSP algorithm and its concrete hardware description language (HDL)
representation. DG2VHDL automatically translates a Dependence Graph
(DG) into a synthesizable, behavioral VHDL entity that can be input
to industrial strength behavioral compilers for producing silicon implementations
of the algorithm (FPGAs, ASICs). Full Search Block Matching Motion
Estimation was selected for its current applications (MPEG, HDTV, Video
Conferencing) as well as for the richness of literature and architectural
exploration over the last decade. We will not only demonstrate here
that the behavioral VHDL code produced automatically by the tool leads,
after behavioral synthesis, to an efficient distributed memory and
control modular array architecture, but will also provide comparative
statistics for several new FS-BMA architectures derived for real-time
motion estimation.
Authors:
Shen-Fu Hsiao, Inst. Compt. Eng., NSYSU, Taiwan (Taiwan)
Wei-Ren Shiue, Inst. Compt. Eng., NSYSU, Taiwan (Taiwan)
Page (NA) Paper number 1673
Abstract:
A recursive algorithm for computation of both forward and backward
DFT has been proposed where the common entries in the decomposed matrices
are factored out in order to reduce the number of multipliers needed
during implementation. The derived algorithm is essentially the band-matrix-vector
multiplication with matrix bandwidth of 3. By exploiting the heterogeneous
dependency graphs for the matrix-vector multiplication and using an
efficient mapping technique, only logN adders and logN-1 multipliers
are needed to compute the DFT of size N, a great saving from a recently
proposed systolic architecture which calls for 3logN adders and 3logN
multipliers. Furthermore, due to the simplicity and regularity of the
architectures, it is possible to design low power processor by turning
off the hardware components of no operation at proper time steps. VLSI
implementation of the DFT/IDFT processor with distributed FSM for timing
control is also presented.
Authors:
Gaye Lightbody, QUEEN'S UNIVERSITY BELFAST, NORTHERN IRELAND (Ireland)
Richard L. Walke, DEFENSE EVALUATION AND RESEARCH AGENCY, MALVERN, ENGLAND (U.K.)
Roger F. Woods, QUEEN'S UNIVERSITY BELFAST, NORTHERN IRELAND (Ireland)
John V. McCanny, QUEEN'S UNIVERSITY BELFAST, NORTHERN IRELAND (Ireland)
Page (NA) Paper number 2357
Abstract:
This paper presents a novel architecture mapping technique which was
essential in the design of a QR array which forms the core processor
of a single chip adaptive beamforming system. The mapping technique
assigns a QR triangular array of 2m^2+3m+1 cells down onto a linear
architecture of m+1 processors. The mapping results in a linear systolic
architecture with one hundred percent hardware utilisation, local interconnects
and individual processors for boundary and internal cell operations.
In addition, this paper highlights the effect latency has on the validity
of the linear architecture.
Authors:
Robert A Freking,
Keshab K Parhi,
Page (NA) Paper number 2100
Abstract:
This paper proposes a comprehensive method for overcoming the inherently
serial nature of variable-length near-entropy coding to obtain unrestrictedly
parallel realizations of Huffman compression. A codestream rearrangement
technique together with a symbol-stream order-recovery procedure form
a concurrent approach capable of exceeding all previously attainable
coderate figures. Furthermore, the method is noteworthy for achieving
100% hardware utilization with no coderate overhead while maintaining
data output in a traditional streamed format. To further this endeavor,
bit-serial encoder and decoder designs that possess compelling speed
and area advantages are developed for service as parallel processing
elements. However, both are suitable in more general contexts as well.
The decoder, in particular, is optimally fast. The encoder and decoder
designs are programmable, thus suggesting the appropriateness of the
composite approach for a general-purpose ultra-high-speed codec. Benefits
for low-power and variable-rate applications are briefly discussed.
Authors:
Stefano Rovetta,
Rodolfo Zunino,
Page (NA) Paper number 1394
Abstract:
Vector quantization systems are usually based on digital implementation
of the core operations. In this paper, video compression systems exploiting
an analog implementation of vector quantization are presented. The
main advantages of analog design are exploited, obtaining notable performances
when compared to other solutions found in the literature. The circuit
features a very modular, completely parallel internal architecture.
Many circuits can be easily connected to obtain a larger codebook size
and a larger vector dimension. Synthesis of codebooks is also described.
|