Authors:
Jeff Y.F. Hsieh,
Teresa H.Y. Meng,
Page (NA) Paper number 2292
Abstract:
A low-power, large-scale parallel digital video (DV) [1] encoder architecture
for a single-chip digital CMOS video camera is discussed in this paper.
This architecture is based on the single chip CMOS camera MPEG-2 encoder
architecture proposed in [2] with an emphasis on formatting and streaming
of the compressed data. The architecture proposed here supports the
625/25 format of 720x576 pixels per frame. When clocked at 40 MHz,
this architecture delivers a processing performance of 1.8 billion
operations per second (BOPS) capable of supporting a frame rate of
25 fps as well as additional image enhancement processing. Low power
consumption is achieved by the use of a parallel architecture and low-power
circuit design techniques. When implemented in a 0.2 micron CMOS technology
at a 1.5 V supply voltage, the parallel architecture consumes 45 mW
providing a power efficiency of 40 billion operations per second per
Watt.
Authors:
Tetsuro Takizawa,
Junji Tajime,
Hidenobu Harasaki,
Page (NA) Paper number 1842
Abstract:
This paper proposes an efficient memory mapping and a frame memory
compression for an HDTV decoder LSI using Direct Rambus(TM) DRAM (DRDRAM).
DRDRAM is employed to achieve high memory bandwidth required for HDTV
decoding at the minimum memory cost. Proposed memory mapping achieves
high memory bandwidth sufficient for HDTV decoding even in the worst
case and no costly line buffers are required in the LSI for format
conversion. Proposed frame memory compression method reduces memory
cost half and achieves HDTV decoding with a single 64 Mb DRDRAM chip
without loss of memory access efficiency. Simulation results show that
SNR degradation is 0.1 to 2 dB in the worst frame and no visible degradation
is perceived except for a resolution chart sequence.
Authors:
Takafumi Morifuji,
Yoshinori Takeuchi,
Masaharu Imai,
Page (NA) Paper number 2238
Abstract:
We present an architecture of a General Purpose Numerical Processor(GPNP).
The processor with this architecture is capable of running a wide variety
of numerical processings and digital signal processings with its programmability.
Flexibility and high-performance are achieved by multiple functional
units and their data transfer parallelism. The prototype of a GPNP
with five functional units can operate with 33-MHz clock frequency
by simulation and its size corresponds with 230-kTr. and 34.5-kbytes
on-chip memory.
Authors:
Shai Rubin,
Moshe Levinger,
Randall R Pratt,
William P Moore,
Page (NA) Paper number 1435
Abstract:
Test-program generators play a key role in hardware functional verification
of large scale processors. However, in the DSP domain, the usage of
full-blown test-program generators is much less popular, mainly due
to the limited resources (time and money) available when developing
such systems. This paper describes a work-model for the fast, low cost
construction of a test-program generator for DSPs. The core technology
uses Genesys, a known test program generator that, until now, has been
used for the verification of large scale processor families, such as
PowerPC and x86. We developed the model while using Genesys for verification
of the IBM C54XDSP, a recently-announced fixed-point DSP. The case
study shows that it is possible to build a full test-program generator
in a very short time and thus achieve better verification coverage
in spite of the shorter development time.
Authors:
Darko Kirovski,
Miodrag Potkonjak,
Page (NA) Paper number 2421
Abstract:
Rapid prototyping and development of in-circuit and FPGA-based emulators
as key accelerators for fast time-to-market has resulted in a need
for fast error correction mechanisms. The fabricated or emulated prototypes
upon error diagnosis require quick and as much as possible flexible
engineering change (EC). However, this problem has recently initiated
research activity mainly in the logic synthesis domain. We introduce
the first set of EC protocols for behavioral synthesis. The protocols
support both the pre- and post-processing EC paradigms. In addition,
instead of developing special algorithms for EC which is the adopted
research model, as a key contribution, we show that using protocols
which facilitate constraint manipulation of the initial design specification
there is no need for development of specialized EC algorithms. The
EC process is performed using the standard optimization algorithms
on the modified design. Nevertheless, as shown on a number of behavioral
synthesis tasks including: resource assignment, design partitioning,
and operation scheduling, the approach provides variable and guaranteed
flexibility for incremental synthesis with minimal hardware overhead.
Authors:
Thomas Zeitlhofer,
Bernhard R Wess,
Page (NA) Paper number 1952
Abstract:
In this paper, we describe a new and efficient approach to solve the
scheduling problem for VLIW architectures. The scheduling times of
the operations are used as the problems parameters. This in conjunction
with a pruning technique based on critical path analysis leads to a
significant reduction of search space complexity. A genetic algorithm
is used to search for valid schedules of a given length. The genetic
algorithm uses a fitness vector that guides the genetic operators crossover
and mutation resulting in a fast convergence towards near perfect solutions.
The proposed method is also applicable to the problem of register allocation
by using a different fitness function. Another advantage of the genetic
algorithm approach is that usually a great number of equally performing
schedules is obtained allowing for further optimization subject to
arbitrary constraints.
Authors:
Timothy W O'Neil,
Sissades Tongsima,
Edwin H.M. Sha,
Page (NA) Paper number 2023
Abstract:
Many iterative or recursive applications commonly found in DSP and
image processing applications can be represented by data-flow graphs
(DFGs). This graph is then used to perform DFG scheduling, where the
starting times for executing the application's individual tasks are
determined. The minimum length of time required to execute all tasks
once is called the schedule length of the DFG. A great deal of research
has been done attempting to optimize such applications by applying
various graph transformation techniques to the DFG in order to minimize
this schedule length. One of the most effective of these techniques
is retiming. In this paper, we demonstrate that the traditional retiming
technique does not always achieve optimal schedules and propose a new
graph-transformation technique, extended retiming, which will. We will
also present an algorithm for finding an extended retiming which transforms
a DFG into one with minimal schedule length. Finally, we will demonstrate
a constant-time algorithm which verifies the existence of a retimed
DFG with the minimum schedule length.
Authors:
Hung-Ying Tyan,
Yu Hen Hu,
Page (NA) Paper number 2213
Abstract:
A novel iterative algorithm is proposed to compute the theoretical
minimum initiation interval of a given recurrent algorithm when there
is a known, fixed inter-module communication delay. Specifically, for
a twin-module implementation problem, a novel representation called
necessary initiation interval is introduced to faciliate the development
of an iterative algorithm which yields both the minimum initiation
interval and the corresponding cut set of the cyclic iterative computational
dependence graph (ICDG). The convergence of this iterative algorithm
in finite iterations is also proved.
Authors:
Shiro Kobayashi,
Gerhard P Fettweis,
Page (NA) Paper number 1717
Abstract:
A new approach for implementing block-floating-point arithmetic is
proposed. This approach intends to preserve the least-significant-bits
(LSBs) to improve signal processing quality. The preservation of LSBs
is automatically and perfectly done by hardware. Several simulation
results of the proposed block-floating-point implementation have shown
improved SNRs over conventional block-floating-point implementation
as expected. For the same number of bits for each data representation
in the memory, the SNRs better than floating-point are also observed.
Authors:
Louis R Litwin,
Thomas J Endres,
Samir N Hulyalkar,
Michael D Zoltowski,
Page (NA) Paper number 1245
Abstract:
One of the most popular blind equalization techniques is the Constant
Modulus Algorithm (CMA), and it has gained popularity in the literature
and in practice because of its LMS-like complexity and its robustness
to non-ideal, but practical, conditions. Although CMA has been well-studied
in the literature, these analyses have typically implemented the algorithm
using ``infinite'' precision arithmetic. The motivation for this paper
is a VLSI implementation of a high data rate, fractionally spaced,
linear forward equalizer whose taps are adjusted using CMA. In this
paper we examine how implementing CMA using finite bit precisions affects
the algorithm's performance.
|