Session: DISPS-L3
Time: 9:30 - 11:30, Thursday, May 10, 2001
Location: Room 250 D
Title: Design and Implementation-Programmable Processors
Chair: Wanda Gass

9:30, DISPS-L3.1
RAPID PROTOTYPING OF MULTI-DSP SYSTEMS BASED ON ACCURATE PERFORMANCE ESTIMATION
B. RINNER, B. RUPRECHTER, M. SCHMID
The development of parallel applications is tedious and more complex than a single-processor solution. We have developed PEPSY, a prototyping environment for multi-DSP systems, with the primary goal to automate the design and implementation of parallel DSP applications. Given an extended data flow graph of the DSP application and a description of the target multi-processor system, PEPSY automatically maps and schedules the DSP application onto the multi-processor system and generates complete code for each processor. PEPSY excels in an accurate performance estimation. The design goals of the parallel application can, therefore, be verified prior to its implementation. With PEPSY, parallelization of a DSP application onto various processors can be realized within minutes.

9:50, DISPS-L3.2
A BLOCK PRIORITY BASED INSTRUCTION CACHING SCHEME FOR MULTIMEDIA PROCESSORS
J. KANG, W. SUNG
In this paper, a new instruction caching scheme that utilizes the block priority information is proposed mainly targeted for embedded multimedia processors. The block priority information is obtained by profiling application programs. The goal of this caching scheme is to keep more important code blocks longer using the block priority information, which programmers provide by analyzing the profiling results of multimedia applications. In addition to a new caching scheme, the methods for determining the priority of each code block are also developed and their performances are evaluated using real multimedia applications. The experimental results show that the cache miss ratio can be reduced up to nearly a half of that of the normal LRU replacement scheme although the improvement depends on the cache size.

10:10, DISPS-L3.3
VARIABLE PARTITIONING FOR DUAL MEMORY BANK DSPS
R. LEUPERS, D. KOTTE
DSPs with dual memory banks offer high memory bandwidth, which is required for high-performance applications. However, such DSP architectures pose problems for C compilers, which are mostly not capable of partitioning program variables between memory banks. As a consequence, time-consuming assembly programming is required for an efficient coding of time-critical algorithms. This paper presents a new technique for automatic variable partitioning between memory banks in compilers, which leads to a higher utilization of available memory bandwidth in the generated machine code. We present experimental results obtained by integrating the proposed technique into an existing C compiler for the AMS Gepard, an industrial DSP core.

10:30, DISPS-L3.4
LOW POWER SHOWDOWN: COMPARISON OF FIVE DSP PLATFORMS IMPLEMENTING A LPC SPEECH CODEC
D. HWANG, C. MITTELSTEADT, I. VERBAUWHEDE
An identical LPC Speech Coder has been implemented on a set of signal processing specific implementation platforms. The main goal of this experiment was to compare the energy consumption. In addition, area/memory requirements and design time are also compared. The coder was first designed in floating-point C. Then, the fixed point word lengths were determined. Depending on the platform, either compiled code was generated, assembly code written or a Verilog/VHDL design was created. The platforms reported in this paper include the DSP processors TI C55x, TI C54x, TI C6x and the design environments Ocapi and A|RT designer. Energy consumption ranges from 2 microJ to 288 microJ per speech frame. Upon scaling the results to the same technology, our results indicate that the lowest power DSP processor (TI C55x) still consumes a factor of 4 more energy than an application specific processor. Results of this experiment can be found at: www.ee.ucla.edu/~ingrid/ee213a/index.html

10:50, DISPS-L3.5
PERFORMANCE ANALYSIS OF LOW BIT RATE H.26L VIDEO ENCODER
A. HALLAPURO, V. LAPPALAINEN, T. HÄMÄLÄINEN
A new video encoder proposal, H.26L, is compared against H.263 and H.263+. In the comparison, both computational complexity and compression performance are analyzed. Moreover, the trade-off possibilities between the complexity and compression performance within H.26L are presented. Experimental comparisons with H.263 and H.263+ show that H.26L reduces the output bit rate about 30% with the same quality. The computation time increases about three times compared to H.263 and leads into the encoding speed of 3-6 fps for QCIF sequences on a 400 MHz Pentium III processor. Real-time operation can be achieved by applying additional, algorithmic and platform-specific optimizations.

11:10, DISPS-L3.6
INTEGER CODE GENERATION FOR THE TMS320C62X
M. COORS, H. KEDING, O. LUETHJE, H. MEYR
This paper presents a methodology which enables the generation of C62x optimized fixed-point C-code from a floating-point description. The FRIDGE design environment transforms floating-point ANSI-C code with local fixed-point annotations into an internal bit-true representation. From this representation we generate C62x optimized integer C code utilizing the code transformation techniques illustrated in this paper. A benchmark is presented comparing the efficiency of the generated code with C67x C-code, C62x floating-point emulation and generic integer ANSI-C code.