

# DSP-BASED SIGNAL PROCESSING FOR OFDM TRANSMISSION

*M. Schöbinger, S. R. Meier*

Infineon Technologies AG,  
Corporate Development,  
D-81609 Munich, Germany.

[Matthias.Schoebinger, Stefan.R.Meier]@infineon.com

## ABSTRACT

A demonstrator for OFDM transmission based on a programmable DSP (TMS320C6201) is described. It turns out that the realized rather moderate sampling rates up to 10 Msamples/s still represent quite a challenge for state-of-the-art DSPs in terms of the required computational power but also the synchronization of the internal processing with the I/O interface to a real-time environment. It is illustrated that SW development under stringent resource constraints requires analysis and partitioning of the algorithms in a manner very similar to the mapping strategies necessary in an ASIC design for either cost-sensitive or extremely challenging applications. Therefore, the demonstrator development provides a sound basis for a subsequent design of such a kind of ASICs.

## 1. INTRODUCTION

Flexible, programmable implementations of transmission systems are desirable in an early phase of system development and standardization. However, real-time behavior is essential to allow a thorough characterization of the transmission channel. Even in the phase of market deployment flexibility is mandatory in order to be able to react to varying market requirements. With the appearance of powerful DSP processors such as Texas Instruments' (TI) TMS320C6201 SW solutions become already feasible for systems with sample rates in the 10 MHz range with a few DSP only. However, apart from more frequent problems with bottlenecks in memory and I/O bandwidth full flexibility can only be preserved if the required computational power is well below the peak performance of the device.

For high-performance systems this requires careful analysis of the properties inherent in the algorithms such as regularity and repeated access to the same data in order to identify those algorithms which allow block processing and can be efficiently implemented as assembler routine. But even for the remaining algorithms implementations with a small number of operations are mandatory (no pure C programming possible). This approach resembles the strategy to choose algorithms and architectures which are particularly suited for efficient silicon implementations in an ASIC design when silicon cost, power consumption and/or battery life time are a major concern [1]. Of course, due to the technological progress subsequent DSP generations provide even higher performance levels. However, we are mainly interested in the relative performance requirements of the differ-

ent functional units in order to pinpoint the optimization potential for a later HW oriented implementation.

These similarities of the DSP approach and optimized chip design enable a seamless transfer of system and algorithmic IP from the SW solution to a HW/SW implementation based on a DSP core providing flexibility where necessary only and optimized hardware accelerators for computation intensive tasks. Implementation of such accelerators by means of parametrizable macros can already start parallel to the SW development. By this way even full-custom implementations utilizing meanwhile available powerful data-path generators are feasible despite stringent time-to-market requirements [2]. The essential prerequisite for this approach is the early consideration of HW implementation aspects.

We illustrate this approach with the example of a demonstrator for OFDM transmission based on TI's TMS320C6201. Due to the powerful peripherals and the 64 Kbyte on-chip memory of this device the DSP-specific overhead for memory and I/O transfers is almost negligible. Therefore, the partitioning of the algorithms is already meaningful for a later chip implementation. However, it is interesting to note that even for a system with only 10 MHz sampling rate the flexibility with respect to the choice of possible parameters is very restricted.

## 2. POTENTIAL OF OFDM

Among the effects which disturb high data-rate signals in transmission channels, multi-path propagation is the determining effect in radio-propagation, whereas wire-line systems are dominated by dispersion-effects. To counteract inter-symbol interference either an adaptive equalization in a single-carrier system has to be applied or the symbol-duration should be chosen significantly longer than the channel impulse response. As an increased symbol-duration reduces the achievable data rates a parallel transmission in a larger number of frequency division sub-channels should be used [3,4]. With orthogonal sub-carriers the required signal processing in transmitter and receiver can be realized with a discrete Fourier transform (IDFT/DFT). In OFDM (Orthogonal Frequency Division Multiplex) based transmission systems the input data stream is serial-to-parallel converted in order to modulate these N narrow-band carriers in parallel. In a frequency selective fading environment, the number of sub-channels N is chosen in such a way that the fading process in each sub-channel can be considered as a flat fading process, i.e. not frequency-selective. The resulting block of N samples in time domain is than extended with a cyclic prefix of the



Fig. 1 Block Diagram of an OFDM Transceiver

length of the maximum channel impulse response as a guard interval. As the symbol-rate for each sub-channels is low, extending the symbols with the cyclic prefix should waste not more than ten percent of the bandwidth. This holds for typical channels with multi-path propagation or the interference present in common frequency networks, but not in cases where dispersion dominates the channel characteristics like in xDSL systems. Due to the typically long channel impulse responses in dispersive channels the use of an additional time domain equalizer TEQ was proposed by Cioffi et al. to shorten the channel impulse response in a way, that it fits into the guard interval [5,6].

In both discussed cases (dispersive channels typically with TEQ as well as channels with multi-path propagation) the guard interval prevents inter-symbol-interference but the notches in the transfer function (frequency domain) caused by the frequency-selective fading have to be compensated by an equalizer performing one complex valued multiplication on each sample in the frequency domain at the output of the FFT in the receiver. In the special case of a simple m-PSK signal and with differential encoding over the adjacent sub-carriers this equalization is not necessary [7] (Fig. 1).

Typical applications for multi-carrier transmission systems are terrestrial broadcasting (DAB / DVB-T), ADSL, wireless LANs (HIPERLAN / IEEE 802.11a), communication over power lines and possibly future broadband wireless access systems.

### 3. FLEXIBLE DEMONSTRATOR FOR OFDM TRANSMISSION

The main building blocks of an OFDM-based transmission system are illustrated in Figure 1. To evaluate the performance of such a transmission system over not well known channels a demonstrator has been developed. The blocks for OFDM signal processing enclosed by the broken lines are most suited for an implementation on a programmable DSP. IFFT/FFT, I/Q-mixing, I/Q-demodulation and the filters for interpolation and decimation by a factor  $v$  are the most computation intensive tasks. The filters are used to avoid aliasing due to the not sufficiently attenuated side lobes appearing in OFDM transmission as well as to suppress out-of-band perturbations in the receiver. Filter requirements are considerable since the width of the transition region between pass and stop band is directly related to the number of not usable carriers resulting in a reduced net data rate. The avail-

ability of necessary assembler routines suggested an FIR filter implementation for the demonstrator. A/D- and D/A- converter are using a sampling clock of 10 MHz. An analog front-end unit contains low-pass filtering, some amplifiers and the receive/transmit-switching. This partitioning allows the use of various external up/down- converters to experiment with diverse types of channels. Two FPGAs contain the MAC-layer functionality and the interfacing to the Reed Solomon ICs (FEC) two EPLDs are interfacing the A/D- and D/A- converter to the DSP and are interfacing several control signals between DSP and MAC-layer functionality in the FPGAs especially for timing synchronization purposes.

As the demonstrator was planned to evaluate realization aspects for a cost-efficient solution, special care was taken to identify the possible bottlenecks in terms of cost:

- A/D conversion is limited by cost aspects to not more than 10 bit (8 bit effective) converters, which requires some care in limiting the peak-to-average ratio of the OFDM signal prior to DA conversion since zero-overhead clipping is not supported by the DSP.
- Channel equalization has to be as simple as possible or should be avoided. In the demonstrator QPSK is applied in conjunction with differential encoding over the sub-carriers which allows an operation without any channel equalization.
- Differential encoding also allows to operate without a carrier offset compensation.
- Repeating a reference OFDM-block frequently is sufficient to implement the clock-synchronization as a simple time tracking for the start of the reference symbol.

The complete OFDM digital signal processing is done on only two TMS320C6201 DSPs [8], one for transmitting (TX) and one for receiving (RX). We will show in the subsequent chapters that this device supports a demonstrator transmitting and receiving a net data rate of 1 Mbit/s (2 Mbit/s including redundancy, signaling overhead and preambles) employing OFDM signals with QPSK-modulated sub-carriers in a 1MHz band at a center frequency of 2.5 MHz.

### 4. DSP-BASED SIGNAL PROCESSING IN THE OFDM DEMONSTRATOR

The TMS320C6x family of 32 bit digital signal processors is based on Texas Instrument's VelociTI™ architecture which is a

very long instruction word (VLIW) architecture [6]. As far as the required computational power is concerned, integer multiplications are the most important operations for the presented demonstrator. Therefore, the throughput rate is essentially determined by the two parallel integer multipliers processing 16-bit half-words each. For the 200 MHz version of the TMS320C6201 this results in a peak performance of  $400 \cdot 10^6$  Mul/s. In addition, the 16-bit resolution of the multipliers determines the quantization noise of the implementation.

#### 4.1 Data processing

Despite the high peak performance of the TMS320C6201 computational resources had to be handled thrifty, as a sampling rate of 10 Msamples/s at the output of the interpolation filter and in the I/Q-mixing stage of the transmitter (and at the input of I/Q-demodulation and decimation filter in the receiver) is still a challenge. The utilization of optimized assembler routines is inevitable to reach the required throughput-rates in our demonstrator example even though the interpolation/decimation filter can be implemented by the well-known polyphase structure with a sampling rate of  $10/v$  MHz for each of the  $v$  polyphase filters only. Table 1 shows that for large block size  $M$  both available FIR filter routines ( $K$  taps each) approach the number of cycles expected in the case that the CPU performs two multiplications (one in each data-path) per cycle. However, these performance figures are achieved at the expense of a very restricted choice of parameters (number of filter coefficients in multiples of 4 or 8 for FIR4 and FIR8, respectively). In contrast in the case of a radix-2 FFT (bit reversal not included) the required number of cycles for large block size  $N$  is twice the estimate if a realization of one complex multiplication by four real multiplications is assumed. This result is related to the more unfavorable number of memory accesses per sample compared to FIR filters.

|             | Number of cycles for optimized TI routines                         | estimated No. of cycles based on multiplications only |
|-------------|--------------------------------------------------------------------|-------------------------------------------------------|
| <b>FIR8</b> | $M \cdot K/2 + 13$<br>→ $O(M \cdot N/2)$                           | $M \cdot K/2$                                         |
| <b>FIR4</b> | $M \cdot (K+8)/2 + 6$<br>→ $O(M \cdot K/2)$                        | $M \cdot K/2$                                         |
| <b>FFT</b>  | $Id(N) \cdot (4N/2 + 7) + 9 + N/4$<br>→ $O(Id(N) \cdot 2 \cdot N)$ | $Id(N) \cdot N$                                       |

Table 1. Number of clock-cycles for TI's optimized assembler routines (16-bit operands).

#### 4.2 Data structure

Block processing and half-word processing can be exploited to reduce the memory bandwidth. Passing through the signal processing chain complex data have to be transferred from a block buffer corresponding to the low sampling rate at the transmitter input (LR buffer) to a buffer corresponding to the high sampling rate (HR buffer) at the DA converter and vice versa in the receiver. The size of the LR buffer is  $M = (N + N_{guard} + K - 1)$  32-bit words (OFDM block size  $N$ , size of cyclic prefix  $N_{guard}$ ,  $K$  filter taps) where the  $K-1$  additional complex samples are necessary due to the sliding window processing of the filters. For a interpolation/decimation factor of  $v$  the HR buffer comprises  $v \cdot M$

32-bit words. In order to avoid additional latencies due to memory conflicts between subsequent functional units a block buffer is provided for every intermediate result. The 64 Kbyte of available on-chip data memory of the TMS320C6201 turned out to be sufficient to support this approach so that the performance figures are not influenced by memory management problems.

Additional overhead results from the fact that the arrangement of input and output data of the available assembler routines is predetermined. Processing of complex data (IFFT/FFT, I/Q-mixing, ...) requires interleaving of 16 bit real and imaginary parts in such a way that both parts are stored as half-words in the same 32-bit word. In contrast real and imaginary part of the samples are treated separately during the interpolation process. Therefore, de-interleaving of real and imaginary part is necessary prior to the filtering with the polyphase structure.

After the interpolation filter real and imaginary parts are arranged in  $v$  blocks each containing the results of one polyphase filter. These results have to be multiplexed into the final data stream. Moreover, interleaving of real and imaginary part is again required prior to I/Q-mixing. Whereas de-interleaving has to be performed at the low sample rate the interleaving has to be realized at high data rate resulting in the request for an additional assembler routine for this process which has been provided by TI's application experts for this project. This routine utilizes the CPU's capability of accessing two data words in parallel, i. e. sorting of  $M$  complex samples is accomplished in  $M$  cycles.

In the receiver de-interleaving of the complex data after I/Q-demodulation and de-multiplexing of the data stream into  $2v$  blocks of input data for the polyphase filters is the critical task.

#### 4.3 DSP interfaces

Apart from the relatively high peak performance the hardware peripherals are very important to resolve the I/O bottleneck with respect to interfacing to a real-time digital transmission system. The key features of the TMS320C6201 are

- efficient memory controllers,
- powerful DMA supporting in parallel four full plus one auxiliary channel, used for background data transfer without CPU intervention between the on-chip peripherals and the internal data memory,
- two efficient Multi Channel Buffered Serial Ports (MCBSP),
- easy to handle Host Port Interface (HPI),

became indispensable to release the CPU as much as possible from non arithmetic tasks [9]. In the standard operation mode after startup and initialization data is read in and written out via the MCBSPs while the transfer between on-chip memory and MCBSPs is done over independent DMA channels. Meeting the real-time specifications is critical, in the sense, that under no circumstances a sample should be lost especially at the 10 MHz sampling rate. The synchronization between the DSP and the FPGAs uses the FRAME-SYNC signals of the MCBSPs and as the OFDM-symbol interpolated to the 10 Msamples/s-rate contains more samples than a MCBSP-frame can handle an additional interrupt signal is used to mark the FRAME-SYNC.

### 5. RESULTS

For a decimation factor of  $v=8$  resulting in a symbol rate of  $10/v=1.25$  MBaud Fig. 2 shows the distribution of the work load among the different functional units of the receiver which is

slightly more complex than the transmitter due to the differential decoding and the necessary merging of the v polyphase filter results to obtain the decimated data. Comparison with the expected number of cycles proves that the overhead due to memory conflicts is virtually negligible. 44% of the computational load is required for the polyphase filters of the decimation block. This is a consequence of the very restricted parameter space for the FIR filters. The minimum size of the polyphase filters is 4 or 8 taps (for FIR4 and FIR8, respectively) resulting in a number of 4v or 8v filter coefficients, respectively. As already mentioned above filter requirements are considerable. Therefore, 32 taps might not be adequate and 64 taps have been chosen. Note the overhead of about 13% for the required rearrangement of data before (complex to real data, C to R) and after filtering (R to C). The R to C sorting contains the necessary merging of the v polyphase filter results which is realized as C code with an acceptable loss of performance only.

During verification of the prototype it turned out that availability of 8% unused cycles (see Fig. 2) is not sufficient to guarantee a timely reaction of the CPU onto interrupts generated by the DMA processes. Therefore the transmission frequency was restricted to 2.5 MHz in contrast to the original intention to allow an adjustment of the center frequency. In this special case the I/Q-demodulation requires only trivial multiplications of the received data stream by 0, +1, 0, -1. As a result the input blocks for half of the polyphase filters are zero and filtering is not necessary in these cases. With these simplifications a sufficient reserve of order 30% unused cycles has been achieved.

For relevant block size N (128 ... 1024) the required number of cycles for the FFT is already close to the asymptotic performance illustrated in Table 1. It uses about 11% of the CPU-resources including bit reversal. Apart from the logarithmic dependence on N of the FFT part the distribution of the computational load is independent of N for fixed data rate. However, with increasing N the latency inherent in OFDM transmission will become larger and the required storage space will also increase. In addition SNR due to quantization effects degrades due to the in-place calculation of the available FFT routine with 16 bit accuracy. However, for a block size below N=512 SNR values of order 40 dB are guaranteed.

Larger interpolation/decimation factors are not reasonable due to the decreasing data rate. Smaller v resulting in a higher data rate would be possible. E. g. doubling of the data rate with unchanged sub-carrier spacing requires doubling of N. The number of cycles for filtering are almost unchanged (twice the sample rate but only half the number of coefficients for the same percentage of unused carriers) and the cycle count for the FFT increases only around 10%. However, in order to guarantee the independence of the results from memory management problems



Fig. 2 Distribution of computational load in the receiver

an increase of the data rate was not considered due to the doubling of the required storage space. Verifications of the prototypes were completed by connecting them via the analog interface over a characteristic channel with adjustable attenuation and with frequency dependant attenuation emulating some typical notch situations (as caused by multi-path propagation). Such a connection was utilized to perform file-transfers between two PC's connected over Ethernet interfaces to the demonstrators.

## 6. CONCLUSION

A DSP-based demonstrator for OFDM transmission providing a net data rate of 1 Mbit/s at 10 MHz sampling rate has been presented. With this demonstrator early exploration of transmission characteristics was possible even though flexibility with respect to the choice of possible parameters is very restricted due to the large computational requirements. A partitioning exploiting the properties inherent in the algorithms was necessary to achieve the required performance level. This partitioning would be virtually identical for a succeeding ASIC implementation. In addition to the possible cost reduction such an ASIC solution will cover a broader range of data rates and optimized word lengths. Moreover, the overhead for bit manipulations (clipping, rounding, slicing) and sorting of data can be drastically reduced and alternative implementations of the different functional units (e. g. instead of the FIR filters) are possible.

## 7. ACKNOWLEDGEMENT

The authors would like to thank N. Bretzke and S. Kirmser for their support in programming the DSP.

## 8. REFERENCES

- [1] T. G. Noll, E. De Man, "Pushing the Performance Limits due to Power Dissipation of Future ULSI Chips"; 1992 IEEE Int. Symp. on Circuits & Systems 1652-1655(1992)
- [2] S. R. Meier, M. Schöbinger: "Efficient and Reusable Time-Sharing Architectures for Equalizer Structures"; IEEE 2000 Custom Integrated Circuits Conference 477-480(2000)
- [3] R.W. Chang et al: "A theoretical study of performance of an orthogonal multiplexing data transmission scheme"; IEEE Trans. Comm. Vol. COM-016, pp524-540, Aug 1968.
- [4] S.B. Weinstein et al: "Data transmission by frequency division multiplexing using the Discrete Fourier Transform", IEEE Trans. Comm. Vol. COM-19 pp628-634, Oct. 1971.
- [5] J. S. Chow, J. C. Tu, J. M. Cioffi, "A discrete multitone transceiver system for HDSL applications"; IEEE J. Selected Areas in Communications Vol. 9 No. 6 pp 895-908 (1991).
- [6] J. S. Chow, J. C. Tu, J. M. Cioffi, "Performance Evaluation of a Multichannel Transceiver System for ADSL and VHDSL Services"; IEEE J. Selected Areas in Communications Vol. 9 No. 6 pp909-919 (1991).
- [7] European Telecommunications Standards Institute, "Inventory of broadband radio technologies and techniques", Technical Report - Broadband Radio Access Networks (BRAN), TR 101 173 V1.1.1 (1998-05).
- [8] N. Seshan, "High VelociTI Processing"; IEEE Signal Processing Magazine (3):86-101,117(1998).
- [9] Texas Instruments, "TMS320C6201, DIGITAL SIGNAL PROCESSORS, Datasheet"

TI, VelociTI and 320 are trademarks of Texas Instruments Incorporated.