# < 30 MW RECTANGULAR-TO-POLAR CONVERSION PROCESSOR IN 802.11AD POLAR TRANSMITTER

Chunshu Li, Andre Bourdoux, Marian Verhelst Yanxiang Huang, Min Li, Liesbet Van Der Perre, Sofie Pollin

IMEC, Kapeldreef-75, Leuven, B-3001, Belgium

### **ABSTRACT**

This paper presents an energy-efficient digital signal processor (DSP) for rectangular-to-polar conversion in 802.11ad polar transmitter working on 60 GHz band. Firstly, system simulations with a complete transmission chain are conducted with regard to error vector magnitude and output spectrum, which allows to systematically optimize the design requirements on the DSP block. Secondly, algorithm and architecture co-optimization on the DSP block is explored to minimize the power consumption. Finally, the proposed DSP is synthesized using  $28 \ nm$  CMOS technology, which provides a throughput of 7.04 Giga samples per second with a power consumption of  $28 \ mW$ , and area of  $0.01 \ mm^2$ .

*Index Terms*— Polar transmitter, 802.11ad, rectangular-to-polar conversion, digital signal processing

### 1. INTRODUCTION

In contrast to the scarce available spectrum below 10 GHz, the mm-wave frequency band, e.g., the 7 GHz of unlicensed bandwidth around 60 GHz frequency, still has large spectrum with little commercial usage. Standard such as 802.11ad [1] is developed, targeting this wide available spectrum to provide up to 6.75 G bits per second (bps) data rate in wireless personal area network (WPAN).

Transmission at 60 GHz covers much less distance for a given transmission power, mainly due to the high free-space path loss. Phased array antennas are typically employed to overcome the high signal losses [2]. However, the cost of analog front-ends increases in proportion to the number of antenna paths. This will drastically increase the power consumption, especially the share of the most power hungry power amplifier (PA). Improving the power efficiency of PA is critical in reducing the power cost of 60 GHz transmitter. Most 60 GHz PAs operate in class-A linear mode [2]-[4] due to the use of variable envelope modulations that are required for high data rates and high spectral efficiency. This causes the typical PA power efficiency of less than 5%, although records up to 30% could be achieved [2]. In order to improve the PA power efficiency, the PA needs to work in its nonlinear region to utilize the peak efficiency. The polar architecture is one interesting solution that allows the PA to operate in saturation without the need for duplicating the signal path or using power combiners. As shown in Fig.1, the phase (PH) signal goes to the PA, while the amplitude (AM) is applied to the PA through a separate modulation path and combined with PH signal by modulating the supply.



Fig. 1. Polar transmitter

One issue in this supply-modulated polar systems is that the PA is subject to the effects of the supply voltage on linearity [5]. For CMOS devices, voltage gain is usually a strong function of the drain-source potential, which leads to PA gain changing with supply voltage. This will contribute to the AM-AM distortion. Voltage-dependent capacitances in active devices exhibit high non-linearity. The supply voltage can change the bias conditions on these capacitances resulting in PH shift dependent on the overall output impedance, causing AM-PM distortion.

We proposed a new transmitter architecture in [6] with polar concept expanded to the whole transmitter, rather than only in radio frequency (RF) domain, as shown in Fig.2. The polar conversion is performed with digital signal processing. The AM signal can then digitally modulate an RF digital-to-analog converter (DAC) working as a variable-size PA. This avoids modulating the supply and also eliminates the need of an additional RF limiter and AM detection circuits, which will introduce extra nonlinearity and bandwidth limitations.

Although the digital polar transmitter has many advantages, the design challenge on DSP needs to be analyzed and tackled. For the 802.11ad application, the DSP usually works at a very high speed depending on the required oversampling factor (OSF) in Fig.2. We previously concluded a 50 mW budget in [7] for the extra digital processing re-



Fig. 2. Block diagram of digital intensive 60GHz polar transmitter

quired by the polar operation to maintain power advantages of polar transmission. The 50 mW budget for DSP needs to cover 1760xOSF M symbols per second (SPS) I/Q to AM/PH conversion.

Motivated by the above analysis, this paper first conducts system simulations with a complete 802.11ad transmission chain to systematically optimize the design requirements on the DSP block, which is presented in Section 2. Section 3 then presents the algorithm and architecture co-optimization on the DSP block to minimize the power consumption. Synthesis results of the proposed DSP using  $28\ nm$  CMOS technology is reported. The results show that the designed DSP can provide 7.04 GSPS throughput with a power consumption of  $28\ mW$ , and area of  $0.01\ mm^2$ . Section 4 concludes the whole paper.

## 2. OPTIMIZATION OF DSP DESIGN REQUIREMENTS

In this section, a complete 802.11ad transmission chain with QAM-16 modulation is modeled using Matlab to optimize the design requirements on the DSP conversion block based on output signals' Error Vector Magnitude (EVM) and power spectrum density (PSD). To be specific, the OSF in Fig.2 and quantization accuracy of DSP conversion block are defined. 802.11ad standard specifies -21 dB EVM for single carrier QAM-16 modulation. Considering the variations in this deeply scaled 28 nm CMOS technology, -31dB is taken as the design goal with a design margin of 10 dB.

The polar signals can be mathematically computed from the cartesian signals as follows,

$$A(t) = \sqrt{I(t)^2 + Q(t)^2},$$
  

$$\phi(t) = \arctan(\frac{Q(t)}{I(t)}).$$
 (1)

The non-linear transformation from rectangular signals to polar signals broadens the signal spectrum. To avoid the spectrum overlap due to expansion after conversion, the rectangular signal needs to be first upsampled and digitally filtered before converting to polar signal. The first alias after oversampling appears at an offset equal to the sampling frequency. For a symbol rate of 1760 MSPS (according to the IEEE802.11ad standard), an OSR of at least 6 is normally required to avoid

the first alias locate in the RF band of 802.11ad standard spanning from 57 GHz to 66 GHz [7]. This puts a challenging task on the DSP circuitry working for 10560 (1760x6) MSPS I/Q to AM/PH transformation.

To reduce the challenge, this work explores the possibility of using OSF of 4 to suppress the alias below the spectrum mask, in combination with the suppression from the analog Butterworth baseband filter in the PH path of Fig.2. The passband width and filter order of this analog baseband filter exhibit design tradeoff between the output EVM and alias rejection ratio: wider passband or smaller filter order keeps more significant signals in the transmission output, which leads to better EVM. However, the alias is less suppressed which may violate the spectrum mask. Extensive simulations show good tradeoff configurations of 2 GHz passband width and filter order of 2. Fig.3 presents the EVM results in terms of converted AM and PH quantization accuracies with 1.) OSF of 4, 2.) raised cosine digital filter, 3.) input I/Q signals of 7 bits, and 4.) 2 GHz passband width of  $2_{nd}$ -order analog Butterworth filter



**Fig. 3**. EVM with different quantization accuracies of AM and PH with OSF of 4



Fig. 4. PSD of output signal with OSF of 4

Note that although there are multiple choices of quantization accuracies in Fig.3 to achieve the -31 dB EVM, the ones with less bits of AM signal should be chosen to make layout easier when routing the digital AM bit-wires to the RFDAC.

Point at the turning corner (6-bit AM and 7-bit PH) gives the best tradeoff between AM and PH quantization accuracies. Simulation with this quantization accuracy shows -32 dB suppression on the first alias residing at 6.16 GHz. Fig.4 is the simulated in-band PSD of output signal with the chosen bit resolution. Both the in-band PSD and the alias rejection are compliant with the spectrum mask.

### 3. DESIGN AND IMPLEMENTATION OF DSP CONVERSION BLOCK

In this section, the details of the DSP conversion block, both the applied algorithm and implementation architecture, will be presented.

Eqn.1 involves multiple complex computations, e.g., square root, trigonometric and division computations. COordinate Rotation DIgital Computer (CORDIC), first described in 1959 by Jack E. Volder [8], is an efficient algorithm for these kinds of complex computations. The basic concept of CORDIC is to decompose the desired rotation angle into multiple predefined elementary rotation angles. Each elementary rotation can be implemented with simple shift-and-add operations. Three modes of CORDIC, i.e., circular, linear and hyperbolic, exist for operations of trigonometric, division and square root respectively.



Fig. 5. Flow diagram of a circular CORDIC

The flow diagram of a circular CORDIC is shown in Fig.5. For polar conversion to get AM and PH, the circular CORDIC works in vectoring mode. In this mode, the direction of each rotation ( $d_i$  in Fig.5) is determined by the sign of  $y_i$ . A positive  $y_i$  leads to a clockwise  $i_{th}$  rotation and vice versa. In this way, the vector is continuously approaching positive x axis. The final  $x_i$  value will approximate the required AM signal, and the sum of each rotated angles ( $z_i$  column in Fig.5) generates the PH value.

For the explored DSP conversion block, the PH signal itself is not required, but the sine and cosine functions of the PH value. Four candidate methods are explored in this work to get this.

Fig.6 presents the first two candidate structures. Both candidates use a circular CORDIC working in vectoring mode to get the required AM signal, and PH signal for further calculation. Candidate 1 employs an additional circular CORDIC working in rotation mode to get  $(sin\theta, cos\theta)$ . The working principle is shown in Fig.7: whenever the vectoring CORDIC makes a rotation, the rotation-mode CORDIC starting with a unit vector rotates the same angle but in an opposite direction. The rotation-mode CORDIC will end up a vector of  $(sin\theta, cos\theta)$ . Since the PH signal is hidden in the structure and no longer necessary, the z column in Fig.5 for both CORDICs can be removed. This structure hence needs 4 calculation columns in all. Candidate 2 uses a Look Up Table (LUT) with PH value from the vectoring CORDIC as index to generate the required  $(sin\theta, cos\theta)$ . With 7-bit PH signal resolution, a quick calculation shows  $2^7 \times 7 \times 2 = 1792$  bits are needed for the LUT. Besides the area cost, a bigger challenge for this LUT-based method comes from the required high throughput (7.04=1.76x4 GSPS). The read access time of a state-of-art ROM is reported to be 0.72 ns [9], which is still far away from the design requirement.



Fig. 6. (a) Candidate 1 (b) candidate 2 for polar conversion



**Fig. 7**. Working mechanism of candidate 1

Another two candidates for the conversion block are presented in Fig.8. These two candidates make use of CORDIC to do division operations. The linear CORDIC working in vectoring mode transforms (x, y, z) to (x, 0, z + y/x), which can be used to calculate division. During the iteration, x remains fixed value. Therefore, two calculation columns are needed to calculate either  $sin\theta$  or  $cos\theta$ . In total, 6 calculation columns are needed for candidate 3. Compared to candidate



Fig. 8. (a) Candidate 3 (b) candidate 4 for polar conversion

1, besides the extra 2 columns, another problem comes from the longer latency to generate the final  $(sin\theta,cos\theta)$ , since the circular and linear CORDIC in candidate 3 have to work in series. Candidate 4 computes AM straightforwardly, which consists of two 7-bit multiplications, a 14-bit addition and a hyperbolic CORDIC to calculate the square root. Obviously, the complexity of candidate 4 is higher than the others.



**Fig. 9.** Pipeline scheme with mixed register and latch sequential elements

Based on the above analysis, the first candidate is chosen to be implemented. To achieve the 7.04 GSPS throughput, the implemented DSP conversion block has four parallel paths, with each working on 1.76 GSPS. Each path implements 4 calculation columns in candidate-1 architecture, and works in pipeline with each pipeline stage performing one rotation. All the input symbols to be converted are first folded to the first domain at the head of the pipeline, and unfolded back at the end of the pipeline. To reduce the area, power and timing costs from the sequential logics in the pipeline, level-triggered latches are chosen to realize the internal sequential logics. To avoid the common hold-time failure problem in latch-based design, as shown in Fig.9, the adjacent latch elements are triggered at opposite levels.

Besides, this kind of two-phase latch scheme allows maximum time borrowing: The data can depart the first latch on the rising/falling edge of the clock, but does not have to set up until the falling/rising edge clock on the receiving latch. If one half-cycle or stage of a pipeline has too much logic, it can borrow time into the next half-cycle or stage. Time borrowing can accumulate across multiple cycles. The time borrowing introduced here has two benefits especially for this high-speed DSP fabricated on deeply-scaled  $(28 \ nm)$  technology: 1.) This leads to shorter design time because the

stage-balancing can automatically take place rather than requiring changes to the pipeline architecture to explicitly move functions from one stage to another. This is especially useful for timing convergence in this designed 1.76 GSPS pipeline with each stage under the timing constraint of half cycle (3.52 GHz). The synthesis results show that, with allowed 10 ps time borrowing for each stage, the timing can easily converge.

2.) The other is the opportunistic timing borrowing. Even if the pipeline is carefully equalized at design time, the delay of each stage can vary in the fabricated chip because of the process and environmental variations, which are getting more severe with smaller technology. In a system capable of time borrowing, the slow stage can opportunistically borrow time from faster ones and average out some of the variation.

The sequential logics at the head and end of the pipeline still use registers for friendly timing analysis when combined with the external logics.

Synthesis using Cadence RTL Compiler and TSMC 28 nm technology reports  $0.01 \ mm^2$  area cost of the proposed design. Netlist simulation with realistic inputs reports power cost of 28 mW, including logics and clock tree. As there is no other polar conversion block working at comparable speed for benchmark, in Table 1, the relative works operating at lower frequencies are normalized to  $E_{norm}$  for comparisons.  $E_{norm}$  represents the energy cost of each bit's conversion for each MHz. For a fair comparison, all the energy numbers are normalized to those using  $28 \ nm$  technology. The comparisons show the proposed design has significant advantages.

| Table 1. I chormance comparisons |              |              |        |           |
|----------------------------------|--------------|--------------|--------|-----------|
|                                  | [10]         | [11]         | [12]   | This work |
| Technology                       | $0.25~\mu m$ | $0.25~\mu m$ | 45 nm  | 28 nm     |
| Input                            | 14           | 13           | 16     | 7         |
| bit width                        |              |              |        |           |
| Max. freq.                       | 406          | 430          | 1140   | 7040      |
| (MHz)                            |              |              |        |           |
| Power (mW)                       | 470          | 276          | 101.63 | 28        |
| $E_{norm}$                       | 9.26         | 5.51         | 3.47   | 0.57      |
| (p.J/bit/MHz)                    |              |              |        |           |

Table 1 Performance comparisons

### 4. CONCLUSIONS

An area and energy efficient architecture to realize rectangular-to-polar conversion in 802.11ad polar transmitter is presented in this paper. Systematic optimizations are first explored to minimize the design requirements on the DSP conversion block. An efficient latch-based CORDIC working in pipeline is then studied to provide the required 7.04 GSPS throughput within 30 mW. The synthesis results compare favorably with previously reported architectures.

#### 5. REFERENCES

- [1] IEEE Std  $802.11ad^{TM}$ -2012.
- [2] V. Vidojkovic, et al., A low-power radio chipset in 40nm lp cmos with beamforming for 60ghz high-data-rate wireless communication, *In Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, 2013, IEEE International, pages 236-237 (2013).
- [3] W. Chen, et. al., "A 60GHz-Band 22 Phased-Array Transmitter in 65nm CMOS, *In Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, 2013, IEEE International, pages 42-43 (2010).
- [4] M. Nariman, et al., "A Compact Millimeter-Wave Energy Transmission System for Wireless Applications," *RFIC Symposium*, pp. 407410, Jun. 2013.
- [5] B. Razavi, "RF Microelectronics 2E," *Prentice Hall*, ISBN-13: 978-0137134731, 2011.
- [6] K. Khalaf, et al., "A digitally modulated 60GHz polar transmitter in 40nm CMOS," *RFIC Symposium*, pp. 159-162, Jun. 2014.
- [7] C. Li, et al., "Opportunities and Challenges of Digital Signal Processing in Deeply Technology-Scaled Transceivers," *Journal of Signal Processing Systems*, August, 2014.
- [8] J.E. Volder, "The CORDIC Trigonometric Computing Technique," *Electronic Computers, IRE Transactions on*, pp. 330-334, 1959.
- [9] Y. Umemoto, et al., "28 nm 50% power-reducing contacted mask read only memory macro with 0.72-ns read access time using 2T pair bitcell and dynamic column source bias control technique" Very Large Scale Integration Systems, IEEE Transactions on, vol. 22, no.3, pp. 575-584, 2014.
- [10] D.D. Hwang, et al., "A 400-MHz processor for the conversion of rectangular to polar coordinates in 0.25-μm CMOS" Solid-State Circuits, IEEE Journal of, vol. 38, pp. 1771-1775, 2003.
- [11] A.G.M. Strollo, et al., "A 430 MHz, 280 mW Processor for the Conversion of Cartesian to Polar Coordinates in 0.25 μm CMOS" Solid-State Circuits, IEEE Journal of, vol. 43, pp. 2503-2513, 2008.
- [12] Z. Bi, et al., "Full custom datapath of 16-bit CORDIC" Advanced Computational Intelligence (ICACI), 2012 IEEE Fifth International Conference on, pp. 993-998, 2012.