



# SEMICONDUCTOR IP CORE FOR ULTRA LOW POWER MPEG-4 VIDEO DECODE IN SYSTEM-ON-SILICON

*J. Dunlop<sup>1</sup>, A. Simpson<sup>1</sup>, S. Masud<sup>2</sup>, M. Wylie<sup>1</sup>, J. Cochrane<sup>1</sup>, R. Kinkead<sup>1</sup>*

<sup>1</sup>Amphion Semiconductor Ltd, 50 Malone Road, Belfast BT9 5BS, Northern Ireland, UK

<sup>2</sup>Dept of Computer Science, Lahore University of Management Sciences,  
Lahore Cantt. 54792, Pakistan\*

## ABSTRACT

An ultra low power, hardware accelerated architecture based semiconductor intellectual property core for MPEG-4 has been developed. This encompasses the simple profile of the video decoding algorithm. The core can provide motion picture quality video at upto CIF resolution. The implementation is based on the application of hardware acceleration of compute-intensive operations with an embedded RISC processor acting purely as a host controller. The architecture comprises custom hardware designs for lookup table decoders, bit-stream parsing, discrete cosine transforms, motion compensation and colour space conversion. The hardware-software co-design approach results in high efficiency in both area and performance. The design has been validated on an FPGA-based development board with an LCD panel for visual demonstration of real-time decoded streaming video sequences. This MPEG-4 video decoder core has been ported to 130nm ASIC technology using system-level integration techniques where the power dissipation is around 10 mWatts. The design is ideally suited to high-volume system-on-chip solutions for a wide range of wireless multimedia communication applications.

## 1.0 INTRODUCTION

The upcoming standard MPEG-4 [1] encompasses a range of features and complex bit-streams. The hardware implementation of this multimedia standard is compute-intensive because it involves a number of coding algorithms, profiles and bit-rates. Thus a hardware-software codesign approach becomes a necessity to provide the flexibility, control and speed-up required.

It has been shown that an efficient implementation of MPEG-4 has to rely on extending the instruction set of a general purpose RISC processor and adding hardware acceleration [2, 3]. The hardware acceleration modules

can be designed to tackle the most compute-intensive tasks thus relieving the processor to manage data-flow and memory reading and writing. Several of the hardware acceleration modules have to be used concurrently to achieve a real-time video performance.

The DSP chips and RISC processors available in the market generally target a narrow segment of applications. Some have been optimized for audio, video or image processing but they become inefficient or result in wastage of resources in general purpose tasks such as bit-stream parsing. The exemplar implementation presented in this paper is based on ARCTangent processor [4]. This was chosen in this design because it can be configured for various hardware and software options and the synthesisable code only contains the useful parts of the processor with appropriate bus-widths. The accelerator modules presented, however, can be used with a range of processor cores.

Development cost, power consumption, flexibility and upgradeability are the most desirable characteristics of a mobile multimedia system. The MPEG-4 core presented here has been designed to offer an efficient solution keeping in line with these requirements. This video decoder is ideal for wireless applications that require real-time video decode but demanding a power and area efficient solution. It also provides advantages for high-end solutions such as video conferencing and Internet video streaming. The core is available in both ASIC and FPGA versions that have been handcrafted to deliver full functionality within minimal system resources.

The rest of the paper describes the development of the core and the resultant performance matrices. The architecture of the system along with the accelerator cores developed are included in section 2. Section 3 explains the implementation results on a Xilinx Virtex device. This is followed by conclusions in section 4.

\* The entire work was done at Amphion Semiconductor Ltd., 50 Malone Road, Belfast, UK

## 2.0 DECODER ARCHITECTURE

The **CS6750** MPEG-4 Decoder is a highly integrated application specific silicon core for low bit-rate (<384kbps) video decode. As such, it is fully compliant with **ISO/IEC14496-2** Video simple profile levels 0 through 3, with integrated error resilience and image post-processing. The CS6750 is designed to handle standalone MPEG-4 decoding functions and thus free the remaining system for other tasks such as transport layer control, audio, LCD control etc. The CS6750 accepts pure **ISO/IEC14496-2** video elementary streams, for up to four individually signalled streams, at bitrates reaching 384kbps. The decoder can operate stand-alone and decode video initialization data from the input stream. If multiple video objects are required then an external host interface is provided to enable video object and layer configuration.

The output YCbCr streams can either be provided as raw streams from the decode process or pass through optional post processing options such as de-blocking, de-ringing and RGB colour space conversion. The schematic of the system is shown in figure 1.



Figure 1: Schematic of Amphion's MPEG-4 Video Decoder System-on-Silicon

The individual blocks of the architecture are described below:

### 2.1 Controller

The central controller has responsibility of directing the use of the decoded input video stream, feeding the pixel generation block, performing motion compensation, controlling object display and supporting the MPEG-4 error resilience tools.

It is supported via Code and Data RAM and has an executable downloaded via a host controller or state machine on power-up. Naturally, enhanced code-builds can be created for the decoder even when the silicon has been committed. The sequencer code can be highly

optimized and closely coupled to the hardware design and requested feature updates can be easily applied.

### 2.2 Video Bitstream Processing

The MPEG-4 video bitstream is handled via this block. Bit aligned operators along with Variable Length Code (VLC) decoding are completed in a minimum of cycles to support the video decode operation. Traditionally these tasks are awkward and cycle consuming for software methods but ideal for optimized Hardware solution. This block is also programmed for selected buffer fill levels to signal or wake the rest of the core. Error logging enables the core to continue consuming codes until a suitable point to verify if a problem has arisen. A major advantage of hardware based bit-parsing and pixel-generation is to be able to accurately predict the worst-case performance requirements.

The VLC table-decode instruction operates in two clock cycles if the data in bit-to-byte buffer, which parses the bit-stream, is available. A block-decode instruction is provided which extracts an entire 64x64 DCT block from the bit-stream. This improves the efficiency of the design by eliminating multiple instructions to decode one DCT value at a time. A de-multiplexer interface is provided in this block which allows decoding of multiple streams. Up to four objects are feasible within Simple Profile.

### 2.3 Pixel Generation

This high performance unit performs the Inverse Scan, Inverse Quantisation and Inverse DCT for the six 8x8 DCT-encoded (4xY, Cr and Cb) pixel blocks of an Intra and Difference macroblock. This unit is capable of streaming data through continuously scanning, multiplying and transforming an entire macroblock of 6x8x8 DCT coefficients into a 6x8x8 block of pixel samples. The IDCT values are obtained continuously, one per clock cycle.

### 2.4 Motion Compensation

This module has been primarily designed to cope with the data movement demands of the motion compensation stage. Control software receives motion vectors from the bitstream parsing unit mentioned in section 2.2 and calculates a final motion vector for each macroblock. The hardware core uses these final motion vectors to read pixel values from the reference frame, interpolates, adds with difference block the values directly obtained from the Pixel Generation core, clips and then finally writes them back to the current frame. The module handles all Simple profile macroblock types and the control software typically does not wait for the end of this operation. Configuration information for this module is set through a simple register interface. Bus mastering capability is

integrated to enable stand-alone operation whereas an optional direct SDRAM interface can be integrated for some applications. Currently the module handles all MPEG-4 Simple profile tools but the hardware design enables the integration of higher profile tools such as Advanced Simple profile.

## 2.5 Post –Processing

This Post-Processing hardware accelerator performs de-blocking and de-ringing algorithms based on those outlined in Appendix F of the **ISO/IEC14496-2** specification. These Post-Processing algorithms have been optimized for a hardware solution and are selectively enabled per video object. The decoder architecture ensures that enabling both levels of post processing has minimal effect on the clock rate requirements of the decoder.

Once pixel data has been smoothed, it is sent through the Colour-Space conversion block and if necessary, to the picture out port.

## 2.5 Colour-Space Conversion and Output Control

Colour-Space conversion performs linear transformation from YCbCr to RGB. The core uses 10-bit coefficients for all calculations with rounding carried out at the output to maintain the 8-bit accuracy for pixels.

Since the decoder is utilized in a wide range of different system configurations, then for ease of portability the picture frames are sent to a relatively simple but flexible interface. The order of objects to be displayed can be selected or implied by the object identities. The complete frame when fully decoded is burst out through the port while the signal-acknowledge handshake enables the receiving system to control the data display rate.

Context and type information is given for the object currently being shown and the pixel component (Y, Cb, Cr or R, G and B) respectively. Row-complete and end-of-frame outputs are provided for interfacing.

## 3.0 IMPLEMENTATION RESULTS

The design of the hardware accelerators was captured in verilog. Synopsys was used to synthesize the complete design including all accelerator blocks, ARC processor and memory cores in 130nm technology. The gate count excluding memories is 110 K gates. The design can work at over 150 MHz although a real-time frame rate for QCIF resolutions is achieved at 12 MHz for simple profile level 1. This performance metrics compares very favourably with previously published results [6]. The additional speed-up can be used to advantage in simultaneous processing of multiple input streams.

The Modelsim tool was used for functional and post-synthesis simulations and Power Theatre tool provided accurate power estimates. The design methodology adopted results in extreme portability in terms of choice of target technology, integration, scalability and testing.

The CS6750 [5] memory requirements are split between local register/buffer requirements, Controller code and data RAM and the Frame memory store. The internal memory requirement for all accelerator blocks in complete MPEG-4 decoder core implementation stands at around 27 K bits. The Frame and Program Memories require additional memory cells which are specified by the user during implementation.

A demonstrator of the CS6750 has been built using ARCanel-2 development board, shown in figure 2. The system is based around Xilinx Virtex-1000 chip. The design occupies approximately 8074 slices and utilises 26 internal Block-RAMs without using the Motion Compensation accelerator module. The clock speed of 12.5 MHz was achieved in this build which meets the requirements of different conformance test streams. The processor's internal UART was used for data input.

The software part of the implementation, written in ANSI-C, is pure control code that can play a large part in defining the quality of implementation, enable system integration and provides a method for enhancements to the algorithms such as adoption of other video tools. A full codec can similarly be built from a selection of the accelerators and modifications in control software. The hardware accelerators have been designed to be AMBA bus compliant. Additionally, each accelerator exists as stand-alone module to integrate with existing video solutions for application speed-up requirements. Low power mode can be enabled through software by module shutdowns and stand-alone accelerator operation .



Figure 2: The Exemplar Implementation of MPEG-4 Video Decoder



The hardware/software co-design approach adopted here has resulted in a balanced product with key cycle-consuming elements in dedicated hardware while remaining algorithms is left in software for enhancement and optional refinement. The arrangement also requires simpler control communication with down and up stream processes such as the de-multiplexor and display system. The memory usage is low as processor is only involved in control operations. A major advantage in this approach is low power consumption because core frequency is dropped along with reduced memory and associated bus accesses. A single processor core can be used for the entire system thus only one set of instruction and data memories/buses is sufficient. The same processor can be used to support other algorithms, as remaining control software is low in processing demands. The scheme offers portability to different platforms due to the fact that hardware modules have been developed with industry standard bus interfaces such as AMBA and control software is ANSI compliant. Further flexibility is possible by replacing some accelerator modules with equivalent software to suit platforms with spare computing power.

#### 4.0 CONCLUSION

An implementation of MPEG-4 video decoder is presented which is entirely based on the application of hardware acceleration of compute-intensive operations with an embedded RISC processor acting simply as a host controller. The ARC Tangent-A4 used in this implementation is an architecture proof of concept using a minimal processor configuration. Other simple (or not so simple) processors may replace the ARC such as Altera NIOS, Xilinx MicroBlaze, ARM7, ARM9 or In-house custom cores. The hardware accelerated approach is ideal for existing customer software solutions that require speed-ups for integration reasons, increased levels or profile requirements. All modules are AMBA compliant but are available to port to other bus systems so they can easily be integrated with system processor as co-processors or on any system bus.

The implementation comprises custom hardware designs for lookup-table decoders, bit-stream parsing, discrete cosine transforms and color-space conversion. The hardware-software co-design approach results in high

efficiency in both area and performance and also provides flexibility coupled with mix-n-match capability. The design has been validated on an FPGA-based development board with an LCD panel for visual demonstration of real-time decoded streaming video sequences. The MPEG-4 Video Decoder core has been ported to 130nm ASIC technology. The system complexity in this technology is 110 K gates with a power consumption of less than 10 mWatts for 15 frames per second at QCIF resolution. The fully processed RGB output is available from the decoder. Work is already underway to provide similar MPEG-4 platform solutions for Encoder and higher profiles. The core is ideally suited to high-volume system-on-a-chip (SoC) solutions for a wide range of wireless multimedia communications applications.

#### 5.0 REFERENCES

- [1] J. Kneip, B. Schmale, H. Möller, R. Bosch, 'Applying and Implementing the MPEG-4 Multimedia Standard', IEEE Micro, Nov-Dec 1999, pp 64-74
- [2] M. Berekovic, H.-J. Stolberg, M.B. Kulaczewski, P. Pirsch, 'Instruction Set Extensions for MPEG-4 Video', Journal of VLSI Signal Processing, 1999, Vol. 23, pp 27-49
- [3] M. Berekovic, H.-J. Stolberg, P. Pirsch, 'A Programmable Co-Processor for MPEG-4 Video', IEEE International Conference on Acoustics Speech and Signal Processing, May 2001, pp 1021-1024
- [4] ARC Processor Cores, <http://www.arc.com>, on 29/10/02
- [5] CS6750 Product Brief, Amphion Semiconductor Ltd., <http://www.amphion.com> on 29/10/02
- [6] M. Takahashi, T. Nishikawa et al., 'A 60-MHz 240-mW MPEG-4 Videophone LSI with 16-Mb Embedded DRAM', IEEE Journal of Solid-State Circuits, Vol. 35, No. 11, Nov 2000, pp 1713 – 1721