# IMPLEMENTATION OF HIGH PERFORMANCE INTRA PREDICTOR IN H.264

Jinwook Kim, Eungu Jung, Eonpyo Hong, Sanghoon Kwak and Dongsoo Har

Gwangju Institute of Science and Technology (GIST), Gwangju, Republic of Korea

{dbjinuk, egjugn, ephong77, shkwak, hardon }@gist.ac.kr

# ABSTRACT

Intra prediction in H.264 video coding includes so many operations that it makes hardware complicated. It is very important to design intra predictor that performs fast operations with limited hardware area. However, it is difficult to satisfy timing constraint with small hardware area. The solutions for satisfying both hardware area limitation and timing constraint make the hardware component usage increase and also make the operational redundancy reduce. In this paper, high performance architecture for intra predictor is proposed, which is faster than previous architecture using about the same size of hardware area.

#### **1. INTRODUCTION**

H.264 is the most recent standard for video coding. It compresses original images which have abundant information into compressed images by using three parts of image compression. First part is prediction part which predicts the most similar frame of a current image frame by using previous image frames or neighbor pixels in the current prediction image frame. The prediction part consists of inter prediction and intra prediction. Intra prediction predicts current block images by using similarity between neighbor pixels of an image frame. Predicted pixel can be extension of neighbor pixels in the proper direction of the prediction mode. After prediction, the predicted image frame subtracted from the current image frame then only residual values are transmitted to the next part. Second part is "DCT (Discrete Cosine Transform) and quantization" part and the last part is Entropy coding part.

In this paper, high performance hardware architecture with small area for intra prediction of H.264 is proposed. The rest of this paper is organized as follows. Section 2 describes algorithm of intra prediction, and Section 3 shows our hardware architecture. Finally, conclusions are given in Section 4.

# 2. INTRA PREDICTION MODES

A prediction unit of intra mode is basically a macroblock



Fig. 1. 4x4 (i4mb) luminance prediction modes in intra prediction of H.264. The arrows in this figure indicate the direction of prediction in each mode. For modes 3–8, the predicted samples are formed from a *weighted average* of the prediction samples A–M(neighbor pixels).



Fig. 2. 16x16 (116mb) luminance prediction modes in intra prediction of H.264.

which is represented by a 16x16 pixels block for the luminance and two 8x8 pixels block for the chrominance.

A prediction unit of "i4mb" type for the luminance is 4x4 pixels block. Although the basic unit of each luminance mode is 16x16 pixels block, a macroblock is divided into sixteen 4x4 pixels blocks for more precise prediction. Intra predictor predicts current 4x4 pixels block as extension of neighbor pixels in the direction of each mode. Fig. 1 shows nine "i4mb" type modes and extension directions.

A prediction unit of "i16mb" type for the luminance is 16x16 pixels block. Intra predictor predicts 16x16 pixels block as similar to "i4mb" type. Fig. 2 shows four "i16mb" type modes and each direction.

There are four prediction modes for the chrominance, and the prediction unit of these modes is 8x8 pixels block. Intra predictor predicts 8x8 pixels block as a similar way of the "luminance frame" prediction.

#### **3. PROPOSED ARCHITECTURE**

Fig. 3 shows the circuit which is proposed in Yu-Wen Huang's paper [4], and it predicts one of rows in 4x4 pixels block during one clock cycle. There are four



separated circuits, and each circuit generates one prediction pixel value. This architecture needs four 13 to 1-multiplexers, twelve adders, four round & shift circuits

and four clip circuits to predict four pixels. Our architecture improved hardware performance in two aspects. First, we use each "carry in" input of first step adders for round operations to eliminate additional logic circuits. It is easy to understand that only one "adding one" operation is required per two input operands. This "adding one" operation can be performed by setting "carry in" when two operands are entering to the adder. As a result, additional circuits for round operation are eliminated. Second aspect is reduction of operational redundancies. As shown in Fig. 3, "B + C", "C + D", and "D + E" operation are executed twice. To eliminate these redundant operations, our intra predictor divided operations by small units as shown in the Fig. 4. First units are "A + B + 1", "B + C + 1", "C + D + 1", "D + E + 1", "E + F + 1", "F + G + 1", "G + H + 1" and "H + H + 1". Second units are summation of two first units, which are "(A + B + 1) + (B + C + 1)", "(B + C + 1) +(C + D + 1)", "(C + D + 1) + (D + E + 1)", "(D + E + 1) +(E+F+1)", "(E+F+1)+(F+G+1)", "(F+G+1)+ (G + H + 1)" and "(G + H + 1) + (H + H + 1)". As a result, we need only fifteen adders and four multiplexers to predict one mode. Although we eliminated input multiplexers, additional four multiplexers are required at the output. Shift operations can be performed by adjusting bits position of wire. Input operands are unchangeable during prediction for one mode, but there are two groups of input operands which are "A, B, C, D, E, F, G, H" and "L, K, J, I, M, A, B, C, D" for all "i4mb" directional modes. Therefore, we need additional nine 2 to 1-multiplexers.

#### 4. IMPLEMENTATION RESULTS

The proposed architecture is implemented in an FPGA (Vertex2) using Synplify Pro 8.2 and simulated by using ModelSim 6.0. The implementation results of the proposed architecture are shown in TABLE. I.



Fig. 4 Proposed architecture. Upper eight adders are first step adder, and lower seven adders are second step adder

| TABLE I                                      |                   |
|----------------------------------------------|-------------------|
| IMPLEMENTATION RESULT OF THE INTRA PREDICTOR |                   |
| Equivalent gate                              | Maximum operating |
| count                                        | frequency         |
| 38,368                                       | 81.3 MHz          |

#### **5. CONCLUSIONS**

We proposed high performance hardware architecture for the prediction value calculation block by reducing redundant or unnecessary hardware use. Our architecture can predict 4x4 pixels block during one clock cycle using about the same size of hardware area of Huang's architecture which predicts four pixels during one clock cycle.

### 6. REFERENCES

- Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, May 2003. Joint Video Team.
- [2] Iain E. G. Richardson, H.264 and MPEG-4 Video Compression, John Wiley & Sons Ltd, 2003.
- [3] Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen, "Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder", IEEE Transactions on Circuits and Systems for Video Technology, Vol.15, No.3, pp.378-401, March2005.