iOS video development: Video H264 hard coding


Series of articles:
iOS video development (I): video capture
iOS video development (II): Video H264 hard coding
iOS video development (III): Video H264 hard decoding
iOS video development (IV): popular understanding of YUV data

Last iOS video development (I): video capture We have introduced how to collect the video data of iOS camera. The amount of original video data collected is relatively large. Such a large amount of data is not conducive to storage or network transmission. So we need to compress the video data, just as you think the file is too large when you want to transmit it to others, so you can send it to the other party in rar compressed package. Video data compression is also called coding. H264 is a video coding format. Apple has opened the VideoToolbox framework for iOS 8.0 and above to realize H264 hard coding. Developers can easily realize video hard coding by using the VideoToolbox framework. Next, we will explain the implementation of h264 hard coding in iOS in the following parts:
1. This paper introduces the basic concept of video coding
2. Hard coding principle and process of VideoToolbox
3. Code implementation hard coding
4. Summary and Demo

Basic concepts

Why can video data be compressed? Because there is redundancy in video data. Generally speaking, for example, in a video, the previous second picture is very similar to the current picture content, so can we not save all the data in these two seconds, but only keep a complete picture, and record the changes in the next picture. When we take the video to play, we will restore other pictures according to the complete picture and other changed places. The process of recording different pictures and then saving them is data encoding, and the process of restoring pictures according to different places is data decoding.

H264 It's a video coding standard. I won't go into details here. Click to see the encyclopedia.
Three frames are defined in H264 protocol:

  • I frame: fully encoded frame, also known as key frame
  • P frame: refers to the frame generated by the previous I frame, which only contains the coding of the difference part
  • B frame: the frame encoded by the frame before and after reference is called B frame

The core algorithms adopted by H264 are intra frame compression and inter frame compression. Intra frame compression is the algorithm for generating I frame, and inter frame compression is the algorithm for generating B frame and P frame.
H264 original code stream is composed of one NALU (Nal Unit) after another. NALU = start code + nal type + video data
The start code is used to indicate that this is the beginning of a NALU unit. It must be "00 00 01" or "00 00 01"
NALU types are as follows:

type explain
0 Not specified
1 Segments not divided by data in non IDR images
2 Class A data segmentation in non IDR images
3 Class B data segmentation in non IDR images
4 Class C data segmentation in non IDR images
5 Fragment of IDR image
6 Supplemental enhanced information (SEI)
7 Sequence parameter set (SPS)
8 Image parameter set (PPS)
9 Separator
10 Sequence Terminator
11 Stream Terminator
12 Fill data
13 Sequence parameter set extension
14 NAL unit with prefix
15 Subsequence parameter set
16 – 18 retain
19 Auxiliary coded image segment without data division
20 Coding fragment extension
21 – 23 retain
24 – 31 Not specified

Generally, we only use 4 types: 1, 5, 7 and 8. Type 5 indicates that this is an I-frame. SPS and PPS data must be in front of the I-frame, that is, type 7 and 8. Type 1 indicates that this is a P-frame or B-frame.

Frame rate: the unit is fps(frame pre second). How many frames are there in the video picture per second? The larger the value, the smoother the picture is
Bit rate: the unit is bps(bit pre second). The amount of data output by video per second. The larger the value, the clearer the picture
Resolution: pixel density of video picture, such as common 720P, 1080P, etc
Key frame interval: how often to encode a key frame
Soft coding: use CPU for coding. Poor performance
Hard coding: do not use CPU for coding, use graphics card GPU, special DSP, FPGA, ASIC chip and other hardware for coding. Good performance

VideoToolbox Implement H264 hard coding

iOS8.0 and above, we can realize the hard coding and decoding of video data through VideoToolbox. VideoToolbox basic data structure:

  • CVPixelBufferRef/CVImageBufferRef: store the image data before encoding and after decoding. These two goods are actually the same thing
  • CMTime: timestamp related, and the time appears in the form of 64 bit / 32-bit
  • CMBlockBufferRef: data output after encoding
  • CMFormatDescriptionRef/CMVideoFormatDescriptionRef: format description of image storage mode, codec, etc. These two goods are the same thing
  • CMSampleBufferRef: container data for storing video images before and after encoding and decoding

    CMSampleBuffer data structure before and after encoding and decoding

Basic steps

1. Create encoder through VTCompressionSessionCreate
2. Set encoder properties through VTSessionSetProperty
3. After setting the attribute, call VTCompressionSessionPrepareToEncodeFrames to prepare encoding
4. Input the collected video data and call VTCompressionSessionEncodeFrame for encoding
5. Get the encoded data and process it
6. Call VTCompressionSessionCompleteFrames to stop the encoder
7. Call VTCompressionSessionInvalidate to destroy the encoder

1. Create encoder

VTCompressionSessionCreate is used to create a video encoding session. This method has 10 parameters. We can see Apple's comments on this API


    CM_NULLABLE CFAllocatorRef                          allocator,
    int32_t                                             width,
    int32_t                                             height,
    CMVideoCodecType                                    codecType,
    CM_NULLABLE CFDictionaryRef                         encoderSpecification,
    CM_NULLABLE CFDictionaryRef                         sourceImageBufferAttributes,
    CM_NULLABLE CFAllocatorRef                          compressedDataAllocator,
    CM_NULLABLE VTCompressionOutputCallback             outputCallback,
    void * CM_NULLABLE                                  outputCallbackRefCon,
    CM_RETURNS_RETAINED_PARAMETER CM_NULLABLE VTCompressionSessionRef * CM_NONNULL compressionSessionOut)

Allocator: memory allocator. NULL is the default allocator
Width and height: the width and height of the video frame pixel. If the encoder does not support this width and height, it may change
codecType: code type, enumeration
Encoder specification: specify a specific encoder. If NULL is filled in, it will be automatically selected by VideoToolBox
sourceImageBufferAttributes: the attribute of the source pixel buffer. If this parameter has a value, VideoToolBox will create a buffer pool, which can be set to NULL without buffer pool
Compressed data allocator: memory allocator for compressed data. Fill in NULL and use the default allocator
outputCallback: callback function of output data after video coding
outputCallbackRefCon: the user-defined pointer in the callback function. We usually pass self to get the methods and properties of the current class in the callback function
compressionSessionOut: encoder handle, pointer passed in to encoder


OSStatus status = VTCompressionSessionCreate(NULL, 180, 320, kCMVideoCodecType_H264, NULL, NULL, NULL, encodeOutputDataCallback, (__bridge void *)(self), &_compressionSessionRef);

2. Set encoder properties & prepare for encoding

After the encoder is created, all properties set for the encoder are implemented by calling the VTSessionSetProperty method.

kVTCompressionPropertyKey_AverageBitRate: sets the average code rate of the encoding in bps. This is not a hard index. The set code rate will fluctuate up and down. The VideoToolBox framework only supports ABR mode. H264 has four rate control methods:

  • CBR (Constant Bit Rate) is encoded in the way of Constant Bit Rate. When Motion occurs, due to the Constant Bit Rate, the codeword size can only be reduced by increasing QP, and the image quality becomes worse. When the scene is still, the image quality becomes better, so the image quality is unstable. This algorithm gives priority to bit rate (bandwidth).
  • VBR (Variable Bit Rate) dynamic bit rate. Its bit rate can change with the complexity of the image, so its coding efficiency is relatively high. When Motion occurs, there are few mosaics. The bit rate control algorithm determines the bit rate used according to the image content. If the image content is relatively simple, less bit rate is allocated (it seems that the codeword is more appropriate), and if the image content is complex, more codewords are allocated, which not only ensures the quality, but also takes into account the bandwidth limitation. This algorithm gives priority to image quality.
    *CVBR (Constrained VariableBit Rate), which is an improved method of VBR, is difficult to translate into Chinese. But where is Constrained reflected? The Maximum bitRate corresponding to this algorithm is constant or the Average BitRate is constant. This method takes into account the advantages of the above two methods: when the image content is still, it saves bandwidth. When Motion occurs, it uses the bandwidth saved in the early stage to improve the image quality as much as possible, so as to achieve the purpose of taking into account both bandwidth and image quality at the same time.
  • The average bit rate can reach a certain range within the set bit rate. It can be used as a compromise between VBR and CBR.

kVTCompressionPropertyKey_ Profile level: sets the image quality of H264 code. There are four profiles of H264: BP, EP, MP and HP

BP(Baseline Profile): basic image quality. Support I/P frames, only Progressive and CAVLC; Main applications: video telephony, conference television, wireless communication and other real-time video communication fields
EP(Extended profile): advanced image quality. Support I/P/B/SP/SI frames, and only support Progressive and CAVLC;
MP(Main profile): mainstream image quality. Provide I/P/B frames, support Progressive and interleaved, and also support CAVLC and CABAC; Main applications: digital radio and television and digital video storage
HP(High profile): advanced image quality. Added 8 on the basis of main Profile × 8 internal prediction, custom quantization, lossless video coding and more YUV formats; Application in radio, television and storage

There are many levels. They are not listed here, but can be used for reference h264 profile & level , the common solutions on iPhone are as follows:

  • Live broadcast:
    Low definition Baseline Level 1.3
    Standard definition Baseline Level 3
    Half HD Baseline Level 3.1
    Full HD Baseline Level 4.1
  • Storage media:
    Low clear Main Level 1.3
    Standard definition Main Level 3
    Main level 1.3
    Full HD Main Level 4.1
  • HD storage:
    Half HD High Level 3.1
    Full HD High Level 4.1

kVTCompressionPropertyKey_RealTime: set whether to encode and output in real time
kVTCompressionPropertyKey_AllowFrameReordering: configure whether to generate B frames. High profile supports B frames
kVTCompressionPropertyKey_MaxKeyFrameInterval,kVTCompressionPropertyKey_MaxKeyFrameIntervalDuration: configure I-frame interval

After setting encoder properties, call VTCompressionSessionPrepareToEncodeFrames to prepare encoding


// Set the bit rate to 512kbps
OSStatus status = VTSessionSetProperty(_compressionSessionRef, kVTCompressionPropertyKey_AverageBitRate, (__bridge CFTypeRef)@(512 * 1024));
// Set the ProfileLevel to BP3 one
status = VTSessionSetProperty(_compressionSessionRef, kVTCompressionPropertyKey_ProfileLevel, kVTProfileLevel_H264_Baseline_3_1);
// Set real-time coded output (avoid delay)
status = VTSessionSetProperty(_compressionSessionRef, kVTCompressionPropertyKey_RealTime, kCFBooleanTrue);
// Configure whether to generate B frames
status = VTSessionSetProperty(_compressionSessionRef, kVTCompressionPropertyKey_AllowFrameReordering, self.videoEncodeParam.allowFrameReordering ? kCFBooleanTrue : kCFBooleanFalse);
// Configure the maximum I-frame interval of 15 frames x 240 seconds = 3600 frames, that is, compile an I-frame every 3600 frames
status = VTSessionSetProperty(_compressionSessionRef, kVTCompressionPropertyKey_MaxKeyFrameInterval, (__bridge CFTypeRef)@(self.videoEncodeParam.frameRate * self.videoEncodeParam.maxKeyFrameInterval));
// Configure the I-frame duration, and compile an I-frame in 240 seconds
status = VTSessionSetProperty(_compressionSessionRef, kVTCompressionPropertyKey_MaxKeyFrameIntervalDuration, (__bridge CFTypeRef)@(self.videoEncodeParam.maxKeyFrameInterval));
// Encoder ready to encode
status = VTCompressionSessionPrepareToEncodeFrames(_compressionSessionRef);

3. Input video data to be encoded

The video data to be encoded is transmitted to the encoder by calling the VTCompressionSessionEncodeFrame method.


    CM_NONNULL VTCompressionSessionRef  session,
    CM_NONNULL CVImageBufferRef         imageBuffer,
    CMTime                              presentationTimeStamp,
    CMTime                              duration, // may be kCMTimeInvalid
    CM_NULLABLE CFDictionaryRef         frameProperties,
    void * CM_NULLABLE                  sourceFrameRefCon,
    VTEncodeInfoFlags * CM_NULLABLE     infoFlagsOut )

session: handle when creating encoder
imageBuffer: YUV data. The type of video stream data collected by iOS through the camera is CMSampleBufferRef. We need to get CVImageBufferRef from it for coding. imageBuffer can be obtained from sampleBuffer through CMSampleBufferGetImageBuffer method.
presentationTimeStamp: the timestamp of this frame, in milliseconds
Duration: the duration of this frame. If there is no duration, fill in kCMTimeInvalid
Frame properties: specify the properties of this frame. Here, we can specify to generate I frame
encodeParams: Custom pointer
Infoflagout: it is used to receive the information of encoding operation. It is set to NULL if not required


// Get CVImageBufferRef
CVImageBufferRef imageBuffer = (CVImageBufferRef)CMSampleBufferGetImageBuffer(sampleBuffer);
// Set whether it is I frame
NSDictionary *frameProperties = @{(__bridge NSString *)kVTEncodeFrameOptionKey_ForceKeyFrame: @(forceKeyFrame)};;
// Input data to be encoded
OSStatus status = VTCompressionSessionEncodeFrame(_compressionSessionRef, imageBuffer, kCMTimeInvalid, kCMTimeInvalid, (__bridge CFDictionaryRef)frameProperties, NULL, NULL);

4. Obtain and process the encoded data

The encoded data is returned through the callback function set by VTCompressionSessionCreate method. The encoded data and the basic information of this frame are in CMSampleBufferRef. If this frame is a key frame, we need to obtain SPS and PPS data, and then add a start code to these data to return.


VEVideoEncoder *encoder = (__bridge VEVideoEncoder *)outputCallbackRefCon;
// Start code
const char header[] = "\x00\x00\x00\x01";
size_t headerLen = (sizeof header) - 1;
NSData *headerData = [NSData dataWithBytes:header length:headerLen];

// Determine whether it is a keyframe
bool isKeyFrame = !CFDictionaryContainsKey((CFDictionaryRef)CFArrayGetValueAtIndex(CMSampleBufferGetSampleAttachmentsArray(sampleBuffer, true), 0), (const void *)kCMSampleAttachmentKey_NotSync);

if (isKeyFrame)
    NSLog(@"VEVideoEncoder::Encoded a key frame");
    CMFormatDescriptionRef formatDescriptionRef = CMSampleBufferGetFormatDescription(sampleBuffer);
    // SPS and PPS information shall be added to the key frame
    size_t sParameterSetSize, sParameterSetCount;
    const uint8_t *sParameterSet;
    OSStatus spsStatus = CMVideoFormatDescriptionGetH264ParameterSetAtIndex(formatDescriptionRef, 0, &sParameterSet, &sParameterSetSize, &sParameterSetCount, 0);
    size_t pParameterSetSize, pParameterSetCount;
    const uint8_t *pParameterSet;
    OSStatus ppsStatus = CMVideoFormatDescriptionGetH264ParameterSetAtIndex(formatDescriptionRef, 1, &pParameterSet, &pParameterSetSize, &pParameterSetCount, 0);
    if (noErr == spsStatus && noErr == ppsStatus)
        // sps data plus start code form NALU
        NSData *sps = [NSData dataWithBytes:sParameterSet length:sParameterSetSize];
        NSMutableData *spsData = [NSMutableData data];
        [spsData appendData:headerData];
        [spsData appendData:sps];
        // Callback to the upper layer through the agent
        if ([encoder.delegate respondsToSelector:@selector(videoEncodeOutputDataCallback:isKeyFrame:)])
            [encoder.delegate videoEncodeOutputDataCallback:spsData isKeyFrame:isKeyFrame];
        // pps data plus start code form NALU
        NSData *pps = [NSData dataWithBytes:pParameterSet length:pParameterSetSize];
        NSMutableData *ppsData = [NSMutableData data];
        [ppsData appendData:headerData];
        [ppsData appendData:pps];
        if ([encoder.delegate respondsToSelector:@selector(videoEncodeOutputDataCallback:isKeyFrame:)])
            [encoder.delegate videoEncodeOutputDataCallback:ppsData isKeyFrame:isKeyFrame];
// Get frame data
CMBlockBufferRef blockBuffer = CMSampleBufferGetDataBuffer(sampleBuffer);
size_t length, totalLength;
char *dataPointer;
status = CMBlockBufferGetDataPointer(blockBuffer, 0, &length, &totalLength, &dataPointer);
if (noErr != status)
    NSLog(@"VEVideoEncoder::CMBlockBufferGetDataPointer Error : %d!", (int)status);

size_t bufferOffset = 0;
static const int avcHeaderLength = 4;
while (bufferOffset < totalLength - avcHeaderLength)
    // Read NAL unit length
    uint32_t nalUnitLength = 0;
    memcpy(&nalUnitLength, dataPointer + bufferOffset, avcHeaderLength);
    // Big end to small end
    nalUnitLength = CFSwapInt32BigToHost(nalUnitLength);
    NSData *frameData = [[NSData alloc] initWithBytes:(dataPointer + bufferOffset + avcHeaderLength) length:nalUnitLength];
    NSMutableData *outputFrameData = [NSMutableData data];
    [outputFrameData appendData:headerData];
    [outputFrameData appendData:frameData];
    bufferOffset += avcHeaderLength + nalUnitLength;
    if ([encoder.delegate respondsToSelector:@selector(videoEncodeOutputDataCallback:isKeyFrame:)])
        [encoder.delegate videoEncodeOutputDataCallback:outputFrameData isKeyFrame:isKeyFrame];

5. Stop coding


OSStatus status = VTCompressionSessionCompleteFrames(_compressionSessionRef, kCMTimeInvalid);

6. Release encoder


_compressionSessionRef = NULL;

Stepping on pits and summary

When doing video codec, I found that the resolution is 720P and the code rate is 1Mbps. When the picture shakes, the mosaic is very serious, and it is more serious to set the code rate lower. At first, I thought that some properties of the encoder were missing or the parameters were set incorrectly. I looked up a lot of information and couldn't find the reason. Later, it was suspected that in ABR mode, when the picture is from static to shaking, the code rate can not go up at once, resulting in mosaic. This assumption seems to be true. As a result, print the coded code rate. When the picture is shaking, the code rate goes up, indicating that this idea is still wrong. Later, I found that the data collected by the camera was 720P, that is, the resolution of 1280x720. When I set the encoding width and height to the encoder, I also set the encoding width and height to the encoder according to the width and height of 1280x720, but in fact, the picture size (pixels) I decoded and played was only 320x180, so I tried to set the encoding width and height to 320x180, and the mosaic problem was solved! I'm still a little white in the field of encoding and decoding. If the great God knows the principle, please don't hesitate to give me advice.

The next article will use VideoToolBox to realize hard decoding, and also supplement the relevant knowledge of YUV data.
Demo address:

Tags: iOS rtc

Posted by stephaneey on Fri, 20 May 2022 18:11:20 +0300