PIM-JPEG

2021-11 by Daniel Ng

Background

JPEG processing is part of a bigger project - PIM-PREP. For more information, read the following doc DPU-GPU-DL-PREP.

JPEG Introduction

JPEG is a digital image encoding standard which uses lossy compression to store images. Neural networks frequently use JPEG images as training datasets, especially for image/object recognition tasks. Since JPEG stores image data in an encoded format, the data has to be decoded before it can be used for training.

This is typically what a JPEG file looks like in binary. In short, the start of the file holds multiple markers such as Start Of Image (which indicates that this is a JPEG file), Start Of Frame (which holds information about image width and height), and Start Of Scan (which is where the encoded image data is stored). After reading the information from the markers, the encoded image data can be decoded to retrieve the RGB values of each pixel of the image. The decoding process is as follows:

First, the Huffman codes are derived from the Huffman Tables that are read from the markers. Using the Huffman codes, the encoded image data can be decoded bit by bit. Decoded data is grouped into 8 x 8 pixel blocks called Minimum Coded Units (MCUs). Therefore, a 1920 x 1080 grayscale image has 32400 MCUs, and a colored image has 97200 MCUs since it has 3 color channels.

After decoding into MCUs, each MCU can be individually dequantized. This is a simple step in which each pixel in each MCU is multiplied by its corresponding value in the Quantization Table read from the markers.

The dequantized MCUs now undergo inverse discrete cosine transform (idct). Basically, each value in an MCU represents the sum of cosine functions oscillating at different frequencies. This sum has to be transformed back to a finite sequence of data points for each MCU to obtain the original values of the 3 color channels for each pixel.

Lastly, the 3 color channels used by JPEGs are YCbCr. Therefore, these have to be converted to RGB values before they can be viewed using BMP files or be used for machine learning.

For more detailed information please refer to the following links:

http://vip.sugovica.hu/Sardi/kepnezo/JPEG%20File%20Layout%20and%20Format.htm
https://www.w3.org/Graphics/JPEG/itu-t81.pdf
https://www.youtube.com/playlist?list=PLpsTn9TA_Q8VMDyOPrDKmSJYt1DLgDZU4

Image Preprocessing Tasks

The most common image preprocessing tasks are the ones used in AlexNet, an early and prominent convolutional neural network. In AlexNet [1], training JPEG images are first decoded and then down-sampled to 256 x 256 pixels. This is done by first isotropically scaling down the shorter side of the image to 256 pixels, and then cropping out the central 256 x 256 patch from the resulting image. After that, the average RGB values of the entire training set is calculated and then subtracted from each pixel of all down-sampled images. Following this, the training set is artificially enlarged by extracting random 224 x 224 patches (and their horizontal reflections) from the 256 x 256 images. Lastly, Principal Component Analysis (PCA) is performed on the set of RGB values throughout the entire training set, and then random multiples of the found principal components are added to each training image, with the magnitude of the random values being extracted from a Gaussian distribution.

Since DPUs do not support floating point operations, certain preprocessing tasks cannot be implemented on the DPUs. The following lists the possible preprocessing tasks for DPUs:

JPEG decoding
Calculating sum of RGB values (for future subtraction of average RGB values)
Image cropping
Image horizontal flipping
Image scaling by factors of whole number

Scaling by any other factors is inefficient for the DPU because that would involve floating point operations. However, the DPUs can still scale an arbitrary image to 256 x 256 by first cropping and then scaling. For example, to scale a 1000 x 667 image to 256 x 256, the central 512 x 512 of the original image can be cropped out, and then scale down by a factor of 2 to 256 x 256. Note that this will have a loss in quality compared to the conventional method of scaling and then cropping.

The following lists the computationally intractable preprocessing tasks for DPUs with today’s known algorithms:

Fractional scaling
Performing PCA
Random values from gaussian distribution
Image rotation

These tasks are impossible as they all involve floating point operations. Therefore, our goal is to implement the possible preprocessing tasks efficiently such that DPUs can perform them faster than CPUs, and ideally faster than the GPU training rate.

JPEG Decoding

Most of the possible image preprocessing tasks on DPUs have been implemented, and can be found in the PIM-JPEG repo. The JPEG decoding process follows the steps outlined in the JPEG Introduction section. DPU programs excel when multiple tasklets can be used, so the implementation aims to use multiple tasklets as often as possible. For this example, let’s assume that 1 DPU with 8 tasklets are used to decode 1 JPEG.

When the program starts, a serial task is performed in which tasklet 0 reads all markers which include important information such as Quantization Tables, Huffman Tables, and image metadata. Tasklets 1-7 wait for tasklet 0 to finish. After that, all 8 tasklets start decoding the image data. Each tasklet is allotted a section of the encoded Huffman data to decode. The problem is, since the data is encoded, there is no simple way to determine where one MCU ends and another begins. Therefore, it is very likely that the tasklets will begin decoding in the middle of an MCU, and thus produce MCUs which are incorrect. However, there is an interesting property about decoding JPEG, and that is each tasklet will automatically synchronize after decoding around 5 to 20 MCUs. This means that for each tasklet, after decoding a small number of incorrect MCUs, all remaining MCUs will be decoded correctly. To illustrate the fix for this issue, assume that tasklet 3 has decoded 100 MCUs, and the first 7 are decoded incorrectly. At this point, tasklet 2 will have to overflow into tasklet 3’s allotted section and keep decoding. It is known that whatever tasklet 2 decodes in tasklet 3’s section must be correct, and so tasklet 2 just has to decode 7 MCUs until it reaches the correct MCUs which are decoded by tasklet 3, at which point tasklet 2 can stop decoding. In general, the first several MCUs decoded by tasklet i will be incorrect, and so tasklet i - 1 will have to overflow into tasklet i’s section to synchronize. After that, since each tasklet has their designated memory space, tasklet 0 will do a one-pass through all the memory spaces and concatenate all the decoded MCUs into a contiguous memory space. For more information on how this strategy/algorithm works, please refer to the following article. This concludes the decode Huffman bitstream portion of the program.

After that, each tasklet handles the dequantization, idct, and YCbCr to RGB conversion of roughly equal numbers of MCU. Dequantization is simple, as each of the 64 values in each MCU just has to be multiplied with the 64 values of the quantization tables. For idct, since the DPU does not support floating point operations, the cosine function cannot be used. Instead, a special ANN algorithm can be implemented using only integers and be used to approximate the results of idct calculated with floating points. Lastly, conversion of YCbCr to RGB can be calculated using this formula.. BMP files can be written using the RGB values for validation of whether decoding was performed correctly, and further preprocessing such as scaling and cropping can be carried out.

JPEG Scaling and Cropping

Image scaling can be done by factors of powers of 2. Scaling by a factor of 2 is simple, as can be seen from the following diagram:

This 2 x 2 image will become a 1 x 1 image by using the average RGB values of the 4 pixels. Scaling by a factor of 4, 8, 16… is the same except this process is repeated more times.

Image cropping and horizontal flipping is even more straightforward, as it is just manipulating the positions of the pixels within the memory space.

Future Tasks/Improvements

Implement scaling by factors of powers of 3 (3, 9, 27…) and other factors if necessary. Should be similar to factors of powers of 2
Find a way to implement impossible tasks listed in the Image Preprocessing Tasks section
Improve the algorithm
- Instead of decoding first and then scaling and cropping, we can combine these steps to save memory and time
Implement 1 DPU multiple JPEGs
Investigate optimal number of tasklets for PIM-JPEG. Perhaps DPUs decoding different number of JPEGs have a different optimal number of tasklets
Investigate whether the loss in quality of cropping and then scaling vs scaling and then cropping affects the neural network training results
Measure % of total time spent on serial vs parallel components of the program

Reference

[1] AlexNet: Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems. 1097-1105.