PIM-PREP

Overview

Accelerating the Preprocessing Data Pipeline of Deep Learning Applications Through Processing In Memory:

Data preprocessing pipelines in deep learning (DL) applications have become a major bottleneck directly affecting the overall performance of model training. Preprocessing tasks are normally executed by CPUs which should steadily feed the resulting transformed or augmented data to GPUs that are responsible for the core training computations. Many preprocessing tasks of current DL models are memory-bound and the CPUs executing the tasks become unable to provide a continual flow of data to GPUs, resulting in GPUs sitting idle instead of performing training computations. Nvidia's data loading library (DALI) partly mitigates the shortcomings of the preprocessing pipeline by offloading some preprocessing tasks to the GPUs. However, using GPUs for preprocessing means there would be less GPU compute and memory capacity available for training. Previous work has also demonstrated that even the use of DALI does not eliminate the pipeline stalls for many applications.

DPUs (DRAM Processing Units) have been demonstrated to be superior to both CPUs and GPUs for executing memory-bound tasks. This project aims at investigating the preprocessing pipeline for DL applications and identifying viable tasks which can be offloaded to DPUs. We plan to analyze several DL models across multiple tasks and training sets. For example, in the image classification task, preprocessing consists of multiple operations including image decoding, resizing, cropping, and others. Preprocessing operations differ in other DL tasks such as object detection or audio classification. By initially providing a DPU implementation of multiple standalone preprocessing workloads and analyzing their characteristics, we can assess their viability for DPU offloading. Once these standalone workloads are comprehensively assessed, we plan to holistically test our DPU implementations by integrating them into their corresponding DL tasks and evaluate our solution on a server configuration consisting of multiple CPUs, GPUs, and DPUs. We plan to compare our solution with previous work that used DALI, and CoorDL (a CPU-only solution which offers improvements over DALI). Another configuration that we also plan to compare against is one that simply adds more CPUs and GPUs for performing preprocessing tasks without offloading to DPUs or using custom solutions such as CoorDL. Such a "naive" configuration gives a more comprehensive overview for analyzing performance/cost trade-offs.

Compiling DALI

Detailed instructions are here. You need to have CUDA and docker installed. The following instructions will build a debug version of DALI and install the python wheel.

git clone --recursive https://github.com/NVIDIA/DALI
cd DALI/docker
CUDA_VERSION=11.0 CMAKE_BUILD_TYPE=Debug ./build.sh
cd ../build-docker-Debug-110_x86_64/
python3 -m pip install nvidia_dali_cuda110-1.4.0.dev0-12345-py3-none-manylinux2014_x86_64.whl

The following code (test.py) runs a simple example to compare running the Flip operation on the CPU and the GPU:

from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types
import numpy as np
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
from timeit import default_timer as timer

# replace DALI_DIR with the directory where DALI is installed
image_dir = "DALI_DIR/docs/examples/data/images"


@pipeline_def
def cpu_flip_pipeline():
    jpegs, labels = fn.readers.file(
        file_root=image_dir, random_shuffle=True, initial_fill=21)
    images = fn.decoders.image(jpegs, device='cpu')
    images = fn.flip(images, horizontal=1)

    return images, labels


@pipeline_def
def gpu_flip_pipeline():
    jpegs, labels = fn.readers.file(
        file_root=image_dir, random_shuffle=True, initial_fill=21)
    images = fn.decoders.image(jpegs, device='mixed')
    #images = fn.decoders.image(jpegs, device='cpu')
    images = fn.flip(images.gpu(), horizontal=1)

    return images, labels


def speedtest(pipeline, batch, n_threads):
    pipe = pipeline(batch_size=batch, num_threads=n_threads, device_id=0)
    pipe.build()
    # warmup
     for _ in range(5):
         pipe.run()
    # test
    n_test = 10
    t_start = timer()
    for _ in range(n_test):
        pipe.run()
    t = timer() - t_start
    print("Speed: {} imgs/s".format((n_test * batch)/t))


def main():
    print("This is the beginning...")

    test_batch_size = 128
    speedtest(cpu_flip_pipeline, test_batch_size, 4)
    speedtest(gpu_flip_pipeline, test_batch_size, 4)


if __name__ == "__main__":
    main()

Run the code with:

python3 test.py

You should see something like this as the output:

This is the beginning...
read 21 files from 2 directories
Speed: 467.83428747655535 imgs/s
read 21 files from 2 directories
Speed: 1559.1448009312685 imgs/s