Massively parallel GPUs, such as NVIDIA's GTX series, are nowdays de-facto standard for acceleration of data-parallel functions, such as image transforms and motion estimation. However, in the area of image & video compression, the last encoding stage - a.k.a. entropy coding is typically pushed back to the host-CPU, resulting in costly transfers of uncompressed data between the accelerator and the host memory. The inherent data dependencies in lossless compression algorithms make them considerably harder to parallelize for data-parallel architectural targets than inherently data-parallel transforms.
My research on compression algorithms has been focused around novel approaches for design and implementation of parallel algorithms for entropy coding. Understanding data-dependencies helps to design parallel algorithms for lossless compression that can be executed efficiently on modern GPGPUs. A simple and interesting example is parallel variable-length encoding (Huffman coding), where we achieved a speed-up of 30-50x on a GTX285 by fine-grain parallelization, re-ordering/re-combining algorithm operations, and use of parallel primitives instead of default serial algorithm operations. The attached paper presents an efficient way of dealing with data dependencies in VLE and RLE algorithms, and shows two parallel algorithms specifically designed for efficient execution on a massively parallel architecture, such as NVIDIA's GPGPUs. (Poster)
Novel massively parallel architectures require novel algorithm design strategies!
Parallelization of fundamental compression algorithms for massively parallel architectures:
- PAVLE - Variable-Length Encoding (VLE) methods, such as Huffman coding, construct codewords on the basis of symbol probabilities, and then replace input data with matching codes. Although optimal prefix codes can be efficiently computed and assigned to input data in parallel, a difficulty for the fine-grain parallelization arises from data dependencies in computing destination memory locations and storing fractional word-length data (RAW data hazards). However, since codeword matching can be done in one parallel step, we can resolve data dependencies in VLE by using prefix sum on the asigned codeword lengths. Scan produces bit-offsets in the destination array, which are then used to find destination memory location and target bit-position inside the word. Fine-grain parallelization of a VLE introduces a problem of dealing with data races that occur when adjacent codes are written to the same memory location by different threads. By using a hardware support for atomic bit-wise operations for parallel bit I/O, we enable parallel execution of threads performing bit-level manipulations on the same memory locations without race conditions and complete the VLE encoding process in parallel.
For more details see: (Parallel Variable-Length Encoding on GPGPUs PDF, Code). Many thanks to Professor Glenn R. Luecke for his feedback and advice how to improve the paper!
- PARLE - Run Length Encoding is a simplistic, yet frequently used lossless compression method. RLE scans the input data and stores the input symbol and the length of the symbol run. A parallel primitive, reduction, can be used for summing up the number of times each symbol appeares in its run. However, instead of accumulating number of occurrences for each symbol in parallel, we could identify the elements at the start/end of each run, and generate the indicator flags. The indexes of the flagged elements are then used to compute the total number of elements that appear in between these locations in parallel: the resulting count corresponds to the number of times an element appeared in its run. For more details see (Fine-Grain Parallelization of Entropy Coding on GPGPUs PDF)
- PARC - Arithmetic coding is a sort of a black magic compression algorithm - it represents the whole input data stream by one and only one real number in the interval [0; 1). As the input data stream becomes longer, the interval required to represent it gets smaller and smaller, and the number of bits needed to specify the interval increases. Successive symbols in the input data stream reduce this interval in accordance with their probabilities. By allowing fractional bit codeword length, arithmetic coding attains the theoretical entropy bound to compression efficiency, and thus provides better compression ratios than Huffman coding on input data with highly skewed symbol probabilities. We implemented this algorithm using a simple SPMD approach, since the bit I/O was posing implementation challenges. For more information see: (ParSim@EuroMPI08 paper: Using Arithmetic Coding for Reduction of Resulting Simulation Data Size on Massively Parallel GPGPUs PDF).
Besides basic research, I had a great pleasure to supervise a few excellent students on several courses and theses. It all started when I had to come up with some exercises for a new parallel programming course (for yesterday ;). As I love practical stuff, we introduced new format which included hands-on project work on the most modern NVIDIA GPGPUs at the time. It turned out to be a complete success, and just in 2008/2009 we implemented tones of cool CUDA/GPU pilot projects - ranging from steganography (hiding information into images), BZIP compression (MTF, AC/Huffman & Burrows Wheeler Transform), JPEG Encoding (DCT, Quantization, Huffman), to a Rock field simulation.
These projects were an excellent starting point for further exploration, and we made several larger student projects/theses:
- Re-configurable GPGPU Encoder - Design and implementation of a re-configurable GPGPU codec supporting compile-time selection of compression algorithms to be used in subsequent encoder passes (a selection of front-end transformations such as spatial prediction, DCT, quantization, and back-end compression schemes such as Huffman coding, AC, RLE). Builds on the optimized GPU compression algorithms described above, and includes a very efficient implementation of Huffman encoder, optimized to the end by Tjark. Completed as a MSc Thesis by Tjark Bringewat, University of Stuttgart under my guidance.
- CUJ2K - Converts BMP Images into JPEG2000 format with your NVIDIA GPU. The JPEG2000 encoder is an extremly complex beast featuring wavelet transformation, complicated entropy coder (based on binary arithmetic coding) and adaptive quantization, rate-control and many more features. This is a complete implementation of a baseline JPEG2000 encoder implementation completed in less than 6 months as a student project by a team of 4 brilliant guys under my guidance (a.k.a. agile project management ;) at the University of Stuttgart. For more details, please see the preview at the CUDAZone or a full code&docs at the project page CUJ2K
- Visual Codec Simulator - VCS is a graphical tool for design and simulation of complete codecs that started as a small seminar project, but grew up into a very nice & useful tool under excellent coding skills of Armin Cont. We designed it to support design-time reconfiguration, arbitrary number of codec stages, selection of different/multiple compression algorithms for each coding block, and specification of coding block characteristic for simulation on different target architectures, e.g. CPU/GPU... This project is a work in progress and we are looking for more people interested to collaborate on different parts of the project. If this is something for you, please contact me at firstname.lastname@example.org
Some interesting links:
Related CUDA/GPU projects:
- A Concise Introduction to Data Compression by David Salomon - A witty walk-through the jungle of compression algorithms
- data-compression.info - The most extensive website on data compression algorithms
- Arithmetic Coding - An excellent intro to AC by Arthuro Campus
- Dirac - Open Source codec, one of the first CUDA implementations of algorithms used in codecs (focus on the wavelet transform).
- Motion Estimation@MIT - H.264 Motion Estimation in CUDA