Massively parallel GPUs, such as NVIDIA's GTX series, are nowdays defacto standard for acceleration of dataparallel functions, such as image transforms and motion estimation. However, in the area of image & video compression, the last encoding stage  a.k.a. entropy coding is typically pushed back to the hostCPU, resulting in costly transfers of uncompressed data between the accelerator and the host memory. The inherent data dependencies in lossless compression algorithms make them considerably harder to parallelize for dataparallel architectural targets than inherently dataparallel transforms.
My research on compression algorithms has been focused around novel approaches for design and implementation of parallel algorithms for entropy coding. Understanding datadependencies helps to design parallel algorithms for lossless compression that can be executed efficiently on modern GPGPUs. A simple and interesting example is parallel variablelength encoding (Huffman coding), where we achieved a speedup of 3050x on a GTX285 by finegrain parallelization, reordering/recombining algorithm operations, and use of parallel primitives instead of default serial algorithm operations. The attached paper presents an efficient way of dealing with data dependencies in VLE and RLE algorithms, and shows two parallel algorithms specifically designed for efficient execution on a massively parallel architecture, such as NVIDIA's GPGPUs. (Poster)
Novel massively parallel architectures require novel algorithm design strategies!
Parallelization of fundamental compression algorithms for massively parallel architectures:
 PAVLE  VariableLength Encoding (VLE) methods, such as Huffman coding, construct codewords on the basis of symbol probabilities, and then replace input data with matching codes. Although optimal prefix codes can be efficiently computed and assigned to input data in parallel, a difficulty for the finegrain parallelization arises from data dependencies in computing destination memory locations and storing fractional wordlength data (RAW data hazards). However, since codeword matching can be done in one parallel step, we can resolve data dependencies in VLE by using prefix sum on the asigned codeword lengths. Scan produces bitoffsets in the destination array, which are then used to find destination memory location and target bitposition inside the word. Finegrain parallelization of a VLE introduces a problem of dealing with data races that occur when adjacent codes are written to the same memory location by different threads. By using a hardware support for atomic bitwise operations for parallel bit I/O, we enable parallel execution of threads performing bitlevel manipulations on the same memory locations without race conditions and complete the VLE encoding process in parallel.
For more details see: (Parallel VariableLength Encoding on GPGPUs PDF, Code). Many thanks to Professor Glenn R. Luecke for his feedback and advice how to improve the paper!
 PARLE  Run Length Encoding is a simplistic, yet frequently used lossless compression method. RLE scans the input data and stores the input symbol and the length of the symbol run. A parallel primitive, reduction, can be used for summing up the number of times each symbol appeares in its run. However, instead of accumulating number of occurrences for each symbol in parallel, we could identify the elements at the start/end of each run, and generate the indicator flags. The indexes of the flagged elements are then used to compute the total number of elements that appear in between these locations in parallel: the resulting count corresponds to the number of times an element appeared in its run. For more details see (FineGrain Parallelization of Entropy Coding on GPGPUs PDF)
 PARC  Arithmetic coding is a sort of a black magic compression algorithm  it represents the whole input data stream by one and only one real number in the interval [0; 1). As the input data stream becomes longer, the interval required to represent it gets smaller and smaller, and the number of bits needed to specify the interval increases. Successive symbols in the input data stream reduce this interval in accordance with their probabilities. By allowing fractional bit codeword length, arithmetic coding attains the theoretical entropy bound to compression efficiency, and thus provides better compression ratios than Huffman coding on input data with highly skewed symbol probabilities. We implemented this algorithm using a simple SPMD approach, since the bit I/O was posing implementation challenges. For more information see: (ParSim@EuroMPI08 paper: Using Arithmetic Coding for Reduction of Resulting Simulation Data Size on Massively Parallel GPGPUs PDF).
Besides basic research, I had a great pleasure to supervise a few excellent students on several courses and theses. It all started when I had to come up with some exercises for a new parallel programming course (for yesterday ;). As I love practical stuff, we introduced new format which included handson project work on the most modern NVIDIA GPGPUs at the time. It turned out to be a complete success, and just in 2008/2009 we implemented tones of cool CUDA/GPU pilot projects  ranging from steganography (hiding information into images), BZIP compression (MTF, AC/Huffman & Burrows Wheeler Transform), JPEG Encoding (DCT, Quantization, Huffman), to a Rock field simulation.
These projects were an excellent starting point for further exploration, and we made several larger student projects/theses:
 Reconfigurable GPGPU Encoder  Design and implementation of a reconfigurable GPGPU codec supporting compiletime selection of compression algorithms to be used in subsequent encoder passes (a selection of frontend transformations such as spatial prediction, DCT, quantization, and backend compression schemes such as Huffman coding, AC, RLE). Builds on the optimized GPU compression algorithms described above, and includes a very efficient implementation of Huffman encoder, optimized to the end by Tjark. Completed as a MSc Thesis by Tjark Bringewat, University of Stuttgart under my guidance.
 CUJ2K  Converts BMP Images into JPEG2000 format with your NVIDIA GPU. The JPEG2000 encoder is an extremly complex beast featuring wavelet transformation, complicated entropy coder (based on binary arithmetic coding) and adaptive quantization, ratecontrol and many more features. This is a complete implementation of a baseline JPEG2000 encoder implementation completed in less than 6 months as a student project by a team of 4 brilliant guys under my guidance (a.k.a. agile project management ;) at the University of Stuttgart. For more details, please see the preview at the CUDAZone or a full code&docs at the project page CUJ2K
 Visual Codec Simulator  VCS is a graphical tool for design and simulation of complete codecs that started as a small seminar project, but grew up into a very nice & useful tool under excellent coding skills of Armin Cont. We designed it to support designtime reconfiguration, arbitrary number of codec stages, selection of different/multiple compression algorithms for each coding block, and specification of coding block characteristic for simulation on different target architectures, e.g. CPU/GPU... This project is a work in progress and we are looking for more people interested to collaborate on different parts of the project. If this is something for you, please contact me at ana.balevic@gmail.com
Some interesting links:
Compression stuff:
 A Concise Introduction to Data Compression by David Salomon  A witty walkthrough the jungle of compression algorithms
 datacompression.info  The most extensive website on data compression algorithms
 Arithmetic Coding  An excellent intro to AC by Arthuro Campus
Parallel programming:
Related CUDA/GPU projects:
 Dirac  Open Source codec, one of the first CUDA implementations of algorithms used in codecs (focus on the wavelet transform).
 Motion Estimation@MIT  H.264 Motion Estimation in CUDA
