How to compare GPU vs CPU performance for fast image processing

How to compare GPU vs CPU performance for fast image processing

I.   INTRODUCTION

Over the past decade, there have been many technical advances in GPUs (graphics processing units), so they can successfully compete with established solutions (for example, CPUs, or central processing units) and be used for a wide range of tasks, including fast image processing.

In this article, we will discuss the capabilities of GPUs and CPUs for performing fast image processing tasks. We will compare two processors and show the advantages of GPU over CPU, as well as explain why image processing on a GPU can be more efficient when compared to similar CPU-based solutions.

In addition, we will go through some common misconceptions that prevent people from using a GPU for fast image processing tasks.

II. ABOUT FAST IMAGE PROCESSING ALGORITHMS

For the purposes of this article, we’ll focus specifically on fast image processing algorithms that have such characteristics as locality, parallelizability, and relative simplicity.

 Here’s a brief description of each characteristic:

  • Locality. Each pixel is calculated based on a limited number of neighboring pixels.
  • Good potential for parallelization. Each pixel does not depend on the data from the other processed pixels, so tasks can be processed in parallel.
  • 16/32-bit precision arithmetic. Typically, 32-bit floating point arithmetic is sufficient for image processing and a 16-bit integer data type is sufficient for storage.

Important criteria for fast image processing

The key criteria which are important for fast image processing are:

  • Performance

Maximum performance of fast image processing can be achieved in two ways: either by increasing hardware resources (specifically, the number of processors), or by optimizing the software code. When comparing the capabilities of GPU and CPU, GPU outperforms CPU in the price-to-performance ratio. It’s possible to realize the full potential of a GPU only with parallelization and thorough multilevel (both low-level and high-level) algorithm optimization.

  • Image processing quality

Another important criterion is the image processing quality. There may be several algorithms used for the exact same image processing operation that differ in resource intensity and the quality of the result. Multilevel optimization is especially important for resource-intensive algorithms and it gets essential performance benefits. After the multilevel optimization is applied, advanced algorithms will return results within a reasonable time period, comparable to the speed of fast but crude algorithms.

  • Latency

A GPU has an architecture that allows parallel pixel processing, which leads to a reduction in latency (the time it takes to process a single image). CPUs have rather modest latency, since parallelism in a CPU is implemented at the level of frames, tiles, or image lines.

III. GPU vs. CPU: KEY DIFFERENCES

Let's have a look at the key differences between GPU and CPU.

1.   The number of threads on a CPU and GPU

CPU architecture is designed in such a way that each physical CPU core can execute two threads on two virtual cores. In this case, each thread executes the instructions independently.

At the same time, the number of GPU threads is tens or hundreds of times greater, since these processors use the SIMT (single instruction, multiple threads) programming model. In this case, a group of threads (usually 32) executes the same instruction. Thus, a group of threads in a GPU can be considered as the equivalent of a CPU thread, or otherwise a genuine GPU thread.

2. Thread implementation on CPU and GPU

One more difference between GPUs and CPUs is how they hide instruction latency.

A CPU uses out-of-order execution for these purposes, whereas a GPU uses actual genuine thread rotation, launching instructions from different threads every time. The method used on the GPU is more efficient as a hardware implementation, but it requires the algorithm to be parallel and the load to be high.

Thus it follows that many image processing algorithms are ideal for implementation on a GPU.

IV. ADVANTAGES OF GPU OVER CPU

  • Our own lab research has shown that if we compare an ideally optimized software for GPU and for CPU (with AVX2 instructions), than GPU advantage is just tremendous: GPU peak performance is around ten times faster than CPU peak performance for the hardware of the same year of production for 32-bit and 16-bit data types. The GPU memory subsystem bandwidth is significantly higher as well.
gpu vs cpu peak performance ratio for float
  • If we make a comparison with non-optimized CPU software without AVX2 instructions, then GPU performance advantage could reach 50-100 times.
  • All modern GPUs are equipped with shared memory, or memory that is simultaneously available to all the cores of one multiprocessor, which is essentially a software-controlled cache. This is ideal for algorithms with a high degree of locality. The bandwidth of the shared memory is several times faster than the bandwidth of CPU’s L1 cache.
  • The other important feature of a GPU compared to a CPU is that the number of available registers can be changed dynamically (from 64 to 256 per thread), thereby reducing the load on the memory subsystem. To compare, x86 and x64 architectures use 16 universal registers and 16 AVX registers per thread.
  • There are several hardware modules on a GPU for simultaneous execution of completely different tasks: image processing (ISP) on Jetson, asynchronous copy to and from GPU, computations on GPU, video encoding and video decoding (NVENC and NVDEC), tensor kernels for neural networks, OpenGL, DirectX, and Vulkan for rendering.

Still, all these advantages of a GPU over a CPU involve a high demand for parallelism of algorithms. While tens of threads are sufficient for maximum CPU load, tens of thousands are required to fully load a GPU.

Embedded applications

Another type of task to consider is embedded solutions. In this case, GPUs are competing with specialized devices such as FPGAs (Field-Programmable Gate Arrays) and ASICs (Application-Specific Integrated Circuits).

The main advantage of GPUs over these devices is significantly greater flexibility. A GPU is a serious alternative for some embedded applications, since powerful multi-core processors don’t meet requirements like size and power budget.

V. USER MISCONCEPTIONS

1. Users have no experience with GPUs, so they try to solve their problems with CPUs

One of the main user misconceptions is associated with the fact that 10 years ago GPUs were considered inappropriate for high-performance tasks.

But technologies are developing rapidly, and while GPU image processing integrates well with CPU processing, the best results are achieved when fast image processing is done on a GPU.

2. Multiple data copy to GPU and back kills performance

This is another bias among users regarding GPU image processing.

As it turns out, it’s a misconception as well, since in this case, the best solution is to implement all processing on the GPU within one task. The source data can be copied to the GPU just once, and the computation results are returned to the CPU at the end of the pipeline. In that case the intermediate data remains on the GPU. Copy can be also performed asynchronously, so it could be done in parallel with computations on the next/previous frame.

3. Small shared memory capacity, which is just 96 KB for each multiprocessor

Despite the small capacity of GPU memory, the 96 KB memory size may be sufficient if shared memory is managed efficiently. This is the essence of software optimization for CUDA and OpenCL. It is not possible just to transfer software code from a CPU to a GPU without taking into consideration the specifics of the GPU architecture.

4. Insufficient size of the global GPU memory for complex tasks

This is an essential point, which is first of all solved by manufacturers when they release new GPUs with a larger memory size. Second of all, it’s possible to implement a memory manager to reuse GPU global memory.

5. Libraries for processing on the CPU use parallel computing as well

CPUs have the ability to work in parallel through vector instructions such as AVX or via multithreading (for example, via OpenMP). In most cases, parallelization occurs in the simplest way: each frame is processed in a separate thread, and the software code for processing one frame remains sequential. Using vector instructions involves the complexity of writing and maintaining code for different architectures, processor models, and systems. Vendor specific libraries like Intel IPP, are highly optimized. Issues arise when the required functionality is not in the vendor libraries and you have to use third-party open source or proprietary libraries, which can lack optimization.

Another aspect which is negatively affecting the performance of mainstream libraries is the widespread adoption of cloud computing. In most cases, it’s much cheaper for a developer to purchase additional capacity in the cloud than to develop optimized libraries. Customers request quick product development, so developers are forced to use relatively simple solutions which aren’t the most effective.

Modern industrial cameras generate video streams with extremely high data rates, which often preclude the possibility of transmitting data over the network to the cloud for processing, so local PCs are usually used to process the video stream from the camera. The computer used for processing should have the required performance and, more importantly, it must be purchased at the early stages of the project. Solution performance depends both on hardware and software. During the initial stages of the project, you should also consider what kind of hardware you’re using. If it’s possible to use mainstream hardware, any software can be used. If expensive hardware is to be used as a part of the solution, the price-performance ratio is rapidly increasing, and it requires using optimized software.

Processing data from industrial video cameras involves a constant load. The load level is determined by the algorithms used and camera bitrate. The image processing system should be designed at the initial stages of the project in order to cope with the load within a guaranteed margin, otherwise it will be impossible to process the streams without data loss. This is a key difference from web systems, where the load is unbalanced.

cpu vs gpu advantages

VI. SUMMARY

Summing up, we come to the following conclusions:

1. GPU is an excellent alternative to CPU for solving complex fast image processing tasks.

2. The performance of optimized image processing solutions on a GPU is much higher than on a CPU. As a confirmation, we suggest that you refer to other articles on the Fastvideo blog, which describe other use cases and benchmarks on different GPUs for commonly used image processing and compression algorithms.

3. GPU architecture allows parallel processing of image pixels which, in turn, leads to a reduction of the processing time for a single image (latency).

4. High GPU performance software can reduce hardware cost in such systems, and high energy efficiency reduces power consumption. The cost of ownership of GPU-based image processing systems is lower than that of systems based on CPU only.

5. A GPU has the flexibility, high performance, and low power consumption required to compete with highly specialized FPGA / ASIC solutions for mobile and embedded applications.

6. Combining the capabilities of CUDA / OpenCL and hardware tensor kernels can significantly increase performance for tasks using neural networks.

Read a full version of the article with in-depth technical info on Fastvideo blog:

https://www.fastcompression.com/blog/gpu-vs-cpu-fast-image-processing.htm


Stan Tarnavskii

Senior Scientific Advisor at Macquarie University

3y

Excellent article and important knowledge. Thank you for sharing.

Like
Reply
Sami Varjo

Computational Imaging Expert

3y

On the mobile world the game is changing. Sometimes even just waiting to transfer data to GPU/back is too much. In general GPU takes just too much juice and OEM guys do not want to spend a electron more than is possible. Other DSP dies are taking some ground from GPU - yet it is so much more easy to do GPU than for example HVX... But as said - things vary on the platform where you have to live on...

Natalia Mira Serna

Industry Specialist Solutions Searcher, Sales Manager 🏭👀| STEM Talent Girl Mentor🙋🏻♀️| Speaker🎤| Smart Manufacturing Certified💡🤖| Computer Vision Lover 📷❤️

3y

Great and interesting article. It is important to know how we will deploy our computer vision systems and one of the most important facts are the tools we are going to use preventing future optimization complications.

Like
Reply
Anusree Sarkar

Computer Scientist 2, Adobe DVA (Color team) | Ex-Qualcomm | BITS Pilani

3y

Great article. Precise and useful for getting a high level idea on benifits of GPUs over CPUs in image processing field. One thing I would like to add though is that these days there is preference for the use of DSP over GPU as GPUs tend to be power hungry for real time processing on embedded processors.

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics