Grayscale BMP Images: A Practical Yet Gentle CUDA Tutorial for Beginners

Last updated on September 21, 2025 pm

Grayscale BMP Images: A Practical Yet Gentle CUDA Tutorial for Beginners

In this blog post, we would implement a simple yet effective grayscale program, in both plain C and CUDA version, to compare their performance, and see the power of parallel processing.

Firstly we would briefly go through the BMP file format, and the C implementation and gray scaling formula. Then we would go through some basic CUDA concepts, and see how they contribute to our CUDA version. Finally, we would set up timers and run experiments on large images, to see the performance gap and the power of parallelization. Some related references and resources used in this post are listed at the end, along with runnable source code in the appendixes.

It assumes that you have a CUDA-compatible computer, with gcc and nvcc (the CUDA compiler driver) installed.

Understanding the BMP Format

We are choosing BMP format as our inputs and outputs, with the following reasons:

The BMP format is simple and easy to understand. Except some headers at the beginning, it simply places all pixels sequentially, in blue, green, and red order. It provides an uncompressed, unpadded format, allowing us to focus on pixel manipulations. That’s why it is named as Bit Map (BMP) format.
The BMP format is widely supported by most image viewer, although it is a pretty old format. So we can easily see our outcomes, nothing hidden inside.

A clear BMP format explanation can be found from The BMP File Format, mainly composed of 4 parts: Header, InfoHeader, ColorTable, and Pixels.

As we are simply grayscaling the image, we would mainly pay attention to 3 values in the non-pixel part: DataOffset in Header (where the pixel data starts), Width in InfoHeader (how many pixels horizontally in the image), and Height in InfoHeader (how many pixels vertically in the image). Additionally we would like to confirm it is an uncompressed, unpadded version, based on the BPP in InfoHeader (bits per pixel, expecting 24 for 24-bit RGB, 3 bytes per pixel, stored in BGR order) and Compression in InfoHeader (0 for no compression). Additionally we can safely copy the non-pixel part from the input image to output image, as the file size or other metadata will not during grayscaling.

You can try implementing this part yourself, as a gentle recap of C file I/O and pointer manipulation. BMP files are little-endian, so be careful when reading integers. You can also refer to the line 43 - 69 in the Appendix Source Code: Plain C Version. This part would be shared between the Plain C implementation and the CUDA C implementation, as CUDA C is a superset of standard C.

Here we prepared a 1024 × 1024 BMP file as an example, adapted from unsplash.com and converted by FreeConvert.com.

Example input image

The main difference of the gray scaler lies in the memory management and pixel processing, covered in following sections (Plain C Implementation and CUDA C Implementation, respectively).

Plain C Implementation (CPU Baseline)

The implementation of the plain C version is quite intuitive: load the pixels row by row, process them and write back to the output file. Reading/writing pixel by pixel or row by ro would harm performance, since each fread() and fwrite() incurs overhead and breaks cache locality.

So our code looks like this:

uint32_t pix_num = width * height;
uint32_t img_size = pix_num * 3u;
unsigned char *img = (unsigned char *)malloc(img_size);

(void)!fread(img, 1, img_size, fpi);

for (int32_t y = 0; y < height; y++)
{
  for (int32_t x = 0; x < width; x++)
  {
      int32_t i = (y * width + x);
      unsigned char *p = &img[i * 3]; // One pixel, BGR.
      unsigned char gray = grayscale(p[2], p[1], p[0]);
      p[0] = gray;
      p[1] = gray;
      p[2] = gray;
  }
}

Here each unsigned char is regarded as a byte. Note that BMP stores pixels in blue–green–red (BGR) order. The grayscale function grayscale() looks like this:

static inline unsigned char grayscale(unsigned char r, unsigned char g, unsigned char b)
{
    unsigned int y = 77u * r + 150u * g + 29u * b + 128u; /* +128 for rounding */
    return (unsigned char)(y >> 8);
}

The gray scale formula is a fixed-point integer approximation of the ITU-R Recommendation BT.601 luma transform. The original formula is Y = 0.299 * R + 0.587 * G + 0.114 * B, but instead of using floating point, both sides are scaled by 256, for example, 0.299 * 256 ≈ 76.544 ≈ 77. So the formula becomes Y= (77 * R + 150 * G + 29 * B) / 256. The right shift by 8 represents division by 256. Adding the 128u is just like adding a 0.5 before rounding a floating point. This integer-based method would be much faster than floating point-based method, which involves type conversion, floating-point computation, etc.

This gray scale formula would also be used in our CUDA version.

Other than the grayscale formula, nothing in this C snippet should be too obscure. Here we ignore the return value of fread() for brevity ((void)!fread(...)), but checking errors is always a good habit. You can compile it with simple command gcc -o scaler ./scaler.c, and play around with BMP images. Its output of our 1024 × 1024 example would shown as:

Example output image from pure C

Full code is available at Appendix Source Code: Plain C Version.

Now, let’s step into the CUDA world!

CUDA Basics in a Nutshell

Host and Device

In CUDA programming, every CUDA-compatible GPU is called a device. Its counterpart, the CPU, is referred to as the host. you need to tell the CUDA runtime which parts of the code run on the host and which run on the device, typically by qualifier keywords, e.g., __host__, __global__, and __device__.

The __host__ qualifier means that the function can be called by host, and executed on host. It is the default behavior if no qualifier is specified, making CUDA C compatible with traditional C function declarations.

The __global__ qualifier means that the function can be called by host, but executed on device, conceptually acting as the bridge between host and device. Such a CUDA C function is typically called a kernel, or kernel function.

The __device__ qualifier, as its name suggests, indicates that the function is called from, and executed on, the device.

Thus, our CUDA kernel would be something like:

__global__ void grayscaleKernel(...)
{
    ...
}

Threads, Blocks, and Grids

When a kernel is launched, it runs in parallel across many threads on the GPU. (Note that this thread is different from the thread in operating system.) A CUDA thread maps to a lightweight execution context on the GPU. This is a typical example of single-program multiple-data (SPMD). When the host calls a kernel, it would launch a grid, and the grid would generate many threads, and manage them by blocks. A block commonly contains 128 to 1024 threads. Earlier CUDA versions capped this at 512, but modern GPUs typically allow up to 1024.

As we mentioned, CUDA follows the SPMD model: every thread runs the same kernel code, but on different data. So how could the thread know what data to process? This is indicated by blockIdx and threadIdx.

The following is a kernel that adds two vectors (A and B) together, and write to C:

__global__ void vecAddKernel(int* A, int* B, int* C, int vec_size)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < vec_size)
    {
        C[i] = A[i] + B[i];
    }
}

The kernel can retrieve the data they need to process by indexing based on blockIdx and threadIdx. They are unique and assigned by the CUDA runtime. Think of it like a spreadsheet: blockIdx picks which page, blockDim tells you how many rows per page, and threadIdx picks the row, so you can go ahead to the part you need to focus on. Although the we do not manually assign blockIdx and threadIdx, there are two things that we do need to manage: the number of blocks in the grid, and the number of threads per blocks.

The blockDim represents how many threads are in a block. When we call the kernel from the host, we need to tell CUDA runtime how many threads are there per block, and how many blocks are there in the grid.

1
2
3

int threads_per_block = 256; // blockDim
int blocks_per_grid = 4;     // gridDim
vecAddKernel<<<blocks_per_grid, threads_per_block>>>(A, B, C, 1000);

Here the threads_per_block would become blockDim in the kernel. Combined with blocks_per_grid, they are telling the CUDA runtime that we want to start 4 blocks in the grid, and each block contains 256 threads. Note the triple-angle brackets <<< and >>>, which distinguish a kernel launch from a regular function call. As 256 * 4 = 1024 > 1000, it would be capable to handle the 1000-length vector. If a thread’s index exceeds the vector length, that thread simply does nothing. A more idiomatic way to define blocks_per_grid is to associate it with the target size:

int vec_size = 1000;
int threads_per_block = 256;                                                  // blockDim
int blocks_per_grid = (vec_size + threads_per_block - 1) / threads_per_block; // gridDim
vecAddKernel<<<blocks_per_grid, threads_per_block>>>(A, B, C, vec_size);

You may have noticed that we are using .x in kernel data indexing, and might get a bit confused. Why the provided integer threads_per_block becomes blockDim.x, not exactly blockDim? This is because CUDA also accepts multidimensional blockDim (and also the gridDim). For example, dim3 blockDim(16, 8, 3); allows us to have number of threads in shape 16 × 8 × 3, so we can calculate index inside the kernel with blockDim.x, blockDim.y and blockDim.z. This is very convenient for some conceptually 3D inputs.

CUDA Memory Model

CUDA devices have their own memory, typically GDDR6 or GDDR6X in NVIDIA GeForce RTX 40 Series Graphics Cards. They resemble DRAM (DDR4 or DDR5 in mainstream laptops or desktops) on your computer, but managed by the GPU via CUDA. So if we want to ask CUDA devices to do some computation, we first need to allocate memory (similar to how you’d use malloc() in C), and copy data into it. Similarly, we need to retrieve the calculation results from CUDA’s memory once the computation is finished. These could be done by calling cudaMalloc() and cudaMemcpy().

So in the big picture, if we need to use CUDA devices to help us, we need to perform at least the following steps:

Apply for required CUDA memory;
Write data to CUDA memory;
Launch kernel;
Collect data from CUDA memory.

You might wonder whether this computing model is efficient enough, since it involves many memory operations that look costly. We would find it out as we add timer to them.

CUDA C Implementation (GPU Version)

Now we are fully prepared to implement the CUDA version. Recall that CUDA C is a superset of plain C, so most parts (such as BMP header parsing and file I/O) will be the same. All we need to consider is how to write a CUDA kernel, and how to launch it.

Please feel free to give it a shot yourself. And if you are ready, let’s see our minimalist CUDA kernel:

/*
 * The CUDA kernel for pixel grayscaling. Each thread handles a specific pixel.
 */
__global__ void grayscaleKernel(unsigned char *__restrict__ img, int pix_num)
{
    int pix = (blockDim.x * blockIdx.x + threadIdx.x);
    if (pix >= pix_num)
        return;

    int i = pix * 3;
    unsigned int y = 77u * img[i + 2] + 150u * img[i + 1] + 29u * img[i];
    unsigned char gray = (unsigned char)(y >> 8);
    img[i] = gray;
    img[i + 1] = gray;
    img[i + 2] = gray;
}

You can observe that it follows our vecAddKernel() example, firstly get the data index, exit if out-of-range, and then manipulate the data. The grayscale formula is exactly the same as in our plain C version. Each thread processes exactly one pixel.

To call our gray scale kernel, we need to setup memory and dimensions for it. To make it as parallel as possible, we can provide the entire image to our CUDA device. Recall that the maximum number of threads per block (blockDim) is 1024, so we can safely set it to that value. Naturally, blocks_per_grid (or gridDim) can be computed as (pix_num + threads_per_block - 1) / threads_per_block, as division (/) in C by default is integer division. So our CUDA code for calling the kernel (includes preparations and related clean-up) would look like:

uint32_t pix_num = width * height;
uint32_t img_size = pix_num * 3u;
unsigned char *img_h = (unsigned char *)malloc(img_size);

(void)!fread(img_h, 1, img_size, fpi);

unsigned char *img_d;
cudaMalloc((void **)&img_d, img_size);
cudaMemcpy(img_d, img_h, img_size, cudaMemcpyHostToDevice);

int threads_per_block = (width > 1024) ? 1024 : width;
int blocks_per_grid = (pix_num + threads_per_block - 1) / threads_per_block;
grayscaleKernel<<<blocks_per_grid, threads_per_block>>>(img_d, pix_num);

cudaMemcpy(img_h, img_d, img_size, cudaMemcpyDeviceToHost);
cudaFree(img_d);

// Don't forget to free `img_h` after writing output.

It’s common CUDA practice to append _h and _d to variable names to indicate host or device memory. Also note that any pointer passed into a kernel must point to device memory, such as img_d in this case.

Similarly, simply run nvcc -o scaler_cu ./scaler.cu to compile and try it with BMP files! The output should exactly the same as the C version.

Full code is available at Appendix Source Code: CUDA C Version.

Experimental Setup and Results

Now that we have two well-functioning grayscale programs, we can measure their performance with timers and compare the results.

Timer Setup

Following our minimalism philosophy, we simply record the start and end time of an operation, for example:

double t_s = now_sec(); // Start time.
cudaMemcpy(img_d, img_h, img_size, cudaMemcpyHostToDevice);
double t_e = now_sec(); // End time.
double t_memcpy_h2d = t_e - t_s;

printf("Memcpy Host to Device: %.4f s\n", t_memcpy_h2d);

With now_sec() defined as:

#include <time.h>

static inline double now_sec(void)
{
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (double)ts.tv_sec + (double)ts.tv_nsec * 1e-9;
}

For the pure C version, we measured: 1) total time, 2) file input time, 3) pixel processing time, and 4) file output time. Full code is available at Appendix Source Code: Plain C Version.

For the CUDA C version, we measured: 1) total time, 2) file input time, 3) memory copy time from host to device, 4) pixel processing time, 5) memory copy time from device to host, and 6) file output time. Full code is available at Appendix Source Code: CUDA C Version.

Note: since kernel launches are asynchronous, we call cudaDeviceSynchronize() before stopping the timer to ensure accurate measurements.

Compiler Optimizations and Flags

The default commands gcc -o scaler ./scaler.o and nvcc -o scaler_cu ./scaler.cu of course work fine, but as we are profiling their performance, we also want to enable compiler optimizations (using aggressive settings), for both of them. This is how we compile them:

1
2

nvcc -O3 -Xcompiler "-march=native -mtune=native -pipe" -gencode arch=compute_89,code=sm_89 -o scaler ./scaler.c
nvcc -O3 -Xcompiler "-march=native -mtune=native -pipe" -gencode arch=compute_89,code=sm_89 -o scaler_cu ./scaler.cu

We are using nvcc for both pure C version and CUDA version, to avoid any compiler version mismatches. Since CUDA C is a superset of C, nvcc can also compile our scaler.c without issues. This is a breakdown of these flags:

-O3 maximize optimization, applies to both host and device code;
-Xcompiler provides flags that only apply to the host compiler (gcc in this case):
- -march specifies CPU type, set to native enables all instruction subsets supported by host machine;
- -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set;
- -pipe uses pipes rather than temporary files during compilation, which can speed up builds slightly.
-gencode specifies the CUDA architecture for code generation, sm_89 stands for Ada Lovelace architecture, which is the architecture of NVIDIA GeForce RTX 40 Series GPU (the CUDA equivalent of -march=native -mtune=native). Remember to update this if you are running it on different devices.

Experimental Results and Analysis

The following table gives a typical result comparison when running on our 1024 × 1024 example:

	Read	H2D	Compute	D2H	Write	Total
Plain C	0.0096	\	0.0193	\	0.0097	0.0599
CUDA C	0.0094	0.0008	0.0013	0.0002	0.0095	0.1621

We also run the gray scalers on a 12K image (not presenting it here, as such large image would significantly drag behind the webpage loading speed, it would occupy around 351 MB in BMP format), you can get it from this beautiful picture of a forest path, which is at 12325 × 9979 resolution (remember to download the original size and convert it to BMP format).

Here is the result of the 12K image:

	Read	H2D	Compute	D2H	Write	Total
Plain C	0.8983	\	2.2435	\	1.1176	6.6233
CUDA C	0.9539	0.0260	0.0014	0.0213	1.1110	2.2435

We can clearly see that there is a crossover between CPU’s performance and GPU’s. For small images (1024 × 1024), CPU outperforms GPU. This is because the overhead of launching kernels and handling memory transfers, dominate the time consumption in GPU.

As the image grows larger, the power of parallelism begins to show. With more pixels, kernel throughput and device memory bandwidth amortize the kernel launch and transfer overheads. Also the GPU compute time barely increases, showing that all computing units in the CUDA device are contributing to parallelism.

As someone may have questions at the CUDA Memory Model section, questioning the efficiency of CUDA’s memory model. Here we can see an important property of GPU memory: graphics memory is optimized for high bandwidth, rather than low latency. This also contribute to GPU’s outperforming on large images. This is also why GPUs excel when there’s enough data to keep thousands of threads busy.

After all, grayscale conversion is not a compute-bound task; the bottleneck lies in memory I/O and data transfers. This experiment is a good example of a memory-bound workload.

In more advanced context, when facing the scenario that processing multiple BMP images, the CUDA version can be further optimized for pipelining. In other words, GPU compute can overlap with transfers, while CPU is loading / saving next / previous image. This would not be possible without CUDA, and would further improve the performance gap between CPU-only and CUDA programming. Other improvements such as multi-streaming, pinned memory, event-based control, error handling, and many other interesting topics would be covered in future posts.

In short, CPUs handle small workloads more efficiently due to lower overhead, but GPUs clearly dominate once the workload size grows large enough to utilize their parallelism.

References

Wen-mei W. Hwu, David B. Kirk and Izzat EI Hajj. 2023. Programming Massively Parallel Processors: A Hands-on Approach (4th. ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

Online Contents:

Used Images:

Appendix

Source Code: Plain C Version

A measured version with timers could be found here.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

const size_t HEADER_STR = 0x00; // Starting byte of BITMAPFILEHEADER.
const size_t HEADER_END = 0x0D;
const size_t HEADER_SIZE = HEADER_END - HEADER_STR + 1;
const size_t COMMON_INFO_SIZE = 40; // Common part of BITMAPINFOHEADER.

/*
 * Little-endian 32-bit integer reader.
 */
static inline uint32_t le32(const unsigned char *p)
{
    return (uint32_t)p[0] | ((uint32_t)p[1] << 8) | ((uint32_t)p[2] << 16) | ((uint32_t)p[3] << 24);
}

/*
 * Little-endian 16-bit integer reader.
 */
static inline uint16_t le16(const unsigned char *p)
{
    return (uint16_t)p[0] | ((uint16_t)p[1] << 8);
}

/*
 * Grayscaler with fast integer (BT.601).
 */
static inline unsigned char grayscale(unsigned char r, unsigned char g, unsigned char b)
{
    unsigned int y = 77u * r + 150u * g + 29u * b + 128u; /* +128 for rounding */
    return (unsigned char)(y >> 8);
}

int main(void)
{
    FILE *fpi = fopen("source.bmp", "rb");
    FILE *fpo = fopen("output.bmp", "wb");

    if (!fpi || !fpo)
        return 1;

    // Load header and common info header.
    unsigned char header[HEADER_SIZE];
    unsigned char common_info[COMMON_INFO_SIZE];

    (void)!fread(header, 1, HEADER_SIZE, fpi);
    (void)!fread(common_info, 1, COMMON_INFO_SIZE, fpi);
    uint32_t data_offset = le32(&header[10]);
    uint32_t remain_info_size = data_offset - HEADER_SIZE - COMMON_INFO_SIZE;
    uint32_t width = le32(&common_info[4]);
    uint32_t height = le32(&common_info[8]);
    uint16_t bpp = le16(&common_info[14]);         // Bits per pixel.
    uint16_t compression = le32(&common_info[16]); // Compression.

    if (bpp != 24 || compression != 0)
    {
        fprintf(stderr, "Unsupported format");
        return 1;
    }

    // Load remaining info header.
    unsigned char remain_info[remain_info_size];
    (void)!fread(remain_info, 1, remain_info_size, fpi);

    // Write header and info header to output image.
    fwrite(header, 1, HEADER_SIZE, fpo);
    fwrite(common_info, 1, COMMON_INFO_SIZE, fpo);
    fwrite(remain_info, 1, remain_info_size, fpo);

    uint32_t pix_num = width * height;
    uint32_t img_size = pix_num * 3u;
    unsigned char *img = (unsigned char *)malloc(img_size);

    (void)!fread(img, 1, img_size, fpi);

    for (int32_t y = 0; y < height; y++)
    {
        for (int32_t x = 0; x < width; x++)
        {
            int32_t i = (y * width + x);
            unsigned char *p = &img[i * 3]; // One pixel, BGR.
            unsigned char gray = grayscale(p[2], p[1], p[0]);
            p[0] = gray;
            p[1] = gray;
            p[2] = gray;
        }
    }

    fwrite(img, 1, img_size, fpo);

    free(img);
    fclose(fpi);
    fclose(fpo);

    return 0;
}

Source Code: CUDA C Version

A measured version with timers could be found here.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

const size_t HEADER_STR = 0x00; // Starting byte of BITMAPFILEHEADER.
const size_t HEADER_END = 0x0D;
const size_t HEADER_SIZE = HEADER_END - HEADER_STR + 1;
const size_t COMMON_INFO_SIZE = 40; // Common part of BITMAPINFOHEADER.

/*
 * Little-endian 32-bit integer reader.
 */
static inline uint32_t le32(const unsigned char *p)
{
    return (uint32_t)p[0] | ((uint32_t)p[1] << 8) | ((uint32_t)p[2] << 16) | ((uint32_t)p[3] << 24);
}

/*
 * Little-endian 16-bit integer reader.
 */
static inline uint16_t le16(const unsigned char *p)
{
    return (uint16_t)p[0] | ((uint16_t)p[1] << 8);
}

/*
 * The CUDA kernel for pixel grayscaling. Each thread handles a specific pixel.
 */
__global__ void grayscaleKernel(unsigned char *__restrict__ img, int pix_num)
{
    int pix = (blockDim.x * blockIdx.x + threadIdx.x);
    if (pix >= pix_num)
        return;

    int i = pix * 3;
    unsigned int y = 77u * img[i + 2] + 150u * img[i + 1] + 29u * img[i];
    unsigned char gray = (unsigned char)(y >> 8);
    img[i] = gray;
    img[i + 1] = gray;
    img[i + 2] = gray;
}

int main(void)
{
    FILE *fpi = fopen("source.bmp", "rb");
    FILE *fpo = fopen("output_cuda.bmp", "wb");

    if (!fpi || !fpo)
        return 1;

    // Load header and common info header.
    unsigned char header[HEADER_SIZE];
    unsigned char common_info[COMMON_INFO_SIZE];

    (void)!fread(header, 1, HEADER_SIZE, fpi);
    (void)!fread(common_info, 1, COMMON_INFO_SIZE, fpi);
    uint32_t data_offset = le32(&header[10]);
    uint32_t remain_info_size = data_offset - HEADER_SIZE - COMMON_INFO_SIZE;
    uint32_t width = le32(&common_info[4]);
    uint32_t height = le32(&common_info[8]);
    uint16_t bpp = le16(&common_info[14]);         // Bits per pixel.
    uint16_t compression = le32(&common_info[16]); // Compression.

    if (bpp != 24 || compression != 0)
    {
        fprintf(stderr, "Unsupported BMP: expecting 24bbp, uncompressed.\n");
        return 1;
    }

    // Load remaining info header.
    unsigned char remain_info[remain_info_size];
    (void)!fread(remain_info, 1, remain_info_size, fpi);

    // Write header and info header to output image.
    fwrite(header, 1, HEADER_SIZE, fpo);
    fwrite(common_info, 1, COMMON_INFO_SIZE, fpo);
    fwrite(remain_info, 1, remain_info_size, fpo);

    uint32_t pix_num = width * height;
    uint32_t img_size = pix_num * 3u;
    unsigned char *img_h = (unsigned char *)malloc(img_size);

    (void)!fread(img_h, 1, img_size, fpi);

    unsigned char *img_d;

    cudaMalloc((void **)&img_d, img_size);
    cudaMemcpy(img_d, img_h, img_size, cudaMemcpyHostToDevice);

    int threads_per_block = (width > 1024) ? 1024 : width;
    int blocks_per_grid = (pix_num + threads_per_block - 1) / threads_per_block;
    grayscaleKernel<<<blocks_per_grid, threads_per_block>>>(img_d, pix_num);

    cudaGetLastError();

    cudaMemcpy(img_h, img_d, img_size, cudaMemcpyDeviceToHost);

    cudaFree(img_d);

    fwrite(img_h, 1, img_size, fpo);

    free(img_h);
    fclose(fpi);
    fclose(fpo);

    return 0;
}

Programming Language

#C #CUDA #Parallel Programming #BMP #SPMD

Grayscale BMP Images: A Practical Yet Gentle CUDA Tutorial for Beginners

https://blog.lingkang.dev/2025/09/19/cuda-grayscale/

Author

Lingkang

Posted on

September 19, 2025

Licensed under

Paper Notes: FlipBit - Save Energy for Flash Memory Write with Approximation Next