Parallel Programming – Part 1

I recently started Parallel programming course on Udacity and this blog is my journey through the course. My fascination with new and upcoming technologies goes back to my high school days and hence involved myself in exploring them. This is my one such journey.

GPU – Graphics Processing Unit, these are built on the principle they deliver best computation/watt, this is achieved by implementing simple control structure and more computation. GPU’s are designed to provide high throughput, while CPU’s are designed with Latency in mind. The following image shows the architectural difference between CPU and GPU.

Previously GPU’s were used for visual applications like graphics rendering , but nowadays we use gpu’s to advanced algorithms such as Computational Fluid Dynamics, Molecular Dynamics, Genetics, Electrodynamics and Data Analysis.

The Udacity course teaches parallel programming with Image processing as the application. So, basically I’m learning Image processing and Parallel Processing together.

I’m using Pycuda, as I’m more comfortable with Python than C. I won’t go into the details of installation of Cuda and PyCuda as you can look that up online. You can find the pycuda tutorial here.

The first problem set was to convert a RGB image into Grayscale.

The Code

```import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import sys
from PIL import Image

image = Image.open(sys.argv[-1])

def rgbtogray(image):
h_imagePixel = numpy.array(image)
width, height = image.size
totalPixel = width*height

h_imagePixel = h_imagePixel.astype(numpy.uint8)
d_imagePixel = cuda.mem_alloc(h_imagePixel.nbytes)
d_outPixel = cuda.mem_alloc(h_imagePixel.nbytes)
cuda.memcpy_htod(d_imagePixel, h_imagePixel)

mod = SourceModule("""
__global__ void gray(unsigned char *h_imagePixel, unsigned char *d_outPixel)
{
unsigned char rgb;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
rgb = h_imagePixel[0+idx*3]*0.299 + h_imagePixel[1+idx*3]*0.587 + h_imagePixel[2+idx*3]*0.114;
d_outPixel[0+idx*3] = rgb;
d_outPixel[1+idx*3] = rgb;
d_outPixel[2+idx*3] = rgb;
}
""")
func = mod.get_function("gray")
func(d_imagePixel,d_outPixel, block=(width,1,1), grid=(height,1,1))

h_imageOut = numpy.empty_like(h_imagePixel)
cuda.memcpy_dtoh(h_imageOut, d_outPixel)

image = Image.fromarray(h_imageOut)
return image

image = rgbtogray(image)
image.save("./out.jpg")
```

The Output

The conversion looks as shown above.

These are few important points that you have to take notice of.

1. numpy.array(image) converts the image into a 3D array (width, height, No_of_elements)
2. The memcpy_htod() copies the memory from CPU to GPU. Keep in mind that memory is stored in a one dimension array.
3. Indexing the threads blockIdx.x * blockDim.x + threadIdx.x. See the image below
4. Thus in the sourcemodule we use 0+idx*3 to read RGB values
5. Again, Memory is stored in 1D array that means RGB values are stored in consecutive memory spaces.

So, this is about pycuda and parallel programming for now.

2 thoughts on “Parallel Programming – Part 1”

1. Karthik Chellappan says:

so when you’re writing kernels in pyCuda, you write them in a sosrt of C-Style?