Reduction using CUDA

In this assignment, you will be implementing the reduction program covered in the class for an array of 20K integer elements using only one thread block. Note that since the maximum thread block size is only 1K threads, each thread will start the tree based reduction using 20 elements each and perform the steps until the number of elements in a level is 1K. After then, the code shown in the class takes over.

Implement the program with and without shared memory usage and show timing for both the versions, and compare with the time taken for a sequential code on the CPU in terms of speedup.