X hits on this document

Powerpoint document

The Missouri S&T CS GPU Cluster - page 16 / 18

43 views

0 shares

0 downloads

0 comments

16 / 18

Tips for speedy code

Have the kernel use the whole card - Have a multiple of 32 threads per block and at least as many blocks as multiprocessors (240 on the Tesla C1060s).

Access global memory properly. Coalescing - Memory read by consecutive threads are combined by the hardware into several, wide memory reads.

Avoid shared memory bank conflicts.

Have as few branching conditional loops as possible.

Have small loops unrolled.

Have no unnecessary __syncthreads() calls.

See the CUDA Programming Guide for further discussion on all of the above.

Document info
Document views43
Page views43
Page last viewedFri Dec 09 19:40:33 UTC 2016
Pages18
Paragraphs213
Words1253

Comments