, 1 min read
Vasily Volkov (UC Berkeley): Unrolling parallel loops
Loop unrolling is not only good for sequential programming, it has similar dramatic effects in highly parallel codes as well, see Unrolling parallel loops (local copy), also see #pragma unroll
in the NVidia CUDA programming guide.
Some bullet points of the presentation:
More resources consumed per thread
Note: each load costs 2 arithmetic instructions
- 32 banks vs 32 streaming processors
- But run at half clock rate These 3 loads are 6x more expensive than 1 FMA
Conclusion:
- Simple optimization technique
- Resembles loop unrolling
- Often results in 2x speedup
See Vasily Volkov.
Cédric Augonnet, Samuel Thibault and Raymond Namyst call Vasily Volkov a "CUDA-hero" in How to get portable performance on accelerator-based platforms without the agonizing pain.
In a similar vein Dr. Mark Harris describes the beneficial effect of unrolling in parallel reduction.