Parallel Programming
by Rodion Martynov
1. Grids
1.1. Blocks 2D
1.1.1. Threads 3D
1.1.1.1. registers
1.1.2. Shared memory
1.1.3. warp
1.2. Global memory
1.2.1. Bursts copying
1.3. Constant memory
1.4. Textures
2. High level strategies
2.1. Maximize arithmetic intensity
2.1.1. coalesced g.mem access
3. Atomics
3.1. atomicAdd
3.2. atomicMin
3.3. atomicXOR
4. Control Divergence
4.1. Because of "If" statement
4.2. Boundary conditions
5. Parallel Computation Patterns
5.1. Convolution
5.2. Reduce
5.3. Prefix Sum (Scan)
6. Communication Patterns
6.1. Map
6.2. Gather
6.3. Scatter
6.4. Stencil
6.5. Transpose
7. Parallel Computation Patterns
7.1. Reduce
7.2. Scan
7.2.1. Inclusive\Exclusive
7.2.2. Hillis Steele (step efficient)
7.2.3. Blelloch Scan (work efficient)
7.2.4. Segmented scan
7.3. Histogram
7.3.1. Atomics
7.3.2. per thread, then reduce
7.3.3. sort by key, then reduce
7.3.3.1. thrust lib
7.4. Compact
7.4.1. input
7.4.2. predicate
7.4.3. output
7.4.3.1. sparse
7.4.3.2. dense
7.4.4. Allocate
7.4.4.1. Clipping
7.4.5. Thrust example
7.4.5.1. Stream_compaction.cu