Get Started. It's Free
or sign up with your email address
Dissertation by Mind Map: Dissertation

1. GPU Base

1.1. Power

1.1.1. J. Lee, Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling

1.2. Workload

1.2.1. Y. Jiao, Power and Performance Characterization of Computational Kernels on the GPU

1.2.2. M. Burtscher, A quantitative study of irregular programs on GPUs

1.2.3. H. Wong, Demystifying GPU microarchitecture through microbenchmarking

1.3. Synchronization

1.3.1. W. Feng, To GPU Synchronize or Not GPU Synchronize ?

1.3.1.1. atomAdd 对线程数敏感

1.3.1.2. syncthread 对线程数不敏感

1.3.1.3. __threadfence只在线程数很少时有效

1.3.2. A. Xiao, Inter-Block GPU Communication via Fast Barrier Synchronization

1.3.3. M. Elteir, Performance Characterization and Optimization of Atomic Operations on AMD GPUs

1.4. Multi-GPU

1.4.1. D. Schaa, Exploring the multiple-GPU design space

1.5. Optimization

1.5.1. S.Ryoo, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

1.5.2. S. Ryoo, Program optimization space pruning for a multithreaded gpu

1.5.2.1. facts

1.5.2.1.1. 资源配置的优化空间很大

1.5.2.1.2. TLP 和 ILP 的 tradeoff

1.5.2.1.3. 基本思路提高占用率, 减少动态指令数

1.5.2.2. motivation

1.5.2.2.1. derive metrics for compiler optimization

1.5.2.3. opimization

1.5.2.3.1. provide enough warps to hide the stalling effects of long latency and blocking operations

1.5.2.3.2. redistribution of work across threads and thread blocks

1.5.2.3.3. reduce the dynamic instruc- tion count per thread

1.5.2.3.4. intra-thread parallelism

1.5.2.3.5. resource-balancing

1.5.2.4. metics

1.5.2.4.1. efficiency

1.5.2.4.2. utilization

1.5.2.5. methodogy

1.5.2.5.1. 寻找 Pareto-optimal 子集, 然后只对这个集合中的 configuration 进行evaluate

1.5.2.6. comments

1.5.2.6.1. by Baghsorkhi

1.5.2.6.2. by Kothapalli

1.6. hardware detail

1.6.1. M. Gebhart, Energy-efficient mechanisms for managing thread context in throughput processors

1.6.1.1. 减少访问寄存器和调度线程时的能源开销

1.6.1.2. register file cache

1.6.1.3. warp scheduler

2. Simulator

2.1. S. Lee, Parallel GPU architecture simulation framework exploiting work allocation unit parallelism

2.2. functional

2.2.1. Barra

2.2.1.1. commented by Sara

2.3. cycle-accurate

2.3.1. GPUSim

2.3.1.1. A. Ariel, Visualizing complex dynamics in many-core accelerator architectures

2.3.2. reduce time tech

2.3.2.1. Reducing input set

2.3.2.2. Truncated Execution

2.3.2.3. Sampling

3. GPU Modeling

3.1. Performane

3.1.1. S. Baghsorkhi, An adaptive performance modeling tool for GPU architectures

3.1.1.1. commented by Sim

3.1.2. Hong 三部曲

3.1.2.1. S. Hong, Modeling performance and power for energy-efficient GPGPU computing

3.1.2.1.1. commented by J.Chen

3.1.2.2. S. Hong, An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

3.1.2.2.1. commented by Kothapalli 不同点阐明

3.1.3. Parameterized

3.1.3.1. A. Resios, GPU performance prediction using parametrized models

3.1.3.2. G. Martin, Cross-architecture performance predictions for scientific applications using parameterized models

3.1.3.3. Y. Zhang, A quantitative performance analysis model for GPU architectures

3.1.3.3.1. commented by Sim

3.1.4. H. Jia, GPURoofline : A Model for Guiding Performance Optimizations on GPUs

3.1.5. 袁良, 基于延迟隐藏因子的 GPU 计算模型

3.1.6. compiler based

3.1.6.1. PTX level

3.1.6.1.1. J. Sim, A performance analysis framework for identifying potential benefits in GPGPU applications(GPUPerf)

3.1.6.2. source code level

3.1.6.2.1. S. Baghsorkhi, Efficient Performance Evaluation for Highly Multi-threaded Graphics Processors

3.1.6.2.2. K. Kothapalli, A performance prediction model for the CUDA GPGPU platform

3.1.7. trace based

3.1.7.1. S. Baghsorkhi, Analytical Performance Prediction for Evaluation and Tuning of GPGPU Applications

3.1.7.1.1. 由于并行程序的复杂性, 某个单独的结构事件没有任何意义

3.1.7.1.2. 赋予源代码以上下文

3.1.7.1.3. 当一个 cache miss 发生时, 我们怎么知道到底是哪条指令引起的呢?

3.1.7.1.4. Methology

3.1.7.2. J. Lai, TEG : GPU Performance Estimation Using a Timing Model

3.1.7.2.1. 提出了一种“模拟”程序运行时间的方法

3.2. Power

3.2.1. S. Hong, An integrated GPU power and performance model

3.2.2. X. Ma, Statistical Power Consumption Analysis and Modeling for GPU-based Computing

3.3. Control Flow

3.3.1. Z. Cui, An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization

3.4. Application Specific

3.4.1. J. Choi, Model-driven autotuning of sparse matrix-vector multiply on GPUs

3.4.2. S. Larsen, A Memory Model for Scientific Algorithms on Graphics Processors

3.4.3. J. Meng, Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

3.4.3.1. commented by Kothapalli

3.4.3.2. related work

3.4.4. J. Meng, GROPHECY : GPU Performance Projection from CPU Code Skeletons

3.4.5. Y. Dotsenko, Auto-tuning of fast fourier transform on graphics processors

3.4.6. On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit

3.5. Synchronization

3.5.1. G. Juan, Performance modeling of atomic additions on GPU scratchpad memory

3.6. Machine Leaning

3.6.1. V. Pallipuram, Evaluation of GPU Architectures Using Spiking Neural Networks

3.7. stochastic

3.7.1. J. Chen, Tree structured analysis on GPU power study

3.7.1.1. power model

3.7.1.2. 统计学

3.7.1.2.1. 相关性分析

3.8. analytical

4. Uncaterory

4.1. BFS

4.1.1. D. Merrill, High Performance and Scalable GPU Graph Traversal

4.1.2. G. Liu, FlexBFS : A Parallelism-aware Implementation of Breadth-First Search on GPU

4.1.2.1. 取得next frontier的size, 并根据它来配置下一个kernel

4.1.3. R. Nasre, Data-driven versus Topology-driven Irregular Computations on GPUs

4.1.4. L. Luo, An effective GPU implementation of breadth-first search

4.1.5. Y.Liang, Modeling the Locality in Graph Traversals Liang

4.1.5.1. 找到了 locality 与图连接性的关系

4.1.5.2. 提出了一个叫 vertex distance 的 metric

4.1.6. Y. Xia, Topologically adaptive parallel breadth-first search on multicore processors

4.1.6.1. motivation

4.1.6.1.1. propose a model to estimate the scalability of BFS with respect to a given graph.

4.1.6.1.2. 提出模型是为了说明 flexible threads 的必要性, 建起 topology 和 scalibility 之间的联系

4.1.6.2. facts

4.1.6.2.1. all nodes at the same level must be explored prior to any node in the next level.

4.1.6.2.2. 为什么线程多会增加同步开销?

4.1.6.2.3. 理想情况中,barrier 的 overhead 是0,但由于负载不平衡, 访存时间有时被掩盖, 会导致这部分时间不为0

4.1.7. S. Hong, Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

4.1.7.1. methedology

4.1.7.1.1. 以 CPU 为主 GPU 为辅

4.1.7.2. facts

4.1.7.2.1. two method

4.1.7.3. contribution

4.1.7.3.1. 该方法考虑了了小世界忘网络的特性

4.1.7.3.2. 用 read-based method 取代队列访问

4.2. Multi-core

4.2.1. M. Suleman, Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

4.2.1.1. 根据程序的运行时行为来控制它的线程数, 从而提高性能,降低能耗

4.2.1.1.1. 若程序受 data-synchronization 限制, 增加线程数不但会浪费功耗, 而且会降低性能

4.2.1.1.2. 若程序受 off-chip bandwidth 限制, 增加线程数不会提高性能, 只会增加功耗

4.2.2. R. Saavedra-Barrera, An Analytical Solution for a Markov Chain Modeling Multithreaded Execution

4.3. Modeling

4.3.1. A. Waterman, Roofline : An Insightful Visual Performance Model for Multicore Architectures

4.4. Irregular Computation

4.4.1. Butcher, M, Implementing and Tuning Irregular Programs on GPUs

4.4.1.1. Presented 13 general optimization principles for efficiently implementing irregular GPU codes

5. Idea

5.1. dynamic thread configuration

5.1.1. 根据程序的同步行为和带宽使用率 实现动态线程数调整

6. Approach

6.1. profiling 的方法

6.1.1. 静态通过分析 PTX 或 二进制指令流的方法

6.1.2. 动态在运行时收集 perf counter 的信息

6.2. Machine Learning

6.2.1. E. Ipek, An approach to performance prediction for parallel applications

6.2.2. B. Lee, Methods of Inference and Learning for Performance Modeling of Parallel Applications

6.3. stochastic

6.3.1. R. Saavedra-Barrera, An Analytical Solution for a Markov Chain Modeling Multithreaded Execution

7. instrumentation

7.1. source level

7.1.1. M. Boyer, Automated Dynamic Analysis of CUDA Programs

7.2. binary level

7.2.1. Ocelot

7.2.1.1. G. Diamos, Ocelot : A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems

7.2.1.1.1. method

7.2.1.1.2. motivation

7.2.1.1.3. JIT技术

7.2.1.2. G. Diamos, Translating GPU Binaries to Tiered SIMD Architectures with Ocelot

7.2.1.2.1. allow architectures other than NVIDIA GPUs to leverage the parallelism in PTX programs. Specifically

7.2.1.2.2. A binary translation framework

7.2.1.3. N. Farooqui, Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures

7.2.1.4. N. Farooqui, A framework for dynamically instrumenting GPU compute applications within GPU Ocelot

7.2.1.4.1. 细节比较多

7.2.1.5. workload

7.2.1.5.1. A. Kerr, A characterization and analysis of PTX kernels

7.2.1.5.2. A. Kerr, Modeling GPU-CPU workloads and systems