Resources
Books
- Computer Architecture: A Quantitative Approach – Hennessy & Patterson
- Computer Organization and Design – Hennessy & Patterson
- Modern Operating Systems – Tanenbaum
- Systems Performance – Gregg
- Introduction to Parallel Computing – Grama et al.
- The Art of Multiprocessor Programming – Herlihy & Shavit
- Programming Massively Parallel Processors – Kirk & Hwu
- Parallel Programming in C with MPI and OpenMP – Quinn
- Numerical Linear Algebra – Trefethen & Bau
- Optimizing Compilers for Modern Architectures – Allen & Kennedy
- GPU Gems (3rd edition)
- Hacker's Delight – H. Warren
Standards, specifications & documentation
- C++ Standard (working draft)
- C++ Reference
- OpenMP Specification
- MPI Standard
- OpenSHMEM Specification
- Kokkos Documentation
Papers
- What Every Computer Scientist Should Know About Floating-Point Arithmetic – D. Goldberg
- What Every Programmer Should Know About Memory – U. Drepper
- Scalable Communication Protocols for Dynamic Sparse Data Exchange - T. Hoefler et al. (slides)
- Scientific Benchmarking of Parallel Computing Systems - T. Hoefler et al. (slides, recorded talk)
- Anatomy of High-Performance Matrix Multiplication - K. Goto & R. Van de Geijn
- Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems - K. Asifuzzaman et al.
- TBA
Talks
- Think Parallel - B. A. Lelbach
- Multidimensional C++ - B. A. Lelbach
- Block-based Parallel Programming - B. A. Lelbach
- More Speed & Simplicity: Practical Data-Oriented Design in C++ - V. Romeo
- The C++ Execution Model
- How to Tame Packs,
std::tuple, and the Wilystd::integer_sequence- A. Alexandrescu - What Every Programmer Should Know about How CPUs Work - M. Godbolt
*(char*)0 = 0;What Does the C++ Programmer Intend With This Code? - J.F. Bastien
Websites, blogs & articles
- HPC Wiki
- Chips & Cheese
- Daniel Lemire
- Matt Godbolt's Advent of Compiler Optimizations (AoCO) 2025
- Optimizing parallel reductions in CUDA
- Parallel Prefix Sum (Scan) with CUDA
- Inside NVIDIA GPUs: Anatomy of high performance matmul kernels
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Outperforming cuBLAS on H100: a Worklog
- Notes About Nvidia GPU Shared Memory Banks
- A (Draft) Taxonomy of SIMD Usage
- TBA