Lab 1: HPC Toolchains & Performance profiling
A reminder on compilers, build systems, developement tools and profilers for HPC.
The material for this lab is available on the following GitHub repo.
CMake build system
-
Write a minimal
CMakeLists.txtfile for thevectorcode. -
Improve your build system by splitting the
CMakeLists.txtinto multiple files in the directory hierarchy. Use CMake best practices.
Relevant resources for this exercise:
Compiler toolchains and flags
-
Using the GCC compiler, build
vectorinto an executable without any optimization flags. What do you notice on the display of the values of vectorv3? -
Add the compilation flags
-Wall,-Wextraand-Werror.
Fix all the warnings emitted by the compiler and recompile the program. -
Run the corrected program and record the time displayed.
-
Compile different versions of the code using flags
-O1,-O2,-O3and-Ofast. Run them each and record their execution time.
What do you see? Compare the generated assembly. -
Repeat with a different compiler (e.g. LLVM Clang, Intel ICX, etc.).
What differences do you notice?
Compilation passes
-
Compare the flags included with
-O2on GCC 11.x and on GCC 12.x. Which flags differ? -
Compare the flags included with
-Ofaston GCC and on LLVM Clang. Which flags differ? -
Study the program in the
saxpydirectory, then compile it with the-fdump-tree-allflag. Run thelscommand. What do you see? -
What does it correspond to? How many do you see?
-
Delete the files that have appeared, and recompile with the
-O1flag.
What do you observe? Are any files missing from the previous compilation?
Code sanitizers
Valgrind is an open-source programming tool for debugging, profiling and identifying memory leaks. (see Wikipedia page)
AddressSanitizer (or ASan) is an open-source programming tool that detects memory corruption bugs such as buffer overflows or accesses to a dangling pointer (use-after-free). AddressSanitizer is based on compiler instrumentation and directly mapped shadow memory. AddressSanitizer is currently implemented in Clang (starting from version 3.1), GCC (starting from version 4.8), Xcode (starting from version 7.x) and MSVC (widely available starting from version 16.9). (see Wikipedia page)
-
Use Valgrind on the
saxpycode. What do you see? -
Use ASan on the
saxpycode. What do you observe? Fix the problem(s). -
What are
dhat,memcheck,cachegrind,callgrind,massif? -
What are
kcachegrind,massif-visualizer?
Debuggers
Sequential debugging
Study the program in bugs, then compile it with the -g3 flag. Run the program using the GNU Debugger (GDB):
| Bash | |
|---|---|
GDB features interactive help. Start by browsing the help menu, typing help to get a list of commands. Then, help followed by a command to obtain information about it.
-
To locate the source of the first bug, type
run(orr) under GDB. Quit (quit), correct the error and recompile the program. -
Proceed in the same way to correct the next bug. Use the
backtrace(orbt) command to display the call stack and obtain more information on the source of the error. -
Identify the next error after recompiling the program. What is the problem? Can it be corrected?
-
We're now going to solve the last bug with other basic GDB functions. Start the debugger without using the
runcommand for the moment. Set a breakpoint on thelaunch_fibonacci()function usingbreakpoint launch_fibonacci(orb launch_fibonacci).
Now try to print the value offibo_values->max.
What do you see? Useupto position yourself before the function call, thenlistto display the lines of code around the breakpoint. Now enter the commandprint fibo_values. Correct the problem and recompile the program.
The sequence is incorrect. We should obtain the following numbers:
$$
F_0 = 0, F_1 = 1, F_2 = 1, F_3 = 2, F_4 = 5, F_6 = 8, \ldots
$$
To pinpoint the source of the error, we will display the values in the sequence step by step, monitoring changes to the fibo_values->result variable.
Parallel debugging
There are debuggers specifically designed for the needs of parallel programs. Examples include TotalView and Linaro DDT. Although these programs are very powerful, they are not free nor open-source.
Nevertheless, it is possible to use GDB to debug small-scale parallel programs.
For multi-threaded programs, simply use the info threads command in GDB. We can choose to view a particular thread with the command thread <thread_number>.
Tip
For MPI programs, we can use the following trick:
| Bash | |
|---|---|
-
Start from the original
mol-dyncode and parallelize the program's most expensive function using OpenMP or Pthread. -
If necessary, use the above methods to debug your parallel code on
mol-dyn.
Performance profiling with GNU gprof, Linux perf and hotspot
Gprof is a GNU Binary Utilities program for code profiling.
When compiling and linking source code with GCC, simply add the -pg option, the program generates a gmon.out file containing the profiling information.
You can then use Gprof to read this file, specifying the options.
Perf (sometimes called perf_events or perf-tools, originally Performance Counters for Linux, PCL) is a performance analysis tool for Linux, available in the Linux kernel since version 2.6.31 in 2009.
The userspace controlling utility, named perf, is accessed from the command line and provides a number of subcommands; it is capable of statistical profiling of the entire system (both kernel and userspace code).
Hotspot is a GUI for the Linux Perf profiler that replaces the perf report command (see its GitHub page).
The mol-dyn code supplied is a model of a molecular dynamics simulation in a gas (interaction between gas molecules).
To compile, there are three preset sizes to choose from:
| Preset | CMake command | Number of particles |
|---|---|---|
| mini | -DNPART=MINI |
1372 |
| medium | -DNPART=MEDIUM |
4000 |
| maxi | -DNPART=MAXI |
13500 |
-
Use Gprof on the
mol-dyncode. What is the most computationally-intensive function? -
Use Perf and Hotspot on the code. What is the most computationally-intensive function?
-
What is a hotspot in the context of performance profiling?
STREAM benchmarks
Given the code provided in the lab 1 repository, you are to measure the performance of a very basic OpenMP implementation of the STREAM bandwidth benchmarks.
Performance metrics
-
Modify the
src/bin/main.cppfile in order to actual measure something (e.g. execution time) and extract some performance data about them. You can use standard C or C++ clocks, CPU clock cycles, or any other unit of time that you think makes sense. You can also swap out the minimal provided code and use a dedicated benchmarking library instead, e.g. Google Benchmarks, nanobench, or Catch2. -
Derive some meaningful metrics (e.g., memory bandwith) from your raw measurements.
-
Plot the obtained data (e.g., using a Python script).
Scalability
-
Measure the strong scaling speedup and weak scaling efficiency of the STREAM benchmarks. Write a simple script to do it.
-
Plot the obtained data.