StarPU Handbook
Loading...
Searching...
No Matches
17. Benchmarking StarPU

Some interesting benchmarks are installed among examples in $STARPU_PATH/lib/starpu/examples/. Make sure to try various schedulers, for instance STARPU_SCHED=dmda.

17.1 Task Size Overhead

This benchmark gives a glimpse into how long a task should be (in µs) for StarPU overhead to be low enough to keep efficiency. Running tasks_size_overhead.sh generates a plot of the speedup of tasks of various sizes, depending on the number of CPUs being used.

17.2 Data Transfer Latency

local_pingpong performs a ping-pong between the first two CUDA nodes, and prints the measured latency.

17.3 Matrix-Matrix Multiplication

sgemm and dgemm perform a blocked matrix-matrix multiplication using BLAS and cuBLAS. They output the obtained GFlops.

17.4 Cholesky Factorization

cholesky_* perform a Cholesky factorization (single precision). They use different dependency primitives.

17.5 LU Factorization

lu_* perform an LU factorization. They use different dependency primitives.

17.6 Simulated Benchmarks

It can also be convenient to try simulated benchmarks, if you want to give a try at CPU-GPU scheduling without actually having a GPU at hand. This can be done by using the SimGrid version of StarPU: first install the SimGrid simulator from https://simgrid.org/ (we tested with SimGrid from 3.11 to 3.16, and 3.18 to 3.30. SimGrid versions 3.25 and above need to be configured with -Denable_msg=ON. Other versions may have compatibility issues, 3.17 notably does not build at all. MPI simulation does not work with version 3.22). Then configure StarPU with --enable-simgrid and rebuild and install it, and then you can simulate the performance for a few virtualized systems shipped along StarPU: attila, mirage, idgraf, and sirocco.

For instance:

$ export STARPU_PERF_MODEL_DIR=$STARPU_PATH/share/starpu/perfmodels/sampling
$ export STARPU_HOSTNAME=attila
$ $STARPU_PATH/lib/starpu/examples/cholesky_implicit -size $((960*20)) -nblocks 20

Will show the performance of the cholesky factorization with the attila system. It will be interesting to try with different matrix sizes and schedulers.

Performance models are available for cholesky_*, lu_*, *gemm, with block sizes 320, 640, or 960 (plus 1440 for sirocco), and for stencil with block size 128x128x128, 192x192x192, and 256x256x256.

Read Chapter SimGrid Support for more information on the SimGrid support.