Parallel STL Tutorial and Benchmarks

If you are programming or have already developed with C++, there is a good chance that you have used Standard Template Library (STL) containers and algorithms in your codes. In that case, you can easily boost the performance of your existing codes with parallel algorithms introduced in C++17. The good news is you do not have to wait until the support for the parallel algorithm being added to the C++ compiler of your choice.

The Intel’s Parallel STL is a fairly complete implementation of the C++ standard library algorithms with support for execution policies, as specified in ISO/IEC 14882:2017 standard, AKA C++17. It is a standalone header-only library available for free on GitHub. It can work with any C++11 compiler that works with Intel’s Threading Building Blocks (TBB), which is also available for free. In addition, if you want to use non-standard vectorization (unsequenced policies), your compiler should support OpenMP 4.0 SIMD constructs. Intel have offered to donate their implementation to both GCC and Clang.

This is a tutorial/benchmark for exploring Parallel STL library on SHARCNet clusters.

List of Benchmarks

Building and Installation

Insatlling Dependencies

This tutorial depends on the following software:

The CMake scripts configured in way that if it cannot find the above dependencies it tries to download and build them automatically. So, there is no need to do anything if they are missing.

Building using CMake

On Compute Canada national plateforms (e.g. graham, cedar, etc.), you can download and install the tutorial in your home directory using the following commands:

$ cd ~/scratch
$ git clone https://git.sharcnet.ca/asobhani/parallelstl_tutorial.git
$ cd parallelstl_tutorial
$ mkdir build && cd build
$ make

If you like to install all the benchmarks into ~/bin so they can be run from anywhere, you can run the following command at the end:

$ make install


After buliding, you will find all the benchmarks in build/src dirctory and you can use their names to run them. To have a reliable benchmark result on clusters you need to submit a job or using an interactive session:

$ salloc --time=0:5:0 --mem=5G --ntasks=1 --cpus-per-task=32 --account==def-someuser
salloc: Pending job allocation 12025281
salloc: job 12025281 queued and waiting for resources
salloc: job 12025281 has been allocated resources
salloc: Granted job allocation 12025281
salloc: Waiting for resource configuration
salloc: Nodes gra415 are ready for job
[asobhani@gra415 build2]$ src/perf_sort
2019-02-25 11:18:31
Running src/perf_sort
Run on (32 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x32)
  L1 Instruction 32K (x32)
  L2 Unified 256K (x32)
  L3 Unified 40960K (x2)
Benchmark                                      Time           CPU Iterations
STL_Sort/10000000                        1161081 us    1161074 us          1
PSTL_Sort/pstl_seq/10000000/real_time    1036589 us    1036567 us          1
PSTL_Sort/pstl_par/10000000/real_time      15631 us      15621 us         36
$ _

Using --help shows all the available options for running the benchmark:

$ src/perf_sort --help
benchmark [--benchmark_list_tests={true|false}]

Interpreting the Results

The benchmark is tested a couple of times with different iteration counts to determine the correct count to use. After it finds an iteration count that causes the benchmark to run for a significant enough period of time it reports only that run in the last column (Iterations), and not the test runs before it. The Time is the average wall time per iteration. The CPU is the average CPU time per iteration. By default the CPU time is used when determining the amount of iterations to run.