High-Performance Tensor Transposition (HPTT) C++ Library
A C++ library for high-performance multi-threaded tensor transpositions.
High-Performance Tensor Transposition Library

Introduction

HPTT supports tensor transpositions of the general form:

  \f[ \mathcal{B}_{\pi(i_0,i_1,...,i_{d-1})} \gets \alpha * \mathcal{A}_{i_0,i_1,...,i_{d-1}} + \beta * \mathcal{B}_{\pi(i_0,i_1,...,i_{d-1})}, \f]

where $\alpha$ and $\beta$ are scalars and $\mathcal{A}$ and $\mathcal{B}$ are d-dimensional tensors (i.e., multi-dimensional arrays).

HPTT assumes a column-major data layout, thus indices are stored from left to right (e.g., $i_0$ is the stride-1 index in $\mathcal{A}_{i_0,i_1,...,i_{d-1}}$).

Key Features

  • Multi-threading support
  • Explicit vectorization
  • Auto-tuning (akin to FFTW)
    • Loop order
    • Parallelization
  • Multi-architecture support
    • Explicitly vectorized kernels for (AVX and ARM)
  • Support for float, double, complex and double complex data types
  • Can operate on sub-tensors

Requirements

You must have a working C++ compiler with c++11 support. I have tested HPTT with:

  • Intel's ICPC 15.0.3, 16.0.3, 17.0.2
  • GNU g++ 5.4, 6.2, 6.3
  • clang++ 3.8, 3.9

Install

Clone the repository into a desired directory and change to that location:

git clone https://github.com/springer13/hptt.git
cd hptt
export CXX=<desired compiler>

Now you have several options to build the desired version of the library:

make avx
make arm
make scalar

This should create 'libhptt.so' inside the ./lib folder.

Getting Started

In general HPTT is used as follows:

#include <hptt.h>

// allocate tensors
float A* = ...
float B* = ...

// specify permutation and size
int dim = 6;
int perm[dim] = {5,2,0,4,1,3};
int size[dim] = {48,28,48,28,28};

// create a plan (shared_ptr)
auto plan = hptt::create_plan( perm, dim, 
                               alpha, A, size, NULL, 
                               beta,  B, NULL, 
                               hptt::ESTIMATE, numThreads);

// execute the transposition
plan->execute();

The example above does not use any auto-tuning, but solely relies on HPTT's performance model. To active auto-tuning, please use hptt::MEASURE, or hptt::PATIENT instead of hptt::ESTIMATE.

Please refer to the hptt::Transpose class for additional information or to hptt::create_plan().

An extensive example is provided here: ./benchmark/benchmark.cpp.

Benchmark

The benchmark is the same as the original TTC benchmark benchmark for tensor transpositions.

You can compile the benchmark via:

cd benchmark
make

Before running the benchmark, please modify the number of threads and the thread affinity within the benchmark.sh file. To run the benchmark just use:

./benshmark.sh

This will create hptt_benchmark.dat file containing all the runtime information of HPTT and the reference implementation.

Citation

In case you want refer to HPTT as part of a research paper, please cite the following article (pdf):

@inproceedings{hptt2017,
author = {Springer, Paul and Su, Tong and Bientinesi, Paolo},
title = {{HPTT}: {A} {H}igh-{P}erformance {T}ensor {T}ransposition {C}++ {L}ibrary},
booktitle = {Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming},
series = {ARRAY 2017},
year = {2017},
isbn = {978-1-4503-5069-3},
location = {Barcelona, Spain},
pages = {56--62},
numpages = {7},
url = {http://doi.acm.org/10.1145/3091966.3091968},
doi = {10.1145/3091966.3091968},
acmid = {3091968},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {High-Performance Computing, autotuning, multidimensional transposition, tensor transposition, tensors, vectorization},
}