Recent generations of CPUs, and GPUs in particular, require data-parallel codes for full efficiency. Data parallelism requires that the same sequence of operations is applied to different input data. CPUs and GPUs can thus reduce the necessary hardware for instruction decoding and scheduling in favor of more arithmetic and logic units, which execute the same instructions synchronously. On CPU architectures this is implemented via SIMD registers and instructions. A single SIMD register can store N values and a single SIMD instruction can execute N operations on those values. On GPU architectures N threads run in perfect sync, fed by a single instruction decoder/scheduler. Each thread has local memory and a given index to calculate the offsets in memory for loads and stores.
Current C++ compilers can do automatic transformation of scalar codes to SIMD instructions (auto-vectorization). However, the compiler must reconstruct an intrinsic property of the algorithm that was lost when the developer wrote a purely scalar implementation in C++. Consequently, C++ compilers cannot vectorize any given code to its most efficient data-parallel variant. Especially larger data-parallel loops, spanning over multiple functions or even translation units, will often not be transformed into efficient SIMD code.
The Vc library provides the missing link. Its types enable explicitly stating data-parallel operations on multiple values. The parallelism is therefore added via the type system. Competing approaches state the parallelism via new control structures and consequently new semantics inside the body of these control structures.
If you are new to vectorization please read this following part and make sure you understand it:
std::array
than to a dynamically resizable std::vector
.You can modify a function to use vector types and thus implement a horizontal vectorization. The original scalar function could look like this (If you are confused about the adjective "scalar" in this context, note that the function mathematically does a vector to vector transformation. However, the computer computes it with scalar instructions, i.e. one value per operand.):
To vectorize the normalize
function with Vc, the types must be substituted by their Vc counterparts and math functions must use the Vc implementation (which is, per default, also imported into std
namespace):
The latter function is able to normalize four 3D vectors when compiled for SSE in the same time the former function normalizes one 3D vector.
For completeness, note that you can optimize the division in the normalize function further:
Then you can multiply x
, y
, and z
with d_inv
, which is considerably faster than three divisions.
As you can probably see, the new challenge with Vc is the use of good data-structures which support horizontal vectorization. Depending on your problem at hand this may become the main focus of design (it does not have to be, though).
If you do not know what alignment is, and why it is important, read on, otherwise skip to Tools. Normally the alignment of data is an implementation detail left to the compiler. Until C++11, the language did not even have any (official) means to query or modify alignment.
Most data types require more than one Byte for storage. Thus, even most atomic data types span several locations in memory. E.g. if you have a pointer to float
, the address stored in this pointer just determines the first of four Bytes of the float
. Naively, one could think that any address (which belongs to the process) can be used to store such a float. While this is true for some architectures, some architectures may terminate the process when a misaligned pointer is dereferenced. The natural alignment for atomic data types typically is the same as their size. Thus the address of a float
object should always be a multiple of 4 Bytes.
Alignment becomes more important for SIMD data types.
Vc provides several classes and functions to get alignment right.
new
and delete
operators, returning correctly aligned pointers to the heap. malloc
and free
. They can be used to allocate any type of memory with an abstract alignment restriction: Vc::MallocAlignment. Note, that (like malloc
) the memory is only allocated and not initialized. If you allocate memory for a type that has a constructor, use the placement new syntax to initialize the memory. T
. STL containers will already default to Vc::Allocator for Vc::Vector<T>. For all other composite types you want to use, you can take the Vc_DECLARE_ALLOCATOR convenience macro to set is as default.