1 files changed, 110 insertions, 0 deletions
diff --git a/doc/routines.md b/doc/routines.md
new file mode 100644
index 00000000..c5e14907
--- /dev/null
+++ b/doc/routines.md
@@ -0,0 +1,110 @@
+CLBlast: Supported routines overview
+================
+
+This document describes which routines are supported in CLBlast. For other information about CLBlast, see the [main README](../README.md).
+
+Full API documentation is available in a separate [API documentation file](api.md).
+
+
+Supported types
+-------------
+
+The different data-types supported by the library are:
+
+* __S:__ Single-precision 32-bit floating-point (`float`).
+* __D:__ Double-precision 64-bit floating-point (`double`).
+* __C:__ Complex single-precision 2x32-bit floating-point (`std::complex<float>`).
+* __Z:__ Complex double-precision 2x64-bit floating-point (`std::complex<double>`).
+* __H:__ Half-precision 16-bit floating-point (`cl_half`). See section 'Half precision' below for more information.
+
+
+Supported routines
+-------------
+
+CLBlast supports almost all the Netlib BLAS routines plus a couple of extra non-BLAS routines. The supported BLAS routines are marked with '✔' in the following tables. Routines marked with '-' do not exist: they are not part of BLAS at all.
+
+| Level-1  | S | D | C | Z | H |
+| ---------|---|---|---|---|---|
+| xSWAP    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xSCAL    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xCOPY    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xAXPY    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xDOT     | ✔ | ✔ | - | - | ✔ |
+| xDOTU    | - | - | ✔ | ✔ | - |
+| xDOTC    | - | - | ✔ | ✔ | - |
+| xNRM2    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xASUM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| IxAMAX   | ✔ | ✔ | ✔ | ✔ | ✔ |
+
+| Level-2  | S | D | C | Z | H |
+| ---------|---|---|---|---|---|
+| xGEMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xGBMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHEMV    | - | - | ✔ | ✔ | - |
+| xHBMV    | - | - | ✔ | ✔ | - |
+| xHPMV    | - | - | ✔ | ✔ | - |
+| xSYMV    | ✔ | ✔ | - | - | ✔ |
+| xSBMV    | ✔ | ✔ | - | - | ✔ |
+| xSPMV    | ✔ | ✔ | - | - | ✔ |
+| xTRMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xTBMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xTPMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xGER     | ✔ | ✔ | - | - | ✔ |
+| xGERU    | - | - | ✔ | ✔ | - |
+| xGERC    | - | - | ✔ | ✔ | - |
+| xHER     | - | - | ✔ | ✔ | - |
+| xHPR     | - | - | ✔ | ✔ | - |
+| xHER2    | - | - | ✔ | ✔ | - |
+| xHPR2    | - | - | ✔ | ✔ | - |
+| xSYR     | ✔ | ✔ | - | - | ✔ |
+| xSPR     | ✔ | ✔ | - | - | ✔ |
+| xSYR2    | ✔ | ✔ | - | - | ✔ |
+| xSPR2    | ✔ | ✔ | - | - | ✔ |
+| xTRSV    | ✔ | ✔ | ✔ | ✔ |   |
+
+| Level-3  | S | D | C | Z | H |
+| ---------|---|---|---|---|---|
+| xGEMM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xSYMM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHEMM    | - | - | ✔ | ✔ | - |
+| xSYRK    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHERK    | - | - | ✔ | ✔ | - |
+| xSYR2K   | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHER2K   | - | - | ✔ | ✔ | - |
+| xTRMM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xTRSM    | ✔ | ✔ | ✔ | ✔ |   |
+
+Furthermore, there are also batched versions of BLAS routines available, processing multiple smaller computations in one go for better performance:
+
+| Batched             | S | D | C | Z | H |
+| --------------------|---|---|---|---|---|
+| xAXPYBATCHED        | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xGEMMBATCHED        | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xGEMMSTRIDEDBATCHED | ✔ | ✔ | ✔ | ✔ | ✔ |
+
+In addition, some extra non-BLAS routines are also supported by CLBlast, classified as level-X. They are experimental and should be used with care:
+
+| Level-X    | S | D | C | Z | H |
+| -----------|---|---|---|---|---|
+| xSUM       | ✔ | ✔ | ✔ | ✔ | ✔ | (Similar to xASUM, but not absolute)
+| IxAMIN     | ✔ | ✔ | ✔ | ✔ | ✔ | (Similar to IxAMAX, but minimum instead of maximum)
+| IxMAX      | ✔ | ✔ | ✔ | ✔ | ✔ | (Similar to IxAMAX, but not absolute)
+| IxMIN      | ✔ | ✔ | ✔ | ✔ | ✔ | (Similar to IxAMAX, but not absolute and minimum instead of maximum)
+| xHAD       | ✔ | ✔ | ✔ | ✔ | ✔ | (Hadamard product)
+| xOMATCOPY  | ✔ | ✔ | ✔ | ✔ | ✔ | (Out-of-place copying/transposing/scaling of matrices)
+| xIM2COL    | ✔ | ✔ | ✔ | ✔ | ✔ | (Image to column transform as used to express convolution as GEMM)
+
+Some less commonly used BLAS routines are not yet supported yet by CLBlast. They are xROTG, xROTMG, xROT, xROTM, xTBSV, and xTPSV.
+
+
+Half precision (fp16)
+-------------
+
+The half-precision fp16 format is a 16-bits floating-point data-type. Some OpenCL devices support the `cl_khr_fp16` extension, reducing storage and bandwidth requirements by a factor 2 compared to single-precision floating-point. In case the hardware also accelerates arithmetic on half-precision data-types, this can also greatly improve compute performance of e.g. level-3 routines such as GEMM. Devices which can benefit from this are among others Intel GPUs, ARM Mali GPUs, and NVIDIA's latest Pascal GPUs. Half-precision is in particular interest for the deep-learning community, in which convolutional neural networks can be processed much faster at a minor accuracy loss.
+
+Since there is no half-precision data-type in C or C++, OpenCL provides the `cl_half` type for the host device. Unfortunately, internally this translates to a 16-bits integer, so computations on the host using this data-type should be avoided. For convenience, CLBlast provides the `clblast_half.h` header (C99 and C++ compatible), defining the `half` type as a short-hand to `cl_half` and the following basic functions:
+
+* `half FloatToHalf(const float value)`: Converts a 32-bits floating-point value to a 16-bits floating-point value.
+* `float HalfToFloat(const half value)`: Converts a 16-bits floating-point value to a 32-bits floating-point value.
+
+The [samples/haxpy.c](../samples/haxpy.c) example shows how to use these convenience functions when calling the half-precision BLAS routine HAXPY.