Merge branch 'half_precision' into development

author: Cedric Nugteren <web@cedricnugteren.nl> 2016-05-30 11:11:28 +0200
committer: Cedric Nugteren <web@cedricnugteren.nl> 2016-05-30 11:11:28 +0200
commit: 61105e38100d323ea270f2cbee0a824d401eaa77 (patch)
tree: a6f8af9f6e75b57870bfce119f037093a46d2e9c /README.md
parent: 182d2cffa163688e2ae08d5d526f8eb63914b6ac (diff)
parent: 03182f9d07533f795a498936391da744d982e8e2 (diff)
1 files changed, 77 insertions, 56 deletions
diff --git a/README.md b/README.md
index e4564c26..51c282a3 100644
--- a/README.md
+++ b/README.md
@@ -20,6 +20,7 @@ Use CLBlast instead of clBLAS:
 * When you are still running on OpenCL 1.1 hardware.
 * When you value an organized and modern C++ codebase.
 * When you target Intel CPUs and GPUs or embedded devices
+* When you can benefit from the increased performance of half-precision fp16 data-types.
 
 Use CLBlast instead of cuBLAS:
 
@@ -127,7 +128,7 @@ If your device is not (yet) among this list or if you want to tune CLBlast for s
 
     cmake -DTUNERS=ON ..
 
-Note that CLBlast's tuners are based on the CLTune auto-tuning library, which has to be installed separately (version 1.7.0 or higher). CLTune is available from GitHub.
+Note that CLBlast's tuners are based on the CLTune auto-tuning library, which has to be installed separately (version 2.3.1 or higher). CLTune is available from GitHub.
 
 Compiling with `-DTUNERS=ON` will generate a number of tuners, each named `clblast_tuner_xxxxx`, in which `xxxxx` corresponds to a `.opencl` kernel file as found in `src/kernels`. These kernels corresponds to routines (e.g. `xgemm`) or to common pre-processing or post-processing kernels (`copy` and `transpose`). Running such a tuner will test a number of parameter-value combinations on your device and report which one gave the best performance. Running `make alltuners` runs all tuners for all precisions in one go. You can set the default device and platform for `alltuners` by setting the `DEFAULT_DEVICE` and `DEFAULT_PLATFORM` environmental variables before running CMake.
 
@@ -177,64 +178,70 @@ These graphs can be generated automatically on your own device. First, compile C
 Supported routines
 -------------
 
-CLBlast is in active development but already supports almost all the BLAS routines. The supported routines are marked with '✔' in the following tables. Routines marked with '-' do not exist: they are not part of BLAS at all.
-
-| Level-1  | S | D | C | Z |
-| ---------|---|---|---|---|
-| xSWAP    | ✔ | ✔ | ✔ | ✔ |
-| xSCAL    | ✔ | ✔ | ✔ | ✔ |
-| xCOPY    | ✔ | ✔ | ✔ | ✔ |
-| xAXPY    | ✔ | ✔ | ✔ | ✔ |
-| xDOT     | ✔ | ✔ | - | - |
-| xDOTU    | - | - | ✔ | ✔ |
-| xDOTC    | - | - | ✔ | ✔ |
-| xNRM2    | ✔ | ✔ | ✔ | ✔ |
-| xASUM    | ✔ | ✔ | ✔ | ✔ |
-| IxAMAX   | ✔ | ✔ | ✔ | ✔ |
-
-| Level-2  | S | D | C | Z |
-| ---------|---|---|---|---|
-| xGEMV    | ✔ | ✔ | ✔ | ✔ |
-| xGBMV    | ✔ | ✔ | ✔ | ✔ |
-| xHEMV    | - | - | ✔ | ✔ |
-| xHBMV    | - | - | ✔ | ✔ |
-| xHPMV    | - | - | ✔ | ✔ |
-| xSYMV    | ✔ | ✔ | - | - |
-| xSBMV    | ✔ | ✔ | - | - |
-| xSPMV    | ✔ | ✔ | - | - |
-| xTRMV    | ✔ | ✔ | ✔ | ✔ |
-| xTBMV    | ✔ | ✔ | ✔ | ✔ |
-| xTPMV    | ✔ | ✔ | ✔ | ✔ |
-| xGER     | ✔ | ✔ | - | - |
-| xGERU    | - | - | ✔ | ✔ |
-| xGERC    | - | - | ✔ | ✔ |
-| xHER     | - | - | ✔ | ✔ |
-| xHPR     | - | - | ✔ | ✔ |
-| xHER2    | - | - | ✔ | ✔ |
-| xHPR2    | - | - | ✔ | ✔ |
-| xSYR     | ✔ | ✔ | - | - |
-| xSPR     | ✔ | ✔ | - | - |
-| xSYR2    | ✔ | ✔ | - | - |
-| xSPR2    | ✔ | ✔ | - | - |
-
-| Level-3  | S | D | C | Z |
-| ---------|---|---|---|---|
-| xGEMM    | ✔ | ✔ | ✔ | ✔ |
-| xSYMM    | ✔ | ✔ | ✔ | ✔ |
-| xHEMM    | - | - | ✔ | ✔ |
-| xSYRK    | ✔ | ✔ | ✔ | ✔ |
-| xHERK    | - | - | ✔ | ✔ |
-| xSYR2K   | ✔ | ✔ | ✔ | ✔ |
-| xHER2K   | - | - | ✔ | ✔ |
-| xTRMM    | ✔ | ✔ | ✔ | ✔ |
+CLBlast is in active development but already supports almost all the BLAS routines. The supported routines are marked with '✔' in the following tables. Routines marked with '-' do not exist: they are not part of BLAS at all. The different data-types supported by the library are:
+
+* __S:__ Single-precision 32-bit floating-point (`float`).
+* __D:__ Double-precision 64-bit floating-point (`double`).
+* __C:__ Complex single-precision 2x32-bit floating-point (`std::complex<float>`).
+* __Z:__ Complex double-precision 2x64-bit floating-point (`std::complex<double>`).
+* __H:__ Half-precision 16-bit floating-point (`cl_half`). See section 'Half precision' for more information.
+
+| Level-1  | S | D | C | Z | H |
+| ---------|---|---|---|---|---|
+| xSWAP    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xSCAL    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xCOPY    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xAXPY    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xDOT     | ✔ | ✔ | - | - | ✔ |
+| xDOTU    | - | - | ✔ | ✔ | - |
+| xDOTC    | - | - | ✔ | ✔ | - |
+| xNRM2    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xASUM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| IxAMAX   | ✔ | ✔ | ✔ | ✔ | ✔ |
+
+| Level-2  | S | D | C | Z | H |
+| ---------|---|---|---|---|---|
+| xGEMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xGBMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHEMV    | - | - | ✔ | ✔ | - |
+| xHBMV    | - | - | ✔ | ✔ | - |
+| xHPMV    | - | - | ✔ | ✔ | - |
+| xSYMV    | ✔ | ✔ | - | - | ✔ |
+| xSBMV    | ✔ | ✔ | - | - | ✔ |
+| xSPMV    | ✔ | ✔ | - | - | ✔ |
+| xTRMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xTBMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xTPMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xGER     | ✔ | ✔ | - | - | ✔ |
+| xGERU    | - | - | ✔ | ✔ | - |
+| xGERC    | - | - | ✔ | ✔ | - |
+| xHER     | - | - | ✔ | ✔ | - |
+| xHPR     | - | - | ✔ | ✔ | - |
+| xHER2    | - | - | ✔ | ✔ | - |
+| xHPR2    | - | - | ✔ | ✔ | - |
+| xSYR     | ✔ | ✔ | - | - | ✔ |
+| xSPR     | ✔ | ✔ | - | - | ✔ |
+| xSYR2    | ✔ | ✔ | - | - | ✔ |
+| xSPR2    | ✔ | ✔ | - | - | ✔ |
+
+| Level-3  | S | D | C | Z | H |
+| ---------|---|---|---|---|---|
+| xGEMM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xSYMM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHEMM    | - | - | ✔ | ✔ | - |
+| xSYRK    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHERK    | - | - | ✔ | ✔ | - |
+| xSYR2K   | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHER2K   | - | - | ✔ | ✔ | - |
+| xTRMM    | ✔ | ✔ | ✔ | ✔ | ✔ |
 
 In addition, some non-BLAS routines are also supported by CLBlast. They are experimental and should be used with care:
 
-| Additional | S | D | C | Z |
-| -----------|---|---|---|---|
-| xSUM       | ✔ | ✔ | ✔ | ✔ |
-| IxMAX      | ✔ | ✔ | ✔ | ✔ |
-| IxMIN      | ✔ | ✔ | ✔ | ✔ |
+| Additional | S | D | C | Z | H |
+| -----------|---|---|---|---|---|
+| xSUM       | ✔ | ✔ | ✔ | ✔ | ✔ |
+| IxMAX      | ✔ | ✔ | ✔ | ✔ | ✔ |
+| IxMIN      | ✔ | ✔ | ✔ | ✔ | ✔ |
 
 Some BLAS routines are not supported yet by CLBlast. They are shown in the following table:
 
@@ -250,6 +257,19 @@ Some BLAS routines are not supported yet by CLBlast. They are shown in the follo
 | xTRSM       |   |   |   |   |
 
 
+Half precision (fp16)
+-------------
+
+The half-precison fp16 format is a 16-bits floating-point data-type. Some OpenCL devices support the `cl_khr_fp16` extension, reducing storage and bandwidth requirements by a factor 2 compared to single-precision floating-point. In case the hardware also accelerates arithmetic on half-precision data-types, this can also greatly improve compute performance of e.g. level-3 routines such as GEMM. Devices which can benefit from this are among others Intel GPUs, ARM Mali GPUs, and NVIDIA's latest Pascal GPUs. Half-precision is in particular interest for the deep-learning community, in which convolutional neural networks can be processed much faster at a minor accuracy loss.
+
+Since there is no half-precision data-type in C or C++, OpenCL provides the `cl_half` type for the host device. Unfortunately, internally this translates to a 16-bits integer, so computations on the host using this data-type should be avoided. For convenience, CLBlast provides the `clblast_half.h` header (C99 and C++ compatible), defining the `half` type as a short-hand to `cl_half` and the following basic functions:
+
+* `half FloatToHalf(const float value)`: Converts a 32-bits floating-point value to a 16-bits floating-point value.
+* `float HalfToFloat(const half value)`: Converts a 16-bits floating-point value to a 32-bits floating-point value.
+
+The `/samples` folder contains examples of how to use these convencience functions when calling one of the half-precision BLAS routines.
+
+
 Contributing
 -------------
 
@@ -270,6 +290,7 @@ Tuning and testing on a variety of OpenCL devices was made possible by:
 * [dividiti](http://www.dividiti.com)
 * [SURFsara HPC center](http://www.surfsara.com)
 
+
 Support us
 -------------
author	Cedric Nugteren <web@cedricnugteren.nl>	2016-05-30 11:11:28 +0200
committer	Cedric Nugteren <web@cedricnugteren.nl>	2016-05-30 11:11:28 +0200
commit	61105e38100d323ea270f2cbee0a824d401eaa77 (patch)
tree	a6f8af9f6e75b57870bfce119f037093a46d2e9c /README.md
parent	182d2cffa163688e2ae08d5d526f8eb63914b6ac (diff)
parent	03182f9d07533f795a498936391da744d982e8e2 (diff)