1 files changed, 120 insertions, 110 deletions
diff --git a/README.md b/README.md
index e4564c26..ddd841e2 100644
--- a/README.md
+++ b/README.md
@@ -2,11 +2,14 @@
 CLBlast: The tuned OpenCL BLAS library
 ================
 
-[![Build Status](https://travis-ci.org/CNugteren/CLBlast.svg?branch=master)](https://travis-ci.org/CNugteren/CLBlast)
+| | master | development |
+|-----|-----|-----|
+| Linux/OS X | [![Build Status](https://travis-ci.org/CNugteren/CLBlast.svg?branch=master)](https://travis-ci.org/CNugteren/CLBlast/branches) | [![Build Status](https://travis-ci.org/CNugteren/CLBlast.svg?branch=development)](https://travis-ci.org/CNugteren/CLBlast/branches) |
+| Windows | [![Build Status](https://ci.appveyor.com/api/projects/status/github/cnugteren/clblast?branch=master&svg=true)](https://ci.appveyor.com/project/CNugteren/clblast) | [![Build Status](https://ci.appveyor.com/api/projects/status/github/cnugteren/clblast?branch=development&svg=true)](https://ci.appveyor.com/project/CNugteren/clblast) |
 
 CLBlast is a modern, lightweight, performant and tunable OpenCL BLAS library written in C++11. It is designed to leverage the full performance potential of a wide variety of OpenCL devices from different vendors, including desktop and laptop GPUs, embedded GPUs, and other accelerators. CLBlast implements BLAS routines: basic linear algebra subprograms operating on vectors and matrices.
 
-__Note that the CLBlast library is actively being developed, and might not be mature enough for production environments__. This preview-version doesn't support the less commonly used routines yet: they will be added in due time. It also lacks extensive tuning on some common OpenCL platforms: __out-of-the-box performance on some devices might be poor__. See below for more details (and how to tune yourself).
+This preview-version is not yet tuned for all OpenCL devices: __out-of-the-box performance on some devices might be poor__. See below for a list of already tuned devices and instructions on how to tune yourself and contribute to future releases of the CLBlast library.
 
 
 Why CLBlast and not clBLAS or cuBLAS?
@@ -16,21 +19,22 @@ Use CLBlast instead of clBLAS:
 
 * When you care about achieving maximum performance.
 * When you want to be able to inspect the BLAS kernels or easily customize them to your needs.
-* When you run on exotic OpenCL devices which you need to tune yourself.
+* When you run on exotic OpenCL devices for which you need to tune yourself.
 * When you are still running on OpenCL 1.1 hardware.
 * When you value an organized and modern C++ codebase.
 * When you target Intel CPUs and GPUs or embedded devices
+* When you can benefit from the increased performance of half-precision fp16 data-types.
 
 Use CLBlast instead of cuBLAS:
 
 * When you want your code to run on devices other than NVIDIA CUDA-enabled GPUs.
-* When you want to tune for a specific configuration (e.g. rectangular matrix-sizes)
+* When you want to tune for a specific configuration (e.g. rectangular matrix-sizes).
 * When you sleep better if you know that the library you use is open-source.
+* When you are using OpenCL rather than CUDA.
 
 When not to use CLBlast:
 
 * When you run on NVIDIA's CUDA-enabled GPUs only and can benefit from cuBLAS's assembly-level tuned kernels.
-* When you need those BLAS routines that are not yet supported by CLBlast.
 
 
 Compilation and installation
@@ -52,14 +56,6 @@ The pre-requisites for compilation of CLBlast are:
   - Intel OpenCL
   - Beignet
 
-Furthermore, to build the (optional) correctness and performance tests, another BLAS library is needed to serve as a reference. This can be either:
-
-* The OpenCL BLAS library [clBLAS](http://github.com/clMathLibraries/clBLAS) (maintained by AMD)
-* A regular CPU Netlib BLAS library, e.g.:
-  - OpenBLAS
-  - BLIS
-  - Accelerate
-
 An example of an out-of-source build using a command-line compiler and make (starting from the root of the CLBlast folder):
 
     mkdir build
@@ -90,7 +86,9 @@ Or alternatively the plain C version:
 
     #include <clblast_c.h>
 
-Afterwards, any of CLBlast's routines can be called directly: there is no need to initialize the library. The available routines and the required arguments are described in the `clblast.h` include file and the included [API documentation](doc/clblast.md). Additionally, a couple of stand-alone example programs are included in `samples/`.
+Afterwards, any of CLBlast's routines can be called directly: there is no need to initialize the library. The available routines and the required arguments are described in the above mentioned include files and the included [API documentation](doc/clblast.md). Additionally, a couple of stand-alone example programs are included in the `samples` subfolder. They can optionally be compiled using the CMake infrastructure of CLBlast by providing the `-DSAMPLES=ON` flag, for example as follows:
+
+    cmake -DSAMPLES=ON ..
 
 
 Using the tuners (optional)
@@ -99,6 +97,7 @@ Using the tuners (optional)
 The CLBlast library will be tuned in the future for the most commonly used OpenCL devices. This pre-release of CLBlast is only tuned for a limited number of devices, in particular those with the following `CL_DEVICE_NAME` values:
 
 * NVIDIA GPUs:
+  - GRID K520
   - GeForce GTX 480
   - GeForce GTX 680
   - GeForce GTX 750 Ti
@@ -111,8 +110,10 @@ The CLBlast library will be tuned in the future for the most commonly used OpenC
   - Tahiti
   - Hawaii
   - Pitcairn
-  - R9 M370X
+  - Radeon R9 M370X Compute Engine
 * Intel GPUs:
+  - HD Graphics Haswell Ultrabook GT2 Mobile
+  - HD Graphics Skylake ULT GT2
   - Iris
   - Iris Pro
 * Intel CPUs:
@@ -123,15 +124,15 @@ The CLBlast library will be tuned in the future for the most commonly used OpenC
   - ARM Mali-T628 GPU
   - Intel MIC
 
-If your device is not (yet) among this list or if you want to tune CLBlast for specific parameters (e.g. rectangular matrix sizes), you should compile the library with the optional tuners:
+If your device is not (yet) among this list or if you want to tune CLBlast for specific parameters (e.g. rectangular matrix sizes), you should compile the library with the optional tuners by specifing `-DTUNERS=ON`, for example as follows:
 
     cmake -DTUNERS=ON ..
 
-Note that CLBlast's tuners are based on the CLTune auto-tuning library, which has to be installed separately (version 1.7.0 or higher). CLTune is available from GitHub.
+Note that CLBlast's tuners are based on the [CLTune auto-tuning library](https://github.com/CNugteren/CLTune), which has to be installed separately (requires version 2.3.1 or higher).
 
 Compiling with `-DTUNERS=ON` will generate a number of tuners, each named `clblast_tuner_xxxxx`, in which `xxxxx` corresponds to a `.opencl` kernel file as found in `src/kernels`. These kernels corresponds to routines (e.g. `xgemm`) or to common pre-processing or post-processing kernels (`copy` and `transpose`). Running such a tuner will test a number of parameter-value combinations on your device and report which one gave the best performance. Running `make alltuners` runs all tuners for all precisions in one go. You can set the default device and platform for `alltuners` by setting the `DEFAULT_DEVICE` and `DEFAULT_PLATFORM` environmental variables before running CMake.
 
-The tuners output a JSON-file with the results. The best results need to be added to `include/internal/database/xxxxx.h` in the appropriate section. However, this can be done automatically based on the JSON-data using a Python script in `scripts/database/database.py`. If you want the found parameters to be included in future releases of CLBlast, please attach the JSON files to the corresponding issue on GitHub or [email the main author](http://www.cedricnugteren.nl).
+The tuners output a JSON-file with the results. The best results need to be added to `src/database/kernels/xxxxx.hpp` in the appropriate section. However, this can be done automatically based on the JSON-data using a Python script in `scripts/database/database.py`. If you want the found parameters to be included in future releases of CLBlast, please attach the JSON files to the corresponding issue on GitHub or [email the main author](http://www.cedricnugteren.nl).
 
 In summary, tuning the entire library for your device can be done as follows (starting from the root of the CLBlast folder):
 
@@ -144,110 +145,125 @@ In summary, tuning the entire library for your device can be done as follows (st
     make
 
 
-Compiling the correctness and performance tests (optional)
+Compiling the correctness tests (optional)
 -------------
 
-To make sure CLBlast is working correctly on your device (recommended), compile with the tests enabled:
+To make sure CLBlast is working correctly on your device (recommended), compile with the tests enabled by specifying `-DTESTS=ON`, for example as follows:
 
     cmake -DTESTS=ON ..
 
-Afterwards, executables in the form of `clblast_test_xxxxx` are available, in which `xxxxx` is the name of a routine (e.g. `xgemm`). Note that CLBlast is best tested against [clBLAS](http://github.com/clMathLibraries/clBLAS) for correctness. If the library clBLAS is not installed on your system, it will use a regular CPU BLAS library to test against. If both are present, setting the command-line option `-clblas 1` or `-cblas 1` will select the library to test against for the `clblast_test_xxxxx` executables.
+To build these tests, another BLAS library is needed to serve as a reference. This can be either:
 
-With the `-DTESTS=ON` flag, additional performance tests are compiled. These come in the form of client executables named `clblast_client_xxxxx`, in which `xxxxx` is the name of a routine (e.g. `xgemm`). These clients take a bunch of configuration options and directly run CLBlast in a head-to-head performance test against clBLAS and/or a CPU BLAS library.
+* The OpenCL BLAS library [clBLAS](http://github.com/clMathLibraries/clBLAS) (maintained by AMD)
+* A regular CPU Netlib BLAS library, e.g.:
+  - OpenBLAS
+  - BLIS
+  - Accelerate
 
+Afterwards, executables in the form of `clblast_test_xxxxx` are available, in which `xxxxx` is the name of a routine (e.g. `xgemm`). Note that CLBlast is tested for correctness against [clBLAS](http://github.com/clMathLibraries/clBLAS) and/or a regular CPU BLAS library. If both are installed on your system, setting the command-line option `-clblas 1` or `-cblas 1` will select the library to test against for the `clblast_test_xxxxx` executables. All tests have a `-verbose` option to enable additional diagnostic output. They also have a `-full_test` option to increase coverage further.
 
-Performance remarks
+All tests can be run directly together in one go through the `make alltests` target or using CTest (`make test` or `ctest`). In the latter case the output is less verbose. Both cases allow you to set the default device and platform to non-zero by setting the `DEFAULT_DEVICE` and `DEFAULT_PLATFORM` environmental variables before running CMake.
+
+
+Compiling the performance tests/clients (optional)
 -------------
 
-The CLBlast library provides pre-tuned parameter-values for a number of OpenCL devices. If your device is not among these, then out-of-the-box performance might be poor. Even if the device is included performance might be poor in some cases: __the preview version is not thoroughly tested for performance yet__. See above under `Using the tuners` to find out how to tune for your device.
+To test the performance of CLBlast and compare optionally against [clBLAS](http://github.com/clMathLibraries/clBLAS) or a CPU BLAS library (see above for requirements), compile with the clients enabled by specifying `-DCLIENTS=ON`, for example as follows:
 
-The folder `doc/performance` contains some PDF files with performance results on tested devices. Performance is compared against a tuned version of the clBLAS library. The graphs of the level-3 routines (Xgemm, Xsymm, Xsyrk) show the strong points of CLBlast:
+    cmake -DCLIENTS=ON ..
 
-* The library reaches a high peak performance for large matrix sizes, in some cases a factor 2 more than clBLAS.
-* The performance for non-power of 2 values (e.g. 1000) is roughly equal to power of 2 cases (e.g. 1024). This is not the case for clBLAS, which sometimes shows a drop of a factor 2.
-* The performance is also constant for different layouts and transpose options. Again, this is not the case for clBLAS.
+The performance tests come in the form of client executables named `clblast_client_xxxxx`, in which `xxxxx` is the name of a routine (e.g. `xgemm`). These clients take a bunch of configuration options and directly run CLBlast in a head-to-head performance test against optionally clBLAS and/or a CPU BLAS library. You can use the command-line options `-clblas 1` or `-cblas 1` to select a library to test against.
 
-The graphs also show the current weak points of CLBlast: for small sizes the benefit is minimal or non-existent, and for some specific configurations clBLAS is still faster.
+The folder `doc/performance` contains some PDF files with performance results on tested devices. Performance is compared in this case against a tuned version of the clBLAS library. These graphs can be generated automatically on your own device. First, compile CLBlast with the clients enabled. Then, make sure your installation of the reference clBLAS is performance-tuned by running the `tune` executable. Finally, run one of the graph-scripts found in `scripts/graphs` using R. For example, to generate the Xgemm PDF on device 1 of platform 0 from the `build` subdirectory:
 
-These graphs can be generated automatically on your own device. First, compile CLBlast with the tests enabled. Then, make sure your installation of the reference clBLAS is performance-tuned by running the `tune` executable. Finally, run one of the graph-scripts found in `test/performance/graphs` using R. For example, to generate the Xgemm PDF on device 1 of platform 0:
+    Rscript ../scripts/graphs/xgemm.r 0 1
 
-    Rscript path/to/test/performance/graphs/xgemm.r 0 1
+Note that the CLBlast library provides pre-tuned parameter-values for some devices only: if your device is not among these, then out-of-the-box performance might be poor. See above under `Using the tuners` to find out how to tune for your device.
 
 
 Supported routines
 -------------
 
-CLBlast is in active development but already supports almost all the BLAS routines. The supported routines are marked with '✔' in the following tables. Routines marked with '-' do not exist: they are not part of BLAS at all.
-
-| Level-1  | S | D | C | Z |
-| ---------|---|---|---|---|
-| xSWAP    | ✔ | ✔ | ✔ | ✔ |
-| xSCAL    | ✔ | ✔ | ✔ | ✔ |
-| xCOPY    | ✔ | ✔ | ✔ | ✔ |
-| xAXPY    | ✔ | ✔ | ✔ | ✔ |
-| xDOT     | ✔ | ✔ | - | - |
-| xDOTU    | - | - | ✔ | ✔ |
-| xDOTC    | - | - | ✔ | ✔ |
-| xNRM2    | ✔ | ✔ | ✔ | ✔ |
-| xASUM    | ✔ | ✔ | ✔ | ✔ |
-| IxAMAX   | ✔ | ✔ | ✔ | ✔ |
-
-| Level-2  | S | D | C | Z |
-| ---------|---|---|---|---|
-| xGEMV    | ✔ | ✔ | ✔ | ✔ |
-| xGBMV    | ✔ | ✔ | ✔ | ✔ |
-| xHEMV    | - | - | ✔ | ✔ |
-| xHBMV    | - | - | ✔ | ✔ |
-| xHPMV    | - | - | ✔ | ✔ |
-| xSYMV    | ✔ | ✔ | - | - |
-| xSBMV    | ✔ | ✔ | - | - |
-| xSPMV    | ✔ | ✔ | - | - |
-| xTRMV    | ✔ | ✔ | ✔ | ✔ |
-| xTBMV    | ✔ | ✔ | ✔ | ✔ |
-| xTPMV    | ✔ | ✔ | ✔ | ✔ |
-| xGER     | ✔ | ✔ | - | - |
-| xGERU    | - | - | ✔ | ✔ |
-| xGERC    | - | - | ✔ | ✔ |
-| xHER     | - | - | ✔ | ✔ |
-| xHPR     | - | - | ✔ | ✔ |
-| xHER2    | - | - | ✔ | ✔ |
-| xHPR2    | - | - | ✔ | ✔ |
-| xSYR     | ✔ | ✔ | - | - |
-| xSPR     | ✔ | ✔ | - | - |
-| xSYR2    | ✔ | ✔ | - | - |
-| xSPR2    | ✔ | ✔ | - | - |
-
-| Level-3  | S | D | C | Z |
-| ---------|---|---|---|---|
-| xGEMM    | ✔ | ✔ | ✔ | ✔ |
-| xSYMM    | ✔ | ✔ | ✔ | ✔ |
-| xHEMM    | - | - | ✔ | ✔ |
-| xSYRK    | ✔ | ✔ | ✔ | ✔ |
-| xHERK    | - | - | ✔ | ✔ |
-| xSYR2K   | ✔ | ✔ | ✔ | ✔ |
-| xHER2K   | - | - | ✔ | ✔ |
-| xTRMM    | ✔ | ✔ | ✔ | ✔ |
-
-In addition, some non-BLAS routines are also supported by CLBlast. They are experimental and should be used with care:
-
-| Additional | S | D | C | Z |
-| -----------|---|---|---|---|
-| xSUM       | ✔ | ✔ | ✔ | ✔ |
-| IxMAX      | ✔ | ✔ | ✔ | ✔ |
-| IxMIN      | ✔ | ✔ | ✔ | ✔ |
-
-Some BLAS routines are not supported yet by CLBlast. They are shown in the following table:
-
-| Unsupported | S | D | C | Z |
-| ------------|---|---|---|---|
-| xROTG       |   |   | - | - |
-| xROTMG      |   |   | - | - |
-| xROT        |   |   | - | - |
-| xROTM       |   |   | - | - |
-| xTRSV       |   |   |   |   |
-| xTBSV       |   |   |   |   |
-| xTPSV       |   |   |   |   |
-| xTRSM       |   |   |   |   |
+CLBlast supports almost all the Netlib BLAS routines plus a couple of extra non-BLAS routines. The supported BLAS routines are marked with '✔' in the following tables. Routines marked with '-' do not exist: they are not part of BLAS at all. The different data-types supported by the library are:
+
+* __S:__ Single-precision 32-bit floating-point (`float`).
+* __D:__ Double-precision 64-bit floating-point (`double`).
+* __C:__ Complex single-precision 2x32-bit floating-point (`std::complex<float>`).
+* __Z:__ Complex double-precision 2x64-bit floating-point (`std::complex<double>`).
+* __H:__ Half-precision 16-bit floating-point (`cl_half`). See section 'Half precision' for more information.
+
+| Level-1  | S | D | C | Z | H |
+| ---------|---|---|---|---|---|
+| xSWAP    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xSCAL    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xCOPY    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xAXPY    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xDOT     | ✔ | ✔ | - | - | ✔ |
+| xDOTU    | - | - | ✔ | ✔ | - |
+| xDOTC    | - | - | ✔ | ✔ | - |
+| xNRM2    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xASUM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| IxAMAX   | ✔ | ✔ | ✔ | ✔ | ✔ |
+
+| Level-2  | S | D | C | Z | H |
+| ---------|---|---|---|---|---|
+| xGEMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xGBMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHEMV    | - | - | ✔ | ✔ | - |
+| xHBMV    | - | - | ✔ | ✔ | - |
+| xHPMV    | - | - | ✔ | ✔ | - |
+| xSYMV    | ✔ | ✔ | - | - | ✔ |
+| xSBMV    | ✔ | ✔ | - | - | ✔ |
+| xSPMV    | ✔ | ✔ | - | - | ✔ |
+| xTRMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xTBMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xTPMV    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xGER     | ✔ | ✔ | - | - | ✔ |
+| xGERU    | - | - | ✔ | ✔ | - |
+| xGERC    | - | - | ✔ | ✔ | - |
+| xHER     | - | - | ✔ | ✔ | - |
+| xHPR     | - | - | ✔ | ✔ | - |
+| xHER2    | - | - | ✔ | ✔ | - |
+| xHPR2    | - | - | ✔ | ✔ | - |
+| xSYR     | ✔ | ✔ | - | - | ✔ |
+| xSPR     | ✔ | ✔ | - | - | ✔ |
+| xSYR2    | ✔ | ✔ | - | - | ✔ |
+| xSPR2    | ✔ | ✔ | - | - | ✔ |
+
+| Level-3  | S | D | C | Z | H |
+| ---------|---|---|---|---|---|
+| xGEMM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xSYMM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHEMM    | - | - | ✔ | ✔ | - |
+| xSYRK    | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHERK    | - | - | ✔ | ✔ | - |
+| xSYR2K   | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xHER2K   | - | - | ✔ | ✔ | - |
+| xTRMM    | ✔ | ✔ | ✔ | ✔ | ✔ |
+
+In addition, some extra non-BLAS routines are also supported by CLBlast, classified as level-X. They are experimental and should be used with care:
+
+| Level-X    | S | D | C | Z | H |
+| -----------|---|---|---|---|---|
+| xSUM       | ✔ | ✔ | ✔ | ✔ | ✔ |
+| IxMAX      | ✔ | ✔ | ✔ | ✔ | ✔ |
+| IxMIN      | ✔ | ✔ | ✔ | ✔ | ✔ |
+| xOMATCOPY  | ✔ | ✔ | ✔ | ✔ | ✔ |
+
+Some less commonly used BLAS routines are not yet supported yet by CLBlast. They are xROTG, xROTMG, xROT, xROTM, xTRSV, xTBSV, xTPSV, and xTRSM.
+
+
+Half precision (fp16)
+-------------
+
+The half-precison fp16 format is a 16-bits floating-point data-type. Some OpenCL devices support the `cl_khr_fp16` extension, reducing storage and bandwidth requirements by a factor 2 compared to single-precision floating-point. In case the hardware also accelerates arithmetic on half-precision data-types, this can also greatly improve compute performance of e.g. level-3 routines such as GEMM. Devices which can benefit from this are among others Intel GPUs, ARM Mali GPUs, and NVIDIA's latest Pascal GPUs. Half-precision is in particular interest for the deep-learning community, in which convolutional neural networks can be processed much faster at a minor accuracy loss.
+
+Since there is no half-precision data-type in C or C++, OpenCL provides the `cl_half` type for the host device. Unfortunately, internally this translates to a 16-bits integer, so computations on the host using this data-type should be avoided. For convenience, CLBlast provides the `clblast_half.h` header (C99 and C++ compatible), defining the `half` type as a short-hand to `cl_half` and the following basic functions:
+
+* `half FloatToHalf(const float value)`: Converts a 32-bits floating-point value to a 16-bits floating-point value.
+* `float HalfToFloat(const half value)`: Converts a 16-bits floating-point value to a 32-bits floating-point value.
+
+The `samples/haxpy.c` example shows how to use these convencience functions when calling the half-precision BLAS routine HAXPY.
 
 
 Contributing
@@ -257,7 +273,7 @@ Contributions are welcome in the form of tuning results for OpenCL devices previ
 
 The contributing authors (code, pull requests, testing) so far are:
 
-* [Cedric Nugteren](http://www.cedricnugteren.nl)
+* [Cedric Nugteren](http://www.cedricnugteren.nl) - main author
 * [Anton Lokhmotov](https://github.com/psyhtest)
 * [Dragan Djuric](https://github.com/blueberry)
 * [Marco Hutter](https://github.com/gpus)
@@ -270,14 +286,8 @@ Tuning and testing on a variety of OpenCL devices was made possible by:
 * [dividiti](http://www.dividiti.com)
 * [SURFsara HPC center](http://www.surfsara.com)
 
+
 Support us
 -------------
 
 This project started in March 2015 as an evenings and weekends free-time project next to a full-time job for Cedric Nugteren. If you are in the position to support the project by OpenCL-hardware donations or otherwise, please find contact information on the [website of the main author](http://www.cedricnugteren.nl).
-
-
-To-do list before release of version 1.0
--------------
-
-- Add half-precision routines (e.g. HGEMM)
-- Add API documentation