Merge branch 'master' into CLBlast-267-convgemm

author: Cedric Nugteren <web@cedricnugteren.nl> 2018-05-19 17:54:27 +0200
committer: Cedric Nugteren <web@cedricnugteren.nl> 2018-05-19 17:54:27 +0200
commit: cbcd4ff7e8e21584a9a1f405c9f4cb979a73b718 (patch)
tree: 4a131ed480dc4f496a211453f95adfebaf3f6336 /doc
parent: e057a9186a1ed0a169fcf4db7a2598d08f530834 (diff)
parent: 507d7bc729eff888dd499e937bf1a636cbdee75b (diff)
2 files changed, 28 insertions, 3 deletions
diff --git a/doc/details_gemm.md b/doc/details_gemm.md
new file mode 100644
index 00000000..d4666abb
--- /dev/null
+++ b/doc/details_gemm.md
@@ -0,0 +1,27 @@
+CLBlast: Details on the GEMM routine and kernel
+================
+
+This document gives a bit more detail on how the GEMM routine is organised and implemented. For other information about CLBlast, see the [main README](../README.md).
+
+
+GEMM: Two approaches
+-------------
+
+CLBlast implements two approaches to GEMM: direct and indirect:
+
+* Direct GEMM: Computing GEMM using a single generic kernel which handles all cases (e.g. all kinds of matrix sizes).
+* Indirect GEMM: Computing GEMM using multiple kernels: the main GEMM kernel and a few pre-processing and post-processing kernels. The main kernel makes several assumptions (e.g. sizes need to be multiples of 32), which the other kernels make sure are satisfied. The main kernel is often faster than the generic kernel of the direct approach, but the cost of pre-processing and post-processing kernels can sometimes be high for small sizes or particular devices.
+
+
+GEMM: In-direct approach
+-------------
+
+Similar to the work by Matsumoto et al. ("Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs"), the main GEMM kernel makes many assumptions on the input arguments, which are handled by pre-processing and post-processing kernels. These assumptions are e.g. matrix sizes are a multiple of the work-group sizes, offsets are zero, and matrix B is transposed. This is a good solution for larger problem sizes since O(n^2) data movement is typically cheaper than O(n^3) computation, but the hidden constant starts to play a role for smaller n. Therefore, there is also a single-kernel direct version available for those cases, but it shares most of the design and parameters as discussed below.
+
+The main kernel has 14 different parameters, of which some are illustrated in figure 1 in the [CLBlast paper](https://arxiv.org/pdf/1705.05249). The parameters define among others the work-group sizes in 2 dimensions (MWG, NWG), the 2D register tiling configuration (MWI, NWI), the vector widths of both input matrices (VWM, VWN), loop unroll factors (KWI), and whether or not and how to use the local memory.
+
+
+GEMM: Direct approach
+-------------
+
+This is a single-kernel approach that shared many of the parameters for the in-direct kernel. One of the differences is that within the kernel there are checks for incomplete tiles in the m/n/k dimensions, influenced by the tuning parameters and the matrix sizes. These incomplete tiles will run a different part of the code, as they for example cannot benefit from vectorisation. Another difference is that there are dedicated kernels for each a/b transpose requirement: NN, NT, TN, TT for non-transposed and transposed.
+\ No newline at end of file
diff --git a/doc/tuning.md b/doc/tuning.md
index 60ad2cc7..b5186ac6 100644
--- a/doc/tuning.md
+++ b/doc/tuning.md
@@ -82,7 +82,7 @@ Compiling with `-DTUNERS=ON` will generate a number of tuners, each named `clbla
 
 The kernels `gemm` and `gemm_direct` have too many parameters to explore. Therefore, they will run in two stages: a first stage with a fixed limited number of parameter combinations, and a second stage with a random selection from a much larger search space. The random fraction is determined by the `fraction` argument on the command-line.
 
-There are also several routine-level tuners. They tune inter-kernel parameters and should only be run after the kernels are tuned. An example is the GEMM routine tuner, which determines when to use the direct or the in-direct GEMM kernel.
+There are also several routine-level tuners. They tune inter-kernel parameters and should only be run after the kernels are tuned. However, they do automatically pick up kernel tuning results from the current folder if there are any. An example is the GEMM routine tuner, which determines when to use the direct or the in-direct GEMM kernel.
 
 
 Using the tuning results
@@ -100,8 +100,6 @@ In summary, tuning the entire library for your device can be done as follows (st
     python ../scripts/database/database.py . ..
     make
 
-After the kernels are tuned, you can run the `clblast_tuner_routine_xgemm` tuner to optimize the high-level GEMM routine, i.e. selecting which method to use: the direct kernel or the in-direct kernel.
-
 
 Tuning using the API (advanced users only)
 -------------
author	Cedric Nugteren <web@cedricnugteren.nl>	2018-05-19 17:54:27 +0200
committer	Cedric Nugteren <web@cedricnugteren.nl>	2018-05-19 17:54:27 +0200
commit	cbcd4ff7e8e21584a9a1f405c9f4cb979a73b718 (patch)
tree	4a131ed480dc4f496a211453f95adfebaf3f6336 /doc
parent	e057a9186a1ed0a169fcf4db7a2598d08f530834 (diff)
parent	507d7bc729eff888dd499e937bf1a636cbdee75b (diff)