Added documentation on the convgemm routine

author: Cedric Nugteren <web@cedricnugteren.nl> 2019-01-19 15:44:19 +0100
committer: Cedric Nugteren <web@cedricnugteren.nl> 2019-01-19 15:44:19 +0100
commit: 11f4c7dd936146f9b4f165d8ef69bafa3a33ad26 (patch)
tree: bb8a7aa8493e3447b3544b9832cecb678c3087d6
parent: c42e48068bfd51a19f9918f20b43be94c0e3dcf2 (diff)
3 files changed, 24 insertions, 1 deletions
diff --git a/README.md b/README.md
index f633f74b..f07177b1 100644
--- a/README.md
+++ b/README.md
@@ -79,6 +79,7 @@ More detailed documentation is available in separate files:
 * [Testing the library for correctness](doc/testing.md)
 * [Bindings / wrappers for other languages](doc/bindings.md)
 * [More details on the GEMM kernel](doc/details_gemm.md)
+* [More details on the convolution implementation](doc/details_conv.md)
 * [Glossary with some terms explained](doc/glossary.md)
 * [Frequently asked questions (FAQ) and their answers](doc/faq.md)
 
diff --git a/ROADMAP.md b/ROADMAP.md
index c932015f..c1db9850 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -20,6 +20,6 @@ This file gives an overview of the main features planned for addition to CLBlast
 | [#228](https://github.com/CNugteren/CLBlast/issues/228)        | Mar-Apr '18 | CNugteren | ✔      | Improving performance for Qualcomm Adreno GPUs |
 | [#270](https://github.com/CNugteren/CLBlast/issues/270)        | Oct '18     | CNugteren | ✔      | Implement col2im |
 | -                                                              | ??          | CNugteren |        | Add support for OpenCL image buffers |
-| [#267](https://github.com/CNugteren/CLBlast/issues/267)        | ??          | CNugteren | WIP    | Merge im2col and GEMM into a direct kernel |
+| [#267](https://github.com/CNugteren/CLBlast/issues/267)        | Jan '19     | vbkaisetsu| ✔      | Merge im2col and GEMM into a direct kernel |
 | [#136](https://github.com/CNugteren/CLBlast/issues/136)        | ??          | CNugteren |        | Implement xAXPBY and xSET |
 | [#169](https://github.com/CNugteren/CLBlast/issues/169)        | ??          | dividiti  |        | Problem-specific tuning parameter selection |
diff --git a/doc/details_conv.md b/doc/details_conv.md
new file mode 100644
index 00000000..65e18e70
--- /dev/null
+++ b/doc/details_conv.md
@@ -0,0 +1,22 @@
+CLBlast: Details on the CONVGEMM routine
+================
+
+This document gives a bit more detail on how the CONVGEMM routine is organised and implemented. For other information about CLBlast, see the [main README](../README.md).
+
+
+CONVGEMM: Two approaches
+-------------
+
+CLBlast implements two approaches to batched convolutions using GEMM: through im2col, or stand-alone:
+
+* `ConvGemmMethod::kWithIm2Col`: running first a batched version of im2col to prepare the data into a temporary buffer, and then running a batched version of GEMM. The implementation is just as the regular im2col and GEMM kernels in CLBlast, but it is implemented as a separate kernel so all the non-needed features can be stripped out and some optimizations can be made. It uses the tuning parameters of the regular im2col and GEMM kernels.
+
+* `ConvGemmMethod::kSingleKernel`: this is a single kernel approach: it loads the data in such a way that the im2col kernel is no longer needed, i.e. loading the data as the im2col transformation does it. That way it becomes a single kernel and there will be no need for an intermediate large buffer. It uses a separate set of tuning parameters, and can be tuned using the `clblast_tuner_xconvgemm` binary.
+
+
+CONVGEMM: Selecting which approach to use
+-------------
+
+Since CONVGEMM is a relatively new and experimental feature, selection of the approach is hard-coded in [xconvgemm.hpp on line 32](../src/routines/levelx/xconvgemm.hpp:32), but can be changed there in a single place.
+
+The main drawback of the `ConvGemmMethod::kWithIm2Col` approach is its extra memory usage, but depending on the device and setting, it might be faster compared to the `ConvGemmMethod::kSingleKernel` approach. The latter has as extra advantage that it has its own tuning parameters, so it can be fine-tuned for your specific use-case a bit better than the 2-kernel approach with im2col.
author	Cedric Nugteren <web@cedricnugteren.nl>	2019-01-19 15:44:19 +0100
committer	Cedric Nugteren <web@cedricnugteren.nl>	2019-01-19 15:44:19 +0100
commit	11f4c7dd936146f9b4f165d8ef69bafa3a33ad26 (patch)
tree	bb8a7aa8493e3447b3544b9832cecb678c3087d6
parent	c42e48068bfd51a19f9918f20b43be94c0e3dcf2 (diff)