Added initial glossary

author: Cedric Nugteren <web@cedricnugteren.nl> 2018-03-10 17:02:38 +0100
committer: Cedric Nugteren <web@cedricnugteren.nl> 2018-03-10 17:02:38 +0100
commit: 49b02ec194c631054a690e897886e3a7339192a1 (patch)
tree: a3a8136f1e9d600b2bb1a6b798893848dd1c6ce7
parent: 86455841d1027e56ea4fee7d93ee42a95894db4d (diff)
2 files changed, 15 insertions, 0 deletions
diff --git a/README.md b/README.md
index 3c8ceee7..2084e51e 100644
--- a/README.md
+++ b/README.md
@@ -78,6 +78,7 @@ More detailed documentation is available in separate files:
 * [Tuning for better performance](doc/tuning.md)
 * [Testing the library for correctness](doc/testing.md)
 * [Bindings / wrappers for other languages](doc/bindings.md)
+* [Glossary with some terms explained](doc/glossary.md)
 
 
 Known issues
diff --git a/doc/glossary.md b/doc/glossary.md
new file mode 100644
index 00000000..821ffc69
--- /dev/null
+++ b/doc/glossary.md
@@ -0,0 +1,14 @@
+CLBlast: Glossary
+================
+
+This document describes some commonly used terms in CLBlast documentation and code. For other information about CLBlast, see the [main README](../README.md).
+
+* __BLAS__: The set of 'Basic Linear Algebra Subroutines'.
+* __Netlib BLAS__: The official BLAS API definition, with __CBLAS__ providing the C headers. 
+* __OpenCL__: The open compute language, a Khronos standard for heterogeneous and parallel computing, e.g. on GPUs.
+* __kernel__: An OpenCL parallel program that runs on the target device.
+* __clBLAS__: Another OpenCL BLAS library, maintained by AMD.
+* __cuBLAS__: The main CUDA BLAS library, maintained by NVIDIA.
+* __GEMM__: The 'GEneral Matrix Multiplication' routine.
+* __Direct GEMM__: Computing GEMM using a single generic kernel which handles all cases (e.g. all kinds of matrix sizes).
+* __Indirect GEMM__: Computing GEMM using multiple kernels: the main GEMM kernel and a few pre-processing and post-processing kernels. The main kernel makes several assumptions (e.g. sizes need to be multiples of 32), which the other kernels make sure are satisfied. The main kernel is often faster than the generic kernel of the direct approach, but the cost of pre-processing and post-processing kernels can sometimes be high for small sizes or particular devices.
author	Cedric Nugteren <web@cedricnugteren.nl>	2018-03-10 17:02:38 +0100
committer	Cedric Nugteren <web@cedricnugteren.nl>	2018-03-10 17:02:38 +0100
commit	49b02ec194c631054a690e897886e3a7339192a1 (patch)
tree	a3a8136f1e9d600b2bb1a6b798893848dd1c6ce7
parent	86455841d1027e56ea4fee7d93ee42a95894db4d (diff)