CHANGELOG


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314

Development version (next version)
- Fix pointer error in pyclblast on ARM
- Added tuned parameters for many devices (see doc/tuning.md)

Version 1.6.0
- Modifications to improve performance on Qualcomm Adreno GPUs:
  * Unique database entries for specific Adreno devices
  * Toggle OpenCL kernel compilation options for Adreno
  * New preprocessor directive RELAX_WORKGROUP_SIZE
- Fixed a bug in handling of #undef in CLBlast loop unrolling and array-to-register mapping functions
- Fixed a bug in XAMAX/XAMIN routines related to inadvertently including the increment and offset in the result
- Fixed a bug in XAMAX/XAMIN routines that would cause only the real part of a complex number to be taken into account
- Fixed a bug that caused tests to not properly do integer-output testing (for XAMAX/XAMIN)
- Fixes a minor issue with the expected input buffer size in the TRMV/TBMV/TPMV/TRSV routines
- Fixes an issue with crashes on Android related to calling clReleaseProgram
- Fixes two small issues in the plotting script
- Fixed a documentation bug in the 'ld' requirements
- Enabled Github Actions CI builds for testing and releasing
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)

Version 1.5.3
- Fix a correctness issue with DGEMM on SM 7.5 Turing GPUs
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)
- Update cl.hpp to the new opencl.hpp header in the samples
- Changed the complex sum routine to return the complex sum instead of the absolute complex sum.

Version 1.5.2
- Changed XAMAX/XAMIN to more likely return first rather than last min/max index, updated API docs
- Added batched routines to pyclblast
- Added CLBLAST_VERSION_MAJOR/MINOR/PATCH defines in headers to store version numbering
- Several small improvements to the benchmark script (thanks to 'baryluk')
- Fixed a bug in the caching when using a context with multiple devices
- Fixed a bug in the tuners related to global workgroup size not being a multiple of the local
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)

Version 1.5.1
- Implemented single-kernel version of convolution as GEMM
- Now catches all exceptions thrown by the tuners
- Fixed a bug in ISAMIN kernel
- Fixed an out-of-bounds read/write in the XHAD routine (thanks to etomzak)
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)

Version 1.5.0
- Added support for shuffle instructions for NVIDIA GPUs (thanks to 'tyler-utah')
- Added an option to compile the Netlib API with static OpenCL device and context (-DNETLIB_PERSISTENT_OPENCL=ON)
- Added a FAQ page to the documentation
- The tuners now check beforehand on invalid local thread sizes and skip those completely
- Made the tuning API (OverrideParameters) more flexible, disregarding superfluous parameters
- Fixed an issue with conjugate transpose not being executed in certain cases for a.o. XOMATCOPY
- Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel
- Fixed an issue with the preprocessor and the new GEMMK == 1 kernel
- Fixed an issue for unequal MWG and NWG and the new GEMMK == 1 kernel
- Fixed an issue for certain parameters for AXPY's 'XaxpyFaster' kernel
- Various minor fixes and enhancements
- Added non-BLAS routines:
  * SCONVGEMM/DCONVGEMM/HCONVGEMM (convolution as im2col followed by batched GEMM)
  * SCOL2IM/DCOL2IM/CCOL2IM/ZCOL2IM/HCOL2IM (col2im transform as used in machine learning)

Version 1.4.1
- Fixed an access violation under Windows upon releasing the OpenCL program when the driver is already unloaded
- Fixed an issue with double cl_program release in the CLBlast caching system
- Added tuned parameters for various devices (see doc/tuning.md)

Version 1.4.0
- Added Python interface to CLBlast 'PyCLBlast'
- Added CLBlast to Ubuntu PPA and macOS Homebrew package managers
- Added an API to run the tuners programmatically without any I/O
- Improved the performance potential by adding a second tunable GEMM kernel with 2D register tiling
- Added support for Intel specific subgroup shuffling extensions for faster GEMM on Intel GPUs
- Re-added a local memory size constraint to the tuners
- The routine tuners now automatically pick up tuning results from disk from the kernel tuners
- Updated and reorganised the CLBlast documentation
- Added a 'canary' region to check for overflows in the tuner and tests (inspired by clARMOR)
- Added an option to test against and compare performance with Intel's MKL
- Fixed an access violation when compiled with Visual Studio upon releasing the OpenCL program
- Fixed incorrect releasing of the OpenCL program resulting in segfaults / access violations
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)
- Added non-BLAS level-1 routines:
  * SHAD/DHAD/CHAD/ZHAD/HHAD (Hadamard element-wise vector-vector product)

Version 1.3.0
- Re-designed and integrated the auto-tuner, no more dependency on CLTune
- Made it possible to override the tuning parameters in the clients straight from JSON tuning files
- Added OpenCL pre-processor to unroll loops and perform array-to-register promotions for compilers
  which don't do this themselves (ARM Mali) - greatly improves performance on these platforms
- Added first tuners for the TRSV (block size) and TRSM (invert kernel) routines
- Added an optional argument to the GEMM routine to provide a pre-allocated temporary buffer
- Fixed an issue with a crashing/hanging AMD APP compiler with the TRSM routine (invert kernel)
- Improved compilation time by splitting the tuning database into multiple compilation units
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)
- Added the RetrieveParameters function to the API to be able to inspect the tuning parameters
- Added a strided-batched (not part of the BLAS standard) routine, faster but less generic compared
  to the existing xGEMMBATCHED routines:
  * SGEMMSTRIDEDBATCHED/DGEMMSTRIDEDBATCHED/CGEMMSTRIDEDBATCHED/ZGEMMSTRIDEDBATCHED/HGEMMSTRIDEDBATCHED

Version 1.2.0
- Fixed a bug in the TRSM/TRSV routines due to missing synchronisations after GEMM/GEMV calls
- Fixed a bug in TRSM when using the a-offset argument
- Added a CUDA API to CLBlast:
  * The library and kernels can be compiled with the CUDA driver API and NVRTC (requires CUDA 7.5)
  * Two CUDA API sample programs are added: SGEMM and DAXPY
  * All correctness tests and performance clients work on CUDA like they did for OpenCL
- Kernels are now cached based on their tuning parameters: fits the use-case of 'OverrideParameters'
- Cross-compiling for Android is now supported using CMake; instructions are added to the README
- Improved performance for small GEMM problems by going from 3 to 1 optional temporary buffers
- GEMM kernel selection (direct vs in-direct) is now done automatically using a new tuner
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)

Version 1.1.0
- The tuning database now has defaults per architecture (e.g. NVIDIA Kepler SM3.5, AMD Fiji)
- The tuning database now has a dictionary to translate vendor/device names to a common set
- The tuners can now distinguish between different AMD GPU board names of the same architecture
- The tuners can now use particle-swarm optimisation to search more efficiently (thanks to 'mcian')
- Improved performance for small problems on NVIDIA hardware by caching the device name
- Further improved compilation time of database.cpp
- Added a small diagnostics helper executable
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)
- Added non-BLAS routines:
  * SIM2COL/DIM2COL/CIM2COL/ZIM2COL/HIM2COL (im2col transform as used to express convolution as GEMM)

Version 1.0.1
- Fixed a bug in the direct version of the GEMM kernel

Version 1.0.0
- Fixed a bug in the TRSM routine for alpha != 1
- Fixed a bug in the cache related to multi-device contexts (thanks to 'kpot')
- Fixed a bug in the direct version of the GEMM kernel
- Fixed several warnings for MSVC and Clang
- Added support for Mesa Clover and AMD's ROCm by making the inline keyword optional in kernels
- Performance reports are now external at https://cnugteren.github.io/clblast
- Greatly improved compilation time of database.cpp
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)
- Added non-BLAS level-1 routines:
  * iSAMIN/iDAMIN/iCAMIN/iZAMIN (absolute minimum version of the ixAMAX BLAS routines)

Version 0.11.0
- Improved the internal program source and binary caches for scalability and speed (thanks to 'intelfx')
- Fixed a bug having to re-create the binary even if it was in the cache
- Fixed a bug when using offsets in the direct version of the GEMM kernels
- Fixed a missing cl_khr_fp64 when running double-precision on Intel CPUs
- Fixed tests on Apple's CPU OpenCL implementation; still not fast but correct at least
- Fixed bugs in the half-precision routines HTBMV/HTPMV/HTRMV/HSYR2K/HTRMM
- Tests now also exit with an error code when OpenCL errors or compilation errors occur
- Tests now also check for the L2 error in case of half-precision
- Clients can now test against cuBLAS on NVIDIA systems for performance comparisons (-DCUBLAS=ON)
- Replaced the R graph scripts with Python/Matplotlib scripts
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)
- Added the OverrideParameters function to the API to be able to supply custom tuning parameters
- Added triangular solver (level-2 & level-3) routines:
  * STRSV/DTRSV/CTRSV/ZTRSV (experimental, un-optimized)
  * STRSM/DTRSM/CTRSM/ZTRSM (experimental, un-optimized)
- Added batched (not part of the BLAS standard) routines:
  * SAXPYBATCHED/DAXPYBATCHED/CAXPYBATCHED/ZAXPYBATCHED/HAXPYBATCHED (batched version of AXPY)
  * SGEMMBATCHED/DGEMMBATCHED/CGEMMBATCHED/ZGEMMBATCHED/HGEMMBATCHED (batched version of GEMM)

Version 0.10.0
- Updated to version 8.0 of the CLCudaAPI C++11 OpenCL header
- Changed the enums in the C API to avoid potential name clashes with external code
- Added a Netlib CBLAS compatible API (not recommended for full control over performance)
- Greatly improved the way exceptions are handled in the library (thanks to 'intelfx')
- Improved performance of GEMM kernels for small sizes by using a direct single-kernel implementation
- Fixed a bug in the tests and samples related to waiting for an invalid event
- Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters
- Fixed a bug in the TRMM routine that would overwrite input data before consuming everything
- Added support for compilation under Visual Studio 2013 (MSVC++ 12.0)
- Added an option to set OpenCL compiler options through the env variable CLBLAST_BUILD_OPTIONS
- Added an option to run tuned kernels multiple times to average execution times
- Added an option to build a static version of the library
- Made it possible to use the command-line environmental vars everywhere and without re-running CMake
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)

Version 0.9.0
- Updated to version 6.0 of the CLCudaAPI C++11 OpenCL header
- Improved performance significantly of rotated GEMV computations
- Improved performance of unseen/un-tuned devices by a better default tuning parameter selection
- Fixed proper MSVC dllimport and dllexport declarations
- Fixed memory leaks related to events not being released
- Fixed a bug with a size_t and cl_ulong mismatch on 32-bit systems
- Fixed a bug related to the cache and retrieval of programs based on the OpenCL context
- Fixed a performance issue (caused by fp16 support) by optimizing alpha/beta parameter passing to kernels
- Fixed a bug in the OpenCL kernels: now placing __kernel before __attribute__
- Fixed a bug in level-3 routines when beta is zero and matrix C contains NaNs
- Added an option (-warm_up) to do a warm-up run before timing in the performance clients
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)

Version 0.8.0
- Added support for half-precision floating-point (fp16) in the library
- Made it possible to compile the performance tests (clients) separately from the correctness tests
- Made a reference BLAS and head-to-head performance comparison optional in the clients
- Increased the verbosity of the "-verbose" option in the correctness tests
- Refactored the host code for better compilation times and fewer lines of code
- Added Appveyor continuous integration and increased coverage of the Travis builds
- Improved the API documentation
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)
- Added half-precision routines:
  * Level-1: HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN
  * Level-2: HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV/HGER/HSYR/HSPR/HSYR2/HSPR2
  * Level-3: HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM
- Added non-BLAS routines:
  * SOMATCOPY/DOMATCOPY/COMATCOPY/ZOMATCOPY/HOMATCOPY (matrix copy, scaling, and/or transpose)

Version 0.7.1
- Improved performance of large power-of-2 xGEMM kernels for AMD GPUs
- Fixed a bug in the xGEMM routine related to the event incorrectly set
- Made MSVC link the run-time libraries statically

Version 0.7.0
- Added exports to be able to create a DLL on Windows (thanks to Marco Hutter)
- Made the library thread-safe
- Performance and correctness tests can now (on top of clBLAS) be performed against CPU BLAS libraries
- Fixed the use of events within the library
- Changed the enum parameters to match the raw values of the cblas standard
- Fixed the cache of previously compiled binaries and added a function to fill or clear it
- Various minor fixes and enhancements
- Added a preliminary version of the API documentation
- Added additional sample programs
- Added tuned parameters for various devices (see README)
- Added level-1 routines:
  * SNRM2/DNRM2/ScNRM2/DzNRM2
  * SASUM/DASUM/ScASUM/DzASUM
  * SSUM/DSUM/ScSUM/DzSUM (non-absolute version of the above xASUM BLAS routines)
  * iSAMAX/iDAMAX/iCAMAX/iZAMAX
  * iSMAX/iDMAX/iCMAX/iZMAX (non-absolute version of the above ixAMAX BLAS routines)
  * iSMIN/iDMIN/iCMIN/iZMIN (non-absolute minimum version of the above ixAMAX BLAS routines)

Version 0.6.0
- Added support for MSVC (Visual Studio) 2015
- Added tuned parameters for various devices (see README)
- Now automatically generates C++ code from JSON tuning results
- Added level-2 routines:
  * SGER/DGER
  * CGERU/ZGERU
  * CGERC/ZGERC
  * CHER/ZHER
  * CHPR/ZHPR
  * CHER2/ZHER2
  * CHPR2/ZHPR2
  * CSYR/ZSYR
  * CSPR/ZSPR
  * CSYR2/ZSYR2
  * CSPR2/ZSPR2

Version 0.5.0
- Improved structure and performance of level-2 routines (xSYMV/xHEMV)
- Reduced compilation time of level-3 OpenCL kernels
- Added level-1 routines:
  * SSWAP/DSWAP/CSWAP/ZSWAP
  * SSCAL/DSCAL/CSCAL/ZSCAL
  * SCOPY/DCOPY/CCOPY/ZCOPY
  * SDOT/DDOT
  * CDOTU/ZDOTU
  * CDOTC/ZDOTC
- Added level-2 routines:
  * SGBMV/DGBMV/CGBMV/ZGBMV
  * CHBMV/ZHBMV
  * CHPMV/ZHPMV
  * SSBMV/DSBMV
  * SSPMV/DSPMV
  * STRMV/DTRMV/CTRMV/ZTRMV
  * STBMV/DTBMV/CTBMV/ZTBMV
  * STPMV/DTPMV/CTPMV/ZTPMV

Version 0.4.0
- Now using the Claduc C++11 interface to OpenCL
- Added plain C API for increased compatibility (clblast_c.h)
- Re-organized tuner infrastructure and added JSON output
- Removed clBLAS sources, it should now be installed separately for testing
- Added Travis continuous integration
- Added level-2 routines:
  * CHEMV/ZHEMV
  * SSYMV/DSYMV

Version 0.3.0
- Re-organized test/client infrastructure to avoid code duplication
- Added an optional bypass for pre/post-processing kernels in level-3 routines
- Significantly improved performance of level-3 routines on AMD GPUs
- Added level-3 routines:
  * CHEMM/ZHEMM
  * SSYRK/DSYRK/CSYRK/ZSYRK
  * CHERK/ZHERK
  * SSYR2K/DSYR2K/CSYR2K/ZSYR2K
  * CHER2K/ZHER2K
  * STRMM/DTRMM/CTRMM/ZTRMM

Version 0.2.0
- Added support for complex conjugate transpose
- Several host-code performance improvements
- Improved testing infrastructure and coverage
- Added level-2 routines:
  * SGEMV/DGEMV/CGEMV/ZGEMV
- Added level-3 routines:
  * CGEMM/ZGEMM
  * CSYMM/ZSYMM

Version 0.1.0
- Initial preview version release to GitHub
- Supported level-1 routines:
  * SAXPY/DAXPY/CAXPY/ZAXPY
- Supported level-3 routines:
  * SGEMM/DGEMM
  * SSYMM/DSYMM