summaryrefslogtreecommitdiff
path: root/CHANGELOG
blob: 48305f03dc026ccdee515e0da77ed2a26ef6a88b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
Development version (next release)
- Updated to version 8.0 of the CLCudaAPI C++11 OpenCL header
- Changed the enums in the C API to avoid potential name clashes with external code
- Greatly improved the way exceptions are handled in the library (thanks to 'intelfx')
- Improved performance of GEMM kernels for small sizes by using a direct single-kernel implementation
- Fixed a bug in the tests and samples related to waiting for an invalid event
- Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters
- Added support for compilation under Visual Studio 2013 (MSVC++ 12.0)
- Added an option to set OpenCL compiler options through the env variable CLBLAST_BUILD_OPTIONS
- Added an option to run tuned kernels multiple times to average execution times
- Added an option to build a static version of the library
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)

Version 0.9.0
- Updated to version 6.0 of the CLCudaAPI C++11 OpenCL header
- Improved performance significantly of rotated GEMV computations
- Improved performance of unseen/un-tuned devices by a better default tuning parameter selection
- Fixed proper MSVC dllimport and dllexport declarations
- Fixed memory leaks related to events not being released
- Fixed a bug with a size_t and cl_ulong mismatch on 32-bit systems
- Fixed a bug related to the cache and retrieval of programs based on the OpenCL context
- Fixed a performance issue (caused by fp16 support) by optimizing alpha/beta parameter passing to kernels
- Fixed a bug in the OpenCL kernels: now placing __kernel before __attribute__
- Fixed a bug in level-3 routines when beta is zero and matrix C contains NaNs
- Added an option (-warm_up) to do a warm-up run before timing in the performance clients
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)

Version 0.8.0
- Added support for half-precision floating-point (fp16) in the library
- Made it possible to compile the performance tests (clients) separately from the correctness tests
- Made a reference BLAS and head-to-head performance comparison optional in the clients
- Increased the verbosity of the "-verbose" option in the correctness tests
- Refactored the host code for better compilation times and fewer lines of code
- Added Appveyor continuous integration and increased coverage of the Travis builds
- Improved the API documentation
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)
- Added half-precision routines:
  * Level-1: HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN
  * Level-2: HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV/HGER/HSYR/HSPR/HSYR2/HSPR2
  * Level-3: HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM
- Added non-BLAS routines:
  * SOMATCOPY/DOMATCOPY/COMATCOPY/ZOMATCOPY/HOMATCOPY (matrix copy, scaling, and/or transpose)

Version 0.7.1
- Improved performance of large power-of-2 xGEMM kernels for AMD GPUs
- Fixed a bug in the xGEMM routine related to the event incorrectly set
- Made MSVC link the run-time libraries statically

Version 0.7.0
- Added exports to be able to create a DLL on Windows (thanks to Marco Hutter)
- Made the library thread-safe
- Performance and correctness tests can now (on top of clBLAS) be performed against CPU BLAS libraries
- Fixed the use of events within the library
- Changed the enum parameters to match the raw values of the cblas standard
- Fixed the cache of previously compiled binaries and added a function to fill or clear it
- Various minor fixes and enhancements
- Added a preliminary version of the API documentation
- Added additional sample programs
- Added tuned parameters for various devices (see README)
- Added level-1 routines:
  * SNRM2/DNRM2/ScNRM2/DzNRM2
  * SASUM/DASUM/ScASUM/DzASUM
  * SSUM/DSUM/ScSUM/DzSUM (non-absolute version of the above xASUM BLAS routines)
  * iSAMAX/iDAMAX/iCAMAX/iZAMAX
  * iSMAX/iDMAX/iCMAX/iZMAX (non-absolute version of the above ixAMAX BLAS routines)
  * iSMIN/iDMIN/iCMIN/iZMIN (non-absolute minimum version of the above ixAMAX BLAS routines)

Version 0.6.0
- Added support for MSVC (Visual Studio) 2015
- Added tuned parameters for various devices (see README)
- Now automatically generates C++ code from JSON tuning results
- Added level-2 routines:
  * SGER/DGER
  * CGERU/ZGERU
  * CGERC/ZGERC
  * CHER/ZHER
  * CHPR/ZHPR
  * CHER2/ZHER2
  * CHPR2/ZHPR2
  * CSYR/ZSYR
  * CSPR/ZSPR
  * CSYR2/ZSYR2
  * CSPR2/ZSPR2

Version 0.5.0
- Improved structure and performance of level-2 routines (xSYMV/xHEMV)
- Reduced compilation time of level-3 OpenCL kernels
- Added level-1 routines:
  * SSWAP/DSWAP/CSWAP/ZSWAP
  * SSCAL/DSCAL/CSCAL/ZSCAL
  * SCOPY/DCOPY/CCOPY/ZCOPY
  * SDOT/DDOT
  * CDOTU/ZDOTU
  * CDOTC/ZDOTC
- Added level-2 routines:
  * SGBMV/DGBMV/CGBMV/ZGBMV
  * CHBMV/ZHBMV
  * CHPMV/ZHPMV
  * SSBMV/DSBMV
  * SSPMV/DSPMV
  * STRMV/DTRMV/CTRMV/ZTRMV
  * STBMV/DTBMV/CTBMV/ZTBMV
  * STPMV/DTPMV/CTPMV/ZTPMV

Version 0.4.0
- Now using the Claduc C++11 interface to OpenCL
- Added plain C API for increased compatibility (clblast_c.h)
- Re-organized tuner infrastructure and added JSON output
- Removed clBLAS sources, it should now be installed separately for testing
- Added Travis continuous integration
- Added level-2 routines:
  * CHEMV/ZHEMV
  * SSYMV/DSYMV

Version 0.3.0
- Re-organized test/client infrastructure to avoid code duplication
- Added an optional bypass for pre/post-processing kernels in level-3 routines
- Significantly improved performance of level-3 routines on AMD GPUs
- Added level-3 routines:
  * CHEMM/ZHEMM
  * SSYRK/DSYRK/CSYRK/ZSYRK
  * CHERK/ZHERK
  * SSYR2K/DSYR2K/CSYR2K/ZSYR2K
  * CHER2K/ZHER2K
  * STRMM/DTRMM/CTRMM/ZTRMM

Version 0.2.0
- Added support for complex conjugate transpose
- Several host-code performance improvements
- Improved testing infrastructure and coverage
- Added level-2 routines:
  * SGEMV/DGEMV/CGEMV/ZGEMV
- Added level-3 routines:
  * CGEMM/ZGEMM
  * CSYMM/ZSYMM

Version 0.1.0
- Initial preview version release to GitHub
- Supported level-1 routines:
  * SAXPY/DAXPY/CAXPY/ZAXPY
- Supported level-3 routines:
  * SGEMM/DGEMM
  * SSYMM/DSYMM