summaryrefslogtreecommitdiff
path: root/CHANGELOG
blob: 3c258bdf360ebe319e735d78a0e8d393d6f79ffb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
Development version (next release)
- Updated to version 6.0 of the CLCudaAPI C++11 OpenCL header

Version 0.8.0
- Added support for half-precision floating-point (fp16) in the library
- Made it possible to compile the performance tests (clients) separately from the correctness tests
- Made a reference BLAS and head-to-head performance comparison optional in the clients
- Increased the verbosity of the "-verbose" option in the correctness tests
- Refactored the host code for better compilation times and fewer lines of code
- Added Appveyor continuous integration and increased coverage of the Travis builds
- Improved the API documentation
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)
- Added half-precision routines:
  * Level-1: HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN
  * Level-2: HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV/HGER/HSYR/HSPR/HSYR2/HSPR2
  * Level-3: HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM
- Added non-BLAS routines:
  * SOMATCOPY/DOMATCOPY/COMATCOPY/ZOMATCOPY/HOMATCOPY (matrix copy, scaling, and/or transpose)

Version 0.7.1
- Improved performance of large power-of-2 xGEMM kernels for AMD GPUs
- Fixed a bug in the xGEMM routine related to the event incorrectly set
- Made MSVC link the run-time libraries statically

Version 0.7.0
- Added exports to be able to create a DLL on Windows (thanks to Marco Hutter)
- Made the library thread-safe
- Performance and correctness tests can now (on top of clBLAS) be performed against CPU BLAS libraries
- Fixed the use of events within the library
- Changed the enum parameters to match the raw values of the cblas standard
- Fixed the cache of previously compiled binaries and added a function to fill or clear it
- Various minor fixes and enhancements
- Added a preliminary version of the API documentation
- Added additional sample programs
- Added tuned parameters for various devices (see README)
- Added level-1 routines:
  * SNRM2/DNRM2/ScNRM2/DzNRM2
  * SASUM/DASUM/ScASUM/DzASUM
  * SSUM/DSUM/ScSUM/DzSUM (non-absolute version of the above xASUM BLAS routines)
  * iSAMAX/iDAMAX/iCAMAX/iZAMAX
  * iSMAX/iDMAX/iCMAX/iZMAX (non-absolute version of the above ixAMAX BLAS routines)
  * iSMIN/iDMIN/iCMIN/iZMIN (non-absolute minimum version of the above ixAMAX BLAS routines)

Version 0.6.0
- Added support for MSVC (Visual Studio) 2015
- Added tuned parameters for various devices (see README)
- Now automatically generates C++ code from JSON tuning results
- Added level-2 routines:
  * SGER/DGER
  * CGERU/ZGERU
  * CGERC/ZGERC
  * CHER/ZHER
  * CHPR/ZHPR
  * CHER2/ZHER2
  * CHPR2/ZHPR2
  * CSYR/ZSYR
  * CSPR/ZSPR
  * CSYR2/ZSYR2
  * CSPR2/ZSPR2

Version 0.5.0
- Improved structure and performance of level-2 routines (xSYMV/xHEMV)
- Reduced compilation time of level-3 OpenCL kernels
- Added level-1 routines:
  * SSWAP/DSWAP/CSWAP/ZSWAP
  * SSCAL/DSCAL/CSCAL/ZSCAL
  * SCOPY/DCOPY/CCOPY/ZCOPY
  * SDOT/DDOT
  * CDOTU/ZDOTU
  * CDOTC/ZDOTC
- Added level-2 routines:
  * SGBMV/DGBMV/CGBMV/ZGBMV
  * CHBMV/ZHBMV
  * CHPMV/ZHPMV
  * SSBMV/DSBMV
  * SSPMV/DSPMV
  * STRMV/DTRMV/CTRMV/ZTRMV
  * STBMV/DTBMV/CTBMV/ZTBMV
  * STPMV/DTPMV/CTPMV/ZTPMV

Version 0.4.0
- Now using the Claduc C++11 interface to OpenCL
- Added plain C API for increased compatibility (clblast_c.h)
- Re-organized tuner infrastructure and added JSON output
- Removed clBLAS sources, it should now be installed separately for testing
- Added Travis continuous integration
- Added level-2 routines:
  * CHEMV/ZHEMV
  * SSYMV/DSYMV

Version 0.3.0
- Re-organized test/client infrastructure to avoid code duplication
- Added an optional bypass for pre/post-processing kernels in level-3 routines
- Significantly improved performance of level-3 routines on AMD GPUs
- Added level-3 routines:
  * CHEMM/ZHEMM
  * SSYRK/DSYRK/CSYRK/ZSYRK
  * CHERK/ZHERK
  * SSYR2K/DSYR2K/CSYR2K/ZSYR2K
  * CHER2K/ZHER2K
  * STRMM/DTRMM/CTRMM/ZTRMM

Version 0.2.0
- Added support for complex conjugate transpose
- Several host-code performance improvements
- Improved testing infrastructure and coverage
- Added level-2 routines:
  * SGEMV/DGEMV/CGEMV/ZGEMV
- Added level-3 routines:
  * CGEMM/ZGEMM
  * CSYMM/ZSYMM

Version 0.1.0
- Initial preview version release to GitHub
- Supported level-1 routines:
  * SAXPY/DAXPY/CAXPY/ZAXPY
- Supported level-3 routines:
  * SGEMM/DGEMM
  * SSYMM/DSYMM