Age | Commit message (Collapse) | Author |
|
|
|
|
|
Added version number defines
|
|
numbering
|
|
Fixed tuners global workgroup size
|
|
|
|
the tuners
|
|
|
|
PyCLBlast: add missing batched routines
|
|
|
|
|
|
|
|
Move queue creation out of the tuner loop
|
|
|
|
Change amax/amin behaviour
|
|
|
|
|
|
|
|
|
|
|
|
Catches all exceptions of the tuners
|
|
|
|
Reduced number of TestMatrix calls for the batched xgemm routines.
|
|
Replace the looped test by a single one with the offset of the last batch.
|
|
Replace the looped test by a single one with the maximal found offset.
|
|
|
|
Fix out-of-bounds read/write in XhadFaster
|
|
Fix an error in XhadFaster where data would be written beyond the end of zgm.
The kernel loop assumed that there was always enough work for each thread to
process WPT items, but this was not enforced. It's possible to detect the
overflow with the "canary" buffer regions, but for SHAD, kCanarySize must be
~500 (much larger than the normal 127).
This commit may improve the performance of XhadFaster, since the kernel was
performing 2x work in some cases (once over real data, once over garbage).
Courtesy of Codeplay Software Ltd.
|
|
Fixed a bug in the absolute-min index kernel
|
|
|
|
intel shuffle extension fix
|
|
|
|
|
|
|
|
|
|
Remove assert for extention not available in macOS
|
|
The cl_nv_device_attribute_query extention is not available on the
Apple platform. This caused failures during debug builds at runtime.
|
|
|
|
|
|
CNugteren/CLBlast-334-pyclblast-half-precision-support
PyCLBlast half precision support
|
|
|
|
|
|
|
|
|
|
Convolution with single kernel
|
|
|
|
|
|
strided-batched-GEMM routine
|
|
|
|
|