summaryrefslogtreecommitdiff
path: root/src
AgeCommit message (Collapse)Author
2018-10-14Merge pull request #319 from CNugteren/convgemm_multi_kernelCedric Nugteren
First im2col+GEMM implementation of convolution
2018-10-13Made tuning API more flexible: disregards any extra parameter valuesCedric Nugteren
2018-10-10Fixed pre-processor warnings related to the subgroup shufflingCedric Nugteren
2018-09-16Merge branch 'master' into convgemm_multi_kernelCedric Nugteren
2018-09-15Fixed an MSVC compilation error due to large stringsCedric Nugteren
2018-09-15Fixed an MSVC compilation error due to large stringsCedric Nugteren
2018-09-15Disabled Intel subgroup shuffling for double-precisionCedric Nugteren
2018-09-15Fixed issues with GEMMK=1 kernel and the pre-processorCedric Nugteren
2018-09-07Added xCONVGEMM as im2col plus a batched GEMM kernelCedric Nugteren
2018-08-13Made last operation in TRSV and TRSM asynchronous, making the events not nullCedric Nugteren
2018-08-13Small refactoring of events in TRSV substitution routineCedric Nugteren
2018-08-07Name change of setting to NETLIB_PERSISTENT_OPENCLCedric Nugteren
2018-08-05Added an option to compile the Netlib API with static OpenCL device and contextCedric Nugteren
2018-08-02Merge pull request #309 from CNugteren/CLBlast-306-omatcopy-conjugateCedric Nugteren
Fixes bug in conjugate transpose not being executed
2018-07-31Merge pull request #308 from CNugteren/CLBlast-301-weird-AMD-Hainan-bugCedric Nugteren
Added workaround for AMD Southern Islands GPU issue
2018-07-31Fixed issue with not performing complex conjugation under certain cases when ↵Cedric Nugteren
transposing
2018-07-31Updated the tuning results for Intel IvyBridge M GT2Cedric Nugteren
2018-07-29Fixed a wrong event issue causing error -57Cedric Nugteren
2018-07-29Removed complex numbers support for CONVGEMMCedric Nugteren
2018-07-29Merge branch 'master' into CLBlast-267-convgemmCedric Nugteren
2018-07-28Added print statements to indicate the 4 stages of GEMM tuningCedric Nugteren
2018-07-28The tuners now also check for valid local thread configurations and skip ↵Cedric Nugteren
invalid ones completely, saving compilation time
2018-07-28Disabled the use of staggered indices on AMD GPUs for the new GEMMK == 1 ↵Cedric Nugteren
kernels to improve performance
2018-07-27Fixed an issue with AMD GPUs and the new GEMMK == 1 kernelCedric Nugteren
2018-07-27Fixed a bug: forgot to initialize the shared pointer for the null kernelCedric Nugteren
2018-07-27Renamed AMD SI workaround definesCedric Nugteren
2018-07-25Added workaround for weird AMD SI Hainan bugCedric Nugteren
2018-07-25Added code to report the average tuning resultsCedric Nugteren
2018-07-23Merge pull request #297 from tyler-utah/masterCedric Nugteren
inline PTX to support subgroup shuffle for Nvidia GPUs
2018-07-16moved a two-line macro to a single lineTyler Sorensen
2018-07-14forgot to add test cases back in, oopsTyler Sorensen
2018-07-14Applied feedback from Cedric from first pull requestTyler Sorensen
2018-07-13Added tuning results for Intel i5-4970SCedric Nugteren
2018-07-13Added device-name removal code to handle POCL naming conventionCedric Nugteren
2018-07-13Added tuning results for GeForce GTX 1070 TiCedric Nugteren
2018-07-13Added tuning results for HD Graphics 6000 Broadwell GT3Cedric Nugteren
2018-07-11restored some of the changed tuning files for xgemmTyler Sorensen
2018-07-11added inline ptx to support shuffle on Nvidia GPUsTyler Sorensen
2018-07-06Eliminate a temporary Program objectAlastair Murray
This was causing a crash for me because the temporary Program destructor called clReleaseProgram on the cl_program with Program, and then clBuildProgram was called on the same cl_program (belonging to the Program owned by the shared_ptr, but it's the same cl_program).
2018-06-28Disabled calls to clReleaseProgram under Windows to avoid segfaults when the ↵Cedric Nugteren
OpenCL driver unloads first
2018-06-03Merge branch 'master' into CLBlast-267-convgemmCedric Nugteren
2018-06-03Fixes for CUDA version of CLBlastCedric Nugteren
2018-06-01Fixes for Apple OpenCL CPU implementation which requires a LWGS of 1 when ↵Cedric Nugteren
barriers are present
2018-05-31Added error-checking for half-empty local work group sizes; fixed a minor ↵Cedric Nugteren
TRSV global worksize issue
2018-05-31Some potential fixes for error -54 when launching TRSV and TRSM kernelsCedric Nugteren
2018-05-30Widened Apple OpenCL check, added way to debug too-large-workgroups issueCedric Nugteren
2018-05-29Added Apple OpenCL TRSV block size override; removed failing old Intel GPU ↵Cedric Nugteren
test from README
2018-05-27Merge pull request #287 from CNugteren/apple-opencl-limitations-fixesCedric Nugteren
Apple opencl limitations for TRSV/TRSM now return not-implemented status
2018-05-27Added a check to return 'NotImplemented' error code in case of systems with ↵Cedric Nugteren
< 16 LWGS for TSRV and TRSM
2018-05-27Made FillMatrix and FillVector functions take a configurable local workgroup ↵Cedric Nugteren
size