Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fixes bug in conjugate transpose not being executed
|
|
Added workaround for AMD Southern Islands GPU issue
|
|
transposing
|
|
|
|
|
|
|
|
|
|
|
|
invalid ones completely, saving compilation time
|
|
kernels to improve performance
|
|
|
|
|
|
|
|
|
|
|
|
inline PTX to support subgroup shuffle for Nvidia GPUs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This was causing a crash for me because the temporary Program destructor called
clReleaseProgram on the cl_program with Program, and then clBuildProgram was
called on the same cl_program (belonging to the Program owned by the
shared_ptr, but it's the same cl_program).
|
|
OpenCL driver unloads first
|
|
|
|
|
|
barriers are present
|
|
TRSV global worksize issue
|
|
|
|
|
|
test from README
|
|
Apple opencl limitations for TRSV/TRSM now return not-implemented status
|
|
< 16 LWGS for TSRV and TRSM
|
|
size
|
|
and standard-deviation
|
|
capture other parts of the kernel code
|
|
approach for convgemm
|