Age | Commit message (Collapse) | Author |
|
|
|
|
|
and kernels
|
|
|
|
|
|
enabling better memory performance
|
|
by routine implementation
|
|
support empty LWS
|
|
|
|
|
|
Do not hardcode the knowledge about "A and C col-major, B row-major".
This allows for easier reuse of the DoGemm() routine with different
kernels.
|
|
though
|
|
slow reference kernel as a place-holder
|
|
|
|
Make sure the passed types are large enough.
|
|
|
|
Make sure all out parameters that are passed to functions such
as clGetDeviceInfo are large enough to contain the replies.
|
|
case of fp16 arguments are cast on host and in kernel
|
|
|
|
OpenCL context
|
|
verbose print statements to the cache
|
|
compilation and kernel execution to screen
|
|
|
|
|
|
In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo
to get the "bytes" amount needed to store the result from
CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an
"auto result = size_t", which in 32-bit mode is 4 bytes, regardless
of the previous return value. The spec describes that it will actually
be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure
we are in fact passing a cl_ulong.
Also adjust all callers to take the changed type into account.
|
|
|
|
declspec(dllimport) when not building the library
|
|
|
|
|
|
|
|
(thanks to OursDesCavernes)
|
|
|
|
|
|
|
|
|
|
templated function
|
|
them directly now
|
|
class
|
|
functions in a separate file
|
|
and/or transposing
|
|
|
|
and renamed files and functions appropriately
|
|
|
|
|
|
GPUs
|
|
single-precision
|
|
|
|
|
|
|
|
kernels
|