Age | Commit message (Collapse) | Author | |
---|---|---|---|
2016-09-12 | Split the XGEMM kernel further up: now in 3 parts. This is done because MSVC ↵ | Cedric Nugteren | |
can't handle long strings | |||
2016-09-12 | Added XgemvFastRot and Xgemm 16-bit tuning results: just defaults which are ↵ | Cedric Nugteren | |
now automatically taken from 32-bit if there are no entries at all | |||
2016-09-11 | Complete re-write of the database script. Changed Pandas for the much faster ↵ | Cedric Nugteren | |
and convienient plain JSON/dict data-type | |||
2016-09-10 | Updated database based on exhaustive tuning results for GEMM for the R9 ↵ | Cedric Nugteren | |
M370X GPU | |||
2016-09-10 | Updated the database script to remove duplicate entries: keeps only the ↵ | Cedric Nugteren | |
best-performing cases for a specific parameters combination | |||
2016-09-06 | Split GEMM tuning in two parts: a small set of tuning parameters which is ↵ | Cedric Nugteren | |
explored exhaustively and a larger set which is explored randomly | |||
2016-09-04 | The GEMM kernel no longer adds beta*C in case beta is zero; this would cause ↵ | Cedric Nugteren | |
problems if C contains NaNs | |||
2016-09-03 | Added tuning results for Intel Broadwell 5500 GT2 GPU | Cedric Nugteren | |
2016-09-03 | Updated tuning results for Haswell GT2 Mobile GPU; fixed database script to ↵ | Cedric Nugteren | |
handle duplicate entries of different runs | |||
2016-08-27 | test/correctness: read platform and device from environment | Ivan Shapovalov | |
Support passing environment variables CLBLAST_PLATFORM and CLBLAST_DEVICE instead of -platform and -device arguments to test executables. This is for `ctest`. | |||
2016-08-22 | Merge branch 'database_defaults' into development | Cedric Nugteren | |
2016-08-21 | Also changed the default-default for unknown device types to use the same ↵ | Cedric Nugteren | |
method as for known device groups | |||
2016-08-21 | Increased the ratio of GEMM tuning results to explore; reduced the tuning ↵ | Cedric Nugteren | |
search space to have a better chance to evaluate more likely parameter combinations | |||
2016-08-20 | Merge branch 'master' of https://github.com/dvasschemacq/CLBlast into ↵ | Cedric Nugteren | |
dvasschemacq-master Conflicts: src/kernels/level1/xaxpy.opencl src/kernels/level2/xgemv.opencl src/kernels/level2/xgemv_fast.opencl src/kernels/level2/xger.opencl src/kernels/level2/xher.opencl src/kernels/level2/xher2.opencl src/kernels/level3/xgemm_part2.opencl | |||
2016-08-18 | Adapt opencl files for 1.1 OpenCL | D. Van Assche | |
In OpenCL 1.1 __kernel has to be before __attribute__, at least with Vivante compiler. | |||
2016-08-15 | Updated the database script to calculate the relative best performance of ↵ | Cedric Nugteren | |
tuning results common for a device/vendor type | |||
2016-07-25 | Removed all old tuning results for the XgemvFastRot kernel; re-added for a ↵ | Cedric Nugteren | |
couple of devices | |||
2016-07-25 | Moved the XgemvFast and XgemvFastRot tuning database into a separate file | Cedric Nugteren | |
2016-07-24 | Merge branch 'development' into gemv_performance | Cedric Nugteren | |
2016-07-24 | Minor improvements after merging in groundwork for custom tuning parameters ↵ | Cedric Nugteren | |
and kernels | |||
2016-07-23 | Fixe a bug in the new XgemvFastRot kernel related to local memory size | Cedric Nugteren | |
2016-07-23 | Further improvements to the XgemvFastRot kernel, properly enables coalescing now | Cedric Nugteren | |
2016-07-23 | Improved the XgemvFastRot kernel by tiled loading of the input matrix A, ↵ | Cedric Nugteren | |
enabling better memory performance | |||
2016-07-22 | clblast::Database, clblast::Routine: implement "database overlays" provided ↵ | Ivan Shapovalov | |
by routine implementation | |||
2016-07-22 | clblast::RunKernel, cl::Kernel: unify variants with/without waitForEvents, ↵ | Ivan Shapovalov | |
support empty LWS | |||
2016-07-22 | cl::Kernel: skip NULL entries in waitForEvents | Ivan Shapovalov | |
2016-07-22 | clblast::RunKernel, cl::Kernel: take const vector as waitForEvents | Ivan Shapovalov | |
2016-07-22 | xgemm: do not hardcode kernel requirements for internal matrix layout | Ivan Shapovalov | |
Do not hardcode the knowledge about "A and C col-major, B row-major". This allows for easier reuse of the DoGemm() routine with different kernels. | |||
2016-07-16 | Fixed some more types and type conversions in the clpp11 interface to OpenCL | Cedric Nugteren | |
2016-07-16 | Merge pull request #80 from gcp/getdevinfo_fixes | Cedric Nugteren | |
Make sure the passed types are large enough. | |||
2016-07-16 | Removed an unused variable from the copy-transpose-pad function | Cedric Nugteren | |
2016-07-13 | Make sure the passed types are large enough. | Gian-Carlo Pascutto | |
Make sure all out parameters that are passed to functions such as clGetDeviceInfo are large enough to contain the replies. | |||
2016-07-10 | Now passing alpha/beta to the kernel as arguments as before fp16 support; in ↵ | Cedric Nugteren | |
case of fp16 arguments are cast on host and in kernel | |||
2016-07-10 | Added tuning results for AMD Oland and for Intel Graphics HD 530 | Cedric Nugteren | |
2016-07-10 | Fixed a bug related to the cache and retrieval of programs based on the ↵ | Cedric Nugteren | |
OpenCL context | |||
2016-07-08 | Cache now compares cl_context instead of a pointer to a context; added ↵ | Cedric Nugteren | |
verbose print statements to the cache | |||
2016-07-06 | Added a VERBOSE mode to debug performance: now prints details about ↵ | Cedric Nugteren | |
compilation and kernel execution to screen | |||
2016-07-06 | Added an option to the performance clients to do a warm-up run before timing | Cedric Nugteren | |
2016-07-03 | Added tuning results for GTX670, GTX750, and GTX1070 (thanks to gcp) | Cedric Nugteren | |
2016-07-02 | Ensure clGetKernelWorkGroupInfo return value fits. | Gian-Carlo Pascutto | |
In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo to get the "bytes" amount needed to store the result from CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an "auto result = size_t", which in 32-bit mode is 4 bytes, regardless of the previous return value. The spec describes that it will actually be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure we are in fact passing a cl_ulong. Also adjust all callers to take the changed type into account. | |||
2016-07-02 | Fixed some memory leaks related to events not properly cleaned-up | Cedric Nugteren | |
2016-06-30 | Added declspec(dllexport) to ClearCache and FillCache, and added ↵ | Cedric Nugteren | |
declspec(dllimport) when not building the library | |||
2016-06-29 | Updated to version 6.0 of the CLCudaAPI header | Cedric Nugteren | |
2016-06-28 | Made it possible to build the clients and tests on Windows using Visual Studio | CNugteren | |
2016-06-27 | Fixes for the AppVeyor Windows build | Cedric Nugteren | |
2016-06-19 | Added tuning results for 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' ↵ | Cedric Nugteren | |
(thanks to OursDesCavernes) | |||
2016-06-19 | Renamed all C++ source files to .cpp to match the .hpp extension better | Cedric Nugteren | |
2016-06-18 | Moved all headers into the source tree, changed headers to .hpp extension | Cedric Nugteren | |
2016-06-18 | Clean-up of the routine class, moved RunKernel to the routine/common file | Cedric Nugteren | |
2016-06-18 | Removed the template from the Routine base-class | Cedric Nugteren | |