summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-07-26Fixed issues related to the recent changes in the Xgemm infrastructureCedric Nugteren
2016-07-26Merge branch 'development' into gemm_directCedric Nugteren
2016-07-26Merge branch 'gemv_performance' into developmentCedric Nugteren
2016-07-25Removed all old tuning results for the XgemvFastRot kernel; re-added for a ↵Cedric Nugteren
couple of devices
2016-07-25Moved the XgemvFast and XgemvFastRot tuning database into a separate fileCedric Nugteren
2016-07-24Merge branch 'development' into gemv_performanceCedric Nugteren
2016-07-24Minor improvements after merging in groundwork for custom tuning parameters ↵Cedric Nugteren
and kernels
2016-07-24Merge pull request #84 from intelfx/device-specific-kernelsCedric Nugteren
Groundwork for device-specific routines
2016-07-24Refactored the Python database script: separated functionality in modules, ↵Cedric Nugteren
now complies to the PEP8 style, added proper command-line argument parsing, and cleaned-up
2016-07-23Fixe a bug in the new XgemvFastRot kernel related to local memory sizeCedric Nugteren
2016-07-23Further improvements to the XgemvFastRot kernel, properly enables coalescing nowCedric Nugteren
2016-07-23Improved the XgemvFastRot kernel by tiled loading of the input matrix A, ↵Cedric Nugteren
enabling better memory performance
2016-07-22clblast::Database, clblast::Routine: implement "database overlays" provided ↵Ivan Shapovalov
by routine implementation
2016-07-22clblast::RunKernel, cl::Kernel: unify variants with/without waitForEvents, ↵Ivan Shapovalov
support empty LWS
2016-07-22cl::Kernel: skip NULL entries in waitForEventsIvan Shapovalov
2016-07-22clblast::RunKernel, cl::Kernel: take const vector as waitForEventsIvan Shapovalov
2016-07-22xgemm: do not hardcode kernel requirements for internal matrix layoutIvan Shapovalov
Do not hardcode the knowledge about "A and C col-major, B row-major". This allows for easier reuse of the DoGemm() routine with different kernels.
2016-07-22CMakeLists.txt: use ${clblast_SOURCE_DIR} instead of ${CMAKE_SOURCE_DIR}Ivan Shapovalov
2016-07-17Improved the GEMM direct kernel by adding register blocking. Still not fast ↵Cedric Nugteren
though
2016-07-16Created infrastructure to support a direct GEMM kernel; added correct but ↵Cedric Nugteren
slow reference kernel as a place-holder
2016-07-16Fixed some more types and type conversions in the clpp11 interface to OpenCLCedric Nugteren
2016-07-16Merge pull request #80 from gcp/getdevinfo_fixesCedric Nugteren
Make sure the passed types are large enough.
2016-07-16Removed an unused variable from the copy-transpose-pad functionCedric Nugteren
2016-07-13Make sure the passed types are large enough.Gian-Carlo Pascutto
Make sure all out parameters that are passed to functions such as clGetDeviceInfo are large enough to contain the replies.
2016-07-10Now passing alpha/beta to the kernel as arguments as before fp16 support; in ↵Cedric Nugteren
case of fp16 arguments are cast on host and in kernel
2016-07-10Added tuning results for AMD Oland and for Intel Graphics HD 530Cedric Nugteren
2016-07-10Fixed a bug related to the cache and retrieval of programs based on the ↵Cedric Nugteren
OpenCL context
2016-07-08Cache now compares cl_context instead of a pointer to a context; added ↵Cedric Nugteren
verbose print statements to the cache
2016-07-06Added a VERBOSE mode to debug performance: now prints details about ↵Cedric Nugteren
compilation and kernel execution to screen
2016-07-06Added an option to the performance clients to do a warm-up run before timingCedric Nugteren
2016-07-04Fixed a linking issue with the tuners on Visual StudioCNugteren
2016-07-03Added tuning results for GTX670, GTX750, and GTX1070 (thanks to gcp)Cedric Nugteren
2016-07-03Merge pull request #76 from gcp/fix_local_mem_sizeCedric Nugteren
Fixes clGetKernelWorkGroupInfo to work well with both 32-bit and 64-bit systems
2016-07-02Ensure clGetKernelWorkGroupInfo return value fits.Gian-Carlo Pascutto
In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo to get the "bytes" amount needed to store the result from CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an "auto result = size_t", which in 32-bit mode is 4 bytes, regardless of the previous return value. The spec describes that it will actually be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure we are in fact passing a cl_ulong. Also adjust all callers to take the changed type into account.
2016-07-02Prints the current pandas version and reports the minimum required versionCedric Nugteren
2016-07-02Fixed some memory leaks related to events not properly cleaned-upCedric Nugteren
2016-06-30Added declspec(dllexport) to ClearCache and FillCache, and added ↵Cedric Nugteren
declspec(dllimport) when not building the library
2016-06-29Updated to version 6.0 of the CLCudaAPI headerCedric Nugteren
2016-06-28Prepared the changelog for the next releaseCedric Nugteren
2016-06-28Updated to version 0.8.0Cedric Nugteren
2016-06-28Changed the AppVeyor buildscript to use nmake instead of 'cmake --build' (2)Cedric Nugteren
2016-06-28Changed the AppVeyor buildscript to use nmake instead of 'cmake --build'Cedric Nugteren
2016-06-28Fixes bug in AppVeyor with install directory (2)Cedric Nugteren
2016-06-28Fixes bug in AppVeyor with install directoryCedric Nugteren
2016-06-28Added configuration for AppVeyor to keep the results of the builds as an ↵Cedric Nugteren
'artifact'
2016-06-28Made it possible to build the clients and tests on Windows using Visual StudioCNugteren
2016-06-28Made it possible to build the OMATCOPY test and client in case only clBLAS ↵CNugteren
is present
2016-06-27Updated the README in various placesCedric Nugteren
2016-06-27Fixes for the AppVeyor Windows buildCedric Nugteren
2016-06-27Added vcvarsall to AppVeyor and added AppVeyor icons to READMECedric Nugteren