summaryrefslogtreecommitdiff
path: root/src
AgeCommit message (Collapse)Author
2023-01-21Add tuning results for Adreno 730Cedric Nugteren
2023-01-17Updated according to feedback from CNugterenAngus, Alexander
2023-01-03implemented changes to boost Adreno performance according to ↵Angus, Alexander
https://jira-dc.qualcomm.com/jira/browse/OSR-8731
2022-09-22Update PyCLBlast version numberCedric Nugteren
2022-06-24Fix typo in commentCedric Nugteren
Resolves https://github.com/CNugteren/CLBlast/issues/440
2022-05-23Fix API inconsistency in cupp11.hppCedric Nugteren
The function `CopyToAsync` has an optional event argument in the OpenCL version, which is used in CLBlast. This makes the code not compile at all if CUDA (through cupp11.hpp`) is used as backend. This issue was found by a CLBlast user and reported privately by email. This PR should fix that.
2022-05-16Merge pull request #432 from justingra/sum-fixCedric Nugteren
sum fix
2022-04-25Add tuning results for Adreno 540Cedric Nugteren
2022-04-25Add tuning results for Radeon RX 6500 XTCedric Nugteren
2022-04-25Add tuning results for Radeon RX 6800 XTCedric Nugteren
2022-04-22sum fixJustin Graham
2022-04-13android.hpp: custom header guard of _clang_danyougle
In order not to have ambiguous definitions, exclude the functions for other compilers
2021-08-27Add Quadro T2000 tuning parameters for the Tesla T4Cedric Nugteren
2021-08-27Remove Tesla T4 tuning resultsCedric Nugteren
2021-08-19Add tuning results for NVIDIA Tesla V100Cedric Nugteren
2021-08-19Add tuning results for NVIDIA Tesla T4Cedric Nugteren
2021-08-19Add tuning results for NVIDIA Quadro T2000Cedric Nugteren
2021-08-19Add tuning results for NVIDIA Quadro GV100Cedric Nugteren
2021-08-19Add tuning results for Intel Core i9-9980HKCedric Nugteren
2021-08-19Add tuning results for NVIDIA A100Cedric Nugteren
2021-05-22Fix issue with printing out-of-bounds local/global sizes for level 1 tunersCedric Nugteren
2021-03-13set the correct flop count for xgemmJishinMaster
2021-02-05Fix Windows paths in pyclblastCedric Nugteren
2021-02-04Added second Windows library pathCedric Nugteren
2021-01-30Add library path for Windows as wellCedric Nugteren
2021-01-29Add library dir on Linux for pyclblastCedric Nugteren
2021-01-21Update pyclblast package version numberCedric Nugteren
2021-01-20Use reference types to prevent unnecessary copyingJerry James
2020-10-10Add tuning results for TITAN RTXCedric Nugteren
2020-10-10Add tuning results for Radeon RX VegaCedric Nugteren
2020-06-07Add a cautionary note in Program::GetIR and mention the fix in CHANGELOGPradeep Garigipati
2020-06-05Fix Program::GetIR to handle programs with multiple devicesPradeep Garigipati
2020-05-11Increase display width of the local/global sizesCedric Nugteren
2020-05-10Made sure that the global workgroup size is a multiple of the local size in ↵Cedric Nugteren
the tuners
2020-05-10Added logging of local/global workgroup sizes when run the tunersCedric Nugteren
2020-05-10Updated PyCLBlast version numberCedric Nugteren
2020-05-10Added a sample to demonstrate a batched routineCedric Nugteren
2020-05-10Added pyclblast bindings for the 3 batched routinesCedric Nugteren
2020-05-03Move queue creation out of the tuner loopCedric Nugteren
2020-03-08Made it more likely (but no guarantees) for amax/amin to return the first indexCedric Nugteren
2020-03-08Silenced a new OpenCL warning messageCedric Nugteren
2020-02-17Catches all exceptions of the tunersCedric Nugteren
2019-12-09Reduce TestMatrix calls for xgemmstridedbatched.Tarmo Räntilä
Replace the looped test by a single one with the offset of the last batch.
2019-12-09Reduce TestMatrix calls for xgemmbatched.Tarmo Räntilä
Replace the looped test by a single one with the maximal found offset.
2019-09-04Fix out-of-bounds read/write in XhadFasteretomzak
Fix an error in XhadFaster where data would be written beyond the end of zgm. The kernel loop assumed that there was always enough work for each thread to process WPT items, but this was not enforced. It's possible to detect the overflow with the "canary" buffer regions, but for SHAD, kCanarySize must be ~500 (much larger than the normal 127). This commit may improve the performance of XhadFaster, since the kernel was performing 2x work in some cases (once over real data, once over garbage). Courtesy of Codeplay Software Ltd.
2019-05-19Fixed a bug in the absolute-min index kernelCedric Nugteren
2019-05-11Added a function to set the OpenCL kernel standard, either 1.1 or 1.2Cedric Nugteren
2019-05-08Changed back to cl_intel_subgroups as suggestedCedric Nugteren
2019-05-07Added a host-code check to make sure the avc_motion_estimation is availableCedric Nugteren
2019-05-07Enabled avc_motion_estimation extension for Intel subgroup shufflingCedric Nugteren