Age | Commit message (Collapse) | Author | |
---|---|---|---|
2023-01-21 | Add tuning results for Intel FPGA emulation device | Cedric Nugteren | |
2023-01-21 | Add tuning results for Radeon Pro 450 | Cedric Nugteren | |
2023-01-21 | Add tuning results for Adreno 740 | Cedric Nugteren | |
2023-01-21 | Add tuning results for Adreno 730 | Cedric Nugteren | |
2023-01-17 | Updated according to feedback from CNugteren | Angus, Alexander | |
2023-01-03 | implemented changes to boost Adreno performance according to ↵ | Angus, Alexander | |
https://jira-dc.qualcomm.com/jira/browse/OSR-8731 | |||
2022-09-22 | Update PyCLBlast version number | Cedric Nugteren | |
2022-06-24 | Fix typo in comment | Cedric Nugteren | |
Resolves https://github.com/CNugteren/CLBlast/issues/440 | |||
2022-05-23 | Fix API inconsistency in cupp11.hpp | Cedric Nugteren | |
The function `CopyToAsync` has an optional event argument in the OpenCL version, which is used in CLBlast. This makes the code not compile at all if CUDA (through cupp11.hpp`) is used as backend. This issue was found by a CLBlast user and reported privately by email. This PR should fix that. | |||
2022-05-16 | Merge pull request #432 from justingra/sum-fix | Cedric Nugteren | |
sum fix | |||
2022-04-25 | Add tuning results for Adreno 540 | Cedric Nugteren | |
2022-04-25 | Add tuning results for Radeon RX 6500 XT | Cedric Nugteren | |
2022-04-25 | Add tuning results for Radeon RX 6800 XT | Cedric Nugteren | |
2022-04-22 | sum fix | Justin Graham | |
2022-04-13 | android.hpp: custom header guard of _clang_ | danyougle | |
In order not to have ambiguous definitions, exclude the functions for other compilers | |||
2021-08-27 | Add Quadro T2000 tuning parameters for the Tesla T4 | Cedric Nugteren | |
2021-08-27 | Remove Tesla T4 tuning results | Cedric Nugteren | |
2021-08-19 | Add tuning results for NVIDIA Tesla V100 | Cedric Nugteren | |
2021-08-19 | Add tuning results for NVIDIA Tesla T4 | Cedric Nugteren | |
2021-08-19 | Add tuning results for NVIDIA Quadro T2000 | Cedric Nugteren | |
2021-08-19 | Add tuning results for NVIDIA Quadro GV100 | Cedric Nugteren | |
2021-08-19 | Add tuning results for Intel Core i9-9980HK | Cedric Nugteren | |
2021-08-19 | Add tuning results for NVIDIA A100 | Cedric Nugteren | |
2021-05-22 | Fix issue with printing out-of-bounds local/global sizes for level 1 tuners | Cedric Nugteren | |
2021-03-13 | set the correct flop count for xgemm | JishinMaster | |
2021-02-05 | Fix Windows paths in pyclblast | Cedric Nugteren | |
2021-02-04 | Added second Windows library path | Cedric Nugteren | |
2021-01-30 | Add library path for Windows as well | Cedric Nugteren | |
2021-01-29 | Add library dir on Linux for pyclblast | Cedric Nugteren | |
2021-01-21 | Update pyclblast package version number | Cedric Nugteren | |
2021-01-20 | Use reference types to prevent unnecessary copying | Jerry James | |
2020-10-10 | Add tuning results for TITAN RTX | Cedric Nugteren | |
2020-10-10 | Add tuning results for Radeon RX Vega | Cedric Nugteren | |
2020-06-07 | Add a cautionary note in Program::GetIR and mention the fix in CHANGELOG | Pradeep Garigipati | |
2020-06-05 | Fix Program::GetIR to handle programs with multiple devices | Pradeep Garigipati | |
2020-05-11 | Increase display width of the local/global sizes | Cedric Nugteren | |
2020-05-10 | Made sure that the global workgroup size is a multiple of the local size in ↵ | Cedric Nugteren | |
the tuners | |||
2020-05-10 | Added logging of local/global workgroup sizes when run the tuners | Cedric Nugteren | |
2020-05-10 | Updated PyCLBlast version number | Cedric Nugteren | |
2020-05-10 | Added a sample to demonstrate a batched routine | Cedric Nugteren | |
2020-05-10 | Added pyclblast bindings for the 3 batched routines | Cedric Nugteren | |
2020-05-03 | Move queue creation out of the tuner loop | Cedric Nugteren | |
2020-03-08 | Made it more likely (but no guarantees) for amax/amin to return the first index | Cedric Nugteren | |
2020-03-08 | Silenced a new OpenCL warning message | Cedric Nugteren | |
2020-02-17 | Catches all exceptions of the tuners | Cedric Nugteren | |
2019-12-09 | Reduce TestMatrix calls for xgemmstridedbatched. | Tarmo Räntilä | |
Replace the looped test by a single one with the offset of the last batch. | |||
2019-12-09 | Reduce TestMatrix calls for xgemmbatched. | Tarmo Räntilä | |
Replace the looped test by a single one with the maximal found offset. | |||
2019-09-04 | Fix out-of-bounds read/write in XhadFaster | etomzak | |
Fix an error in XhadFaster where data would be written beyond the end of zgm. The kernel loop assumed that there was always enough work for each thread to process WPT items, but this was not enforced. It's possible to detect the overflow with the "canary" buffer regions, but for SHAD, kCanarySize must be ~500 (much larger than the normal 127). This commit may improve the performance of XhadFaster, since the kernel was performing 2x work in some cases (once over real data, once over garbage). Courtesy of Codeplay Software Ltd. | |||
2019-05-19 | Fixed a bug in the absolute-min index kernel | Cedric Nugteren | |
2019-05-11 | Added a function to set the OpenCL kernel standard, either 1.1 or 1.2 | Cedric Nugteren | |