我使用这里给出的建议来为我的算法选择最优的GPU。https://stackoverflow.com/a/33488953/5371117
我在我的MacBook Pro上使用boost::compute::system::devices();查询设备,这将返回以下设备列表。
Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
Intel(R) UHD Graphics 630
AMD Radeon Pro 560X Compute Engine我想使用AMD Radeon Pro 560X Compute Engine作为我的目的,但是当我迭代找到具有最大评级的设备时,= CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS。我得到以下结果:
Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz,
freq: 2600, compute units: 12, rating:31200
Intel(R) UHD Graphics 630,
freq: 1150, units: 24, rating:27600
AMD Radeon Pro 560X Compute Engine,
freq: 300, units: 16, rating:4800AMD GPU的评级最低。此外,我查看了规范,在我看来,CL_DEVICE_MAX_CLOCK_FREQUENCY没有返回正确的值。
根据AMD芯片规格https://www.amd.com/en/products/graphics/radeon-rx-560x,我的AMD基频为1175 MHz,而不是300 MHz。
根据英特尔芯片规格https://en.wikichip.org/wiki/intel/uhd_graphics/630,我的英特尔GPU基频为300 MHz,而不是1150 MHz,但它的升压频率为1150 MHz。
std::vector<boost::compute::device> devices = boost::compute::system::devices();
std::pair<boost::compute::device, ai::int64> suitableDevice{};
for(auto& device: devices)
{
auto rating = device.clock_frequency() * device.compute_units();
std::cout << device.name() << ", freq: " << device.clock_frequency() << ", units: " << device.compute_units() << ", rating:" << rating << std::endl;
if(suitableDevice.second < benchmark)
{
suitableDevice.first = device;
suitableDevice.second = benchmark;
}
} 我做错什么了吗?
发布于 2019-12-30 15:13:57
不幸的是,这些属性只能在实现中直接比较(相同的HW制造商,相同的操作系统)。
我的建议是:
CL_DEVICE_TYPE_GPU以外的设备类型的任何东西(除非没有任何可用的CPU,在这种情况下,您可能想回到CPU上)。CL_DEVICE_HOST_UNIFIED_MEMORY属性返回正确。这些将是集成的GPU,通常比离散的GPU慢,除非您受到数据传输速度的约束,在这种情况下,它们可能会更快。所以你会想选择一种类型,而不是另一种。。
发布于 2020-01-03 18:58:23
此代码将返回具有最高浮点性能的设备。
select_device_with_most_flops(find_devices());
这是内存最多的设备
select_device_with_most_memory(find_devices());
首先,find_devices()返回系统中所有OpenCL设备的向量。select_device_with_most_memory()很简单,并且使用getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()。
浮点性能由这个方程给出:触发器/s=核心/CU*CU* IPC *时钟频率。
select_device_with_most_flops()比较困难,因为OpenCL只提供计算单元(CUs)的数量,对于CPU来说,计算单元是线程数,而对于GPU,则必须乘以每个CU的流处理器/ cuda核的数,这对于Nvidia、AMD和Intel以及它们不同的微体系结构来说是不同的,通常在4到128之间。幸运的是,供应商包括在getInfo<CL_DEVICE_VENDOR>()中。因此,根据供应商和CUs的数量,可以计算出每个CU的核数。
下一部分是FP32工控机或每个时钟的指令。对于大多数GPU,这是2,而对于最近的CPU是32,请参阅https://en.wikipedia.org/wiki/FLOPS?oldformat=true#FLOPs_per_cycle_for_various_processors,没有办法直接在OpenCL中计算出IPC,所以CPU的32只是猜测而已。人们可以使用设备名称和查找表来获得更准确的信息。如果设备是GPU,getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU将导致真。
最后一部分是时钟频率。OpenCL通过getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()在MHz中提供基本时钟频率。这个装置可以提高更高的频率,所以这又是一个近似。
所有这些都给出了浮点性能的估计。完整代码如下所示:
typedef unsigned int uint;
string trim(const string s) { // removes whitespace characters from beginnig and end of string s
const int l = (int)s.length();
int a=0, b=l-1;
char c;
while(a<l && ((c=s.at(a))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) a++;
while(b>a && ((c=s.at(b))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) b--;
return s.substr(a, 1+b-a);
}
bool contains(const string s, const string match) {
return s.find(match)!=string::npos;
}
vector<Device> find_devices() {
vector<Platform> platforms; // get all platforms (drivers)
vector<Device> devices_available;
vector<Device> devices; // get all devices of all platforms
Platform::get(&platforms);
if(platforms.size()==0) print_error("There are no OpenCL devices available. Make sure that the OpenCL 1.2 Runtime for your device is installed. For GPUs it comes by default with the graphics driver, for CPUs it has to be installed separately.");
for(uint i=0; i<(uint)platforms.size(); i++) {
devices_available.clear();
platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &devices_available); // CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU
if(devices_available.size()==0) continue; // no device of type device_type found in plattform i
for(uint j=0; j<(uint)devices_available.size(); j++) devices.push_back(devices_available[j]);
}
print_device_list(devices);
return devices;
}
Device select_device_with_most_flops(const vector<Device> devices) { // return device with best floating-point performance
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
//const string device_name = trim(d.getInfo<CL_DEVICE_NAME>());
const string device_vendor = trim(d.getInfo<CL_DEVICE_VENDOR>()); // is either Nvidia, AMD or Intel
const uint device_compute_units = (uint)d.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
const bool device_is_gpu = d.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
const uint device_ipc = device_is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
const uint nvidia = (uint)(contains(device_vendor, "NVIDIA")||contains(device_vendor, "vidia"))*(device_compute_units<=30u?128u:64u); // Nvidia GPUs usually have 128 cores/CU, except Volta/Turing (>30 CUs) which have 64 cores/CU
const uint amd = (uint)(contains(device_vendor, "AMD")||contains(device_vendor, "ADVANCED")||contains(device_vendor, "dvanced"))*(device_is_gpu?64u:1u); // AMD GCN GPUs usually have 64 cores/CU, AMD CPUs have 1 core/CU
const uint intel = (uint)(contains(device_vendor, "INTEL")||contains(device_vendor, "ntel"))*(device_is_gpu?8u:1u); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs have 1 core/CU
const uint device_cores = device_compute_units*(nvidia+amd+intel);
const uint device_clock_frequency = (uint)d.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
const float device_tflops = 1E-6f*(float)device_cores*(float)device_ipc*(float)device_clock_frequency; // estimated device floating point performance in TeraFLOPs/s
if(device_tflops>best_value) { // device_memory>best_value
best_value = device_tflops; // best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_most_memory(const vector<Device> devices) { // return device with largest memory capacity
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
const float device_memory = 1E-3f*(float)(d.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()/1048576ull); // in GB
if(device_memory>best_value) {
best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_id(const vector<Device> devices, const int id) { // return device
if(id>=0&&id<(int)devices.size()) {
return devices[id];
} else {
print("Your selected device ID ("+to_string(id)+") is wrong.");
return devices[0]; // is never executed, just to avoid compiler warnings
}
}更新:我现在已经在一个轻量级的OpenCL包装器中包含了这个改进版本。这正确地计算了过去十年左右所有CPU和GPU的触发器:https://github.com/ProjectPhysX/OpenCL-Wrapper。
https://stackoverflow.com/questions/59527959
复制相似问题