Titan RTX 上双精度和单精度的矩阵乘法基准测试

我试图了解我们的 GPU 工作站的单精度和双精度之间的性能差异。我们的工作站配备了两个 TITAN RTX GPU，但我在单个 Titan RTX 上运行基准测试。我正在使用 cublas 矩阵乘法测试性能。我将由随机浮点数或双精度数组成的 8192x8192 矩阵相乘。为了确保我这边没有错误，我还在Python中使用cupy库重复了这个过程，结果非常相似。浮点型的测试结果约为每 1 次乘法 75 毫秒，双精度型的测试结果约为 2,000 毫秒。如果我有一个较旧的 GPU，这将很有意义，因为 75*32 = 2,400~2000，因此我的双精度性能将比 https://docs.nvidia 表中预期的差约 32 倍。然而，我的 GPU 的计算能力为 7.5，因此我预计性能只会翻倍 2 倍。其他信息：Ubuntu 18 LTS、nvcc 10.2、驱动程序 440.82。这是 CUDA 代码：#include <iostream>#include <chrono>#include <string>#include <cuda_runtime.h>#include "cublas_v2.h"#include <math.h>#include <stdio.h>#include <cuda.h>#include <device_functions.h>#include <sstream>#include <time.h>unsigned long mix(unsigned long a, unsigned long b, unsigned long c){ a=a-b; a=a-c; a=a^(c >> 13); b=b-c; b=b-a; b=b^(a << 8); c=c-a; c=c-b; c=c^(b >> 13); a=a-b; a=a-c; a=a^(c >> 12); b=b-c; b=b-a; b=b^(a << 16); c=c-a; c=c-b; c=c^(b >> 5); a=a-b; a=a-c; a=a^(c >> 3); b=b-c; b=b-a; b=b^(a << 10); c=c-a; c=c-b; c=c^(b >> 15); return c;}using namespace std;int main(){ int deviceCount; cudaGetDeviceCount(&deviceCount); cudaDeviceProp deviceProp; cublasStatus_t err; cudaGetDeviceProperties(&deviceProp, 0); printf("Detected %d devices \n", deviceCount); printf("Device %d has compute capability %d.%d:\n\t maxshmem %d. \n\t maxthreads per block %d. \n\t max threads dim %d. %d. %d.\n ", 0, deviceProp.major, deviceProp.minor, deviceProp.sharedMemPerBlock, deviceProp.maxThreadsPerBlock, deviceProp.maxThreadsDim[0], deviceProp.maxThreadsDim[1], deviceProp.maxThreadsDim[2]); cudaEvent_t start_d, stop_d; cudaEventCreate(&start_d); cudaEventCreate(&stop_d); //RND insicialization unsigned long seed = mix(clock(), time(NULL), 0); srand(seed); int N=8192; int Nloops=2; }}

查看完整描述

Titan RTX 上双精度和单精度的矩阵乘法基准测试

Titan RTX 上双精度和单精度的矩阵乘法基准测试

1 回答

添加回答

热搜

最近搜索清空

Titan RTX 上双精度和单精度的矩阵乘法基准测试

Titan RTX 上双精度和单精度的矩阵乘法基准测试

1 回答

添加回答