Speed up R matrix computation with smallest effort.
给R提速有两个方法:
1. 使用Intel compiler
2. 使用更快的矩阵运算库
其中我使用第一个方法并没有看到显著的速度提升,所以这里介绍第2种方法,保证矩阵运算至少提速2倍。
我使用的是R-2.13.1版本,矩阵库使用GotoBLAS。
根据下面这个链接,
http://r.789695.n4.nabble.com/configure-can-t-find-dgemm-in-MKL10-td920212.html
GotoBLAS比Intel MKL快。据说,GotoBLAS比ATLAS也要快。
具体步骤如下:
(1)建立一个shell 源文件:
export FFLAGS="-march=native -O3" export CFLAGS="-march=native -O3 -DMKL_ILP64" export CXXFLAGS="-march=native -O3 -DMKL_ILP64" export FCFLAGS="-march=native -O3" ./configure --enable-R-shlib --enable-BLAS-shlib --with-blas --with-lapack --prefix=/net/dumbo/home/zhanxw/software/Rmkl
之后用make, make install安装。
(2)下载GotoBLAS,在源目录’make’即可,得到的BLAS库文件名是’libgoto2.so’
(3)建立符号链接。在R安装目录下e.g. /lib64/R/lib,已经有一个R默认的BLAS动态连接库libRblas.so,把这个改成链接到libgoto2.so的符号链接。
这3步之后,R就会使用GotoBLAS作为矩阵运算库。在我们的服务器上,benchmark结果如下:
# GCC + default BLAS
R Benchmark 2.5 =============== Number of times each test is run__________________________: 3 I. Matrix calculation --------------------- Creation, transp., deformation of a 2500x2500 matrix (sec): 0.764666666666666 2400x2400 normal distributed random matrix ^1000____ (sec): 0.596666666666666 Sorting of 7,000,000 random values__________________ (sec): 0.833333333333333 2800x2800 cross-product matrix (b = a' * a)_________ (sec): 4.425 Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 2.30366666666667 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 1.13650194597564 II. Matrix functions -------------------- FFT over 2,400,000 random values____________________ (sec): 0.778666666666666 Eigenvalues of a 640x640 random matrix______________ (sec): 1.406 Determinant of a 2500x2500 random matrix____________ (sec): 2.28733333333334 Cholesky decomposition of a 3000x3000 matrix________ (sec): 2.02366666666667 Inverse of a 1600x1600 random matrix________________ (sec): 1.933 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 1.76516531172197 III. Programmation ------------------ 3,500,000 Fibonacci numbers calculation (vector calc)(sec): 1.06166666666667 Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.601666666666669 Grand common divisors of 400,000 pairs (recursion)__ (sec): 2.56866666666667 Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 0.757666666666661 Escoufier's method on a 45x45 matrix (mixed)________ (sec): 0.595000000000013 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.785128552514896 Total time for all 15 tests_________________________ (sec): 22.9366666666667 Overall mean (sum of I, II and III trimmed means/3)_ (sec): 1.16349747864837 --- End of test ---
# GCC + GotoBLAS(GCC)
R Benchmark 2.5 =============== Number of times each test is run__________________________: 3 I. Matrix calculation --------------------- Creation, transp., deformation of a 2500x2500 matrix (sec): 0.776333333333333 2400x2400 normal distributed random matrix ^1000____ (sec): 0.597 Sorting of 7,000,000 random values__________________ (sec): 0.838 2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.376333333333333 Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.293 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.558725402933605 II. Matrix functions -------------------- FFT over 2,400,000 random values____________________ (sec): 0.785666666666668 Eigenvalues of a 640x640 random matrix______________ (sec): 2.092 Determinant of a 2500x2500 random matrix____________ (sec): 0.303666666666667 Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.292999999999999 Inverse of a 1600x1600 random matrix________________ (sec): 0.396333333333331 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.455580734019386 III. Programmation ------------------ 3,500,000 Fibonacci numbers calculation (vector calc)(sec): 1.07166666666667 Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.608999999999999 Grand common divisors of 400,000 pairs (recursion)__ (sec): 2.848 Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 0.675666666666665 Escoufier's method on a 45x45 matrix (mixed)________ (sec): 0.591000000000001 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.761149272082565 Total time for all 15 tests_________________________ (sec): 12.5466666666667 Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.578643662905733 --- End of test ---
# ICC + build-in BLAS
R Benchmark 2.5 =============== Number of times each test is run__________________________: 3 I. Matrix calculation --------------------- Creation, transp., deformation of a 2500x2500 matrix (sec): 0.722333333333333 2400x2400 normal distributed random matrix ^1000____ (sec): 0.398 Sorting of 7,000,000 random values__________________ (sec): 0.853333333333333 2800x2800 cross-product matrix (b = a' * a)_________ (sec): 23.2723333333333 Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 9.48066666666666 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 1.80121303632586 II. Matrix functions -------------------- FFT over 2,400,000 random values____________________ (sec): 0.919666666666667 Eigenvalues of a 640x640 random matrix______________ (sec): 1.01100000000001 Determinant of a 2500x2500 random matrix____________ (sec): 4.84600000000001 Cholesky decomposition of a 3000x3000 matrix________ (sec): 3.71033333333332 Inverse of a 1600x1600 random matrix________________ (sec): 6.53100000000001 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 2.62935462784594 III. Programmation ------------------ 3,500,000 Fibonacci numbers calculation (vector calc)(sec): 0.825333333333333 Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.588666666666654 Grand common divisors of 400,000 pairs (recursion)__ (sec): 2.65866666666667 Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 0.665000000000001 Escoufier's method on a 45x45 matrix (mixed)________ (sec): 0.55400000000003 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.686183322572556 Total time for all 15 tests_________________________ (sec): 57.0363333333334 Overall mean (sum of I, II and III trimmed means/3)_ (sec): 1.4812151139281 --- End of test ---
# ICC + GotoBLAS(ICC)
R Benchmark 2.5 =============== Number of times each test is run__________________________: 3 I. Matrix calculation --------------------- Creation, transp., deformation of a 2500x2500 matrix (sec): 0.738666666666667 2400x2400 normal distributed random matrix ^1000____ (sec): 0.388000000000001 Sorting of 7,000,000 random values__________________ (sec): 0.857333333333333 2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.633333333333333 Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.537666666666667 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.631245051729315 II. Matrix functions -------------------- FFT over 2,400,000 random values____________________ (sec): 0.938333333333333 Eigenvalues of a 640x640 random matrix______________ (sec): 5.53166666666667 Determinant of a 2500x2500 random matrix____________ (sec): 0.957666666666666 Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.601000000000001 Inverse of a 1600x1600 random matrix________________ (sec): 1.741 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 1.16088739499808 III. Programmation ------------------ 3,500,000 Fibonacci numbers calculation (vector calc)(sec): 0.813 Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.591333333333334 Grand common divisors of 400,000 pairs (recursion)__ (sec): 2.663 Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 0.669333333333332 Escoufier's method on a 45x45 matrix (mixed)________ (sec): 4.883 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 1.13162201708511 Total time for all 15 tests_________________________ (sec): 22.5443333333333 Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.939499363744844 --- End of test ---
通过比较GCC/ICC 与 R自带的BLAS/GotoBLAS的4种组合,在我们的服务器系统下GCC+GotoBLAS最快。
注:
LAPACK是对BLAS的再次封装,因此我们不需要改变libRlapack.so。这一点可以通过’nm -g libRlapack.so’,查看dgemm_的定义为‘U’(说明这个函数没有在该文件中实现),而通过’ldd libRlapack.so’可以发现它会调用libRblas.so
其他资源: