让R更快

让R更快
Speed up R

用R 3.0.1来做基准,用Rscript运行R-benchmark-25.R计时。

1) 基准时间:

197.16u 1.74s 199.71r 1703184kB 0 Rscript R-benchmark-25.R

之后下载OpenBLAS,编译,然后把编译好的libopenblas.so 替换R自带的libRblas.so.
2) 测试单线程的速度:

Single thread OpenBLAS

105.83u 108.53s 58.04r 3072256kB 0 Rscript R-benchmark-25.R

 

3) 下面试试四个线程:

Four threaded OpenBLAS

export OPENBLAS_NUM_THREADS=4

79.39u 23.67s 58.54r 2122688kB 0 Rscript R-benchmark-25.R
可以看出使用四个线程和一个线程耗时差不多一样。但如果细看矩阵计算时间(这里忽略),四个线程还是能提速不少的。
但是考虑到使用方便,一个线程在大多数情况下应该就够用了。

加速R的矩阵运算(Speed up R matrix computation)

Speed up R matrix computation with smallest effort.

给R提速有两个方法:
1. 使用Intel compiler
2. 使用更快的矩阵运算库

其中我使用第一个方法并没有看到显著的速度提升,所以这里介绍第2种方法,保证矩阵运算至少提速2倍。
我使用的是R-2.13.1版本,矩阵库使用GotoBLAS。
根据下面这个链接,
http://r.789695.n4.nabble.com/configure-can-t-find-dgemm-in-MKL10-td920212.html
GotoBLAS比Intel MKL快。据说,GotoBLAS比ATLAS也要快。

具体步骤如下:
(1)建立一个shell 源文件:

export FFLAGS="-march=native -O3"
export CFLAGS="-march=native -O3 -DMKL_ILP64"
export CXXFLAGS="-march=native -O3 -DMKL_ILP64"
export FCFLAGS="-march=native -O3"

./configure --enable-R-shlib --enable-BLAS-shlib --with-blas --with-lapack --prefix=/net/dumbo/home/zhanxw/software/Rmkl                  

之后用make, make install安装。
(2)下载GotoBLAS,在源目录’make’即可,得到的BLAS库文件名是’libgoto2.so’
(3)建立符号链接。在R安装目录下e.g. /lib64/R/lib,已经有一个R默认的BLAS动态连接库libRblas.so,把这个改成链接到libgoto2.so的符号链接。

这3步之后,R就会使用GotoBLAS作为矩阵运算库。在我们的服务器上,benchmark结果如下:
# GCC + default BLAS

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.764666666666666
2400x2400 normal distributed random matrix ^1000____ (sec):  0.596666666666666
Sorting of 7,000,000 random values__________________ (sec):  0.833333333333333
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  4.425
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  2.30366666666667
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  1.13650194597564

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.778666666666666
Eigenvalues of a 640x640 random matrix______________ (sec):  1.406
Determinant of a 2500x2500 random matrix____________ (sec):  2.28733333333334
Cholesky decomposition of a 3000x3000 matrix________ (sec):  2.02366666666667
Inverse of a 1600x1600 random matrix________________ (sec):  1.933
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.76516531172197

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  1.06166666666667
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.601666666666669
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.56866666666667
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.757666666666661
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.595000000000013
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.785128552514896


Total time for all 15 tests_________________________ (sec):  22.9366666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  1.16349747864837
                      --- End of test ---

# GCC + GotoBLAS(GCC)

  R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.776333333333333
2400x2400 normal distributed random matrix ^1000____ (sec):  0.597
Sorting of 7,000,000 random values__________________ (sec):  0.838
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.376333333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.293
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.558725402933605

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.785666666666668
Eigenvalues of a 640x640 random matrix______________ (sec):  2.092
Determinant of a 2500x2500 random matrix____________ (sec):  0.303666666666667
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.292999999999999
Inverse of a 1600x1600 random matrix________________ (sec):  0.396333333333331
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.455580734019386

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  1.07166666666667
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.608999999999999
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.848
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.675666666666665
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.591000000000001
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.761149272082565


Total time for all 15 tests_________________________ (sec):  12.5466666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.578643662905733
                      --- End of test ---


# ICC + build-in BLAS

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.722333333333333
2400x2400 normal distributed random matrix ^1000____ (sec):  0.398
Sorting of 7,000,000 random values__________________ (sec):  0.853333333333333
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  23.2723333333333 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  9.48066666666666 
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  1.80121303632586 

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.919666666666667 
Eigenvalues of a 640x640 random matrix______________ (sec):  1.01100000000001 
Determinant of a 2500x2500 random matrix____________ (sec):  4.84600000000001 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  3.71033333333332 
Inverse of a 1600x1600 random matrix________________ (sec):  6.53100000000001 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  2.62935462784594 

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.825333333333333 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.588666666666654 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.65866666666667 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.665000000000001 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.55400000000003 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.686183322572556 


Total time for all 15 tests_________________________ (sec):  57.0363333333334 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  1.4812151139281 
                      --- End of test ---

# ICC + GotoBLAS(ICC)

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.738666666666667
2400x2400 normal distributed random matrix ^1000____ (sec):  0.388000000000001
Sorting of 7,000,000 random values__________________ (sec):  0.857333333333333 
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.633333333333333 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.537666666666667 
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.631245051729315 

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.938333333333333 
Eigenvalues of a 640x640 random matrix______________ (sec):  5.53166666666667 
Determinant of a 2500x2500 random matrix____________ (sec):  0.957666666666666 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.601000000000001 
Inverse of a 1600x1600 random matrix________________ (sec):  1.741 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.16088739499808 

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.813 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.591333333333334 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.663 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.669333333333332 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  4.883 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.13162201708511 


Total time for all 15 tests_________________________ (sec):  22.5443333333333 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.939499363744844 
                      --- End of test ---

通过比较GCC/ICC 与 R自带的BLAS/GotoBLAS的4种组合,在我们的服务器系统下GCC+GotoBLAS最快。

注:
LAPACK是对BLAS的再次封装,因此我们不需要改变libRlapack.so。这一点可以通过’nm -g libRlapack.so’,查看dgemm_的定义为‘U’(说明这个函数没有在该文件中实现),而通过’ldd libRlapack.so’可以发现它会调用libRblas.so

其他资源: