使用Intel Compiler Suite和Intel MKL编译64位R

Compiling 64bit R using Intel Compiler (icc/ifort) and Intel Math Kernel Library (MKL).
通过Intel的编译器和Intel MKL,我们得到运行速度最快的R系统(比上一篇介绍的 R+GotoBlas 还快一点点)。

下载,安装Intel Parallel Studio,这个包括Intel C compiler (icc), C++ Compiler (icpc), Fortran compiler(ifort):
http://software.intel.com/en-us/articles/intel-parallel-studio-xe/

下载,安装Intel Math Kernel Library
http://software.intel.com/en-us/articles/intel-mkl/
Intel的这两个软件对于非商业用途是免费的。

然后需要下载R的源代码:
http://cran.cnr.berkeley.edu/

解压缩R之后,在其目录下建立bash文件来指定编译的方式(R本身是使用静态链接还是动态链接库?安装路径?)。
具体方式可以在这个脚本的末尾部分找到,大家可以自己按需要修改。
注:在我的比较下,使用动态链接的BLAS库与静态链接库相比不会损失速度;使用动态链接库的优点是可以方便的换用不同BLAS库。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/bin/iccvars.sh intel64                                                                                  
source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/bin/ifortvars.sh intel64
source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/bin/mklvars.sh intel64
 
export CC=icc
export CFLAGS="-O3 -wd188 -mieee-fp"
export F77=ifort
export FFLAGS="-O3 -mieee-fp"
export CXX=icpc
export CXXFLAGS="-O3"
export FC=ifort
export FCFLAGS="-O3 -mieee-fp"
export ICC_LIBS=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/compiler/lib/intel64
export IFC_LIBS=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/compiler/lib/intel64
export SHLIB_CXXLD=icpc
export SHLIB_CXXLDFLAGS=-shared
 
MKL_LIB_PATH=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
export LD_LIBRARY_PATH=$MKL_LIB_PATH
 
OMP_NUM_THREADS=8
 
export LDFLAGS="-L${MKL_LIB_PATH},-Bdirect,--hash-style=both,-Wl,-O1 -L$ICC_LIBS -L$IFC_LIBS -L/usr/local/lib"
 
export SHLIB_LDFLAGS="-lpthread"
export MAIN_LDFLAGS="-lpthread"
 
MKL="-L${MKL_LIB_PATH} -lmkl_blas95 -lmkl_lapack95  -Wl,--start-group -lmkl_intel -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread"
 
OMP_NUM_THREADS=8
 
MKL="-L${MKL_LIB_PATH} -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread"
#static linked library of R                                                                                                                                  
#./configure --with-blas="$MKL"  --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl                                                          
 
# dynamic linked library of: R and BLAS                                                                                                                      
#./configure --enable-R-shlib --enable-BLAS-shlib --with-blas="$MKL"  --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl                     
 
#dynamic linked library of: BLAS                                                                                                                             
./configure --enable-BLAS-shlib --with-blas="$MKL"  --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl

之后用make; make install即可。
使用同样的R-benchmark脚本,结果如下:
Intel Compiler (ICC+Ifort) and Intel MKL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3
 
   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.719666666666667
2400x2400 normal distributed random matrix ^1000____ (sec):  0.394333333333333
Sorting of 7,000,000 random values__________________ (sec):  0.861
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.709
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.448
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.611437229773395
 
   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.907666666666668
Eigenvalues of a 640x640 random matrix______________ (sec):  0.613000000000001
Determinant of a 2500x2500 random matrix____________ (sec):  0.493333333333333
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.334333333333332
Inverse of a 1600x1600 random matrix________________ (sec):  0.611666666666667
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.569777440099831
 
   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.82
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.535999999999999
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.64933333333334
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.683666666666667
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.828000000000003
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.774276714018349
 
 
Total time for all 15 tests_________________________ (sec):  11.609
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.646126830621363
                      --- End of test ---

Intel Compiler(ICC+Ifort) + GotoBlas2(Compiled by ICC/Ifort)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3
 
   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.715333333333333
2400x2400 normal distributed random matrix ^1000____ (sec):  0.41
Sorting of 7,000,000 random values__________________ (sec):  0.862666666666666
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.829333333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.554666666666667
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.690382674196494
 
   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.922333333333332
Eigenvalues of a 640x640 random matrix______________ (sec):  0.681333333333333
Determinant of a 2500x2500 random matrix____________ (sec):  0.511666666666667
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.433333333333332
Inverse of a 1600x1600 random matrix________________ (sec):  0.594333333333331
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.591732764155743
 
   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.835999999999999
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.545000000000002
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.66133333333333
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.695666666666665
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.585000000000001
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.698105585240407
 
 
Total time for all 15 tests_________________________ (sec):  11.838
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.658231817116501
                      --- End of test ---

常见的错误:
在编译R的时候,我们用–with-blas=”$MKL”来制定Intel MKL的位置 (网上其他文章的做法),但如果$MKL的值不正确,R无法正常链接MKL。我们需要检查configure的输出或者文件config.log,要确保这两项的检查都是yes:
checking for dgemm_ in -L/home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread… yes
checking whether double complex BLAS can be used… yes
checking whether the BLAS is complete… yes

值得指出的是在链接Intel库时,LP64 和 ILP64是不同的。在我的机器上,错误的制定ILP64,例如-lmkl_intel_ilp64,会导致R无法使用MKL,因为使用ILP64编译的程序会crash(在configure脚本里,这个文件是conftest)

config.log是非常有用的文件,它包括的configure检查系统环境时相关信息,通过这个文件并结合configure(本质是一个shell script),可以帮助我们确定R是否可以,或者为什么不可以链接MKL库。

另外,使用shared BLAS库的时候R会检查zgeev_,并检查不到MKL,这个R“有意”的结果。因为动态的MKL库会包含LAPACK的信息。如果介意这方面的速度损失,可以使用静态链接的方式。

Updated (2011-10-05):

Similar idea in the PPT format:

R_BLAS-Sachdeva

加速R的矩阵运算(Speed up R matrix computation)

Speed up R matrix computation with smallest effort.

给R提速有两个方法:
1. 使用Intel compiler
2. 使用更快的矩阵运算库

其中我使用第一个方法并没有看到显著的速度提升,所以这里介绍第2种方法,保证矩阵运算至少提速2倍。
我使用的是R-2.13.1版本,矩阵库使用GotoBLAS。
根据下面这个链接,
http://r.789695.n4.nabble.com/configure-can-t-find-dgemm-in-MKL10-td920212.html
GotoBLAS比Intel MKL快。据说,GotoBLAS比ATLAS也要快。

具体步骤如下:
(1)建立一个shell 源文件:

1
2
3
4
5
6
export FFLAGS="-march=native -O3"
export CFLAGS="-march=native -O3 -DMKL_ILP64"
export CXXFLAGS="-march=native -O3 -DMKL_ILP64"
export FCFLAGS="-march=native -O3"
 
./configure --enable-R-shlib --enable-BLAS-shlib --with-blas --with-lapack --prefix=/net/dumbo/home/zhanxw/software/Rmkl                 

之后用make, make install安装。
(2)下载GotoBLAS,在源目录’make’即可,得到的BLAS库文件名是’libgoto2.so’
(3)建立符号链接。在R安装目录下e.g. /lib64/R/lib,已经有一个R默认的BLAS动态连接库libRblas.so,把这个改成链接到libgoto2.so的符号链接。

这3步之后,R就会使用GotoBLAS作为矩阵运算库。在我们的服务器上,benchmark结果如下:
# GCC + default BLAS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3
 
   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.764666666666666
2400x2400 normal distributed random matrix ^1000____ (sec):  0.596666666666666
Sorting of 7,000,000 random values__________________ (sec):  0.833333333333333
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  4.425
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  2.30366666666667
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  1.13650194597564
 
   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.778666666666666
Eigenvalues of a 640x640 random matrix______________ (sec):  1.406
Determinant of a 2500x2500 random matrix____________ (sec):  2.28733333333334
Cholesky decomposition of a 3000x3000 matrix________ (sec):  2.02366666666667
Inverse of a 1600x1600 random matrix________________ (sec):  1.933
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.76516531172197
 
   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  1.06166666666667
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.601666666666669
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.56866666666667
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.757666666666661
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.595000000000013
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.785128552514896
 
 
Total time for all 15 tests_________________________ (sec):  22.9366666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  1.16349747864837
                      --- End of test ---

# GCC + GotoBLAS(GCC)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
  R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3
 
   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.776333333333333
2400x2400 normal distributed random matrix ^1000____ (sec):  0.597
Sorting of 7,000,000 random values__________________ (sec):  0.838
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.376333333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.293
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.558725402933605
 
   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.785666666666668
Eigenvalues of a 640x640 random matrix______________ (sec):  2.092
Determinant of a 2500x2500 random matrix____________ (sec):  0.303666666666667
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.292999999999999
Inverse of a 1600x1600 random matrix________________ (sec):  0.396333333333331
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.455580734019386
 
   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  1.07166666666667
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.608999999999999
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.848
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.675666666666665
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.591000000000001
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.761149272082565
 
 
Total time for all 15 tests_________________________ (sec):  12.5466666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.578643662905733
                      --- End of test ---

# ICC + build-in BLAS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3
 
   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.722333333333333
2400x2400 normal distributed random matrix ^1000____ (sec):  0.398
Sorting of 7,000,000 random values__________________ (sec):  0.853333333333333
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  23.2723333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  9.48066666666666
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  1.80121303632586
 
   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.919666666666667
Eigenvalues of a 640x640 random matrix______________ (sec):  1.01100000000001
Determinant of a 2500x2500 random matrix____________ (sec):  4.84600000000001
Cholesky decomposition of a 3000x3000 matrix________ (sec):  3.71033333333332
Inverse of a 1600x1600 random matrix________________ (sec):  6.53100000000001
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  2.62935462784594
 
   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.825333333333333
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.588666666666654
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.65866666666667
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.665000000000001
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.55400000000003
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.686183322572556
 
 
Total time for all 15 tests_________________________ (sec):  57.0363333333334
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  1.4812151139281
                      --- End of test ---

# ICC + GotoBLAS(ICC)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3
 
   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.738666666666667
2400x2400 normal distributed random matrix ^1000____ (sec):  0.388000000000001
Sorting of 7,000,000 random values__________________ (sec):  0.857333333333333
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.633333333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.537666666666667
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.631245051729315
 
   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.938333333333333
Eigenvalues of a 640x640 random matrix______________ (sec):  5.53166666666667
Determinant of a 2500x2500 random matrix____________ (sec):  0.957666666666666
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.601000000000001
Inverse of a 1600x1600 random matrix________________ (sec):  1.741
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.16088739499808
 
   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.813
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.591333333333334
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.663
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.669333333333332
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  4.883
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.13162201708511
 
 
Total time for all 15 tests_________________________ (sec):  22.5443333333333
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.939499363744844
                      --- End of test ---

通过比较GCC/ICC 与 R自带的BLAS/GotoBLAS的4种组合,在我们的服务器系统下GCC+GotoBLAS最快。

注:
LAPACK是对BLAS的再次封装,因此我们不需要改变libRlapack.so。这一点可以通过’nm -g libRlapack.so’,查看dgemm_的定义为‘U’(说明这个函数没有在该文件中实现),而通过’ldd libRlapack.so’可以发现它会调用libRblas.so

其他资源: