Compiling 64bit R using Intel Compiler (icc/ifort) and Intel Math Kernel Library (MKL).
通过Intel的编译器和Intel MKL,我们得到运行速度最快的R系统(比上一篇介绍的 R+GotoBlas 还快一点点)。
下载,安装Intel Parallel Studio,这个包括Intel C compiler (icc), C++ Compiler (icpc), Fortran compiler(ifort):
http://software.intel.com/en-us/articles/intel-parallel-studio-xe/
下载,安装Intel Math Kernel Library
http://software.intel.com/en-us/articles/intel-mkl/
Intel的这两个软件对于非商业用途是免费的。
然后需要下载R的源代码:
http://cran.cnr.berkeley.edu/
解压缩R之后,在其目录下建立bash文件来指定编译的方式(R本身是使用静态链接还是动态链接库?安装路径?)。
具体方式可以在这个脚本的末尾部分找到,大家可以自己按需要修改。
注:在我的比较下,使用动态链接的BLAS库与静态链接库相比不会损失速度;使用动态链接库的优点是可以方便的换用不同BLAS库。
source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/bin/iccvars.sh intel64 source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/bin/ifortvars.sh intel64 source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/bin/mklvars.sh intel64 export CC=icc export CFLAGS="-O3 -wd188 -mieee-fp" export F77=ifort export FFLAGS="-O3 -mieee-fp" export CXX=icpc export CXXFLAGS="-O3" export FC=ifort export FCFLAGS="-O3 -mieee-fp" export ICC_LIBS=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/compiler/lib/intel64 export IFC_LIBS=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/compiler/lib/intel64 export SHLIB_CXXLD=icpc export SHLIB_CXXLDFLAGS=-shared MKL_LIB_PATH=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64 export LD_LIBRARY_PATH=$MKL_LIB_PATH OMP_NUM_THREADS=8 export LDFLAGS="-L${MKL_LIB_PATH},-Bdirect,--hash-style=both,-Wl,-O1 -L$ICC_LIBS -L$IFC_LIBS -L/usr/local/lib" export SHLIB_LDFLAGS="-lpthread" export MAIN_LDFLAGS="-lpthread" MKL="-L${MKL_LIB_PATH} -lmkl_blas95 -lmkl_lapack95 -Wl,--start-group -lmkl_intel -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread" OMP_NUM_THREADS=8 MKL="-L${MKL_LIB_PATH} -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread" #static linked library of R #./configure --with-blas="$MKL" --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl # dynamic linked library of: R and BLAS #./configure --enable-R-shlib --enable-BLAS-shlib --with-blas="$MKL" --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl #dynamic linked library of: BLAS ./configure --enable-BLAS-shlib --with-blas="$MKL" --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl
之后用make; make install即可。
使用同样的R-benchmark脚本,结果如下:
Intel Compiler (ICC+Ifort) and Intel MKL
R Benchmark 2.5 =============== Number of times each test is run__________________________: 3 I. Matrix calculation --------------------- Creation, transp., deformation of a 2500x2500 matrix (sec): 0.719666666666667 2400x2400 normal distributed random matrix ^1000____ (sec): 0.394333333333333 Sorting of 7,000,000 random values__________________ (sec): 0.861 2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.709 Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.448 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.611437229773395 II. Matrix functions -------------------- FFT over 2,400,000 random values____________________ (sec): 0.907666666666668 Eigenvalues of a 640x640 random matrix______________ (sec): 0.613000000000001 Determinant of a 2500x2500 random matrix____________ (sec): 0.493333333333333 Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.334333333333332 Inverse of a 1600x1600 random matrix________________ (sec): 0.611666666666667 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.569777440099831 III. Programmation ------------------ 3,500,000 Fibonacci numbers calculation (vector calc)(sec): 0.82 Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.535999999999999 Grand common divisors of 400,000 pairs (recursion)__ (sec): 2.64933333333334 Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 0.683666666666667 Escoufier's method on a 45x45 matrix (mixed)________ (sec): 0.828000000000003 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.774276714018349 Total time for all 15 tests_________________________ (sec): 11.609 Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.646126830621363 --- End of test ---
Intel Compiler(ICC+Ifort) + GotoBlas2(Compiled by ICC/Ifort)
R Benchmark 2.5 =============== Number of times each test is run__________________________: 3 I. Matrix calculation --------------------- Creation, transp., deformation of a 2500x2500 matrix (sec): 0.715333333333333 2400x2400 normal distributed random matrix ^1000____ (sec): 0.41 Sorting of 7,000,000 random values__________________ (sec): 0.862666666666666 2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.829333333333333 Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.554666666666667 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.690382674196494 II. Matrix functions -------------------- FFT over 2,400,000 random values____________________ (sec): 0.922333333333332 Eigenvalues of a 640x640 random matrix______________ (sec): 0.681333333333333 Determinant of a 2500x2500 random matrix____________ (sec): 0.511666666666667 Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.433333333333332 Inverse of a 1600x1600 random matrix________________ (sec): 0.594333333333331 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.591732764155743 III. Programmation ------------------ 3,500,000 Fibonacci numbers calculation (vector calc)(sec): 0.835999999999999 Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.545000000000002 Grand common divisors of 400,000 pairs (recursion)__ (sec): 2.66133333333333 Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 0.695666666666665 Escoufier's method on a 45x45 matrix (mixed)________ (sec): 0.585000000000001 -------------------------------------------- Trimmed geom. mean (2 extremes eliminated): 0.698105585240407 Total time for all 15 tests_________________________ (sec): 11.838 Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.658231817116501 --- End of test ---
常见的错误:
在编译R的时候,我们用–with-blas=”$MKL”来制定Intel MKL的位置 (网上其他文章的做法),但如果$MKL的值不正确,R无法正常链接MKL。我们需要检查configure的输出或者文件config.log,要确保这两项的检查都是yes:
checking for dgemm_ in -L/home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread… yes
checking whether double complex BLAS can be used… yes
checking whether the BLAS is complete… yes
值得指出的是在链接Intel库时,LP64 和 ILP64是不同的。在我的机器上,错误的制定ILP64,例如-lmkl_intel_ilp64,会导致R无法使用MKL,因为使用ILP64编译的程序会crash(在configure脚本里,这个文件是conftest)
config.log是非常有用的文件,它包括的configure检查系统环境时相关信息,通过这个文件并结合configure(本质是一个shell script),可以帮助我们确定R是否可以,或者为什么不可以链接MKL库。
另外,使用shared BLAS库的时候R会检查zgeev_,并检查不到MKL,这个R“有意”的结果。因为动态的MKL库会包含LAPACK的信息。如果介意这方面的速度损失,可以使用静态链接的方式。
Updated (2011-10-05):
Similar idea in the PPT format: