r – Page 2 – Little Tail

在Solaris下安装R

August 30, 2012June 16, 2015 zhanxw Leave a comment

2015-06-16 更新：
这里讲的方法可能已经不管用了，至少在我使用Solaris 10 的VirtualBox Image的时候，下面的方法行不通。

Install Solaris 10 and install R under Solaris
最近的写的一个R package vcf2geno不能在Solaris 10 下编译。
为了解决这个问题，我决定重复CRAN上Solaris 10的测试环境。
下面写个流水账

1.安装Virtual Box
下载，安装。

2.安装Solaris 10 和 Guest OS
一定要注意的是手动分区，把/tmp的容量设置到5G以上。不然默认的/tmp只有512M，这样就不方便安装Solaris Studio （第3步）。
安装virtualBox的Guest OS 软件后，Solaris的屏幕分辨率增加。

3.安装Solaris Studio 12.3
下载，解压，安装（就是特别慢）

4.安装OpenSolaris
Solaris里面没有ubuntu的apt和fedora的yum，但是有个类似的软件叫做OpenCSW Link.
参考OpenCSW的手册Getting Start.
然后用pkgutil安装tetex, gcc4g++, iconv, readline

5.下载R
去R的主页下载源代码，然后解压

6.配置环境参数，安装
export CC=suncc export CFLAGS="-xO5 -xc99 -xlibmil -nofstore" export CPICFLAGS=-Kpic export F77=sunf95 export FFLAGS="-O5 -libmil -nofstore" export FPICFLAGS=-Kpic export CXX="sunCC -library=stlport4" export CXXFLAGS="-xO5 -xlibmil -nofstore -features=tmplrefstatic" export CXXPICFLAGS=-Kpic export FC=sunf95 export FCFLAGS=$FFLAGS export FCPICFLAGS=-Kpic export LDFLAGS=-L/opt/sunstudio12.1/rtlibs/amd64 export SHLIB_LDFLAGS=-shared export SHLIB_CXXLDFLAGS=-G export SHLIB_FCLDFLAGS=-G export SAFE_FFLAGS="-O5 -libmil"

export CPPFLAGS=’-I/opt/csw/include -I/opt/csw/include/readline’
export LDFLAGS=’-L/opt/sunstudio12.1/rtlibs/amd64 -L/opt/csw/lib’

export PATH=/usr/xpg4/bin:$PATH
export PATH=/usr/sfw/bin/:$PATH

之后我们依次执行：
./configure
gmake #（Solaris下面的GNU make）
gmake install

经验：

Solaris下自带的软件很少，常用软件要到OpenCSW下载

Solaris的路径设置和Linux很不同，不在/usr/bin, /usr/local/bin等等。这时候可以用glocate来快速查找，见【2】

主要参考：
【1】http://cran.r-project.org/doc/manuals/R-admin.html#Solaris
【2】Solaris下的locate工具. Link
【3】http://www.opencsw.org/get-it/packages/

使用Intel Compiler Suite和Intel MKL编译64位R

September 11, 2011October 5, 2011 zhanxw Leave a comment

Compiling 64bit R using Intel Compiler (icc/ifort) and Intel Math Kernel Library (MKL).
通过Intel的编译器和Intel MKL，我们得到运行速度最快的R系统（比上一篇介绍的 R+GotoBlas 还快一点点）。

下载，安装Intel Parallel Studio，这个包括Intel C compiler (icc), C++ Compiler (icpc), Fortran compiler(ifort)：
http://software.intel.com/en-us/articles/intel-parallel-studio-xe/

下载，安装Intel Math Kernel Library
http://software.intel.com/en-us/articles/intel-mkl/
Intel的这两个软件对于非商业用途是免费的。

然后需要下载R的源代码：
http://cran.cnr.berkeley.edu/

解压缩R之后，在其目录下建立bash文件来指定编译的方式（R本身是使用静态链接还是动态链接库？安装路径？）。
具体方式可以在这个脚本的末尾部分找到，大家可以自己按需要修改。
注：在我的比较下，使用动态链接的BLAS库与静态链接库相比不会损失速度；使用动态链接库的优点是可以方便的换用不同BLAS库。

source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/bin/iccvars.sh intel64                                                                                   
source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/bin/ifortvars.sh intel64
source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/bin/mklvars.sh intel64

export CC=icc
export CFLAGS="-O3 -wd188 -mieee-fp"
export F77=ifort
export FFLAGS="-O3 -mieee-fp"
export CXX=icpc
export CXXFLAGS="-O3"
export FC=ifort
export FCFLAGS="-O3 -mieee-fp"
export ICC_LIBS=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/compiler/lib/intel64
export IFC_LIBS=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/compiler/lib/intel64
export SHLIB_CXXLD=icpc
export SHLIB_CXXLDFLAGS=-shared

MKL_LIB_PATH=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
export LD_LIBRARY_PATH=$MKL_LIB_PATH

OMP_NUM_THREADS=8

export LDFLAGS="-L${MKL_LIB_PATH},-Bdirect,--hash-style=both,-Wl,-O1 -L$ICC_LIBS -L$IFC_LIBS -L/usr/local/lib"

export SHLIB_LDFLAGS="-lpthread"
export MAIN_LDFLAGS="-lpthread"

MKL="-L${MKL_LIB_PATH} -lmkl_blas95 -lmkl_lapack95  -Wl,--start-group -lmkl_intel -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread"

OMP_NUM_THREADS=8

MKL="-L${MKL_LIB_PATH} -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread"
#static linked library of R                                                                                                                                   
#./configure --with-blas="$MKL"  --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl                                                           

# dynamic linked library of: R and BLAS                                                                                                                       
#./configure --enable-R-shlib --enable-BLAS-shlib --with-blas="$MKL"  --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl                      

#dynamic linked library of: BLAS                                                                                                                              
./configure --enable-BLAS-shlib --with-blas="$MKL"  --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl

之后用make; make install即可。
使用同样的R-benchmark脚本，结果如下：
Intel Compiler (ICC+Ifort) and Intel MKL

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.719666666666667 
2400x2400 normal distributed random matrix ^1000____ (sec):  0.394333333333333 
Sorting of 7,000,000 random values__________________ (sec):  0.861 
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.709 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.448 
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.611437229773395 

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.907666666666668 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.613000000000001 
Determinant of a 2500x2500 random matrix____________ (sec):  0.493333333333333 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.334333333333332 
Inverse of a 1600x1600 random matrix________________ (sec):  0.611666666666667 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.569777440099831 

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.82 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.535999999999999 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.64933333333334 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.683666666666667 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.828000000000003 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.774276714018349 


Total time for all 15 tests_________________________ (sec):  11.609 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.646126830621363 
                      --- End of test ---

Intel Compiler(ICC+Ifort) + GotoBlas2(Compiled by ICC/Ifort)

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.715333333333333
2400x2400 normal distributed random matrix ^1000____ (sec):  0.41
Sorting of 7,000,000 random values__________________ (sec):  0.862666666666666
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.829333333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.554666666666667
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.690382674196494

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.922333333333332
Eigenvalues of a 640x640 random matrix______________ (sec):  0.681333333333333
Determinant of a 2500x2500 random matrix____________ (sec):  0.511666666666667
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.433333333333332
Inverse of a 1600x1600 random matrix________________ (sec):  0.594333333333331
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.591732764155743

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.835999999999999
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.545000000000002
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.66133333333333
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.695666666666665
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.585000000000001
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.698105585240407


Total time for all 15 tests_________________________ (sec):  11.838
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.658231817116501
                      --- End of test ---

常见的错误：
在编译R的时候，我们用–with-blas=”$MKL”来制定Intel MKL的位置（网上其他文章的做法），但如果$MKL的值不正确，R无法正常链接MKL。我们需要检查configure的输出或者文件config.log，要确保这两项的检查都是yes:
checking for dgemm_ in -L/home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread… yes
checking whether double complex BLAS can be used… yes
checking whether the BLAS is complete… yes

值得指出的是在链接Intel库时，LP64 和 ILP64是不同的。在我的机器上，错误的制定ILP64，例如-lmkl_intel_ilp64，会导致R无法使用MKL，因为使用ILP64编译的程序会crash(在configure脚本里，这个文件是conftest)

config.log是非常有用的文件，它包括的configure检查系统环境时相关信息，通过这个文件并结合configure(本质是一个shell script），可以帮助我们确定R是否可以，或者为什么不可以链接MKL库。

另外，使用shared BLAS库的时候R会检查zgeev_，并检查不到MKL，这个R“有意”的结果。因为动态的MKL库会包含LAPACK的信息。如果介意这方面的速度损失，可以使用静态链接的方式。

Updated (2011-10-05):

Similar idea in the PPT format:

R_BLAS-Sachdeva

加速R的矩阵运算(Speed up R matrix computation)

September 11, 2011September 11, 2011 zhanxw Leave a comment

Speed up R matrix computation with smallest effort.

给R提速有两个方法：
1. 使用Intel compiler
2. 使用更快的矩阵运算库

其中我使用第一个方法并没有看到显著的速度提升，所以这里介绍第2种方法，保证矩阵运算至少提速2倍。
我使用的是R-2.13.1版本，矩阵库使用GotoBLAS。
根据下面这个链接，
http://r.789695.n4.nabble.com/configure-can-t-find-dgemm-in-MKL10-td920212.html
GotoBLAS比Intel MKL快。据说，GotoBLAS比ATLAS也要快。

具体步骤如下：
（1）建立一个shell 源文件：

export FFLAGS="-march=native -O3"
export CFLAGS="-march=native -O3 -DMKL_ILP64"
export CXXFLAGS="-march=native -O3 -DMKL_ILP64"
export FCFLAGS="-march=native -O3"

./configure --enable-R-shlib --enable-BLAS-shlib --with-blas --with-lapack --prefix=/net/dumbo/home/zhanxw/software/Rmkl

之后用make, make install安装。
（2）下载GotoBLAS，在源目录’make’即可，得到的BLAS库文件名是’libgoto2.so’
（3）建立符号链接。在R安装目录下e.g. /lib64/R/lib，已经有一个R默认的BLAS动态连接库libRblas.so，把这个改成链接到libgoto2.so的符号链接。

这3步之后，R就会使用GotoBLAS作为矩阵运算库。在我们的服务器上，benchmark结果如下：
# GCC + default BLAS

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.764666666666666
2400x2400 normal distributed random matrix ^1000____ (sec):  0.596666666666666
Sorting of 7,000,000 random values__________________ (sec):  0.833333333333333
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  4.425
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  2.30366666666667
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  1.13650194597564

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.778666666666666
Eigenvalues of a 640x640 random matrix______________ (sec):  1.406
Determinant of a 2500x2500 random matrix____________ (sec):  2.28733333333334
Cholesky decomposition of a 3000x3000 matrix________ (sec):  2.02366666666667
Inverse of a 1600x1600 random matrix________________ (sec):  1.933
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.76516531172197

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  1.06166666666667
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.601666666666669
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.56866666666667
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.757666666666661
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.595000000000013
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.785128552514896


Total time for all 15 tests_________________________ (sec):  22.9366666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  1.16349747864837
                      --- End of test ---

# GCC + GotoBLAS(GCC)

  R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.776333333333333
2400x2400 normal distributed random matrix ^1000____ (sec):  0.597
Sorting of 7,000,000 random values__________________ (sec):  0.838
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.376333333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.293
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.558725402933605

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.785666666666668
Eigenvalues of a 640x640 random matrix______________ (sec):  2.092
Determinant of a 2500x2500 random matrix____________ (sec):  0.303666666666667
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.292999999999999
Inverse of a 1600x1600 random matrix________________ (sec):  0.396333333333331
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.455580734019386

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  1.07166666666667
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.608999999999999
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.848
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.675666666666665
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.591000000000001
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.761149272082565


Total time for all 15 tests_________________________ (sec):  12.5466666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.578643662905733
                      --- End of test ---

# ICC + build-in BLAS

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.722333333333333
2400x2400 normal distributed random matrix ^1000____ (sec):  0.398
Sorting of 7,000,000 random values__________________ (sec):  0.853333333333333
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  23.2723333333333 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  9.48066666666666 
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  1.80121303632586 

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.919666666666667 
Eigenvalues of a 640x640 random matrix______________ (sec):  1.01100000000001 
Determinant of a 2500x2500 random matrix____________ (sec):  4.84600000000001 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  3.71033333333332 
Inverse of a 1600x1600 random matrix________________ (sec):  6.53100000000001 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  2.62935462784594 

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.825333333333333 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.588666666666654 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.65866666666667 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.665000000000001 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.55400000000003 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.686183322572556 


Total time for all 15 tests_________________________ (sec):  57.0363333333334 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  1.4812151139281 
                      --- End of test ---

# ICC + GotoBLAS(ICC)

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.738666666666667
2400x2400 normal distributed random matrix ^1000____ (sec):  0.388000000000001
Sorting of 7,000,000 random values__________________ (sec):  0.857333333333333 
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.633333333333333 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.537666666666667 
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.631245051729315 

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.938333333333333 
Eigenvalues of a 640x640 random matrix______________ (sec):  5.53166666666667 
Determinant of a 2500x2500 random matrix____________ (sec):  0.957666666666666 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.601000000000001 
Inverse of a 1600x1600 random matrix________________ (sec):  1.741 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.16088739499808 

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.813 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.591333333333334 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.663 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.669333333333332 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  4.883 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.13162201708511 


Total time for all 15 tests_________________________ (sec):  22.5443333333333 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.939499363744844 
                      --- End of test ---

通过比较GCC/ICC 与 R自带的BLAS/GotoBLAS的4种组合，在我们的服务器系统下GCC+GotoBLAS最快。

注：
LAPACK是对BLAS的再次封装，因此我们不需要改变libRlapack.so。这一点可以通过’nm -g libRlapack.so’，查看dgemm_的定义为‘U’（说明这个函数没有在该文件中实现），而通过’ldd libRlapack.so’可以发现它会调用libRblas.so

其他资源：

介绍用Intel编译器编译64bit R 2.10

介绍Intel编译器版本11编译R

介绍如何链接Intel MKL 库

R代码除错 (How to debug R code)

May 31, 2011May 31, 2011 zhanxw Leave a comment

R代码除错 (How to debug R code)
Tricks about how to debug R code

使用R的用户中很多人抱怨R的代码不好调试。对我来说，我觉得R至少比Perl好一点，因为至少R的说明档丰富，至少看的懂源码。好了，长话短说，R的界面很简单，没有Visual studio那么强大的调试器，也没有GDB那样灵活的调试命令（见 GDB 使用经验, GDB 使用经验（二）），我总结出来以下5种调试方法，用在不同的场合。当然话说回来，还是尽量写没有bug的代码，一劳永逸。

1. 传统调试函数
traceback(), debug(), trace(), browser(), recover()
traceback() 是在出错退出后，打印出调用堆栈的情况
debug() 是将断点设置在一个函数上，这个函数被调用的时候会变为单步执行，因此我们可以手动跟踪，只不过这里不如gdb灵活
trace() 等于是在函数中插入额外的调试代码，例如：trace(sum)在每次调用sum的时候打印出sum的参数；又比如
## arrange to call the browser on entering and exiting
## function f
trace(“f”, quote(browser(skipCalls=4)), exit = quote(browser(skipCalls=4)))
则表示使用browser()来调试，从第5次开始
browser()：这个函数往往作为参数，被调用时用户可以检查变量。用户可以输入c表示继续，n表示下一条指令，Q表示退出
recover()：和browser类似，也是被调用。不同在于用户可以选择不同的frame（堆栈深度）。

2. 更传统的调试函数print()，cat()
使用print()来打印每个变量调用时候的值；
更简单的情况可以用cat()，它的语法更简单，例如cat(“x=”, x)

3. 设置options(error=…)
我们希望出错的时候，R可以停止执行后续代码，并进入我们指定的调式模式。
在R的交互界面，可以设置：
options(error=recover)
在Rscript，即命令行方式，可以用下面的话把出错信息存储到文件：
options(error = quote({dump.frames(to.file=TRUE); q()}))

调试完毕，恢复初始设置时，可以用：
options(error = NULL)

这里举个例子吧（出处）：
错误的情景：

x <- 1:5
y <- x + rnorm(length(x),0,1)
f <- function(x,y) {
  y <- c(y,1)
  lm(y~x)
}

我们调试的时候，输入：

options(error=recover)

> f(x,y)
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'x')

Enter a frame number, or 0 to exit   

1: f(x, y)
2: lm(y ~ x)
3: eval(mf, parent.frame())
4: eval(expr, envir, enclos)
5: model.frame(formula = y ~ x, drop.unused.levels = TRUE)
6: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)

Selection: 1
Called from: eval(expr, envir, enclos)
Browse[1]> x
[1] 1 2 3 4 5
Browse[1]> y
[1] 1.6591197 0.5939368 4.3371049 4.4754027 5.9862130 1.0000000

通过检查x和y的值就能发现问题了。

4. 设置断点 setBreakpoint()
从R 2.10开始，我们有了两个调试相关的函数findLineNum(), setBreakpoint()
有了断点，我们可以快速执行代码，直至有可能的错误部分（想想如果只有debug()则需要人工单步执行R语句，或者错误发生后recover()，我们需要反推到底是什么造成的错误）。这将大大提高我们除错的速度。
出处

这里举个例子展示如何在第3行设置断点：

x <- " f <- function(a, b) {
             if (a > b)  {
                 a
             } else {
                 b
             }
         }"


eval(parse(text=x))  # Normally you'd use source() to read a file...

findLineNum("<text>#3")   # <text> is a dummy filename used by parse(text=)

#This will print
#f step 2,3,2 in <environment: R_GlobalEnv>

#and you can use

setBreakpoint("<text>#3")

5. *apply 函数中如何调试：
用过R的都知道在循环中出错不容易。因为R处理循环很慢，我们往往不用for循环，而用sapply(), lapply()等等。这些函数出错的时候从来不会说是第几个循环变量出错的。对此，我们有如下方法：

使用try()函数, 出处：
举个例子：

> x <- as.list(-2:2)
> x[[2]] <- "what?!?"
> ## using sapply
> sapply(x, function(x) 1/x)
Error in 1/x : non-numeric argument to binary operator
# 看看用try()函数怎么样？
> sapply(x, function(x) try(1/x))
Error in 1/x : non-numeric argument to binary operator
[1] "-0.5"                                                    
[2] "Error in 1/x : non-numeric argument to binary operator\n"
[3] "Inf"                                                     
[4] "1"                                                       
[5] "0.5"

或者第三方程序库也行：
出处
foreach(.verbose= TRUE) —— 这个我没试验出来，不过foreach仍然是个强大的工具
plyr(.inform=TRUE)
给个plyr库的例子：

> laply(x, function(x) 1/x, .inform = TRUE)

Error in 1/x : non-numeric argument to binary operator
Error: with piece 2: 
[1] "what?"

另外题外话，R里面执行install.packages()的时候，只有头一次可以选repo（镜像库）的位置，如果之后你还想选不同的镜像库怎么办？可以执行这个：
options(“repos”=c(CRAN=”@CRAN@”))

最后把参考过的网页列在下面：
【1】Getting the state of variables after an error occurs in R
【2】What is your favorite R debugging trick?
【3】Debugging lapply/sapply calls
【4】R script line numbers at error?

如何检验一维数据的分布

May 9, 2011May 9, 2011 zhanxw Leave a comment

本文介绍如何使用Ｒ软件来分析一维随机变量。分析的内容包括如何查找一维数据的分布类型，如何估计分布参数以及如何用假设检验来测试一维数据的分布类型。
How to find, fit, test the distribution of univariate variable in R?
我们经常见到一维随机变量，比如线性模型的响应，我们通常需要检验它是否是正态分布来决定模型中直接用Ｙ还是用log（Ｙ），或者其他的transformation。
本文主要参考【1】，我会介绍一些基本的方法，但建议读者参考原文获得更多的信息。

1. 画密度图，ＣＤＦ图

直方图：history(x)
密度图：plot(density(x))
CDF图：plot(ecdf(x))

检查是否是正态分布：

z= (x-mean(x))/sd(x)
qqnorm(z)
abline(0,1)

类似的可以检查其他分布（先构造一个理论分布，再qqnorm）

x.wei <- rweibull(200, shape=2.1, scale=1.1)
x.teo <- rweibull(200, shape=2.1, scale=1.0)
qqplot(x.teo, x.wei)
abline(0,1)

http://www.statsoft.com/textbook/distribution-fitting/

2. 利用矩估计猜测分布类型
主要是standardize之后计算一二三四阶矩（moment），然后对比下面网页列举的常见分布，猜出到底是哪一种分布：
NIST 1.3.5.11. Measures of Skewness and Kurtosis

3. 估计分布参数
当我们知道分布类型后，可以估计分布参数，常见的有矩估计和最大似然估计。
矩估计相对简单，可以用mean，var函数计算，但可能不具有无偏的性质。
最大似然估计有
1) mle() 在 stats4 包里
2) fitdistr() 在 MASS 包里
1）的方法显然更基本，但能适用于各种分布，2）的方法使用简单，对Gamma, Weibull, Normal等分布只需要一个命令，例如：

fitdistr(x.norm,"normal") ## fitting gaussian pdf parameters 
mean	sd
9.9355373 2.0101691 
(0.1421404) (0.1005085)

4. 检查分布是否合适？
在做Goodness of fit tests之前，可以先画出直方图和理论密度分布图。
之后，可以利用卡方检验来做Goodness of fit tests。具体来讲：
i) 对于Poisson, binomial, negative binomail, 我们可以使用vcd包中的goodfit函数。
ii) 对于一般的分布，可以把变量归类，然后利用卡方检验公示计算观察到变量数量和理论值之间的差异，然后计算pvalue
iii) 对于一般的分布，也可以使用Kolmogorov-Smirnov test来做统计检验

对第三种情况举例如下：

> x.wei <- rweibull(n=200, shape=2.1, scale = 1.1)
> ks.test(x.wei, "pweibull", shape=2, scale= 1)

	One-sample Kolmogorov-Smirnov test

data:  x.wei 
D = 0.1042, p-value = 0.02591
alternative hypothesis: two-sided

特别的，我们需要检查数据是否是正态分布。
最常用的是Shapiro－Wilk test：shapiro.test()
此外，R里面有一个package nortest，提供了另外5种检查正态分布的函数：
i) Shapiro-Francia test: sf.test()
ii) Anderson-Darling test: ad.test()
iii) Cramer-Von Mises test: cvm.test()
iv) Lilliefors test: lillie.test() 适用于小样本，参数未知的正态分布
v) pearson.test: pearson.test()
这5种test各有细致的差异，使用的时候需自己区分。

参考文献：
【1】 http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf