May 2011 – Little Tail

R代码除错 (How to debug R code)

May 31, 2011May 31, 2011 zhanxw Leave a comment

R代码除错 (How to debug R code)
Tricks about how to debug R code

使用R的用户中很多人抱怨R的代码不好调试。对我来说，我觉得R至少比Perl好一点，因为至少R的说明档丰富，至少看的懂源码。好了，长话短说，R的界面很简单，没有Visual studio那么强大的调试器，也没有GDB那样灵活的调试命令（见 GDB 使用经验, GDB 使用经验（二）），我总结出来以下5种调试方法，用在不同的场合。当然话说回来，还是尽量写没有bug的代码，一劳永逸。

1. 传统调试函数
traceback(), debug(), trace(), browser(), recover()
traceback() 是在出错退出后，打印出调用堆栈的情况
debug() 是将断点设置在一个函数上，这个函数被调用的时候会变为单步执行，因此我们可以手动跟踪，只不过这里不如gdb灵活
trace() 等于是在函数中插入额外的调试代码，例如：trace(sum)在每次调用sum的时候打印出sum的参数；又比如
## arrange to call the browser on entering and exiting
## function f
trace(“f”, quote(browser(skipCalls=4)), exit = quote(browser(skipCalls=4)))
则表示使用browser()来调试，从第5次开始
browser()：这个函数往往作为参数，被调用时用户可以检查变量。用户可以输入c表示继续，n表示下一条指令，Q表示退出
recover()：和browser类似，也是被调用。不同在于用户可以选择不同的frame（堆栈深度）。

2. 更传统的调试函数print()，cat()
使用print()来打印每个变量调用时候的值；
更简单的情况可以用cat()，它的语法更简单，例如cat(“x=”, x)

3. 设置options(error=…)
我们希望出错的时候，R可以停止执行后续代码，并进入我们指定的调式模式。
在R的交互界面，可以设置：
options(error=recover)
在Rscript，即命令行方式，可以用下面的话把出错信息存储到文件：
options(error = quote({dump.frames(to.file=TRUE); q()}))

调试完毕，恢复初始设置时，可以用：
options(error = NULL)

这里举个例子吧（出处）：
错误的情景：

x <- 1:5
y <- x + rnorm(length(x),0,1)
f <- function(x,y) {
  y <- c(y,1)
  lm(y~x)
}

我们调试的时候，输入：

options(error=recover)

> f(x,y)
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'x')

Enter a frame number, or 0 to exit   

1: f(x, y)
2: lm(y ~ x)
3: eval(mf, parent.frame())
4: eval(expr, envir, enclos)
5: model.frame(formula = y ~ x, drop.unused.levels = TRUE)
6: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)

Selection: 1
Called from: eval(expr, envir, enclos)
Browse[1]> x
[1] 1 2 3 4 5
Browse[1]> y
[1] 1.6591197 0.5939368 4.3371049 4.4754027 5.9862130 1.0000000

通过检查x和y的值就能发现问题了。

4. 设置断点 setBreakpoint()
从R 2.10开始，我们有了两个调试相关的函数findLineNum(), setBreakpoint()
有了断点，我们可以快速执行代码，直至有可能的错误部分（想想如果只有debug()则需要人工单步执行R语句，或者错误发生后recover()，我们需要反推到底是什么造成的错误）。这将大大提高我们除错的速度。
出处

这里举个例子展示如何在第3行设置断点：

x <- " f <- function(a, b) {
             if (a > b)  {
                 a
             } else {
                 b
             }
         }"


eval(parse(text=x))  # Normally you'd use source() to read a file...

findLineNum("<text>#3")   # <text> is a dummy filename used by parse(text=)

#This will print
#f step 2,3,2 in <environment: R_GlobalEnv>

#and you can use

setBreakpoint("<text>#3")

5. *apply 函数中如何调试：
用过R的都知道在循环中出错不容易。因为R处理循环很慢，我们往往不用for循环，而用sapply(), lapply()等等。这些函数出错的时候从来不会说是第几个循环变量出错的。对此，我们有如下方法：

使用try()函数, 出处：
举个例子：

> x <- as.list(-2:2)
> x[[2]] <- "what?!?"
> ## using sapply
> sapply(x, function(x) 1/x)
Error in 1/x : non-numeric argument to binary operator
# 看看用try()函数怎么样？
> sapply(x, function(x) try(1/x))
Error in 1/x : non-numeric argument to binary operator
[1] "-0.5"                                                    
[2] "Error in 1/x : non-numeric argument to binary operator\n"
[3] "Inf"                                                     
[4] "1"                                                       
[5] "0.5"

或者第三方程序库也行：
出处
foreach(.verbose= TRUE) —— 这个我没试验出来，不过foreach仍然是个强大的工具
plyr(.inform=TRUE)
给个plyr库的例子：

> laply(x, function(x) 1/x, .inform = TRUE)

Error in 1/x : non-numeric argument to binary operator
Error: with piece 2: 
[1] "what?"

另外题外话，R里面执行install.packages()的时候，只有头一次可以选repo（镜像库）的位置，如果之后你还想选不同的镜像库怎么办？可以执行这个：
options(“repos”=c(CRAN=”@CRAN@”))

最后把参考过的网页列在下面：
【1】Getting the state of variables after an error occurs in R
【2】What is your favorite R debugging trick?
【3】Debugging lapply/sapply calls
【4】R script line numbers at error?

如何在Python中调用C/C++代码

May 26, 2011May 26, 2011 zhanxw Leave a comment

如何在Python中调用C/C++代码
How to mix C/C++ code in Python

本文介绍一种手动的、简单的在Python中使用C/C++代码的方式。这个方法主要使用了ctypes模块。其他的混合Python，C/C++编程的方法还有Swig 和 Boost.Python。前一种方法需要写一个接口文件（interface），而后一种需要使用庞大、深奥的boost类库，后两者适合可能适合更复杂的情况，这里只介绍第一种方法。

混合C/C++代码需要这几步：
1. 包装接口 C/C++ wrap functions up
2. 打包成共享库 Compiling C/C++ code and pack it to shared library
3. Python中导入共享库 Python imports shared library

先介绍一下北京，这里我的C++类GenomeSequence使用了模板（Template）和Memorymap，这是一个访问基因序列的类，比如如果一个生物序列是GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACT… 我们的类是gs，那么gs[0] = ‘G’, gs[1]=’A’ …. 摘录相关的函数如下：

class GenomeSequence : public genomeSequenceArray
{
public:
    /// Simple constructor - no implicit file open
    GenomeSequence();
    /// set the reference name that will be used in open()
    /// \param referenceFilename the name of the reference fasta file to open
    /// \return false for success, true otherwise
    ///
    /// \sa open()
    bool setReferenceName(std::string referenceFilename);
    /// return the number of bases represented in this reference
    /// \return count of bases
    genomeIndex_t   getNumberBases() const
    {
        return getElementCount();
    }
    inline char operator[](genomeIndex_t index) const
    {
        uint8_t val;
        if (index < getNumberBases())
        {
            if ((index&1)==0)
            {
                val = ((uint8_t *) data)[index>>1] & 0xf;
            }
            else
            {
                val = (((uint8_t *) data)[index>>1] & 0xf0) >> 4;
            }
        }
        else
        {
            val = baseNIndex;
        }
        val = isColorSpace() ? int2colorSpace[val] : int2base[val];
        return val;
    }
    /* ........... more codes omitted ................ */
}

但实际上这些细节并不重要，重要是如何包装，我们编写GenomeSequence_wrap.cpp文件，包括对上述4个函数的封装，源码如下：

#include "GenomeSequence.h"
#include <string>

extern "C"{
    GenomeSequence* GenomeSequence_new(){ return new GenomeSequence();}
    bool GenomeSequence_setReferenceName(GenomeSequence* gs, char* s) { 
        if (!gs) return false;
        std::string str = s;
        //printf("Loading %s ...\n", s);
        if (!gs->setReferenceName(str)){
            gs->open();
        } else {
            printf("Loading FAIL\n");
        }
        return (gs->setReferenceName(str));
    }
    void GenomeSequence_close(GenomeSequence* gs) {if (gs) gs->close();};
    int GenomeSequence_getNumBase(GenomeSequence* gs) {
        if (!gs) {
            printf("invalid gs\n");
            return -1;
        }
        return (gs->getNumberBases());
    }
    char GenomeSequence_getBase(GenomeSequence* gs, unsigned int i) { 
        if (gs) {
            return (*gs)[i];
        };
    };
}

第二步是编译，记住单个C/C++文件编译时使用-fPIC参数，最后打包的时候编译成共享库，摘录Makefile文件中片段如下：

lib:
	g++ -c -fPIC -I./lib GenomeSequence_wrap.c
	g++ -shared -Wl,-soname,libstatgen.so -o libstatgen.so  lib/*.o lib/samtools/k*.o lib/samtools/bgzf.o *.o

最后一步是在Python中写一个封装类，注意前两行引入ctypes库，之后就用这个库调用包装函数就行。
注意：我在GenomeSequence类的__getitem__中使用了如何扩展Python的容器类一文中介绍的一些技巧，这样可以更灵活的使用下标来访问数组中的元素。

from ctypes import cdll
lib = cdll.LoadLibrary("./libstatgen.so")

class GenomeSequence:
    def __init__ (self):
        self.obj = lib.GenomeSequence_new()
    def open(self, filename):
        lib.GenomeSequence_setReferenceName(self.obj, filename)
    def __len__ (self):
        return lib.GenomeSequence_getNumBase(self.obj)
    def __getitem__(self, key):
        if isinstance(key, int):
            return chr(lib.GenomeSequence_getBase(self.obj, key))
        elif isinstance(key, slice):
            return ''.join([self[x] for x in xrange(*key.indices(len(self)))])
        elif isinstance(key, tuple):
            return ''.join([self[i] for i in key])

    def at(self, i):
        return chr(lib.GenomeSequence_getBase(self.obj, i))
    def close(self):
        lib.GenomeSequence_close(self.obj)
    
if __name__ == '__main__':
    gs = GenomeSequence ()
    gs.open("/home/zhanxw/statgen/src/karma/test/phiX.fa");
    print len(gs)
    seq = [(gs.at(i)) for i in xrange(60)]
    print ''.join(seq)
    print gs[0:10],gs[20:30]
    print gs[0:10, 20:30]
    print gs[-10:]
    gs.close()
    print "DONE"

本文主要参考【1】。这里的方法基本重复了【1】中的步骤。写出本文中的代码在于进一步验证ctypes库可以灵活的处理C/C++和Python中的简单数据类型int, char*。

【1】Calling C/C++ from python?

如何扩展Python的容器类

May 26, 2011May 26, 2011 zhanxw Leave a comment

如何扩展Python的容器类
How to extend Python container class (using some idiom)

本文假设已经有一个C++语言写的array类型的数据结构，可以用v.getBase(unsigned int i) 来得到v 数组在下标i的数值。我们想利用Python灵活的slice功能，比如1:10, 1:10:2, -10:-5等方式来指定不同的下标。这种灵活的下标在Python中可以有三种形式：

1. 整数： v[1]
2. slice 对象： v[1:10]
3. tuple 对象： v[1:10, 20:30]

这三种对象都会被传到__getitem__(self, key)的key参数中。通过参考【1】，【2】，我发现下面的代码可以简洁的处理上述所有情况：

 
class ContainerClass:
    def __getitem__(self, key):
        if isinstance(key, int):
            return chr(v.getBase(self.obj, key))
        elif isinstance(key, slice):
            return ''.join([self[x] for x in xrange(*key.indices(len(self)))])
        elif isinstance(key, tuple):
            return ''.join([self[i] for i in key])

注意：
这里只是代码片段。全部代码见另一片Blog：如何在Python中调用C/C++代码。

【1】Python Data Model:
http://docs.python.org/reference/datamodel.html
【2】Python in a nut shell:
http://books.google.com/books?id=JnR9hQA3SncC&pg=PA110&lpg=PA110&dq=python+slice+object+idiom&source=bl&ots=Jb1XIv_71t&sig=-_NHkwycfC8yipkc4Tl_e4sruKc&hl=en&ei=uRXfTcr5Jsro0QHa4sG5Cg&sa=X&oi=book_result&ct=result&resnum=10&ved=0CF0Q6AEwCQ#v=onepage&q=python%20slice%20object%20idiom&f=false

如何检验一维数据的分布

May 9, 2011May 9, 2011 zhanxw Leave a comment

本文介绍如何使用Ｒ软件来分析一维随机变量。分析的内容包括如何查找一维数据的分布类型，如何估计分布参数以及如何用假设检验来测试一维数据的分布类型。
How to find, fit, test the distribution of univariate variable in R?
我们经常见到一维随机变量，比如线性模型的响应，我们通常需要检验它是否是正态分布来决定模型中直接用Ｙ还是用log（Ｙ），或者其他的transformation。
本文主要参考【1】，我会介绍一些基本的方法，但建议读者参考原文获得更多的信息。

1. 画密度图，ＣＤＦ图

直方图：history(x)
密度图：plot(density(x))
CDF图：plot(ecdf(x))

检查是否是正态分布：

z= (x-mean(x))/sd(x)
qqnorm(z)
abline(0,1)

类似的可以检查其他分布（先构造一个理论分布，再qqnorm）

x.wei <- rweibull(200, shape=2.1, scale=1.1)
x.teo <- rweibull(200, shape=2.1, scale=1.0)
qqplot(x.teo, x.wei)
abline(0,1)

http://www.statsoft.com/textbook/distribution-fitting/

2. 利用矩估计猜测分布类型
主要是standardize之后计算一二三四阶矩（moment），然后对比下面网页列举的常见分布，猜出到底是哪一种分布：
NIST 1.3.5.11. Measures of Skewness and Kurtosis

3. 估计分布参数
当我们知道分布类型后，可以估计分布参数，常见的有矩估计和最大似然估计。
矩估计相对简单，可以用mean，var函数计算，但可能不具有无偏的性质。
最大似然估计有
1) mle() 在 stats4 包里
2) fitdistr() 在 MASS 包里
1）的方法显然更基本，但能适用于各种分布，2）的方法使用简单，对Gamma, Weibull, Normal等分布只需要一个命令，例如：

fitdistr(x.norm,"normal") ## fitting gaussian pdf parameters 
mean	sd
9.9355373 2.0101691 
(0.1421404) (0.1005085)

4. 检查分布是否合适？
在做Goodness of fit tests之前，可以先画出直方图和理论密度分布图。
之后，可以利用卡方检验来做Goodness of fit tests。具体来讲：
i) 对于Poisson, binomial, negative binomail, 我们可以使用vcd包中的goodfit函数。
ii) 对于一般的分布，可以把变量归类，然后利用卡方检验公示计算观察到变量数量和理论值之间的差异，然后计算pvalue
iii) 对于一般的分布，也可以使用Kolmogorov-Smirnov test来做统计检验

对第三种情况举例如下：

> x.wei <- rweibull(n=200, shape=2.1, scale = 1.1)
> ks.test(x.wei, "pweibull", shape=2, scale= 1)

	One-sample Kolmogorov-Smirnov test

data:  x.wei 
D = 0.1042, p-value = 0.02591
alternative hypothesis: two-sided

特别的，我们需要检查数据是否是正态分布。
最常用的是Shapiro－Wilk test：shapiro.test()
此外，R里面有一个package nortest，提供了另外5种检查正态分布的函数：
i) Shapiro-Francia test: sf.test()
ii) Anderson-Darling test: ad.test()
iii) Cramer-Von Mises test: cvm.test()
iv) Lilliefors test: lillie.test() 适用于小样本，参数未知的正态分布
v) pearson.test: pearson.test()
这5种test各有细致的差异，使用的时候需自己区分。

参考文献：
【1】 http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf