Statistics – Page 2 – Little Tail

PyCon2013 有意思的幻灯

March 24, 2013March 24, 2013 zhanxw Leave a comment

Interesting slides from PyCon 2013
今年PyCon在加州的Santa Clara召开。我虽然没去，但一如既往的关心。
挑出和我相关的一些有意思的幻灯，在此分享。

1. BeautifulCode
Raymond Hettinger是一个善用Python的高手，他的code非常简洁，有Python的风味。
这个幻灯里，他介绍很多Python中常用的简洁的写法，包括怎么用iterator， list comprehension.
我感觉有意思的是defaultDict （不需要再用dict.get()), decorator（修饰方法，类比CSS）和context（干净的获取和释放资源）

原始链接

2. Python: A “Toy” Language

David Beazley是一个Python作家，对Python GIL有过详细的介绍，以前写过如何把Generator组合成一个workflow的幻灯。
现在在芝加哥教Python。这个人擅长Python的教学，并能给出有趣的例子。这次也不例外，他介绍了如何用Python和Shapeoko （包括Arduino）来组装并使用CNC （数控机床？）。这个例子告诉我们Python可以做计算机程序之外很有用的应用。

原始链接

3. Awesome Big Data Algorithms

作者是MSU的老师。这个Blog的背景是土壤Genetics。因为土壤中的genetics比单纯人的DNA更复杂，数据量更大，因此需要Big Data Algorithm。这个幻灯介绍三种算法：SkipList, HyperLogLog, Bloom filter（CountMin Sketch）。

SkipList是一种基于链表的数据结构，相对羽平衡二叉树，这个算法的优点是更好的支持并发操作。本质上，SkipList是一个分层次的链表。在最底层，链表元素按顺序排列。在更高的层次，（按照概率）简历一部分低层的数据的索引。这种数据结构在查找时非常有效：从高层开始查找，直到最底层顺序查找，整个查找是log（N）

HyperLogLog是应用于大数据的算法，用来计算一个很大集合的基数（即合理总共有多少不相同的元素）。大致思路是用一组相互独立的哈希函数依次处理输入，然后对哈希值分块计数：对高位统计有多少连续的0；用低位的值当做数据块。比如:011000|01，就是高位有3个连续的0，低位是1，就表示第一个数据块。因为连续观测的三个0的概率大约是1/8，所以对数据块1来讲，可以把计数乘8，作为集合基数的估计。因为低位可能有0，1，2，3这四种数值，总基数可以取上述4中技术的几何平均数。在HyperLogLog中，具体的还有一些系数可以调整，使得估计更准确。
这片Blog详细介绍了HyperLogLog算法，图文并茂。

BloomFilter或CountMin Sketch是两个不同的算法，但又有紧密联系。相似之处是两个算法都需要一族独立的哈希函数。不同之处是处理的问题不同。对BloomFilter,在预处理阶段对每一个特定的输入算出所有哈希函数的值，并在这些值上做出标记。最后，当查找一个特定的输入是否出现过，只需查找这一系列的哈希函数对应值上有没有标记。对于BloomFilter，可能有False Positive，但不可能有False Negative。此外，BloomFilter可看做查找一个数据有或者没有的数据结构（数据的频率是否大于1）。CountMin Sketch在BloomFilter的基础上更进一步，它可用来估算某一个输入的频率（不局限于大于1）。具体思路是对哈希函数值对应的sketch上计数（对BloomFilter则只是标记是或否）。最后估计频率的时候，对每个估算出的频率取最小值。

原始链接

4. Why you should use Python 3 for text processing

这个讲座关注的Python3，而不是现在我使用的Python2.7.3。但在Python社区，有时好的功能会从版本3移植回版本2。
这个讲座介绍了Python3里面的新功能，例如ChainMap，startswith（tuple作为参数），unicode支持，textwrap模块（可以方便的排版）和email模块。

原始链接

写统计软件的两个常见错误

November 27, 2012November 27, 2012 zhanxw Leave a comment

Two common mistakes when writing statistics software

最近在写一个软件，发现两个常见的错误：

1. 缺失数据的处理

对于缺失数据，表示上可以是NA, “.”, -9，应该先保证软件读入的数字是正确的。

之后，在模型中，是应该丢掉数据，还是impute数据？是impute到均值还是用随机数值？

在写软件的时候必须要清楚。

2. 数值运算的维数

在统计中，常常有大量的数值运算，比如矩阵相乘。

在文章中这类运算往往会忽略维数，比如矩阵的行数和列数。

这时候软件中应加入更多的边界检查，这样就能避免程序崩溃。

制作Scientific Poster

November 14, 2012November 21, 2012 zhanxw Leave a comment

制作Scientific Poster
How to make scientific poster

做一个好的poster需要用Illustrator，简单的来讲，Poster中任何非文字的部分都应该采取矢量图（公式、图示、表格），而文字部分应该在Illustrator里完成。

如果想我一样从没用过Illustrator，应该看这个youtube教学视频，这个视频专门针对Scientific Poster分八讲讲解Illustrator中各个细节。

下面一些链接都很有用：
UMichigan library link

Illustrator Creating Text

Scientific Poster Design

A good pattern of color

使用Intel Compiler Suite和Intel MKL编译64位R

September 11, 2011October 5, 2011 zhanxw Leave a comment

Compiling 64bit R using Intel Compiler (icc/ifort) and Intel Math Kernel Library (MKL).
通过Intel的编译器和Intel MKL，我们得到运行速度最快的R系统（比上一篇介绍的 R+GotoBlas 还快一点点）。

下载，安装Intel Parallel Studio，这个包括Intel C compiler (icc), C++ Compiler (icpc), Fortran compiler(ifort)：
http://software.intel.com/en-us/articles/intel-parallel-studio-xe/

下载，安装Intel Math Kernel Library
http://software.intel.com/en-us/articles/intel-mkl/
Intel的这两个软件对于非商业用途是免费的。

然后需要下载R的源代码：
http://cran.cnr.berkeley.edu/

解压缩R之后，在其目录下建立bash文件来指定编译的方式（R本身是使用静态链接还是动态链接库？安装路径？）。
具体方式可以在这个脚本的末尾部分找到，大家可以自己按需要修改。
注：在我的比较下，使用动态链接的BLAS库与静态链接库相比不会损失速度；使用动态链接库的优点是可以方便的换用不同BLAS库。

source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/bin/iccvars.sh intel64                                                                                   
source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/bin/ifortvars.sh intel64
source /home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/bin/mklvars.sh intel64

export CC=icc
export CFLAGS="-O3 -wd188 -mieee-fp"
export F77=ifort
export FFLAGS="-O3 -mieee-fp"
export CXX=icpc
export CXXFLAGS="-O3"
export FC=ifort
export FCFLAGS="-O3 -mieee-fp"
export ICC_LIBS=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/compiler/lib/intel64
export IFC_LIBS=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/compiler/lib/intel64
export SHLIB_CXXLD=icpc
export SHLIB_CXXLDFLAGS=-shared

MKL_LIB_PATH=/home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
export LD_LIBRARY_PATH=$MKL_LIB_PATH

OMP_NUM_THREADS=8

export LDFLAGS="-L${MKL_LIB_PATH},-Bdirect,--hash-style=both,-Wl,-O1 -L$ICC_LIBS -L$IFC_LIBS -L/usr/local/lib"

export SHLIB_LDFLAGS="-lpthread"
export MAIN_LDFLAGS="-lpthread"

MKL="-L${MKL_LIB_PATH} -lmkl_blas95 -lmkl_lapack95  -Wl,--start-group -lmkl_intel -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread"

OMP_NUM_THREADS=8

MKL="-L${MKL_LIB_PATH} -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread"
#static linked library of R                                                                                                                                   
#./configure --with-blas="$MKL"  --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl                                                           

# dynamic linked library of: R and BLAS                                                                                                                       
#./configure --enable-R-shlib --enable-BLAS-shlib --with-blas="$MKL"  --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl                      

#dynamic linked library of: BLAS                                                                                                                              
./configure --enable-BLAS-shlib --with-blas="$MKL"  --with-lapack="$MKL" --prefix=/net/dumbo/home/zhanxw/software/Rmkl

之后用make; make install即可。
使用同样的R-benchmark脚本，结果如下：
Intel Compiler (ICC+Ifort) and Intel MKL

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.719666666666667 
2400x2400 normal distributed random matrix ^1000____ (sec):  0.394333333333333 
Sorting of 7,000,000 random values__________________ (sec):  0.861 
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.709 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.448 
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.611437229773395 

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.907666666666668 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.613000000000001 
Determinant of a 2500x2500 random matrix____________ (sec):  0.493333333333333 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.334333333333332 
Inverse of a 1600x1600 random matrix________________ (sec):  0.611666666666667 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.569777440099831 

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.82 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.535999999999999 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.64933333333334 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.683666666666667 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.828000000000003 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.774276714018349 


Total time for all 15 tests_________________________ (sec):  11.609 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.646126830621363 
                      --- End of test ---

Intel Compiler(ICC+Ifort) + GotoBlas2(Compiled by ICC/Ifort)

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.715333333333333
2400x2400 normal distributed random matrix ^1000____ (sec):  0.41
Sorting of 7,000,000 random values__________________ (sec):  0.862666666666666
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.829333333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.554666666666667
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.690382674196494

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.922333333333332
Eigenvalues of a 640x640 random matrix______________ (sec):  0.681333333333333
Determinant of a 2500x2500 random matrix____________ (sec):  0.511666666666667
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.433333333333332
Inverse of a 1600x1600 random matrix________________ (sec):  0.594333333333331
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.591732764155743

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.835999999999999
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.545000000000002
Grand common divisors of 400,000 pairs (recursion)__ (sec):  2.66133333333333
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.695666666666665
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.585000000000001
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.698105585240407


Total time for all 15 tests_________________________ (sec):  11.838
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.658231817116501
                      --- End of test ---

常见的错误：
在编译R的时候，我们用–with-blas=”$MKL”来制定Intel MKL的位置（网上其他文章的做法），但如果$MKL的值不正确，R无法正常链接MKL。我们需要检查configure的输出或者文件config.log，要确保这两项的检查都是yes:
checking for dgemm_ in -L/home/zhanxw/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread… yes
checking whether double complex BLAS can be used… yes
checking whether the BLAS is complete… yes

值得指出的是在链接Intel库时，LP64 和 ILP64是不同的。在我的机器上，错误的制定ILP64，例如-lmkl_intel_ilp64，会导致R无法使用MKL，因为使用ILP64编译的程序会crash(在configure脚本里，这个文件是conftest)

config.log是非常有用的文件，它包括的configure检查系统环境时相关信息，通过这个文件并结合configure(本质是一个shell script），可以帮助我们确定R是否可以，或者为什么不可以链接MKL库。

另外，使用shared BLAS库的时候R会检查zgeev_，并检查不到MKL，这个R“有意”的结果。因为动态的MKL库会包含LAPACK的信息。如果介意这方面的速度损失，可以使用静态链接的方式。

Updated (2011-10-05):

Similar idea in the PPT format:

R_BLAS-Sachdeva

Shopping List 购物单

August 1, 2011July 18, 2012 zhanxw Leave a comment

一些值得购买的书、镜头。收到此处作为参考：

书：

镜头：

摄影周边：

礼物：

R Graphics

July 28, 2011September 11, 2011 zhanxw Leave a comment

Plot legend outside of the plot area

primitive method:

method1, hard code layout (
https://stat.ethz.ch/pipermail/r-help/2007-May/132466.html)

layout(matrix(c(2,1), byrow = T), height=c(2,10))
par(mar=c(5,3,0,2))
plot(rnorm(100))
grid(10,10)
plot.new()
par(mar=c(0,0,0,0))
plot.window(c(0,1), c(0,1))
lsize = legend(0.5, .5, "text", pch=’o’)
box("figure")
box(, lty = 2)
box("plot", col = "red")
points(rnorm(100), type="n")
# layout.show(2) : display the layout

method 2, refer to:

use xpd = NA options will not clip the legend outside of the plot area, however, you need to manual place the legend, that makes plot less pleasant. However, the folloing example works:

data(iris)
par(xpd = NA, mai=rep(.5,4), mfrow=c(2,2))
for(i in 1:4)
boxplot(iris[,i]~iris[,5], main=colnames(iris)[i], col=2:4, xaxt="n")

width <- 6 # width of the plot device in inches (it should be 7 inches by
# default but Windows does not implement this correctly)
leg <- legend(0, 0, legend=levels(iris$Species), fill=2:4, hor=TRUE, plot=FALSE)
xcoord <- grconvertX(width/2, "inches", "user")-leg$rect$w/2
ycoord <- grconvertY(0, "inches", "user")+leg$rect$h

legend(xcoord, ycoord, legend=levels(iris$Species), fill=2:4, bty="n", hor=TRUE)

package ggplot2:

Example:

> ggplot(dat,aes(x=x)) +
+ geom_histogram(aes(y=..density..,fill="Histogram"),binwidth=0.5) +
+ stat_function(fun = dnorm, aes(colour= "Density")) +
+ scale_x_continuous(‘x’, limits = c(-4, 4)) +
+ opts(title = "Histogram with Overlay") +
+ scale_fill_manual(name="",value="blue") +
+ scale_colour_manual(name="",value="red") +
+ scale_y_continuous(‘Frequency’)+
+ opts(legend.key=theme_rect(fill="white",colour="white"))+
+ opts(legend.background = theme_blank())

package Lattice:

Specify space = "bottom", or space = "right" in the auto.keys or key parameter

Example:

> xyplot( lat ~ long , data = quakes, key = list( points = list( col=c("orange","white","blue")), text = list( c("a","b","c") ), space="bottom"))
> xyplot( lat ~ long , data = quakes, key = list( points = list( col=c("orange","white","blue")), text = list( c("a","b","c") ), space="right"))
> barchart(yield ~ variety | site, data = barley,
groups = year, layout = c(1,6), stack = TRUE,
auto.key = list(space = "right"),
ylab = "Barley Yield (bushels/acre)",
scales = list(x = list(rot = 45)))

https://stat.ethz.ch/pipermail/r-help/2005-July/075953.html

Powered by Qumana

R代码除错 (How to debug R code)

May 31, 2011May 31, 2011 zhanxw Leave a comment

R代码除错 (How to debug R code)
Tricks about how to debug R code

使用R的用户中很多人抱怨R的代码不好调试。对我来说，我觉得R至少比Perl好一点，因为至少R的说明档丰富，至少看的懂源码。好了，长话短说，R的界面很简单，没有Visual studio那么强大的调试器，也没有GDB那样灵活的调试命令（见 GDB 使用经验, GDB 使用经验（二）），我总结出来以下5种调试方法，用在不同的场合。当然话说回来，还是尽量写没有bug的代码，一劳永逸。

1. 传统调试函数
traceback(), debug(), trace(), browser(), recover()
traceback() 是在出错退出后，打印出调用堆栈的情况
debug() 是将断点设置在一个函数上，这个函数被调用的时候会变为单步执行，因此我们可以手动跟踪，只不过这里不如gdb灵活
trace() 等于是在函数中插入额外的调试代码，例如：trace(sum)在每次调用sum的时候打印出sum的参数；又比如
## arrange to call the browser on entering and exiting
## function f
trace(“f”, quote(browser(skipCalls=4)), exit = quote(browser(skipCalls=4)))
则表示使用browser()来调试，从第5次开始
browser()：这个函数往往作为参数，被调用时用户可以检查变量。用户可以输入c表示继续，n表示下一条指令，Q表示退出
recover()：和browser类似，也是被调用。不同在于用户可以选择不同的frame（堆栈深度）。

2. 更传统的调试函数print()，cat()
使用print()来打印每个变量调用时候的值；
更简单的情况可以用cat()，它的语法更简单，例如cat(“x=”, x)

3. 设置options(error=…)
我们希望出错的时候，R可以停止执行后续代码，并进入我们指定的调式模式。
在R的交互界面，可以设置：
options(error=recover)
在Rscript，即命令行方式，可以用下面的话把出错信息存储到文件：
options(error = quote({dump.frames(to.file=TRUE); q()}))

调试完毕，恢复初始设置时，可以用：
options(error = NULL)

这里举个例子吧（出处）：
错误的情景：

x <- 1:5
y <- x + rnorm(length(x),0,1)
f <- function(x,y) {
  y <- c(y,1)
  lm(y~x)
}

我们调试的时候，输入：

options(error=recover)

> f(x,y)
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'x')

Enter a frame number, or 0 to exit   

1: f(x, y)
2: lm(y ~ x)
3: eval(mf, parent.frame())
4: eval(expr, envir, enclos)
5: model.frame(formula = y ~ x, drop.unused.levels = TRUE)
6: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)

Selection: 1
Called from: eval(expr, envir, enclos)
Browse[1]> x
[1] 1 2 3 4 5
Browse[1]> y
[1] 1.6591197 0.5939368 4.3371049 4.4754027 5.9862130 1.0000000

通过检查x和y的值就能发现问题了。

4. 设置断点 setBreakpoint()
从R 2.10开始，我们有了两个调试相关的函数findLineNum(), setBreakpoint()
有了断点，我们可以快速执行代码，直至有可能的错误部分（想想如果只有debug()则需要人工单步执行R语句，或者错误发生后recover()，我们需要反推到底是什么造成的错误）。这将大大提高我们除错的速度。
出处

这里举个例子展示如何在第3行设置断点：

x <- " f <- function(a, b) {
             if (a > b)  {
                 a
             } else {
                 b
             }
         }"


eval(parse(text=x))  # Normally you'd use source() to read a file...

findLineNum("<text>#3")   # <text> is a dummy filename used by parse(text=)

#This will print
#f step 2,3,2 in <environment: R_GlobalEnv>

#and you can use

setBreakpoint("<text>#3")

5. *apply 函数中如何调试：
用过R的都知道在循环中出错不容易。因为R处理循环很慢，我们往往不用for循环，而用sapply(), lapply()等等。这些函数出错的时候从来不会说是第几个循环变量出错的。对此，我们有如下方法：

使用try()函数, 出处：
举个例子：

> x <- as.list(-2:2)
> x[[2]] <- "what?!?"
> ## using sapply
> sapply(x, function(x) 1/x)
Error in 1/x : non-numeric argument to binary operator
# 看看用try()函数怎么样？
> sapply(x, function(x) try(1/x))
Error in 1/x : non-numeric argument to binary operator
[1] "-0.5"                                                    
[2] "Error in 1/x : non-numeric argument to binary operator\n"
[3] "Inf"                                                     
[4] "1"                                                       
[5] "0.5"

或者第三方程序库也行：
出处
foreach(.verbose= TRUE) —— 这个我没试验出来，不过foreach仍然是个强大的工具
plyr(.inform=TRUE)
给个plyr库的例子：

> laply(x, function(x) 1/x, .inform = TRUE)

Error in 1/x : non-numeric argument to binary operator
Error: with piece 2: 
[1] "what?"

另外题外话，R里面执行install.packages()的时候，只有头一次可以选repo（镜像库）的位置，如果之后你还想选不同的镜像库怎么办？可以执行这个：
options(“repos”=c(CRAN=”@CRAN@”))

最后把参考过的网页列在下面：
【1】Getting the state of variables after an error occurs in R
【2】What is your favorite R debugging trick?
【3】Debugging lapply/sapply calls
【4】R script line numbers at error?

如何检验一维数据的分布

May 9, 2011May 9, 2011 zhanxw Leave a comment

本文介绍如何使用Ｒ软件来分析一维随机变量。分析的内容包括如何查找一维数据的分布类型，如何估计分布参数以及如何用假设检验来测试一维数据的分布类型。
How to find, fit, test the distribution of univariate variable in R?
我们经常见到一维随机变量，比如线性模型的响应，我们通常需要检验它是否是正态分布来决定模型中直接用Ｙ还是用log（Ｙ），或者其他的transformation。
本文主要参考【1】，我会介绍一些基本的方法，但建议读者参考原文获得更多的信息。

1. 画密度图，ＣＤＦ图

直方图：history(x)
密度图：plot(density(x))
CDF图：plot(ecdf(x))

检查是否是正态分布：

z= (x-mean(x))/sd(x)
qqnorm(z)
abline(0,1)

类似的可以检查其他分布（先构造一个理论分布，再qqnorm）

x.wei <- rweibull(200, shape=2.1, scale=1.1)
x.teo <- rweibull(200, shape=2.1, scale=1.0)
qqplot(x.teo, x.wei)
abline(0,1)

http://www.statsoft.com/textbook/distribution-fitting/

2. 利用矩估计猜测分布类型
主要是standardize之后计算一二三四阶矩（moment），然后对比下面网页列举的常见分布，猜出到底是哪一种分布：
NIST 1.3.5.11. Measures of Skewness and Kurtosis

3. 估计分布参数
当我们知道分布类型后，可以估计分布参数，常见的有矩估计和最大似然估计。
矩估计相对简单，可以用mean，var函数计算，但可能不具有无偏的性质。
最大似然估计有
1) mle() 在 stats4 包里
2) fitdistr() 在 MASS 包里
1）的方法显然更基本，但能适用于各种分布，2）的方法使用简单，对Gamma, Weibull, Normal等分布只需要一个命令，例如：

fitdistr(x.norm,"normal") ## fitting gaussian pdf parameters 
mean	sd
9.9355373 2.0101691 
(0.1421404) (0.1005085)

4. 检查分布是否合适？
在做Goodness of fit tests之前，可以先画出直方图和理论密度分布图。
之后，可以利用卡方检验来做Goodness of fit tests。具体来讲：
i) 对于Poisson, binomial, negative binomail, 我们可以使用vcd包中的goodfit函数。
ii) 对于一般的分布，可以把变量归类，然后利用卡方检验公示计算观察到变量数量和理论值之间的差异，然后计算pvalue
iii) 对于一般的分布，也可以使用Kolmogorov-Smirnov test来做统计检验

对第三种情况举例如下：

> x.wei <- rweibull(n=200, shape=2.1, scale = 1.1)
> ks.test(x.wei, "pweibull", shape=2, scale= 1)

	One-sample Kolmogorov-Smirnov test

data:  x.wei 
D = 0.1042, p-value = 0.02591
alternative hypothesis: two-sided

特别的，我们需要检查数据是否是正态分布。
最常用的是Shapiro－Wilk test：shapiro.test()
此外，R里面有一个package nortest，提供了另外5种检查正态分布的函数：
i) Shapiro-Francia test: sf.test()
ii) Anderson-Darling test: ad.test()
iii) Cramer-Von Mises test: cvm.test()
iv) Lilliefors test: lillie.test() 适用于小样本，参数未知的正态分布
v) pearson.test: pearson.test()
这5种test各有细致的差异，使用的时候需自己区分。

参考文献：
【1】 http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf

Exact Logistic Regression

February 21, 2011February 22, 2011 zhanxw Leave a comment

We will briefly explain why we would be interested in implementing exact logistic regression, then provides C++ and R codes.

1. Why exact test?

Since we want to have a clear mind of how likely/unlikely the realization we observed. In the classic example of 2×2 table without covariates, especially the 2×2 table has very few (<5) occurrence, Fisher exact tests are often applied, and large sample theory cannot give an accurate estimation.

2. Why exact logistic regression?

Fisher’s exact test cannot applied to logistic model. For example, when we have covariates in the model, we want BOTH estimate the effect size and get its exact p-value. In this case, only exact logistic regression provides solution.The theoretical background is provided in reference [1].

My implementation:

Download: exactLogisticRegression.tar

1. I verified the results with SAS.
2. The speed is comparable to, or faster than SAS.

Cons:
1. Only 1 interested parameter conditioning on all other parameter is supported for now.
2. I have not implemented the confidence limit parts, as it’s a bit more tedious.

R binding : mypackage_1.0.tar

See mypackage/R/rcpp_hello_world.R, I wrote a R function to wrap the C++ function.

R binding is helped by RCpp package. It greatly reduced the workload of exchanging date (in the form of matrix, list, vector) between C++ and R. A quick tutorial can be found from RCpp homepage(http://dirk.eddelbuettel.com/code/rcpp.html). For experienced Rcpp user, the quick-ref documentation (http://dirk.eddelbuettel.com/code/rcpp/Rcpp-quickref.pdf) is helpful.

【1】 Exact Logistic Regression, Robert E. Derr, SAS Institute Inc., Cary, NC http://support.sas.com/rnd/app/da/new/daexactlogistic.html

【2】Rcpp: Rcpp: Seamless R and C++ Integration dirk.eddelbuettel.com/code/rcpp.html

Laplacian Eigenmap 的R code 和结果

February 8, 2011February 10, 2011 zhanxw Leave a comment

利用矩阵的分解和分析图是个很有意思的话题。当我们能用这个技术来改进PCA的时候，或者降维的时候，我们有可能相信有意思的结果会蹦出来。这里主要参考了文献【1】和Pluskid的blog【2】。其中【1】给出了推导过程：目标函数是二次型的矩阵，约束同样是二次型的；还有详细的Algorithm：里面最关键的一步是Generalized eigenvector problem（wiki有非常简短的介绍），理论上可以用Golub Matrix Computation Chapter 8 的方法（我没读，差不多忘了Numerial Method课的知识了），但我并没有使用。另外【2】里的文字流畅，言简意赅，是入门的好文章。

简单来讲，当利用k-neighbor 或[latex] \epsilon [/latex]方法构造临街矩阵，利用Simpled minded（0 or 1）或者Heat Kernel来构建Weight矩阵后，我们的问题是求解：

$$ L f = \lambda D f \quad st. \quad f^T D f = 1 \quad \mathrm{and} \quad f^T D \mathbf{1} = 0 $$

我们已经知道D是一个对角阵，所以求解$$ D^{-1} L f = \lambda f $$即可。

考虑到约束条件，我们只需对每一个[latex]f[\latex]做如下变换：

$$ f^T D f = 1 \Rightarrow f = f / \sqrt{\sum_i f_i^2 d_i} $$

$$ f^T D \mathbf{1} = 0 \Rightarrow f = f – \frac{ \sum_i f_i d_i } { \sum_i d_i} $$

R code

LaplacianEigenmap.R Link

LaplacianEigenmapTest.R Link

结果：

从 [latex] 0 = \lambda_0 \le \lambda_1 \le \lambda_2 \ldots \lambda_n [/latex] 中取最小的两个非零的特征根[latex] \lambda_1\, \lambda_2 [/latex]对应的特征向量，仿照paper，得到结果如下：

注意，这里的图案和paper不符，但我的验算，检查是否是特征值、约束条件，表明我的计算过程应该是正确的。

另外Pluskid 的Blog上的图案中，Laplacian Eigenmap的结果是一个彩带，我认为有可能是使用了[latex] \lambda_0, \lambda_1 [/latex]对应的特征向量。

参考：

【1】Mikhail Belkin and Partha Niyogi, “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation,” Neural Computation 15, no. 6 (February 6, 2011): 1373-1396.

【2】漫谈 Clustering (番外篇): Dimensionality Reduction http://blog.pluskid.org/?p=290