BENCHMARK

In the webpage, we provide information about benchmarks between SEQMINER, VariantAnnotation and GEMINI. Our benchmark was performed on Intel Xeon 5530 with 256 Gigabyte memory and enough hard drive space. This document provides necessary instructions in order to help the readers replicate our results.

We provide data and source codes. In general, we describe commonly used data sets (e.g. reference genome, gene definition files et al) in Data Set, and we describe benchmark specific data sets and source codes in Benchmark Directory. After visiting Benchmark Directory, you will see a directory lists: data (benchmark specific data sets), integration (source codes where we benchmarked data integration), extraction (source codes where we benchmarked information retrieval). We will introduce the details of each directories in subsequent sections.

Data sets

We provide all data sets or provide instructions on how to obtain them in this website. The data are categorized into two categories: (1) Common data sets (2) Benchmark specific data sets.

Common data sets

To download the following files or other commonly used data sets (e.g. reference genome, gene definition files et al), please refer to this description webpage here. The files listed below are frequently used during benchmarking.

Chromosome 1 genotype file from the 1000 Genome Project

Tabix index

dbNSFP database. It include Polyphen scores and SIFT scores.

Human reference genome build 37.

Index file for the reference genome.

Gene definition file.

Benchmark specific data sets

The following files can be downloaded directly.

100 random selected genomic regions on chromosome 1

File link: http://www.zhanxw.com/seqminer/benchmark/data/rand.chr1.range.100

100 random selected genes on chromosome 1

File link: http://www.zhanxw.com/seqminer/benchmark/data/rand.chr1.gene.100

Data integration

The directory integration includes source codes for data integration. We provide relevant codes for SEQMINER, VariantAnnotation (va) and GEMINI.

In seqminer folder, we provide annotate.R source file. It demonstration how to use SEQMINER to annotate genetic variants. The command line is:

Rscript annotation.R

In va folder, we provide the source code annotate.R . It uses VariantAnnotation to annotate variants in the chromosome 1 genotype file of VCF format. Ideally, VariantAnnotation performs best when it can annotate all variants in one batch. However, due to our memory limit (256G memory), we let VariantAnnotation annotate 20,000 variants per batch. The following command line was from the provided cmd.sh file:

Rscript annotate.R ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz anno.chr1.chunk20000.vcf 20000

NOTE: the last parameter 20000 specifies 20,000 variants per batch. In our computational environment, we have also tried 5,000 or 10,000 variant per batch and found the performance of VariantAnnotation degraded. We also noted that specifying a larger number than 20,000 will break VariantAnnotation due to memory limitation.

In gemini folder, we built GEMINI database projects and import annotation information. The annotation was performed by SnpEff with success and VEP without success. After annotation, we used the following command to import compressed VCF file (see import.sh):

gemini load -v snpEff.annotated.wgs.vcf.gz -t snpEff --cores 8 snpEff.annotated.wgs.db

Information retrieval

Benchmarking of the information retrieval step was provided in the extract folder.

We benchmarked each software (SEQMINER and VariantAnnotation) in two settings:

(1) extract nonsynonymous variants from 100 regions;

(2) extract nonsynonymous variants from 100 genes.

These two settings correspond to two directories, “chr1.nonsyn.gene” and “chr1.nonsyn.range”.

Similar to the Data Integration section, we provide source codes under each directory.

In seqminer folder, to replicate our benchmark results in the Table 1 in our SEQMINER manuscript in the first setting (extracting non synonymous variants from 100 regions), please download the script at:

http://www.zhanxw.com/seqminer/benchmark/extraction/seqminer/chr1.nonsyn.range/script.oneByone.R

to your local seqminer/chr1.nonsyn.range directory.

Make sure the vcfFile variable in the source codes exists, and then run the following command to extract 1000 regions:

Rscript script.oneByone.R

In va folder, its directory structure mimic the seqminer settings. To extract 100 randomly selected region, you will change current directory to va/chr1.nonsyn.range, and execute Rscript script.oneByone.R.

Contact

Please contact Xiaowei Zhan zhanxw@gmail.com or Dajiang Liu dajiang.liu@outlook.com for comments or suggestions.

Last update: November 13, 2014