Data Set

We list these datasets that are used in benchmarks. They are publicly resources so you are welcomed to try out.

Chromosome 1 VCF file

ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz

Chromosome 1 genotype VCF file from the 1000 Genome Project.

File link: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz Tabix index link: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi

Whole genome VCF file

  1. ALL.wgs.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz

    This is whole genome VCF file from the 1000 Genome Project. The file size is 142G. To obtains this file, download per-chromosome VCF from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/, combine them and use tabix to create its index.

  2. ALL.wgs.phase1_release_v3.20101123.snps_indels_svs.genotypes.bcf.gz

    This is whole genome VCF file from the 1000 Genome Project. The file size is 131G. To obtains this file, download per-chromosome VCF from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/, combine them and use bcftools to create its index.

dbNSFP data set

hg19_ljb_all.txt.gz

This file is from dbNSFP database. Its include Polyphen scores and SIFT scores.

File link: http://qbrc.swmed.edu/zhanxw/seqminer/data/hg19_ljb_all.txt.gz

Human reference genome

Human reference genome build 37 in the FASTA format.

File link: http://qbrc.swmed.edu/zhanxw/seqminer/data/human.g1k.v37.fa

Index file link: http://qbrc.swmed.edu/zhanxw/seqminer/data/human.g1k.v37.fa.fai

Human reference genome build 37 with decoy sequence in the FASTA format (Detail1, Detail2).

File link: http://qbrc.swmed.edu/zhanxw/seqminer/data/hs37d5.fa

Index file link: http://qbrc.swmed.edu/zhanxw/seqminer/data/hs37d5.fa.fai

UCSC Known Genes

knownGene.txt.gz

UCSC gene definition file in the knownGene format (Details) for NCBI genome build 37.

File link: http://qbrc.swmed.edu/zhanxw/seqminer/data/knownGene.txt.gz

UCSC RefFlat Genes

knownGene.txt.gz

UCSC gene definition file in the refFlat format (Details).

File link: http://qbrc.swmed.edu/zhanxw/seqminer/data/refFlat_hg19.txt.gz

Gencode Genes

refFlat.gencode.v19.gz

Gencode gene definition version 19 in the refFlat format (Details). We have also previous versions of gene files and can provide upon request.

File link: http://qbrc.swmed.edu/zhanxw/seqminer/data/refFlat.gencode.v19.gz

Contact

Please contact Xiaowei Zhan zhanxw@gmail.com or Dajiang Liu dajiang.liu@outlook.com for comments or suggestions.

Last update: November 13, 2014