SeqMiner Quick Guide

To receive email notification when manual updates, please sign up

SeqMiner (formerly, vcf2geno) is an R package that helps you efficiently extracting sequencing data in the VCF format files.

SeqMiner requires VCF files indexed by Tabix.

NOTE: To use gene based extraction, VCF need to be annotated by SeqAnno

We have provided an examplar VCF file in this package. You can use the following command to find its location on your computer:

library(vcf2geno)
fileName = system.file("vcf/all.anno.filtered.extract.vcf.gz", package = "vcf2geno")
fileName

## [1] "/Library/Frameworks/R.framework/Versions/2.15/Resources/library/vcf2geno/vcf/all.anno.filtered.extract.vcf.gz"

This manual demonstrates a quick guide on how to use SeqMiner.

Extract gentoype matrix from VCF file

Extract genotype matrix from 1:196621007-196716634, and only focused on Nonsynonymous variants:

cfh <- readVCFToMatrixByRange(fileName, "1:196621007-196716634", "Nonsynonymous")
cfh

##             NA12286 NA12341 NA12342
## 1:196642233       1       2       0
## 1:196659237       0       2       1

In this example, we use the fileName for the VCF file path, 1:196621007-196716634 for the genomic range, and Nonsynonymous for the variants type. Extraction results are stored in the variable cfh. Its row names are variant positions, while column names are sample IDs.

Similarly, we can extract genotype matrix per gene:

geneFile = system.file("vcf/refFlat_hg19.txt.gz", package = "vcf2geno")
geneFile

## [1] "/Library/Frameworks/R.framework/Versions/2.15/Resources/library/vcf2geno/vcf/refFlat_hg19.txt.gz"

cfh <- readVCFToMatrixByGene(fileName, geneFile, "CFH", "Synonymous")
cfh

##             NA12286 NA12341 NA12342
## 1:196654324       1       2       1
## 1:196682947       2       2       1
## 1:196654324       1       2       1

Note here we require a gene file in refFlat format. We have prepared several gene definition files in refFlat format using NCBI human build 37. From this website, you can obtain these resources files.

The supported annotation variant type are:

# large strucutre variant
StructuralVariation

# indels
Stop_Gain
Stop_Loss
Start_Gain
Start_Loss
Frameshift        /* Indel length is not divisible by 3 */                                          
CodonGain         /* Insertion length is divisible by 3 */                                          
CodonLoss         /* Deletion length is divisible by 3 */                                           
CodonRegion       /* Just say the variant is in the Coding Region used in Structrual Varition*/
Insertion
Deletion

# single variants
Nonsynonymous
Synonymous
Essential_Splice_Site
Normal_Splice_Site
Utr5
Utr3
Exon
Intron
Upstream
Downstream
SNV           
Noncoding
Monomorphic

Extract arbitrary fields from VCF file

For advanced data extraction from VCF file, there are two functions: readVCFToListByRange and readVCFToListByGene. Their usage are similar to the previous two functions: readVCFToMatrixByRange and readVCFToMatrixByGene but the results are organized in list.

result <- readVCFToListByGene(fileName, geneFile, "CFH", "Synonymous", c("CHROM", 
    "ID"), c("AC", "AN"), "GT")

## extractStringArray: [1] AC
## extractStringArray: [2] AN
## extractStringArray: [1] GT
## Dump 9 elements:
## s[0] = "0/1"
## s[1] = "1/1"
## s[2] = "0/1"
## s[3] = "1/1"
## s[4] = "1/1"
## s[5] = "0/1"
## s[6] = "0/1"
## s[7] = "1/1"
## s[8] = "0/1"

result

## $CHROM
## [1] "1" "1" "1"
## 
## $ID
## [1] "." "." "."
## 
## $AC
## [1] "4" "5" "4"
## 
## $AN
## [1] "6" "6" "6"
## 
## $GT
##      [,1]  [,2]  [,3] 
## [1,] "0/1" "1/1" "0/1"
## [2,] "1/1" "1/1" "1/1"
## [3,] "0/1" "0/1" "0/1"
## 
## $sampleId
## [1] "NA12286" "NA12341" "NA12342"

There are 7 parameters in readVCFToListByGene: 1. VCF file path 2. Gene definition file (refFlat foramt) 3. Character array of gene names (can be one gene or more than one gene names) 4. Annotation type 5. VCF field to extract (explain later) 6. VCF tags in the INFO field to extract (explain later) 7. VCF tags in the individual field to extract (explain later)

The last three parameter are closely related to VCF format. In the VCF data lines, there are 8 fixed fields per record: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO. These field, as well as the FORMAT field can be specifed in the 5th parameter. For example, specify c(“CHROM”, “POS”) will extract the genomic position.

In the VCF INFO field, there may be information including variant depth (DP), variant allele counts (AC), total allele counts (AN). To extract those information, use the 6th paramter.

In the VCF genotype fileds, which are from the 10th column to the last column per record, it contains the individual level information. Mostly commonly used information are genotype (GT), genotype depth (GD) and genotype quality (GQ). To extract these, specify c(“GT”, “GD”, “GQ”) in the 7th parameter.

The extraction results are stored in result in our example. All values are characters for simplicity. The extraction results is a list, and it contains CHROM, ID, AC, AN, GT and sampleId. CHROM and ID are speicified in the 5th parameter; AC and AN are speicified by the 6th parameter; GT are specified in the last parameter. We also list sample IDs in the sampleId part.

Applying rare-variant analysis

SeqMiner can greatly reduce your effort to process huge sequencing files in VCF format. That indicates your valuable time can be spared on more important things (e.g. investigating causal variants).

Here we demonstrate how to apply SKAT method to study CFH gene. The genotype file we used is extracted from 1000 Genomes Project. It contains 85 individuals in CEU population and the variants are all located in the CFH gene region. We will use simulated phenotype in this demonstration.

# Extract genotype matrix
vcf.1000g <- system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "vcf2geno")
geneFile = system.file("vcf/refFlat_hg19.txt.gz", package = "vcf2geno")
geneName <- "CFH"
cfh <- readVCFToMatrixByGene(vcf.1000g, geneFile, geneName, "Nonsynonymous")
# now cfh is a marker by people matrix
dim(cfh)

## [1] 10 85


# Simulation phenotype under the null model
pheno <- rnorm(ncol(cfh))

# Apply SKAT method
library(SKAT)
obj <- SKAT_Null_Model(pheno ~ 1, out_type = "C")
SKAT(t(cfh), obj)$p.value

## Warning: Genotypes of some variants are not the number of minor alleles!

## [1] 0.0001301

Another way to extract genotype is readVCFToListByGene function, which provides more options to extract genotypical information from VCF file. In the following example, we extract chromosome name, chromosomal position, reference allele, annotation and genotype:

# Extract genotype matrix
vcf.1000g <- system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "vcf2geno")
geneFile = system.file("vcf/refFlat_hg19.txt.gz", package = "vcf2geno")
geneName <- "CFH"

cfh <- readVCFToListByGene(vcf.1000g, geneFile, geneName, "Nonsynonymous", c("CHR", 
    "POS", "REF", "ALT"), c("ANNO"), c("GT"))

dim(cfh$GT)

## [1] 85 10


# Convert 0|0, 0|1, 1|1 style genotype to 0, 1, 2 geno is a people by
# marker matrix
geno <- matrix(NA, nr = nrow(cfh$GT), ncol(cfh$GT))
geno[cfh$GT == "0|0"] = 0
geno[cfh$GT == "0|1"] = 1
geno[cfh$GT == "1|1"] = 2

# Apply SKAT method
library(SKAT)
obj <- SKAT_Null_Model(pheno ~ 1, out_type = "C")
SKAT(geno, obj)$p.value

## Warning: 5 SNPs with either high missing rates or no-variation are
## excluded!

## Warning: The missing genotype rate is 0.025882. Imputation is applied.

## [1] 0.0002317

Contact

SeqMiner is developed by Xiaowei Zhan and Dajiang Liu. We welcome your questions and feedbacks.