Variant Filtering

So I have a few VCF files that have been mapped and I wanted to take a look at the data. Here is some information on how that has come along thus far. I’m still trying to determine if we are doing the right thing by chunking up the data and the consequences of doing it this way. That being said, here is some information on what we have. This is based upon


February 16, 2023

Merging VCF Files

The first thing we need to to is zip and then index the individual vcf files. As a reminder, these files have been created by individual freebayes SNP calling of chunks of individuals partitioned into groups based upon the first letter of the popualtion from which they were collected. So, in the following examples, the file R.vcf contains all the individuals in populations starting with the letter R. When there are not that many individuals in a group, they are clustered—below we have JK.vcf which has all the individuals from populations starting with the letters J and K.

So, lets compress these raw files first.

% bgzip JK.vcf
% bgzip P.vcf 
% bgzip R.vcf 
% bgzip S.vcf 
% bgzip W.vcf
% ls -alh
total 70032
drwxr-xr-x   7 rodney  staff   224B Feb 16 14:22 .
drwxr-xr-x  21 rodney  staff   672B Feb 16 14:10 ..
-rw-r--r--   1 rodney  staff   3.1M Feb 16 14:20 JK.vcf.gz
-rw-r--r--   1 rodney  staff   5.8M Feb 16 14:22 P.vcf.gz
-rw-r--r--   1 rodney  staff   4.6M Feb 16 14:22 R.vcf.gz
-rw-r--r--   1 rodney  staff    14M Feb 16 14:22 S.vcf.gz
-rw-r--r--   1 rodney  staff   5.8M Feb 16 14:22 W.vcf.gz

Now we need to index these files.

% tabix -p vcf JK.vcf.gz 
% tabix -p vcf P.vcf.gz 
% tabix -p vcf R.vcf.gz
% tabix -p vcf S.vcf.gz
% tabix -p vcf W.vcf.gz

Then we can merge the files together into a single file. Here, the output file indicates the populations that contribute to it.

 % vcf-merge JK.vcf.gz P.vcf.gz R.vcf.gz S.vcf.gz W.vcf.gz > JKPRSW.vcf

So here are a few things that I’ve found that provide some insights into the data we are getting. Here is the sequencing depth per individual for only the SNP sites who have a minimum quality score of 20.

% vcftools --vcf JKPRSW.vcf --remove-indels --minQ 20 --depth -c > depth.txt

This produced a distribution of sequencing depths for these individuals.

Figure 1 Average sequencing depth across loci within each of the 302 individuals.

Alternatively—and perhaps more important for us—we can look at the per-site depths. The goal here is to retain sites that have enough sequencing depth such that we are confident that it is a good site but not too much such that it is a reapeaty region in the geniome. For simplicity, I’ll drop the lower and upper standard deviation.

Figure 2 Average sequencing depth per site across 302 individuals.

For simplicity, I’m just going to take the inner-50% of the data, dropping off the lower and upper quantiles.

quantile( site_depth$SUM_DEPTH,
          prob = c(0.25,0.75))
25% 75% 
110 964 

This leaves us with 5047 sites.

Now, we can go back to vcftools and use these values to select only sites with a quality score exceeding 20, and are in the middle 50 percentile of the depth coverage and have a minor allele frequency of at least 5%.

% vcftools --vcf JKPRSW.vcf --remove-indels --min-meanDP 110 --max-meanDP 964 --minQ 20 --maf 0.05  --min-alleles 2 --max-alleles 2 --freq
After filtering, kept 302 out of 302 Individuals
Outputting Frequency Statistics...
After filtering, kept 120 out of a possible 22604 Sites
Run Time = 1.00 seconds

So, we have 120 loci to work with from this subset.