This is what you expect a typical SNP to look like:
There are three clearly distinguishable clusters representing the three possible genotypes at that locus (TT, TC and CC).
But some SNPs look like this:
What is going on here?
Well, the last plot is a SNP from a complex genomic region known as KIR. Genes in this region exhibit great allelic copy number diversity. Consequently, SNP probes in this region can bind to several copies of a particular allele, which leads to noisy multi-cluster signals, such as the one pictured here. Moreover, since little is known about the KIR region, the SNP probes may not always bind to the expected locations within those genes, which complicates further the interpretation of the signal.
Wouldn’t it be nice if we could make some sense out of these SNPs to utilise this data for large scale association studies?
One naive approach would be to assume that there is a straight one-to-one mapping between SNP clusters and single gene copy numbers. This idea is good for detecting regions of common copy number variation but from our experience in the KIR region, the clusters are hard to distinguish and cannot be mapped to copy number variation of a single gene. In order to explain what is really going on, we need to resort to a different technology.
Ideally, we would like to fully sequence the KIR region in a large number of individuals. But because of great sequence similarity in this region, very long reads would be required for correct assembly. However, we have a more targeted, cheaper and readily available technology at our disposal for measuring copy number variation: quantitative Polymerase Chain Reaction (qPCR).
The idea is simple and, we found, can work remarkably well: first do qPCR in a subset of samples, then use supervised classification to link qPCR copy numbers to SNPs patterns.
This is the approach we developed in our recently published BMC Genomics paper and applied to testing KIR3DL1/3DS1 copy number association with T1D.
Notice, however, that certain qPCR samples lie within the wrong SNP copy number cluster. For example, samples with a qPCR copy number of 0-2 lie in the SNP cluster 1-1. Here, we attribute the error to imperfect linkage disequilibrium between the tagging SNP and target genes: this SNP does not in fact lie in the KIR3DL1 or KIR3DS1 genes but in the neighbouring gene KIR2DL4**005, an allele which undergoes copy number variation along with KIR3DL1/3DS1.
This idea of imputing KIR genes from tagging SNPs in the region is something that other groups are researching. And we know from attending ASHG 2013, of the ambitious ongoing work by Gil McVean and collaborators (poster 1919W) at Oxford to extend this approach to all KIR genes. We are very interested in seeing the outcomes of their research (or for that matter anyone else’s who is imputing KIR copy number from SNP data).
In the immediate future (until long read sequencing becomes sufficiently cheap), we would like to see similar hybrid qPCR/SNP approaches applied more widely to leverage existing SNP datasets, so that non-genotypable regions like KIR can be assessed more thoroughly and with sufficient power.
We hope that our work might inspire you to revisit your GWAS SNP data and carefully select samples on which to do qPCR, to conduct similar analysis for regions of common copy number variation. We would recommend preferentially selecting samples to qPCR from smaller SNP clouds, since these are likely to correlate with rarer copy number groups (for example the 3-0 group above). This could achieve better prediction rate for a smaller number of samples (as we suggest in Figure 4 of our paper).
In particular, it would be great to see adoption of this approach in KIR association studies which have so far been hindered by embarrassingly small sample sizes (especially when large case-control ImmunoChip cohorts are already available).