Nature GeNetics
doi:10.1038/ng.3638
ONLINE METHODS
CNV calling in exome sequencing data of 60,642 individuals. XHMM was
run as previously described
17
. Briefly, GATK DepthOfCoverage was employed
to calculate mean per-base coverage (counting unique fragments based on reads
mapping with quality >20), across 219,437 targets (including 7,439 and 708 on
chromosomes X and Y, respectively, and 9 on the mitochondrial genome). To
accommodate the variety of exome capture methods used across the various
component projects, these targets were liberally defined as the Illumina ICE
v1 targets plus GENCODE v19 coding regions, both padded by 2 bp, from
which the unique set of relevant ‘exome targets’ was finalized. A total of 31,769
of these targets were subsequently filtered out before CNV calling, 21,072 for
having mean sequencing depth (across all samples) <10×, 8,875 for having
low-complexity sequence (as defined by RepeatMasker) over >25% of its span,
225 for having GC content <10% or >90%, 1,582 for covering <10 bp, and 15 for
spanning >10 kb. The resulting sample × target read depth matrix was scaled
by mean-centering the targets, after which principal-component analysis of
the full matrix was performed; note that, with the LAPACK implementation
in XHMM, this still required 800 GB of RAM and ~1 month of computation
time. For data normalization, the top 388 principal components (those with
variance >70% of the mean variance across all components) were removed
from the data to account for systematic biases at the target or sample level,
such as GC content or sequencing batch effects. Subsequently, three targets
were removed for still having high variance after normalization (s.d. >50),
and sample-level z scores were calculated (with absolute values capped at 40).
CNVs were called using the Viterbi HMM algorithm with default XHMM
parameters, and XHMM CNV quality scores were calculated as previously
described using the forward–backward HMM algorithm and modifications as
previously described. In addition, all called CNVs were statistically genotyped
across all samples using the same XHMM quality scores and output as a single
uniformly called VCF file.
Quality control of CNV data. In total, we attempted CNV calling for 60,642
of the 60,706 (99.9%) ExAC samples, with the remainder either having failed
calling for low overall read depth or not included because of upstream data
access issues. The CNVs output by XHMM were first frequency filtered to
remove common CNVs, that is, those seen more than 600 times (>1%), defined
as overlapping more than 50% of their respective targets. On the basis of previ-
ous work
17
, we retained only CNVs with quality scores greater than or equal to
60. We removed any individual having a CNV count greater than 3 s.d. above
the mean, that is, 24 CNVs (n = 775 samples removed). Thus, our final data
set consisted of 59,898 individuals and 126,771 CNVs overlapping GENCODE
autosomal protein-coding genes.
Filtering of genes. Of the 20,345 GENCODE v19 genes labeled as protein cod-
ing, we limited our analyses to the set of 19,430 genes occurring on autosomes,
with CNVs on sex chromosomes removed because of technical issues. Next,
we removed any gene where half or more of its targets were filtered out dur-
ing CNV calling (1,068 genes). We further removed genes having unusually
low (<30×) or high (>200×) mean coverage (944 genes). Using data from a
recent report on CNV from whole-genome sequencing data for 849 genomes
sequenced from the 1000 Genomes Project
36
, we removed any gene known to
be multiallelic (735 genes). Finally, we removed any gene in which there existed
any CNV with a frequency greater than 0.5% (1,193 genes). This filtering
yielded a final set of 15,734 genes for all subsequent genic analyses.
Assessment of CNV quality in parent–child trios. To assess overall CNV
quality, we used 241 previously described
37,38
parent–offspring trios from
Bulgaria to confirm that apparent de novo rates and parent-to-child transmis-
sion broadly conformed to expectations of random Mendelian segregation
(note that the offspring had a diagnosis of schizophrenia and were not part of
the primary ExAC data set, which included only unrelated individuals). Poor
sensitivity would result in severely reduced transmission statistics, whereas
poor specificity would induce many false positive CNV calls and increased rates
of de novo CNV. Through reasonable estimates of transmission and de novo
events, we could infer high specificity and sensitivity for the CNV calls
overall. Defining CNV transmission as implemented in the PLINK/Seq cnv-
denovo command
17
, we assessed whether the rate of transmission for CNVs
converged to the expected Mendelian rate of 50% across a range of quality
score thresholds. Using the recommended quality score cutoff (SQ ≥ 60),
median per-trio CNV transmission rates were at the expected 50%, with
the aggregate transmission rate across CNVs in all trios falling to 43% (44%
for deletions, 42% for duplications). These rates excluded situations where
an offspring’s CNV was neither confidently called as deleted or duplicated
(SQ ≥ 60) nor confidently called as diploid (DQ ≥ 60). Including these more
uncertain events and conservatively counting them as non-transmissions
resulted in aggregate transmission rates of 32%. Nevertheless, these results
remain consistent with high specificity, as confirmed by a low mean of 0.058
de novo CNVs per trio (half of which were over 1 kb in length and spanned
five or more exons), which only increased to 0.13 de novo CNVs per trio
when treating uncertain events in the parents as diploid. Indeed, a comparable
de novo CNV rate of 0.051 was found in a larger version of this cohort (622
trios) using genotyping arrays
38
.
Gene- and exon-specific copy number calls. We defined gene-specific copy
number state for each individual, assessing the probability of a CNV occurring
anywhere between the transcriptional start and end sites of a gene. Specifically,
this analysis was performed by defining the genomic interval spanned by each
gene and then using the sample × target matrix of z scores to statistically geno-
type these gene regions across all samples. This genotyping procedure yielded a
VCF file containing key copy number metrics, including those corresponding to
the probability that an individual is confidently called diploid for the extent of
the gene or, alternatively, has some deletion or duplication therein. All of these
probability-derived metrics were calculated using the forward–backward
HMM algorithm modified to efficiently calculate posterior probabilities
across all targets in a gene, analogous to genotyping across all targets in a
particular called CNV region. Although XHMM performs exome-wide cor-
rection for both regional and individual read depth variability, we found that
increased sample read depth was still correlated with increased numbers of
CNVs (Supplementary Fig. 1). In the absence of large-scale validation efforts
and given the focus on CNVs that were rare at any particular locus, it was not
feasible to easily normalize out this effect. However, we did account for poten-
tial confounders, such as gene size and read depth, in calculating gene-specific
diploid quality (by defining a threshold for diploid quality of 3 s.d. below
the mean for all individuals). Using this approach, we obtained confidence
measures for deletion, duplication, and diploid status for every individual at
every gene. We further employed the same strategy to call exon-specific copy
number states, again starting with genic exons and overlapping these with
all targets at which read depth was calculated and normalized; note that this
analysis typically included a single target per exon, but for a small proportion
of exons two or more targets were included because of the slight difference
in the definition of target regions for CNV calling and GENCODE exonic
regions. Genic CNV counts derived from this procedure correlated with the
number of loss-of-function variants in a gene (Supplementary Fig. 7).
Creating genic CNV intolerance scores. For the 15,734 genes that survived
quality control, we constructed genic measures of intolerance for all CNVs
and separately for deletions and duplications. In the absence of a high-quality
mutation model for CNVs, we employed an empirical approach incorporating
genomic information. From a set of 9,396 unique pairs of segmental duplica-
tions on the same chromosome downloaded from the UCSC Genome Browser,
we created a subset of 2,790 non-redundant pairs, requiring that the genomic
intervals between them be less than 80% overlapping and less than 4 Mb in
length. We identified a significant increase in the number of CNVs in genes
within these regions (Supplementary Fig. 1), so we included this factor in
predicting CNV frequency. Ultimately, we calculated genic intolerance from
the residuals of a logistic regression of CNV frequency on gene length, read
depth, GC content, sequence complexity, and the number of pairs of segmen-
tal duplications the gene is located between, along with higher-order terms.
We next calculated z scores such that positive values represent a lower fre-
quency of CNV (more intolerance), winsorizing the negative tail at 5%.
Stratifying CNVs by genic content affected. We stratified CNVs by the
number of genes and exons for which they were called (confidently) to affect
dosage. Specifically, we defined single-gene CNVs as those with a gene-specific
npg
© 2016 Nature America, Inc. All rights reserved.
npg
© 2016 Nature America, Inc. All rights reserved.