Transcriptome-wide association study (TWAS) using PredictDB Depression Genes and Networks (DGN) weights
Genetic Epidemiology Research on Adult Health and Aging (GERA) Europeans
All genomic positions are from GRCh37.
TWAS results for 10 traits are presented in the figures and tables below. Traits and trait categories are listed and defined as follows:
Platelet count (PLT)
Red blood cell indices (RBC): red blood cell count (RBC), hematocrit (HCT), hemoglobin (HGB), mean corpuscular volume (MCV), and red cell distribution width (RDW)
White blood cell indices (WBC): white blood cell count (WBC), monocyte (MONO), neutrophil (NEUTRO), and lymphocyte (LYMPH)
Figure 1. TWAS-GWAS mirror plot of genes and variants within the locus
Note: Top figure displays TWAS significant genes and any additional non-significant genes reported from GWAS, bottom figure displays GWAS variants. In the TWAS plot, "reported in GWAS" means that the GERA TWAS gene was reported in the GWAS catalog as the assigned gene for a single variant signal associated with the phenotype category, often based on physical proximity. In the GWAS plot, "reported in GWAS" means that the GERA GWAS variant was reported in the GWAS catalog as a single variant signal associated with the phenotype category. Marginal TWAS displays results of gene-trait associations. Conditional TWAS displays results of gene-trait associations, conditional on reported GWAS variants at the locus (conditional results only available for significant TWAS genes).
Figure 2a. TWAS-GWAS mirror locus-zoom plot
Note: Top panel displays predicted expression correlation between index TWAS gene and other genes at the locus. Bottom panel displays LD between the index SNP and other SNPs at the locus. Lines connect genes to their predictive model variants. Color scale for genes denotes the degree of predicted expression correlation with the index gene. Color scale for SNPs and solid lines denotes the degree of LD with the index SNP. Dashed red line in the top panel denotes TWAS p-value threshold = 4.37e-7, and in bottom panel denotes GWAS p-value threshold = 5.0e-8
Figure 2b. Network visualization of TWAS results
Sentinel TWAS gene is indicated by a star, all other genes are squares. Sentinel GWAS variant is indicated by a triangle, all other variants are circles. Color scale of all lines and shapes is based on correlation with the index gene or index variant. Line thickness corresponds to the model weight. Solid line indicates a positive direction of effect and dashed line indicates a negative direction. Size of the shape corresponds to the size of the -log10(p-value).
Figure 3. Mirror plot of GERA TWAS and meta-analysis TWAS
Note: Top figure displays TWAS significant genes and any additional non-significant genes reported from GWAS, bottom figure displays the same genes from TWAS meta-analysis of ARIC, WHI, and BioMe. In both plots, "reported in GWAS" means that the TWAS gene was reported in the GWAS catalog as the assigned gene for a single variant signal associated with the phenotype category, often based on physical proximity.
Figure 4. Comparison of TWAS results from DGN reference panel to results from secondary reference panels
Note: DGN = Depression Genes and Networks, GWB = GTEx whole blood, GTL = GTEx EBV transformed lymphocytes, MSA = MESA monocytes; each represents a gene expression reference panel. The figure in each tab displays a mirror plot of GERA results using DGN reference panel versus GERA results using a secondary reference panel (GWB, GTL, or MSA).
Table 1. Overall trait-specific TWAS results from primary and secondary reference panels within the locus
Note: DGN = Depression Genes and Networks, GWB = GTEx whole blood, GTL = GTEx EBV transformed lymphocytes, MSA = MESA monocytes; each represents a gene expression reference panel.
Significant gene-trait associations highlighted in green. HLA genes / MHC regions / single SNP models highlighted in red.
MHC region is defined as GRCh37; chr6:28,477,797-33,448,354. Single SNP model indicates that the predictive expression model for the gene contained only a single SNP.
Table 2. Overall TWAS results for other traits in the trait category from primary and secondary reference panels within the locus
Note: DGN = Depression Genes and Networks, GWB = GTEx whole blood, GTL = GTEx EBV transformed lymphocytes, MSA = MESA monocytes; each represents a gene expression reference panel.
Significant gene-trait associations highlighted in green, HLA genes / MHC regions / single SNP models highlighted in red.
MHC region is defined as GRCh37; chr6:28,477,797-33,448,354. Single SNP model indicates that the predictive expression model for the gene contained only a single SNP.
Table 3. Reported GWAS variants within the locus (all traits in the category)
Table row is highlighted in red if the SNP is not in GERA imputation data
Positions reported in Figure 1 highlighted in green
GWAS
Genes
:
matching significant TWAS genes are highlighted in green
matching non-significant TWAS genes are highlighted in yellow
included in DGN but not predicted in GERA are highlighted in orange
included in DGN but no predictive model was fit are highlighted in red
not included in DGN (i.e. we have no info in this analysis) are not highlighted
Table 4. GERA GWAS results displayed in Figure 1 as variants "Reported in GWAS"
Note: Results listed below are from trait specific GWAS
Table 5. Gene expression prediction model weights
Note: "LD" refers to the LD between the most significant rsid at the locus with the variant listed in the row. "Corr" refers to the Pearson correlation between the most significant gene at the locus and the gene listed in the row.
Methods
A preprint of the manuscript describing these results and detailing the methods will be available on bioRxiv soon
. All cohorts included in the analysis are described individually below. We analyze 10 hematological phenotypes (platelet count, red blood cell count, hematocrit, hemoglobin, mean corpuscular volume, red cell distribution width, white blood cell count, monocyte count, neutrophil count, and lymphocyte count) across all cohorts.
PrediXcan method:
PrediXcan (26258848) is a gene-based association test that prioritizes genes which are likely to be causal for the phenotype. It implements an elastic net based method for selecting variants associated with gene expression in a given reference panel, and then uses those variants to predict gene expression in a cohort with only genotype data. We downloaded the PrediXcan software (see URLs) along with its prepackaged weights for gene expression data from PredictDB (see URLs). Weights for gene expression using RNA sequencing data were obtained from the Genotype-Tissue Expression project (23715323) (whole blood, genes= ; and EBV transformed lymphocytes, genes=; n=), Depression Genes and Networks (24092820) (whole blood, genes=11538, n=922), and Multi-Ethnic Study of Atherosclerosis (Europeans only, monocytes, genes=, n=) (23900078). Imputed genotypes for all cohorts were filtered for imputation quality based on R2 > 0.3; variants not meeting this threshold were excluded from the analysis. We use DGN as our primary reference panel for all TWAS analyses as it is the largest single whole blood RNA-seq dataset.
Included cohorts:
These TWAS analyses were limited to self-reported white or European ancestry participants, for easy comparability with the DGN European ancestry eQTL panel, including input of LD information into the R Shiny application (see R Shiny Methods), and with the largest single-ancestry blood cell trait GWAS.
Genetic Epidemiology Research on Adult Health and Aging (GERA)
The GERA cohort includes over 100,000 adults who are members of the Kaiser Permanente Medical Care Plan, Northern California Region (KPNC) and consented to research on the genetic and environmental factors that affect health and disease, linking together clinical data from electronic health records, survey data on demographic and behavioral factors, and environmental data with genetic data. The GERA cohort was formed by including all self-reported racial and ethnic minority participants with saliva samples (19%); the remaining participants were drawn sequentially and randomly from non-Hispanic White participants (81%). Genotyping was completed as previously described (26092718) using 4 different custom Affymetrix Axiom arrays with ethnic-specific content to increase genomic coverage. Genotype data were imputed to 1000 Genomes Phase 3. Principal components analysis was used to characterize genetic structure in this European sample (26092716). Hematological measures were extracted from medical records. In individuals with multiple measurements, the first visit with complete white blood cell differential (if any) was used for each participant. Otherwise, the first visit was used. In total, 54,542 participants with hematological measures were included in the analysis.
Women's Health Initiative (WHI)
WHI originally enrolled 161,808 women aged 50-79 between 1993 and 1998 at 40 centers across the US, including both a clinical trial (including three trials for hormone therapy, dietary modification, and calcium/vitamin D) and an observational study arm (9492970). WHI recruited a socio-demographically diverse population representative of US women in this age range. Two WHI extension studies conducted additional follow-up on consenting women from 2005-2010 and 2010-2015. Genotyping was available on some WHI participants through the WHI SNP Health Association Resource (SHARe) resource, which used the Affymetrix 6.0 array (~906,600 SNPs, 946,000 copy number variation probes) and on other participants through the MEGA array (https://www.biorxiv.org/content/10.1101/188094v2). Imputation and association analysis was performed separately in individuals with Affymetrix only, MEGA only, and both Affymetrix and MEGA data. For variants with both Affymetrix and MEGA genotypes available, MEGA genotypes were used. In total, 18,100 self-reported white women with hematological phenotypes were included. Six sub-cohorts from the WHI study were included in the meta-analysis and phenotypes were not collected uniformly across the cohorts. Sample size information for each phenotype is contained in (Supplementary Table 5).
Atherosclerosis Risk in Communities (ARIC)
The ARIC study was initiated in 1987 and recruited participants age 45-64 years from 4 field centers (Forsyth County, NC; Jackson, MS; northwestern suburbs of Minneapolis, MN; Washington County, MD) in order to study cardiovascular disease and its risk factors (2646917), including the participants of self-reported European ancestry included here. Standardized physical examinations and interviewer-administered questionnaires were conducted at baseline (1987-89), three triennial follow-up examinations, a fifth examination in 2011-13, and a sixth exam in 2016-2017. Genotyping was performed through the CARe consortium Affymetrix 6.0 array (20400780). ARIC EA genotype data were imputed to Haplotype Reference Consortium (HRC). In total, 9,345 European ancestry participants with hematological phenotypes were included in the analysis. All phenotypes were adjusted for study site, age, age squared, sex, and top ten PCs and were inverse normalized.
BioMe
The Mount Sinai BioMe Biobank, founded in September 2007, is an ongoing, broadly consented EHR-linked bio- and data repository that enrolls participants non-selectively from the Mount Sinai Medical Center patient population. The BioMe Biobank draws from a population of over 70,000 inpatient and 800,000 outpatient visits annually from over 30 broadly selected clinical sites of the Mount Sinai Medical Center (MSMC). As of September 2020, BioMe has enrolled more than 50,000 patients that represent a broad racial, ethnic and socioeconomic diversity with a distinct and population-specific disease burden, characteristic of the communities served by Mount Sinai Hospital. BioMe participants are predominantly of African (AA, 24%), Hispanic/Latino (HL, 35%), European (EA, 32%), and other ancestry (OA, 10%). All blood cell phenotype data, as well as demographic variables, were extracted from the patients' EHRs. Genotyping was performed using the Illumina GSA array (~640.000 variants) and genotype data were imputed using the '1000G Phase 3 v5' reference panel. In total, 8,455 European ancestry participants with hematological phenotypes were included in the analysis. All phenotypes were adjusted for study site, age, age squared, sex, and top ten PCs and were inverse normalized. The BioMe Biobank Program operates under a Mount Sinai Institutional Review Board-approved research protocol. All study participants provided written informed consent.
Conditional analysis:
For each statistically significant TWAS gene-trait association, the effect of predicted gene expression was conditioned on a set of previously reported GWAS sentinel variants from [32888494] meeting the following criteria: 1) the sentinel variant fell within a 1Mb region of the TWAS gene, 2) the trait with which the GWAS variant was associated matched the TWAS analytical trait or was within the same trait category as the analytical trait (platelets, red blood cell indices {hematocrit, hemoglobin, mean corpuscular volume, red blood cell count, red blood cell distribution width}, white blood cell indices {white blood cell count, neutrophils, monocytes, lymphocytes}), and 3) the GWAS variant met an imputation quality threshold of R2 > 0.3. We used a modified version of the cpgen R package (see cpgen Methods) to perform the conditional analysis, accounting for a PLINK KING-robust kinship matrix (20926424), which used only genotyped variants and excluded those with minor allele frequency less than 5% and those missing more than 1% of SNPs.
Meta-analysis and replication with ARIC, WHI, BioMe:
In order to replicate the conditionally significant gene-trait association, we tested each association via a meta-analysis of the ARIC, WHI, and BioMe cohorts. As described above, PrediXcan was used to facilitate gene expression imputation and association in each cohort separately, and the meta-analysis association test was conducted using METAL (20616382).
Replication of the GERA significant gene-trait associations was performed using meta-analyzed TWAS results from ARIC, WHI, and BioMe. Nine gene-trait associations remained statistically significant after conditional analysis; for this set of genes, we defined a Bonferroni-corrected statistically significant replication threshold at p-value < 5.56 X 10-3. For the fine-mapping analysis, statistical significance of replicated genes was qualified based on two different thresholds -- a stringent threshold Bonferroni-corrected for all 239 statistically significant TWAS gene-trait associations at p-value < 2.09 X 10-4, and a more lenient threshold at p-value < 0.05.
FOCUS:
We used the Fine-mapping Of CaUsal gene Sets (FOCUS) (30926970) software to fine-map TWAS statistics at genomic risk regions. As input, we used GWAS summary data from GERA along with eQTL weights from PredictDB Depression Genes and Networks whole blood data, and an LD reference panel from 1000 Genomes phase 3. The software outputs a credible set of genes at each locus which can be used to explain observed genomic risk.
Fine-mapping loci and locus categories:
Fine-mapping loci refers to fine-mapping analysis of trait-specific genomic locations that contain, and are centered at, sentinel TWAS genes. That is, we take the set of trait-specific statistically significant TWAS genes, select the most significant gene in the set (the sentinel gene), and assign it to a locus along with any other statistically significant TWAS genes within a 1Mb window of the sentinel gene. We then select the next most significant TWAS gene which has not yet been assigned to a locus and continue in this fashion until all statistically significant TWAS genes have been assigned to a locus.
We define locus categories based on whether the locus contains a single gene or multiple genes and whether the locus replicates in TWAS meta-analysis at either a lenient or strict threshold. Thus, locus categories are defined as follows: 1=single gene locus, strict replication (p < 2.09E-04); 2=single gene locus, replication (p < 0.05); 3=single gene locus, no replication; 4=multi gene locus, strict replication (p < 2.09E-04); 5=multi gene locus, replication (p < 0.05); 6=multi gene locus, no replication.
R Shiny:
We use R's convenient Shiny package to produce the web application which displays our GERA TWAS results. The IdeogramTrack (https://rdrr.io/bioc/Gviz/man/IdeogramTrack-class.html) uses Genome Reference Consortium Human Build 37 (GRCh37) and UCSC cytogentic bands from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/. All GERA TWAS results were produced using PrediXcan as described above, all GERA GWAS results were produced using XXX, and GERA conditional analysis results were produced using cpgen as described below. Known GWAS sentinel variants were obtained from PMID: 32888494. Model weights and model variants were taken from our primary DGN reference panel from PredictDB (or secondary reference panels GWB, GTL, or MSA from PredictDB). Correlation of predicted expression among genes at the locus was calculated using R's cor() function, and LD among variants was computed using plink --r2 (https://zzz.bwh.harvard.edu/plink/ld.shtml). We used ggplot2() to produce all figures, except the network visualization used visNetwork(). Tables were produced using the DT package (https://www.rdocumentation.org/packages/DT/versions/0.16).
cpgen:
We used the R package cpgen to perform conditional analysis of TWAS-significant genes, while accounting for a KING kinship matrix. However, cpgen is designed in such a way that it performs eigenvalue decomposition on the cohort sample for every function call. Since we had 239 TWAS-significant associations, this would have required eigenvalue decomposition on a sample of N ~ 55,000 for each of those 239 associations, a computationally burdensome calculation. Thus, we slightly modified the cpgen script. Specifically, we computed the eigenvalue decomposition on the GERA sample outside of the cpgen script (for each phenotype), and then subsequently loaded the appropriate eigenvectors and eigenvalues into the program, modifying the script so that it could take these eigenvectors and eigenvalues as input.