Objective

We took part in the Exposome Data Challenge to examine whether we could identify exposure environments with high sexual dimorphism in human development. For answering this questions we used our implementation of causal random forest (teff).

Rationale

Boys and girls develop differently. For instance, their immune response to infections differ from an early age, their brains grow at different rates, and the prevalence of numerous common diseases, like obesity, is also different. Boys are typically more susceptible to obesity than girls. Given the contrasting paths of development, it is remarkable that epidemiological studies rarely consider sex as an effect of interest rather than a confounder. Exposomic studies, in particular, are characterized by the acquisition of massive amounts of data at individual and population levels. An important goal of these studies is to inform the likely conditions for which a given public health intervention would be optimal, such that the best intervention is applied at the right time to the right population. However, as the main difference between individuals is sex, exposomic studies aiming at improving precision medicine and precision public health cannot do without considering how environmental risk factors affect sexual dimorphism in development and disease.

From a mechanistic context, studying the factors that increase sexual dimorphic outcomes of disease can offer important insights into its etiology, and inform of possible interventions and targeted treatments. Important advancements have been made on studying sex-related risk factors for diseases like cancer, Alzheimer’s and automimmune diseases. However, a relevant component of these age-related diseases is hormonal regulation. Studying sex differences in preteens offers not only the opportunity for identifying targeted treatments for early-age illnesses but also to explore disease mechanisms unlikely influenced by sex hormones.

Finally, form a biological perspective, methods determining the environmental landscape that promotes sexual-dimorphisms could foster evolutionary studies of sex-biased selection based on exposomic data in humans and other species.

Hypothesis

We hypothesized that a co-ocurrence of environmental conditions can be identified from exposomic data such that the differences in BMI between boys and girls is highest. We further hipothesized that the individuals who belonged to such exposure environment could be characterized by specific patterns of gene expression and methylation relevant to BMI differences. To assess these hypotheses, we used random causal forest on exposomic data to determine the exposure environment of high dimorphism, to which individuals could be classified. The hypotheses were also explored for asthma and IQ.

Methods

Individuals who developed a complex disease are likely exposed to multiple environmental risk factors. For diseases with differences in risk or severity between sexes, one could expect that, under the same exposure environment, individual patients progress differently depending on their sex. We are interested on identifying the exposure environment, e.g. a set of exposures, for which the effect of sex is high on a clinical outcome. However, from an exposomic perspective, each individual is characterized by a unique exposomic profile, and direct assessment of the sex effect on the outcome under the entire profile is not possible. The question of what the outcome would be if we could switch an individual for another with opposite sex but keep the same exposure profile, and thus be able to compute the difference in outcome between sexes in the profile, belongs to the realm of causal inference. To our knowledge, we are the first to cast the assessment of sexual dimorphism in clinical outcomes as a causal inference problem; and herein lies the novelty of our approach.

Our approach aims to estimate for each exposure profile the effect of sex on BMI. We thus define the effect of sex \(\tau(\mathbf{p})\) for a set of exposure values \(\mathbf{p}\) as the expectation of the difference in the outcome \(D\) between males (0) and females (1) \[ \tau(\mathbf{p}) = E[D_j(1)- D_j(0) | \mathbf{P}_j = \mathbf{p}] \]where \(\mathbf{P}_j\) is the random variable that corresponds to the set of \(i\) exposure residuals of an individual \(j\), and \(\mathbf{p} = \{x'_{ji}\}\) is observed vector of exposure residuals, adjusted by covariates. The apostrophe indicates residuals and the subscript \(i\) indexes the exposures. Therefore, for a girl with exposure residuals \(\mathbf{p}\), her effect of being female on BMI is the expected difference between the observed BMI \(D_j(1)\) and the inferred \(D_j(0)\); that is, the BMI had a boy been observed with \(\mathbf{p}\). Similarly, \(\tau(\mathbf{p})\) for boy is the expected difference between the inferred \(D_j(1)\) and the observed \(D_j(0)\). We refer to \(\tau(\mathbf{p})\) as the asocciated sexual dimorphism of the exposure environment \(\mathbf{p}\). We used random causal forest (RCF) to directly estimate \(\tau(\mathbf{p})\). A main advantage of RCF is that it can estimate the confidence interval (CI) for \(\tau(\mathbf{p})\).

Our approach is based on two steps: the first one selects informative exposures of sexual dimorphism and the second applies RCF to determine the exposure environment for which the differences in phenotypes between sexes is highest.

In the first step, we aimed to select the individual exposures that would contribute to modify the effect of sex on a phenotype, such as BMI. We, therefore, tested the association between phenotype (\(D\)) and the the interaction between the exposures (\(x_i\)) and sex (\(t={0,1}\)) using the linear model

\[ D_{j} = \alpha_{i} + \beta_{i} (x_{ji} \times t_j) + \sum_{r=1...k} \gamma_{ri} C_{jir} + \epsilon_{ji} \]

where \(D_{j}\) is the phenotype for individual \(j\) with exposure \(i\) (\(x_{ji}\)) and sex \(t_i\). \(C_{jir}\) is the \(r\) covariate. \(\beta_{i}\) is the effect of interest that measures the association between the phenotype and the interaction between exposure \(i\) and sex. Significant exposures at nominal value were selected as informative exposures of sexual dimorphism and used in a random causal forest analysis to determine the exposure environment of highest sexual dimorphism for the phenotype.

For the second step, we adapted our recent implementation and development of RCF in the package teff that allows the extraction profiles from feature data (exposome) for which the effect of a condition (sex) on an outcome (phenotype) is maximized. We applied RCF to the levels of selected exposures. In RCF, a random train-set of \(80\%\) individuals is drawn to grow the forest. The remaining \(20\%\) of subjects are set aside to infer \(\tau(\mathbf{p})\) and thier \(95\%\) CIs according to the RCF predictor (Athey, et al. Ann. Statist. 47(2): 1148-1178 ). The application of these procedures was implemented in the function profile of teff.

In the package teff, individuals with significant CI are used to create exposure profiles of highly positive (female > male) and highly negative (female < male) sexual dimorphisms. The positive profile consists of a binary values across exposures that indicate whether each exposure showed higher or lower than average levels across individuals with CI higher than 0. Likewise the negative profile is based on individuals with CI lower than 0.New individuals with exposure data can thus be targeted and classified. For targeted individuals, the package can test whether the association between sex and the outcome is indeed significantly different across profiles. Furthermore, the classification of individuals into exposure profiles of high sexual dimorphism can be used in down-stream analysis of gene expression and methylation data to determine the most likely molecular mechanisms affected by the exposure environment. Further applications of the package can be found in teff

Results

Our analysis uses only R/Bioconductor packages and our implementation of the random causal forest package teff, also in R.

library(parallel)
library(ggplot2)
library(ggthemes)
library(ggpubr)
library(arm)
library(limma)
library(EnhancedVolcano)
library(meta)
library(clusterProfiler)
library(minfi)
library(teff)

Phenotype data

Data from the Exposome Data Challenge were downloaded into the local directory ./data, from which they were loaded into the R session. We first inspected the exposome data

load("./data/exposome.rdata")

code <- read.csv("./data/codebook.csv", sep=";")
rownames(code) <- code[,1]

We observed that birth weight and zBMI were normally distributed measurements.

par(mfrow=c(1,2))
hist(phenotype[,"e3_bw"], main="e3_bw", xlab="")
hist(phenotype[,"hs_zbmi_who"], main ="hs_zbmi_who", xlab="")

And asthma was dichotomous

table(phenotype[,"hs_asthma"])
## 
##    0    1 
## 1159  142

However, neuropsychological measurements did not normaly distributed and, therefore, we dichotomized them, calling individuals with measurements lower than the 20th percentile.

par(mfrow=c(1, 2))
hist(phenotype[,"hs_correct_raven"], main="hs_correct_raven", xlab="")
hist(phenotype[,"hs_Gen_Tot"], main="hs_Gen_Tot", xlab="")

phenotype[,"hs_correct_raven"] <- as.numeric(phenotype[,"hs_correct_raven"] < quantile(phenotype[,"hs_correct_raven"], 0.20))

phenotype[,"hs_Gen_Tot"] <- as.numeric(phenotype[,"hs_Gen_Tot"] < quantile(phenotype[,"hs_Gen_Tot"], 0.20))

In addition, we considered obesity as an outcome derived from the hs_bmi_c_cat variable for individuals in category 4.

table(phenotype[,"hs_bmi_c_cat"])
## 
##   1   2   3   4 
##  13 904 253 131
phenotype[,"hs_bmi_c_cat"] <- as.numeric(phenotype[,"hs_bmi_c_cat"]==4)

We then asked the extent to which the phenotypes in the data set were sexually dimorphic. We thus assessed their association with sex by linear models, whose link function depended on the distributional nature of the phenotype.

dims <- lapply(names(phenotype)[2:7], function(ph)
{
  mod <- "gaussian" 
  
  if(length(table(phenotype[[ph]]))==2) 
     mod <- "binomial"
  summary(glm(phenotype[[ph]] ~ covariates$e3_sex_None, family = mod))$coefficients[2, c(1,4)]
})

dims <- do.call(rbind, dims)
rownames(dims) <- names(phenotype)[2:7]

dims
##                      Estimate     Pr(>|t|)
## e3_bw            155.23191027 3.660523e-08
## hs_asthma          0.50134167 6.592082e-03
## hs_zbmi_who        0.10139352 1.252016e-01
## hs_correct_raven   0.05637744 7.010998e-01
## hs_Gen_Tot        -0.20191949 1.492246e-01
## hs_bmi_c_cat       0.42595146 2.482838e-02

As expected, positive significant associations were found for boys with birth weight, as well as for asthma and obesity frequencies. We confirmed tha sex-corrected phenotypes, such as zBMI and neuropsichological measures, were not associated with sex.

ggp <- lapply(names(phenotype)[c(2,4)], function(ph){
  df <- data.frame(phenotype=phenotype[[ph]], 
                   sex=relevel(covariates$e3_sex_None, ref=2))
  p <- ggplot(df, aes(sex, phenotype))+
   theme(legend.position = "none")+
   theme_classic()+
   geom_jitter(width = 0.05, col="darkgrey", size=1, alpha=0.1)+
   stat_summary(fun.data="mean_cl_normal", 
                 geom="errorbar", width=0.25, color=c("orange", "blue"))+
   geom_boxplot(col="black", alpha=0.01, width=0.1)+
   ylab(ph)
  p
})

ggarrange(ggp[[1]],ggp[[2]], labels = c(""," "), ncol = 2, nrow = 1)

The plots show the inferred CI for the mean of birth weight and zBMI (colored) together with a box plot of quartiles. Birth weight was lower for females, but zBMI was no different between sexes.

Exposomic and Transcriptomic match

We then loaded the gene expression data and matched the individuals with transcriptomic, exoposomic and phenotipic data.

load("./data/genexpr.Rdata")

genesIDs <-  Biobase::fData(genexpr)
exprs <-  Biobase::exprs(genexpr)
phenotype_expression <-  Biobase::pData(genexpr)

Individuals with data in all data sets were selected.

comnames <- intersect(rownames(phenotype), rownames(phenotype_expression)) 

covs <- covariates[comnames, ]
eff <- phenotype[comnames,]
phenotype_expression <- phenotype_expression[comnames, ]
exprs <- exprs[,comnames]
expos <- exposome[comnames, ]

zBMI

We analyzed normalized body mass index (zBMI) from the variable hs_zbmi_who. Note that while there are no overall differences of zBMI between sexes, we are interested in finding the set of exposures that would make this difference high. Therefore, in the first step of our analyses, we selected the exposures that significantly modulated the effect of sex on the phenotype, assessing the interaction between the exposures and sex on BMI. While we had confirmed that this phenotype did not associated with sex differences, we sill adjusted for covariates that included sex, exposure levels, height and weight, in addition to cohort, year of birth, and age among others.

colnames(covs)[-c(1,3)]
##  [1] "h_cohort"          "e3_yearbir_None"   "h_mbmi_None"      
##  [4] "hs_wgtgain_None"   "e3_gac_None"       "h_age_None"       
##  [7] "h_edumc_None"      "h_native_None"     "h_parity_None"    
## [10] "hs_child_age_None" "hs_c_height_None"  "hs_c_weight_None"

We selected exposures with nominal significant associations for their interaction with sex.

# sex explicitly encoded in variable t (males:0, females: 1)
# exposure encoded in variable exp
# phenotype encoded in variable eff
# covariated encoded in data.frame cov (exluding ID and sex)

#phenotye name
ph <- "hs_zbmi_who"

texp <- sapply(names(expos)[-1], function(xx){
  dat4int <- data.frame(t = -as.numeric(covs$e3_sex_None )+2,
                      eff = eff[,ph],
                      exp = as.numeric(expos[,xx]),
                      covs[-c(1,3)])

 
  fa <- formula(paste0("eff ~ t:exp +",
                     paste0(names(dat4int)[-2], collapse="+")))

  summary(glm (fa, data=dat4int))$coeff["t:exp", c(1,4)]
})

texp <-t(texp)

texpsig <- texp[texp[,2]<0.05,]

dd <- data.frame(labels=code[rownames(texpsig),"labels"], texpsig, check.names = FALSE)

dd[order(dd[,3]),]

We observed 31 (nominally) significant exposures whose interaction with sex associated with zBMI. We observed exposures from different families. Most represented families included meteorological, natural spaces, organochlorines and phthalates.

tb  <- table(code[rownames(texpsig),"family"])
tball  <- table(code[,"family"])
sort(round(tb/tball[names(tb)],3), decreasing = TRUE)
## 
##                             Meteorological 
##                                      0.500 
##                             Natural Spaces 
##                                      0.333 
##                            Organochlorines 
##                                      0.333 
##                                 Phthalates 
##                                      0.273 
##                                  Lifestyle 
##                                      0.128 
##                              Air Pollution 
##                                      0.125 
## Per- and polyfluoroalkyl substances (PFAS) 
##                                      0.100 
##                                    Phenols 
##                                      0.071 
##                                     Metals 
##                                      0.050

The top association was for consumption of organic food (hs_org_food_Ter) whose association with zBMI reduction was stronger in boys than girls.

df <- data.frame(phenotype=eff[,ph], 
                 sex=relevel(covs$e3_sex_None, ref=2),
                 exposure=expos[,"hs_org_food_Ter"])

ggline(df, x = "exposure", y = 'phenotype', 
       add = "mean_se", color = "sex", palette = c("orange", "blue"), 
       xlab="Organic food", main="", ylab="zBMI")  

We then applied RCF on these 31 exposures to infer the effect of sex on zBMI for individual exposure profiles. The implementation of the algorithm automatically splits the data into a test (20%) and and training set (80%). The forest is grown on the training set while causal inferences are performed on the test set. The benefit of RCF is that the effect of sex is estimated with confidence intervals for individual profiles, allowing identification of exposure environments with highly positive or negative sexual dimorphisms. Application of RCF was performed with the function predicteff.

#format data into model matrices (to account for factor variables)
phenored <- data.frame(eff=eff[,ph],
                       t=-as.numeric(covs$e3_sex_None )+2, 
                       covs[-c(1,3)])
names(phenored)[3:ncol(phenored)] <- paste0("cov", 3:ncol(phenored))

fa <- formula(paste0("~", paste0(names(phenored), collapse="+")))
mod <- model.matrix(fa, data = phenored)[,-1]

fa <- formula(paste0("~", paste0(names(expos), collapse="+")))
modexp <- model.matrix(fa, data = expos)[,-1]

#data for teff
data4teff <- list(features=data.frame(modexp), teffdata = data.frame(mod))

#appluy RCF
pred <- predicteff(data4teff, featuresinf = rownames(texpsig))

ctrl <- list(lb=c("Male", "Female"), wht="bottomright", whs = "topleft")
plotPredict(pred, lb="Associated sexual dimorphism", ctrl.plot=ctrl)

For the test set of 201 individuals, we observed 42 for which their exposure profiles significanlty associated with negative dimorphism in zBMI (M>F).

table(pred$cu<0)
## 
## FALSE  TRUE 
##   155    46

Within these environments, boys would likely have higher zBMI than girls, supporting the higher vulnerability of boys to obesity. RCF also scores the exposure features that contributed most to the forests. We observed, in particular, that PM10, was the variable with highest importance.

imp <- data.frame(labels=code[ pred$featurenames[,1],"labels"], importance=pred$featurenames[,2])

head(imp)
plot(pred$featurenames[,2], ylab="Exposure importance", pch=16)
for(i in 1:4)
  text(i+2.3, pred$featurenames[i,2]+0.001,
       code[ pred$featurenames[i,1],"labels"], cex=0.5)

We then extracted a common exposure environment of negative sexual dimorphism, using the averages of the top 70% of important exposures across the 42 individuals with significant negative dimorphism in zBMI. All the individuals in the original dataset were classified on whether they belonged to the exposure environment with a tolerance of 70% across selected exposures. We then tested whether the classification of individuals into the environment of negative dimorphism modulated the effect of sex on BMI. All these processes are applied in the function target.

pred <- predicteff(data4teff, featuresinf = rownames(texpsig), profile = TRUE, quant = 0.7)
tar <- target(data4teff, pred, effect = "negative", mat = 0.7,
              lb = code[pred$featurenames$featurenames,"labels"])

We observed that 217 individuals from 1007 were targeted by the environment of negative dimorphism, independently of sex.

expenv <- tar$classification
ef <- data4teff$teffdata$eff
tr <- data4teff$teffdata$t

tb <- table(tr,expenv)

tb
##    expenv
## tr    0   1
##   0 419 115
##   1 371 102
chisq.test(tb)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tb
## X-squared = 1.4945e-29, df = 1, p-value = 1

For the selected parameters of the targeting, we obtained twelve exposures that defined the common exposure environment of negative dimorphism in zBMI.

o.prof <- pred$profile$profnegative
names(o.prof) <- colnames(pred$profile$profnegative)
o.dir <- rep("-", length(o.prof))
o.dir[o.prof] <- "+"
imp <- pred$featurenames[complete.cases(pred$featurenames),]
row.names(imp) <- imp[,1]        
  
df <- data.frame(code[names(o.prof),c("family", "labels", "period")], importance=imp[names(o.prof),2],
                 "M>F"=o.dir, check.names = FALSE)

df

We also observed that the association of zBMI with the interaction between the exposure environment and sex was highly significant.

tar
## object of class: tarteff 
## 
## classification into 
##   negative treatment effect: 1
##   neutral: 0
##  
##   0   1 
## 790 217 
## 
## interaction fitted model: gaussian
##     Estimate   Std. Error      t value     Pr(>|t|) 
## -0.524140672  0.177261958 -2.956870600  0.003180596

Illustrated by the interaction plot of estrimated means with 95% CI

plotTarget(tar, 
           labs=c("Sexual dimorphism", "zBMI", "Sex", "Male", "Female"), 
           labeff= c("Neutral", "Negative"))

and the boxplot of zBMI quartiles

boxPlot(tar, 
           labs=c("Sexual dimorphism", "zBMI", "Sex", "Male", "Female"), 
        lg="topright",
        labeff= c("Neutral", "Negative"))

Obesity

We then tested whether consistent environmental exposures could be also found in Obesity. We follow the same analysis scheme as before.

ph <- "hs_bmi_c_cat"

texp <- sapply(names(expos)[-1], function(xx){
  dat4int <- data.frame(t = -as.numeric(covs$e3_sex_None )+2,
                      eff = eff[,ph],
                      exp = as.numeric(expos[,xx]),
                      covs[-c(1,3)])

 
  fa <- formula(paste0("eff ~ t:exp +",
                     paste0(names(dat4int)[-2], collapse="+")))

  summary(bayesglm (fa, data=dat4int, family="binomial"))$coeff["t:exp", c(1,4)]
})

texp <-t(texp)

texpsig <- texp[texp[,2]<0.05,]

data.frame(labels=code[rownames(texpsig),"labels"], texpsig, check.names = FALSE)

For obesity we found a more reduces set of environmental exposures that modulated the effect of sex. Most of the exposures related to meteorological conditions. Nonetheless, we identified a set of individuals with significant dimporhims in obesity according to these exposures. Again, only significant associations with negative dimorphisms were found, where boys had higher frequency of obesity than girls.

phenored <- data.frame(eff=eff[,ph],
                       t=-as.numeric(covs$e3_sex_None )+2, 
                       covs[-c(1,3)])

names(phenored)[3:ncol(phenored)] <- paste0("cov", 3:ncol(phenored))


fa <- formula(paste0("~", paste0(names(phenored),
                            collapse="+")))
mod <- model.matrix(fa, data = phenored)[,-1]

fa <- formula(paste0("~", paste0(names(expos),
                            collapse="+")))
modexp <- model.matrix(fa, data = expos)[,-1]


data4teff <- list(features=data.frame(modexp), teffdata = data.frame(mod))

predob <- predicteff(data4teff, featuresinf = rownames(texpsig))

ctrl <- list(lb=c("Male", "Female"), wht="bottomright", whs = "topleft")
plotPredict(predob, lb="Associated sexual dimorphism", ctrl.plot=ctrl)

For this case, we could not confirm the modulation between obesity and sex by the exposure environment of negative dimorphism in obesity.

predob <- predicteff(data4teff, featuresinf = rownames(texpsig), profile = TRUE, quant = 0.7)
tarob <- target(data4teff, predob, effect = "negative", mat = 0.7,
              model = "binomial", 
              lb = code[predob$featurenames[,1],"labels"])

tarob
## object of class: tarteff 
## 
## classification into 
##   negative treatment effect: 1
##   neutral: 0
##  
##   0   1 
## 689 318 
## 
## interaction fitted model: binomial
##   Estimate Std. Error    z value   Pr(>|z|) 
## -0.8705103  0.5126598 -1.6980274  0.0895026

However, we did confirm the significant modulation of the association between obesity and sex by the exposure environment of negative dimorphism in zBMI.

tarob <- target(data4teff, pred, effect = "negative", mat = 0.7,
              model = "binomial", 
              lb = code[pred$featurenames[,1],"labels"])

tarob
## object of class: tarteff 
## 
## classification into 
##   negative treatment effect: 1
##   neutral: 0
##  
##   0   1 
## 803 204 
## 
## interaction fitted model: binomial
##    Estimate  Std. Error     z value    Pr(>|z|) 
## -1.80292925  0.79959305 -2.25480854  0.02414535

Transcriptomic and methylomic associations for exposure enviroment of BMI dimporphism

We asked whether the classification of individuals into the exposure environment of negative dimorphism in zBMI could be associated to molecular data that point towards mechanisms of zBMI increments and their difference between sexes. We performed differential expression analysis for the exposure environment, using limma, adjusting for the covariates previously used in RCF modeling.

phenored <- data.frame(ex=tar$classification,
                       t=data4teff$teffdata$t, 
                       covs[,-c(1,3)])

fa <- formula(paste0("~ ", paste0(names(phenored)[-c(5,11)],
                                  collapse="+")))

#estimate associations
mod <- model.matrix(fa, data = phenored)
fit <- lmFit(exprs, mod)
fit <- eBayes(fit)

tt <- topTable(fit, coef="ex", number=Inf)
tt <- data.frame(genes=genesIDs[rownames(tt), "GeneSymbol_Affy"], tt)

head(tt[,c(1,4,5)])

We observed one significant association at transcriptome wide level. Remarkably this corresponded to METTL3, a subunit of N6-adenosine-methyltransferase, the enzyme that most abundantly methylates mRNA, forming N6-methyladenosine (m6A). While m6A regulation is found present in developmental processes, active research indicates its role in obesity, cancer, type 2 diabetes mellitus (T2DM), infertility, and developmental arrest (PMID: 28256005). In particular, the m6A is reversed by the fat mass and obesity-associated protein (FTO), a likely mechanism for obesity (PubMed: 28002401). In addition, m6A is the main modification of the RNA X-inactive specific transcript (XIST), that leads to the chromosome X inactivation in females.

EnhancedVolcano(tt, lab = tt$genes, 
                selectLab  = c("METTL3","SLC2A6"), 
                x = 'logFC', y = 'P.Value', 
                xlim=c(-0.2, 0.2),  
                pCutoff = 0.05/nrow(tt),
                labSize = 4.0,
                labCol = 'black',
                labFace = 'bold',
                boxedLabels = TRUE,
                legendPosition = 'right',
                drawConnectors = TRUE,
                widthConnectors = 1,
                colConnectors = 'black',
                title = "Exposure Environment",
                subtitle = "Differential expression",
                hline = 0.01)  

We also performed an enrichment analysis of 287 genes whose P-value of association was nominally significant. Interestingly, we found four pathways related to sugar transport activity that were significantly enriched.

tt <- topTable(fit, coef="ex", number=Inf)
selgenes <- rownames(tt)[tt$P.Value<0.01]
selgenes <- unique(genesIDs[selgenes, "EntrezeGeneID_Affy"])
selgenes <- selgenes[!selgenes%in%c("NA", "")]
selgenes <- sapply(strsplit(selgenes, ";"), function(x) x[[1]])

#run enrichment in GO
GO <- enrichGO(gene = selgenes, 'org.Hs.eg.db', ont="MF", pvalueCutoff=0.05, pAdjustMethod="BH")

dotplot(GO)

These erichment associations were led by 4 genes

wgenes <- unlist(strsplit(GO$geneID[1], "/"))
gn <- genesIDs[genesIDs[, "EntrezeGeneID_Affy"]%in%wgenes, "GeneSymbol_Affy"]
gn <- unique(gn)
gn
## [1] "SLC5A9" "SLC2A8" "SLC2A6" "SLC5A1"

all encode glucose transporters, namely GLUT9, GLUT8, GLUT6 and SGLT1; respectively.

We then axplored the association between exposure environment of negative dimorphism and methylation.

load("./data/methy.Rdata")

genesIDsmet <-  rowData(methy)

phenored <- data.frame(ex=tar$classification,
                       t=data4teff$teffdata$t, 
                       covs[,-c(1,3)])

fa <- formula(paste0("~",
                     paste0(names(phenored)[-c(5,11)],
                            collapse="+")))


mod <- model.matrix(fa, data = phenored)

cnm <- intersect(colnames(methy), rownames(mod))

met <- getBeta(methy[, cnm])

lmfit <- lmFit(met, design = mod[cnm,])
lmFite <- eBayes(lmfit)
tab <- topTable(lmFite, n = Inf, coef = "ex")

resmet <- data.frame(tab[,c(1,4,5)], gene=genesIDsmet[rownames(tab),"UCSC_RefGene_Name"])

head(resmet, 4)
EnhancedVolcano(tab, lab = rownames(tab), 
                selectLab  = c("cg16406960", "cg20839991", "cg16190350"),
                x = 'logFC', y = 'P.Value', 
                xlim = c(-0.075, 0.075),  
                pCutoff = 3e-07,
                labSize = 4.0,
                labCol = 'black',
                labFace = 'bold',
                boxedLabels = TRUE,
                legendPosition = 'right',
                drawConnectors = TRUE,
                widthConnectors = 1,
                colConnectors = 'black',
                title = "Exposure Environment",
                subtitle = "Differential methylation")  

We observed two significant associations at methylome level in the genes GNLY and HLA-DPB2. GNLY enocodes granulysin an antimicrobial peptide relased by cytolytic T cells (CTSs) and natural killer (NK) cells upon antigen stimulation. Granulysin acting as alarmin promotes immune response and imflamation (PMCID: PMC2981473). HLA-DPB2 belongs to the major Histocompatibility Complex, Class II, DP Beta 2; and is a integral part of normal human immune response. While the significant associations involved immune response, we did not find any significant enriched pathways for nominally significant associations.

Given the association between the exposure environment with METTL3 levels and the pathway of glucose transporter genes, we asked whether there were CpG cites of glocose transporters whose methylation associated with both the exposure environment and the levels of METTL3.

pad <- lapply(unique(gn), function(nn)
{
  loccpgs <- grep(nn, genesIDsmet[,"UCSC_RefGene_Name"])
  cpgs <- rownames(genesIDsmet[loccpgs,]) 
  tab[cpgs,c(1,4)]
})


pad <- do.call(rbind, pad)
METTL3 <- rownames(tt)[1]

modmettl3 <- mod[cnm,]
modmettl3[, "ex"] <- exprs[METTL3,cnm]

lmfit <- lmFit(met, design = modmettl3)
lmFitetabmettl3 <- eBayes(lmfit)
tabmettl3 <- topTable(lmFitetabmettl3, n = Inf, coef = "ex")


padmettl3 <- lapply(unique(gn), function(nn)
{
  loccpgs <- grep(nn, genesIDsmet[,"UCSC_RefGene_Name"])
  cpgs <- rownames(genesIDsmet[loccpgs,]) 
  tabmettl3[cpgs,c(1,4)]
})


padmettl3 <- do.call(rbind, padmettl3)
plot(-log10(pad[, "P.Value"]), -log10(padmettl3[, "P.Value"]),pch=16, ylab="-log10(P) association with METTL3 levels", xlab="-log10(P) association with exposure enviroment", main="Methlylation at glucose transporters")

text(1.9, max(-log10(padmettl3[, "P.Value"])),"cg16190350 at SLC2A6", cex=0.7)

cbind(pad["cg16190350",], padmettl3["cg16190350",]) 
par(mfrow=c(1,2))

plot(exprs[METTL3,cnm], met["cg16190350",cnm], pch=16, ylab="Methylation at SLC2A6", xlab="Expression at METTL3" ,col=2*tar$classification[cnm]+1)

abline(lm(met["cg16190350",cnm]~exprs[METTL3,cnm]))

legend("topright", legend=c("neutral dim", "negative dim"), col=c("black", "green"), pch=16, cex=0.7)


SLC2A6 <- rownames(genesIDs[genesIDs[, "GeneSymbol_Affy"]%in%"SLC2A6",])

plot(exprs[METTL3,cnm], exprs[SLC2A6,cnm], pch=16, ylab="Expression at SLC2A6", xlab="Expression at METTL3" ,col=2*tar$classification[cnm]+1)

abline(lm(exprs[SLC2A6,cnm]~exprs[METTL3,cnm]))

legend("topright", legend=c("neutral dim", "negative dim"), col=c("black", "green"), pch=16, cex=0.7)

We observed that increased expression of METTL3 associated with both expression of SLC2A6 and demethylation at its promoter at cg16190350. SLC2A6 enocodes for the clucose transported GLUT6 that, while it does not seem to regulate glocose uptake, it has been shown to be upregulated by inflamatory stimuli and involved in the activation of inflamatory macrophages M1 (PMID: 30700586, PMID: 30431159), a common inflamation during obesity that leads to insulin resintance (PMID: 29765169). While METTL3 is involved in multiple functions (PMC7339281), it has been found that its upregulation both stimulates M1 polarization (PMID: 31365297) and adipogenesis (PMC6066751), by its association with WTAP and METTL14 in a RNA N6-adenosine methyltransferase complex.

Given the reported links among, inflammation, METTL3 and SLC2A6 expression, we used the inferred levels of immune cells in blood to test their associations with zBMI, the interaction of exposure environment and sex, and with the expression of SLC2A6 and METTL3 and their interaction with sex.

cellnames <- c("NK_6", "Bcell_6", "CD4T_6", "CD8T_6", "Gran_6", "Mono_6") 

expenv <- tar$classification
nms <- names(expenv)
bmi <- eff[nms,"hs_zbmi_who"]
sex <-  2-as.numeric(covs[nms,"e3_sex_None"])

df <- data.frame(bmi, sex, expenv, covs[nms,c(2,12)], SLC2A6=exprs[SLC2A6,], METTL3=exprs[METTL3,])

#association of immune cells with bmi
a1 <- sapply(cellnames, function(nn)
{  
  x <- phenotype_expression[, nn]
  dd <- data.frame(x, df)
  summary(glm(x ~ bmi+sex+h_cohort+hs_child_age_None, data=dd ))$coeff["bmi",c(1,4)]
})

#association of immune cells with exposureenvironment
a2 <- sapply(cellnames, function(nn)
{  
  x <- phenotype_expression[, nn]
  dd <- data.frame(x, df)
  summary(glm(x ~ expenv*sex+h_cohort+hs_child_age_None, data=dd ))$coeff["expenv:sex",c(1,4)]
})

#association of immune cells with SLC2A6
a3 <- sapply(cellnames, function(nn)
{  
  x <- phenotype_expression[, nn]
  dd <- data.frame(x, df)
  summary(glm(x ~ SLC2A6+sex+h_cohort+hs_child_age_None, data=dd ))$coeff["SLC2A6",c(1,4)]
})

#association of immune cells with SLC2A6
a4 <- sapply(cellnames, function(nn)
{  
  x <- phenotype_expression[, nn]
  dd <- data.frame(x, df)
  summary(glm(x ~ METTL3+sex+h_cohort+hs_child_age_None, data=dd ))$coeff["METTL3",c(1,4)]
})

#association of immune cells with SLC2A6
a5 <- sapply(cellnames, function(nn)
{  
  x <- phenotype_expression[, nn]
  dd <- data.frame(x, df)
  summary(glm(x ~ SLC2A6*sex+h_cohort+hs_child_age_None, data=dd ))$coeff["SLC2A6:sex",c(1,4)]
})

#association of immune cells with SLC2A6
a6 <- sapply(cellnames, function(nn)
{  
  x <- phenotype_expression[, nn]
  dd <- data.frame(x, df)
  summary(glm(x ~ METTL3*sex+h_cohort+hs_child_age_None, data=dd ))$coeff["METTL3:sex",c(1,4)]
})

imnres <- t(round(rbind(a1,a2,a3,a4,a5,a6),4))

colnames(imnres) <- rep("P",ncol(imnres))
colnames(imnres)[c(1,3,5,7,9, 11)] <- c("BMI", "envExp*sex", "SLC2A6", "METTL3","SLC2A6*sex", "METTL3*sex" )

data.frame(imnres, check.names = FALSE)

We confirmed the higher presence of monocytes (macrophage precursors) in individuals with high zBMI, in addition to granulocytes. However, T cell abundance was reduced with zBMI. We observed that the exposure environment modulated the effect of sex on the abundance of CD8 T cells. The levels of monocytes increased with SLC2A6. Finally we found that the effect of METTL3 levels of NK cells was significantly different between boys and girls.

Conclusions

We successfully applied a causal model approach to study the sexual dimorphism in BMI of children. The modeling identified a specific profile of exposures within the exposome, for which the boys had significanlty higher zBMI than girls. We therefore were able to target individuals into a 12-exposure environment, which included air pollutants like PM10, organochlorines, phenols, phthalates and meteorological conditions such as temperature, humidity and UV light. Our analyses do not focused on the individual effect of these exposures on BMI but on how their joint conditioning defines an environment for which BMI dimorphism is high. However, it is interesting that the exposures by themselves have been previously linked with BMI differences (see for instance: PMID: 30703612, PMID: 29347948, PMCID: PMC4056217), some with a modifying effect of sex (PMID: 30538977).

Targeting individuals into the exposure environment of negative BMI dimorphism is important for identifying the exposure conditions where boys would be more susceptible to gain weight than girls. While precision public health interventions are on sight, further work is needed to translate the exposure environment into a standardized quantity, as independent as possible from covariates. In our method a fair amount of adjustment has been done on the exposures and the phenotype which can limit the method’s standardization for precision medicine practice. However, we found that the targeting was important to explore the molecular biological response to the exposure environment, offering important insights into the role of biological functions relating adipogenesis, immune and inflammatory response.

Our method for causal model inference of the effect of sex was based on random causal forest. The advantage of the approach is that it offers interval estimations for the effect of sex on the outcome for each individual, given their exposomic profiles. Inferences are unbiased and honest (Athey, et al. Ann. Statist. 47(2): 1148-1178 ). As such, individuals with significant estimated dimorphisms can be used to construct a consensus profile to which new individuals can be targeted. Here, we used the targeting on the entire data set, including the 80% of individuals used to grow the forest but not in the causal inferences. We confirmed that the targeting selected a group of negative dimorphism for zBMI. This can be considered an internal validation of the method. External validation with an independent exposomic study would be needed to validate the exposure environment. In relation to the ability of the methodology to identify reproducible groups of high sexual dimorphism; elsewhere, we have externally validated highly dimorphic groups using not exposomic but transcriptomic data. Note that the internal validation of an exposure environment is by no means granted, as we observed when attempting to construct an exposure environment for obesity. In addition, the several meaningful biological correlates of the environmental targeting support its relevance.

The exposure environment that we identify depends on the the selection of exposures at two stages: The first one for causal modeling and the second one for targeting. We are aware that the initial selection of informative exposures can have an important effect on the RCF performance. We observed that not preforming a feature selection before RCF yields no identification of individuals with high dimorphisms. This is likely due to the over-specification of individuals by the exposures in the entire exposome. Changes into the selection criteria (threshold) in the interaction association analysis can have an important effect. Note that the most important variable in random causal modeling (PM10) is not the most significant in the interaction associations; and therefore there is the possibility of not including it by being more stringent with the selection. The selection for the targeting depends on selecting the important variables involved in RCF. We selected the variables in the top 70% importance quantile to improve the targeting. While no optimization of the parameter was performed, we observed that the threshold is suitable to identify exposure environments of other dimorphisms like asthma and IQ (see Annex).

Annex: Other phenotypes

We applied our analysis strategy with no modifications to asthma and IQ, where we identified exposure environments of negative and positive dimorphisms, respectively.

Asthma

ph <- "hs_asthma"

texp <- sapply(names(expos)[-1], function(xx){
  dat4int <- data.frame(t = -as.numeric(covs$e3_sex_None )+2,
                      eff = eff[,ph],
                      exp = as.numeric(expos[,xx]),
                      covs[-c(1,3)])

 
  fa <- formula(paste0("eff ~ t:exp +",
                     paste0(names(dat4int)[-2], collapse="+")))

  summary(glm (fa, data=dat4int, family="binomial"))$coeff["t:exp", c(1,4)]
})

texp <-t(texp)

texpsig <- texp[texp[,2]<0.05,]

data.frame(labels=code[rownames(texpsig),"labels"], texpsig, check.names = FALSE)
phenored <- data.frame(eff=eff[,ph],
                       t=-as.numeric(covs$e3_sex_None )+2, 
                       covs[-c(1,3)])

names(phenored)[3:ncol(phenored)] <- paste0("cov", 3:ncol(phenored))


fa <- formula(paste0("~", paste0(names(phenored),
                            collapse="+")))
mod <- model.matrix(fa, data = phenored)[,-1]

fa <- formula(paste0("~", paste0(names(expos),
                            collapse="+")))
modexp <- model.matrix(fa, data = expos)[,-1]


data4teff <- list(features=data.frame(modexp), teffdata = data.frame(mod))

pred <- predicteff(data4teff, featuresinf = rownames(texpsig))


ctrl <- list(lb=c("Male", "Female"), wht="bottomright", whs = "topleft")

plotPredict(pred, lb="Associated sexual dimorphism", ctrl.plot=ctrl)

pred <- predicteff(data4teff, featuresinf = rownames(texpsig), profile = TRUE, quant = 0.7)
tar <- target(data4teff, pred, effect = "negative", mat = 0.7,
              model = "binomial",
              lb = code[pred$featurenames[,1],"labels"])

tar
## object of class: tarteff 
## 
## classification into 
##   negative treatment effect: 1
##   neutral: 0
##  
##   0   1 
## 773 234 
## 
## interaction fitted model: binomial
##     Estimate   Std. Error      z value     Pr(>|z|) 
## -2.074627541  0.673716588 -3.079377262  0.002074338
o.prof <- pred$profile$profnegative
names(o.prof) <- colnames(pred$profile$profnegative)
o.dir <- rep("-", length(o.prof))
o.dir[o.prof] <- "+"
imp <- pred$featurenames[complete.cases(pred$featurenames),]
row.names(imp) <- imp[,1]        
  
df <- data.frame(code[names(o.prof),c("family", "labels", "period")], importance=imp[names(o.prof),2],
                 "M>F"=o.dir, check.names = FALSE)

df

IQ

ph <- "hs_Gen_Tot"

texp <- sapply(names(expos)[-1], function(xx){
  dat4int <- data.frame(t = -as.numeric(covs$e3_sex_None )+2,
                      eff = eff[,ph],
                      exp = as.numeric(expos[,xx]),
                      covs[-c(1,3)])

 
  fa <- formula(paste0("eff ~ t:exp +",
                     paste0(names(dat4int)[-2], collapse="+")))

  summary(glm (fa, data=dat4int, family="binomial"))$coeff["t:exp", c(1,4)]
})

texp <-t(texp)

texpsig <- texp[texp[,2]<0.05,]

data.frame(labels=code[rownames(texpsig),"labels"], texpsig, check.names = FALSE)
phenored <- data.frame(eff = eff[,ph],
                       t = -as.numeric(covs$e3_sex_None )+2, 
                       covs[-c(1,3)])

names(phenored)[3:ncol(phenored)] <- paste0("cov", 3:ncol(phenored))


fa <- formula(paste0("~", paste0(names(phenored),
                            collapse="+")))
mod <- model.matrix(fa, data = phenored)[,-1]

fa <- formula(paste0("~", paste0(names(expos),
                            collapse="+")))
modexp <- model.matrix(fa, data = expos)[,-1]


data4teff <- list(features=data.frame(modexp), teffdata = data.frame(mod))

pred <- predicteff(data4teff, featuresinf = rownames(texpsig))

ctrl <- list(lb=c("Male", "Female"), wht="bottomright", whs = "topleft")
plotPredict(pred, lb="Associated sexual dimorphism", ctrl.plot=ctrl)

pred <- predicteff(data4teff, featuresinf = rownames(texpsig), profile = TRUE, quant = 0.7)
tar <- target(data4teff, pred, effect = "positive", mat = 0.7,
              model = "binomial", 
              lb = code[pred$featurenames[,1],"labels"])

tar
## object of class: tarteff 
## 
## classification into 
##   positive treatment effect: 1
##  
##   0   1 
## 916  91 
## 
## interaction fitted model: binomial
##    Estimate  Std. Error     z value    Pr(>|z|) 
## 1.815167769 0.584988987 3.102909302 0.001916284
o.prof <- pred$profile$profnegative
names(o.prof) <- colnames(pred$profile$profnegative)
o.dir <- rep("-", length(o.prof))
o.dir[o.prof] <- "+"
imp <- pred$featurenames[complete.cases(pred$featurenames),]
row.names(imp) <- imp[,1]        
  
df <- data.frame(code[names(o.prof),c("family", "labels", "period")], importance=imp[names(o.prof),2],
                 "M>F"=o.dir, check.names = FALSE)

df